no big data, though
approx. 6 million data sets
> wc -l 2001.csv
5967781 2001.csv
approx. 600 MB of data
> ls -hl 2001.csv
573M Jan 10 2016 2001.csv
29 colums, data has gaps, but looks consistent
> head 2001.csv
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2001,1,17,3,1806,1810,1931,1934,US,375,N700��,85,84,60,-3,-4,BWI,CLT,361,5,20,0,NA,0,NA,NA,NA,NA,NA
2001,1,18,4,1805,1810,1938,1934,US,375,N713��,93,84,64,4,-5,BWI,CLT,361,9,20,0,NA,0,NA,NA,NA,NA,NA
2001,1,19,5,1821,1810,1957,1934,US,375,N702��,96,84,80,23,11,BWI,CLT,361,6,10,0,NA,0,NA,NA,NA,NA,NA
2001,1,20,6,1807,1810,1944,1934,US,375,N701��,97,84,66,10,-3,BWI,CLT,361,4,27,0,NA,0,NA,NA,NA,NA,NA
You simple can not
too large
and even if you could: way too slow
even if it was fast enough: does not run in the Browser
even if it does (360 or Google Sheets): is it most interactive?
Handling large amounts of data
while maintaining the interactivity and the intuitive UI
But: data needs to be (physically) close to processing / interaction to make it fast and thus most useful
Like: benefits of Excel and Access at the same time
You can get inspiration from data sources listed on previous slide
After 5-10 minutes report interesting results to full audience
Good oportunity to discuss these questions with your neighbors
Preparing the data
Kudos to Chi Nhan Nguyen for technical support on this section
All live in the Python world
Working on a reduced number of random samples
Reading sample data
s = 10000 # desired sample size
n = 5967780 # complete data size
skip_rows = sorted(numpy.random.choice(np.arange(1, n + 1), (n - s)))
# pandas dataframe (2-dim table structure)
df = pandas.read_csv('../../data/2001/2001.csv', skiprows=skip_rows)
> df.count()
DepTime 9611
CRSDepTime 10000
ArrTime 9590
CRSArrTime 10000
UniqueCarrier 10000
ActualElapsedTime 9590
CRSElapsedTime 10000
AirTime 9590
ArrDelay 9590
DepDelay 9611
Origin 10000
Dest 10000
Distance 10000
Cancelled 10000
CancellationCode 0
Diverted 10000
CarrierDelay 0
WeatherDelay 0
NASDelay 0
SecurityDelay 0
LateAircraftDelay 0
cols_for_correlation = [
'DayOfWeek', 'DepTime','ArrTime', 'ArrDelay', 'Distance',
'UniqueCarrier_', 'Origin_', 'Dest_'
]
corrmat = df[cols_for_correlation].corr()
seaborn.heatmap(corrmat, annot=True)
Interesting: Distance / Origin to Unique Carrier
Pearson's r is a measure of the degree of linear(!) dependence between two variables
By Kiatdd - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=37108966
correlation coefficient (r) above each figure
Notice: figures at the bottom row obviously are correlated, just not linearly
Pearson's r: 0.77, plausible but trivial
seaborn.jointplot(df['ArrTime], df['DepTime'], kind="hex")
4 = (DL) Delta
1 = AQ (Aloha Airlines)
Get complete code from github: https://github.com/DJCordhose/big-data-visualization/tree/gh-pages/code
Try out analysis part (requires Python with numpy, pandas, matplotlib, seaborn, sklearn to be installed)
python big-data-visualization/code/es/analysis_pandas.py
Easiest to get started with Python: https://docs.continuum.io/anaconda/
working on random samples only
time series data not exploited (would be possible with Pandas)
not interactive
https://www.elastic.co/products/elasticsearch
Relational Database | Databases | Tables | Rows | Columns |
---|---|---|---|---|
Elasticsearch | Indices | Types | Documents | Fields |
PUT /megacorp/employee/1
{
"first_name" : "John",
"last_name" : "Smith",
"age" : 25
}
Create an index
from pyelasticsearch import ElasticSearch, bulk_chunks
es = ElasticSearch(ES_HOST)
es.create_index(INDEX_NAME)
Create a mapping to make types clear to Elasticsearch
mapping = {
'flight': {
'properties': {
'SecurityDelay': {
'type': 'integer',
'null_value': -1
},
'Origin': {
'type': 'string'
},
'Cancelled': {
'type': 'boolean'
},
...
es.put_mapping(index=INDEX_NAME, doc_type=DOC_TYPE,mapping=mapping )
Good for performance, query hints, and index size
# reads all data into data frame in memory
df = pd.read_csv('../../data/2001/2001.csv')
# missing value will be filled with -1
df.fillna(-1)
# add a timestamp to each row
df['@timestamp'] = df.apply(lambda row:
pd.Timestamp('%s-%s-%s;%04d'
%(row['Year'], row['Month'],
row['DayofMonth'], row['CRSDepTime'])))
# custom convert cancelled into boolean
df['Cancelled'] = df.apply(lambda row:
False if row['Cancelled'] == 0 else True)
Already possible with Kibana
Low-latency interaction (Latency more obvious when Elasticsearch and Browser are not co-located on the same machine)
(Limited) offline access
Charts limited to what Kibana offers
Data based vector graphics (SVG)
but is it also useful?
Be careful not to overdo things
Try not to display too many dimensions in order not to overwhelm the user
Don't make it too hard for it, though
http://strataconf.com/big-data-conference-ny-2015/public/schedule/detail/43411
http://blog.visual.ly/12-great-visualizations-that-made-history/
var pieChartCarriers = dc.pieChart("#pie");
pieChartCarriers
.slicesCap(5)
.dimension(carrier)
.group(carrier.group().reduceCount());
var chart = dc.lineChart("#series");
chart
.dimension(date)
.group(date.group())
.brushOn(true));
Experiment with our Dashboard
http://bit.ly/dataviz-map-demo
It should even work on your smartphone
Is there anything of interest you found out?
Give it 5 minutes max.
If you need a clear assignment:
What are the origins of Delta flights to STL (St Louis)? Why?
Will not work on your local machine (needs Elasticsearch running)
Aloha Airlines (AQ) mainly flies between Hawaii and West Coast (California/Nevada)
Delta (DL) mainly flies from Atlanta (ATL) its headquarter and largest hub
Slides: http://bit.ly/data-exploration-odsc
Complete Code: https://github.com/DJCordhose/big-data-visualization/tree/gh-pages/code
Ping me for questions / help / comments: http://zeigermann.eu / @DJCordhose