approx. 6 million data sets
> wc -l 2001.csv
5967781 2001.csv
approx. 600 MB of data
> ls -hl 2001.csv
573M Jan 10 2016 2001.csv
29 colums, data has gaps, but looks consistent
> head 2001.csv
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2001,1,17,3,1806,1810,1931,1934,US,375,N700��,85,84,60,-3,-4,BWI,CLT,361,5,20,0,NA,0,NA,NA,NA,NA,NA
2001,1,18,4,1805,1810,1938,1934,US,375,N713��,93,84,64,4,-5,BWI,CLT,361,9,20,0,NA,0,NA,NA,NA,NA,NA
2001,1,19,5,1821,1810,1957,1934,US,375,N702��,96,84,80,23,11,BWI,CLT,361,6,10,0,NA,0,NA,NA,NA,NA,NA
2001,1,20,6,1807,1810,1944,1934,US,375,N701��,97,84,66,10,-3,BWI,CLT,361,4,27,0,NA,0,NA,NA,NA,NA,NA
You simple can not
too large
and even if you could: way too slow
even if it was fast enough: does not run in the Browser
even if it does (360 or Google Sheets): is it most interactive?
Preparing the data
Kudos to Chi Nhan Nguyen for technical support on this section
All live in the Python world
Working on a reduced number of random samples
Reading sample data
s = 10000 # desired sample size
n = 5967780 # complete data size
skip_rows = sorted(numpy.random.choice(np.arange(1, n + 1), (n - s)))
# pandas dataframe (2-dim table structure)
df = pandas.read_csv('../../data/2001/2001.csv', skiprows=skip_rows)
cols_for_correlation = [
'DayOfWeek', 'DepTime','ArrTime', 'ArrDelay', 'Distance',
'UniqueCarrier_', 'Origin_', 'Dest_'
]
seaborn.corrplot(df[cols_for_correlation])
Interesting: Distance / Origin to Unique Carrier
Pearson's r is a measure of the degree of linear(!) dependence between two variables
https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
Pearson's r: 0.77, plausible but trivial
seaborn.jointplot(df['ArrTime], df['DepTime'], kind="hex")
4 = (DL) Delta
1 = AQ (Aloha Airlines)
working on random samples only
time series data not exploited
not interactive
https://www.elastic.co/products/elasticsearch
Relational Database | Databases | Tables | Rows | Columns |
---|---|---|---|---|
Elasticsearch | Indices | Types | Documents | Fields |
Create an index
from pyelasticsearch import ElasticSearch, bulk_chunks
es = ElasticSearch(ES_HOST)
es.create_index(INDEX_NAME)
Create a mapping to make types clear to Elasticsearch
mapping = {
'flight': {
'properties': {
'SecurityDelay': {
'type': 'integer',
'null_value': -1
},
'Origin': {
'type': 'string'
},
'Cancelled': {
'type': 'boolean'
},
...
es.put_mapping(index=INDEX_NAME, doc_type=DOC_TYPE,mapping=mapping )
# reads all data into data frame in memory
df = pd.read_csv('../../data/2001/2001.csv')
# missing value will be filled with -1
df.fillna(-1)
# add a timestamp to each row
df['@timestamp'] = df.apply(lambda row:
pd.Timestamp('%s-%s-%s;%04d'
%(row['Year'], row['Month'],
row['DayofMonth'], row['CRSDepTime'])))
# custom convert cancelled into boolean
df['Cancelled'] = df.apply(lambda row:
False if row['Cancelled'] == 0 else True)
Data based vector graphics (SVG)
// creating a new crossfilter (multi-dimensional dataset)
// flights array contains all flights for a single month
flight = crossfilter(flights);
console.log(flight.size()); // 490698
// creating a dimension for all carriers of all flights
carrier = flight.dimension(d => d.UniqueCarrier);
// group() reduces this dimension to distinct values
var distrinctCarriersGroup = carrier.group();
console.log("Distinct carriers: " + distrinctCarriersGroup.size());
// 12
// filter just for TW
carrier.filter("TW");
// groupAll() creates a group intersecting all current filters
// Trans World Airlines, became part of American Airlines in 2001
console.log("TW flights", flight.groupAll().value()); // 19427
// display first 10 matches
console.log(flight.top(10));
Could even be used as an interactive REPL in the browser
http://blog.visual.ly/12-great-visualizations-that-made-history/
var pieChartCarriers = dc.pieChart("#pie");
pieChartCarriers
.slicesCap(5)
.dimension(carrier)
.group(carrier.group().reduceCount());
Aloha Airlines (AQ) mainly flies between Hawaii and West Coast (California/Nevada)
Delta (DL) mainly flies from Atlanta (ATL) its headquarter and largest hub
Do you have (possibly large) data set worth exploring?
Best to have it as a CSV or similar (because that might be the hardest and most boring part)
Maybe try some of the techniques described here
Complete Code: https://github.com/DJCordhose/big-data-visualization/tree/gh-pages/code
Slides: http://bit.ly/data-exploration-javazone
Ping me for questions / help / comments: Oliver Zeigermann / @DJCordhose
Try out analysis part (requires numpy, pandas, matplotlib, seaborn, sklearn to be installed)
python big-data-visualization/code/es/analysis_pandas.py
Try crossfiler as an interactive console
Like Jupyter Notebook, but without installation in the browser
Open interactive console (Chrome is my recommendation)
Question: What is the longest departure and the longest arrival delay?