approx. 6 million data sets
> wc -l 2001.csv
5967781 2001.csv
approx. 600 MB of data
> ls -hl 2001.csv
573M Jan 10 2016 2001.csv
29 colums, data has gaps, but looks consistent
> head 2001.csv
Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay
2001,1,17,3,1806,1810,1931,1934,US,375,N700��,85,84,60,-3,-4,BWI,CLT,361,5,20,0,NA,0,NA,NA,NA,NA,NA
2001,1,18,4,1805,1810,1938,1934,US,375,N713��,93,84,64,4,-5,BWI,CLT,361,9,20,0,NA,0,NA,NA,NA,NA,NA
2001,1,19,5,1821,1810,1957,1934,US,375,N702��,96,84,80,23,11,BWI,CLT,361,6,10,0,NA,0,NA,NA,NA,NA,NA
2001,1,20,6,1807,1810,1944,1934,US,375,N701��,97,84,66,10,-3,BWI,CLT,361,4,27,0,NA,0,NA,NA,NA,NA,NA
Certainly no challenge in storing
Big Data in the sense of: too big for Excel or RAM, hard to process
Best to bring the data to the domain expert with minimal hassle
Up to 2 million cells
We have 9 * 400,000: too many for Google Sheets
> cut -f2,3,4,5,6,7,8,9 -d, 09.csv >09_no_month.csv
> awk -F, '$1 >= 10 && $1 <= 15' 09_no_month.csv > 09_very_small.csv
> ls -lh 09_very_small.csv
-rw-r--r-- 1 olli staff 1.0M Jul 21 21:46 09_very_small.csv
> wc -l 09_very_small.csv
38272 09_very_small.csv
Physically change the data, just one possible
Just a view on the data, more than one possible
Departure Delay > 200 minutes
Origins of Delta flights to STL
How many flights per day?
flights per day / carriers / delay
The best is yet to come
Sheets can automatically find this
Things get more interactive and connected
Using full data set of September (10x the data)
Data based vector graphics (SVG)
Again: Origins of Delta flights to STL
var pieChartCarriers = dc.pieChart("#pie");
pieChartCarriers
.slicesCap(5)
.dimension(carrier)
.group(carrier.group().reduceCount());
Not very hard, but definitely requires programming skills
Things scale
Using all of 2001 (once more 10x the data)
Data needs to be (physically) close to the interaction to make it fast and thus most useful
Again: Origins of Delta flights to STL
same amount of data sets, but almost 8x the size
can be optimized by just tranferring fields you really need
Working on a small sample (just 10000 data sets)
All live in the Python world
Google Sheets | D3 | ELK | Segmented D3 | iPython | |
---|---|---|---|---|---|
One tool for designer / user | |||||
Required effort | |||||
Easy to create new dashboard | |||||
Interactivity | |||||
Unlimited data size | |||||
Offline | |||||
Unrestricted widgets | |||||
Auto Refresh | |||||
Repl | |||||
Turn Query into Dashboard | |||||
Add your category here | |||||
Code for all examples: https://github.com/DJCordhose/big-data-visualization/code
Slides: http://bit.ly/data2day-explore
Oliver Zeigermann / @DJCordhose