The presentation was on the concept of how analysis can evolve to better take advantage of real time data streams. The community currently does lots of fascinating analysis of real time data from Twitter, mobiles devices, sensors etc., but it is inevitably a post mortem. By that I mean we do the analysis well after the event itself is over. If we think of a data stream as a living organism that is constantly changing we focus our analysis on the history that has already past.
Partially this is due to technology. Until recently most of what we use to munge large data streams, like Hadoop, run as batch operation. Lots of great presentations at Where went something like: I did a collect from Twitter and pulled 5 million Tweets over 2 days, pushed that into EC2 ran some map/reduce queries, ran analytics and produce a really nice visualization. Lots of fascinating results from these approaches, but the downside, it all comes well after the event itself happens. While this is useful for lots of purposes it misses the opportunity inherent in the data persistently updating. As the data changes so do the answers to our queries and equations. So, the premise of the presentation was how can we make our analysis as dynamic as our data.
The analogy at the beginning of the presentation was to “just in time” manfucaturing, originated by Toyota in the 1950′s as the far cooler sounding term – kanban. In short a kanban approach treats manufacturing like a super market, as you see demand from consumers depleting the store shelves you restock (just in time). Previously you’d do historical analysis to try and predict demand, and then schedule your inventories accordingly. Demand can be hard to predict in many industries and kanban ended up being a great innovation – just look at Toyota’s global growth after the 1950′s.
On the Web predicting global demand can be quite tricky as well. Yet a good chunk of the big data analysis we do currently takes this approach. Get a big hunk o’ data look for patterns and try to forecast that forward to gain insight. Could it be possible to use a kanban type approach, where we look at real time demand and update our analyses as the data streams to optimize response.
This is by no means a unique idea on our part. Lots of companies have been building real time analysis platforms like Backtype (now Twitter Storm), EsperTech, StreamBase, Yahoo S4, Hstreaming etc. Currently, I’ve not seen anyone looking at it from a geographic perspective or using non-reductionist approaches to the data analysis. This looked like a good opportunity to try and add to the conversation. In our Where presentation we also talked a good bit about the bias and short comings of reductionist approaches to data science, and alternatives like entropy calculations, but we’ll save that for a follow on post so this does not get too bloated.
To test the concept we took data from a mobile app used during the NYC Marathon to track runners. Then we treated it like a live feed and as the data streamed out, we intersected each event from the device with a grid over NYC. As each event intersected a grid cell we updated the equation result to the map. Calculating a sum for each grid cell is just one of many possible equations you could run against the data. We also calculated spatial entropy, to try and detect certainty/deviations in the data, but we’ll save that for the follow on post.
While the visualization of the data in fun, it really doesn’t get to the true potential of having perpetually updating analysis results. The real potential we see is setting a threshold and having the analysis alert you when a new result has been achieved that you want to know about. Let’s say for the NYC marathon data above we took the aggregation count for each cell and looked at the density of people per square meter for a time period. If density gets to threshold X send a promotion to each mobile device, and if it gets to why Y send a text message to safety officials that there be a crowd control problem. The ability to make analysis immediately actionable opens up whole new opportunities that create value for users. Instead of having a post mortem on what happened during the event, so we can better prepared next year, we can immediately respond to make it better right now.
Welcome to the Esri DC Development Center blog. We write about features of our work on big data analytics, open platforms, and open data, what is new and exciting in the Esri and community, and general industry thought leadership and discussions of geospatial data visualization and analysis.
Please explore what we're working on and let us know if you have any questions or ideas!