Statistical Challenges of Data at Scale: Bringing Back the Science
I believe there has always been a bit of a myth in GIS circles that the quickly growing consumer side of maps/data eschews science. Whether you call it Neogeography, the GeoWeb or NoGIS there is a perception that traditional GIS is where you do hardcore science, and the new stuff is a vacuous world of check-ins and slippy maps. While there is often a little truth in stereotypes, I think there is more fault than truth in this characterization. No doubt many of the early mashups where whimsical and trended towards trivial. The consumer focus simplified much of what was complex in GIS, and lead many traditionalists to view it as a cheap knock off of their profession.
You can’t always read a book just by it’s cover though. While the interface to most social/local/mobile apps are simple it does not mean there is not science sitting behind it. The number of GeoNerds I’ve seen with a copy of “Multidimensional and Metric Data Structures” is non-trivial. Take the work Schuyler Earle did with creating geometric shapes from Flickr photos in his Shape of Alpha project. On the surface a playful map of the shape created by photos tagged with a WOE ID for Texas:
Underneath is a good bit of science using an iteration of Delaunay triangulation called alpha shapes. GIS has used Delaunay triangulation specifically and computational geometry generally since its inception, and the purists would say there is nothing new here. I’d argue the point, but what I’m really after here is that this kind of cross pollination is critical. Both new and old are better off from it.
I first ran into the statistical challenges of data at scale in grad school at GMU. We were doing a lot of mass traceroute analysis to infer router and autonomous system networks, then trying to tease geography out of it. In academic geography there is a tradition of applying graph theory to geographic problems and many GIS packages have network analysis tools. Problem was we had millions of links and nodes. It crashed every GIS application we tried. Also the statistical construct from the geography literature did not really work for a dynamic and evolving networks (i.e. the Internet). We ended up digging into statistical mechanics to find the methods we needed to solve the problem. Physicists have dealt with similar sized data problems and had a great tool set. It just needed to be tweaked to work for data with geographic dimensions. There has also been good work in GIS applying complexity science concepts like complex adaptive systems to geographic phenomenon. Paul Torren’s work has always been my favorite.
My point in this rambling is what we are seeing with the data volumes emerging from social/local/mobile data will require some new thinking around the science we bring to bear. I believe it will need to be inter-disciplinary. We should look beyond our traditional literature for novel approaches that can repurposed. Also we need to rethink some of the fundamental premises we have around GIS and data. Traditional concepts like error bounds will fundamentally change because data collection is no longer happening on an annual basis, but will occur persistently from millions of globally distributed sensors. Error will be a fluid concept and not a static measure. Metadata needs to also change to be a fluid concept. The requirement for dedicated GIS metadata librarians with hundreds of metadata elements will not scale.
Most importantly I think we need to stop thinking of the crowd as volunteers and amateurs. We should think of them as data collection points. Just as we do with data arrays. Sometimes our sensors malfunction and send us bad data, but we can use statistics to control for this. This new reality is going to require innovative concepts around not only leveraging the crowd for data, but also using the crowd to ascertain the veracity of data. The crowd needs to be leveraged to verify and update metadata. This has been done with great success for “point of interest” (POI) and road data in the commercial sector by projects from Factual and OpenStreetMap, respectively.
Further, the concept of sample size and margin of error is being turned upside down. Previously a small cadre of highly trained professionals made a small number of highly precise observations and these were extrapolated to an entire population. Now, sample sizes come close to the size of the actual population but are also incredibly biased (.i.e. Twitter provides a massive sample but is biased to only those using Twitter). Finding novel ways of leveraging SMS and location are being generated by projects like DARPA’s “More Eyes” initiative and the Ushahidi project, but there is much work still to be done. This work, though, needs to be interdisciplinary realizing that often the best geographic data will not come through authoritative channels in a rapidly expanding data ecosystem. To built the best science and technology we need to embrace the fact the future is not going to look like the past and we are going to have to rethink many fundamentals. At lest that is my biased and humble opinion.
One Response to Statistical Challenges of Data at Scale: Bringing Back the Science
Leave a Reply Cancel reply
About Us
Welcome to the GeoIQ blog. We write about features of our GeoIQ analytics engine, what is new and exciting in the GeoCommons community, and general industry thought leadership and discussions of geospatial data visualization and analysis.
Please explore what we're working on and let us know if you have any questions or ideas!
New GeoCommons Maps- RW-map1 lynnr321
- NHGH 1941 data JCReut
- RW-map1 lynnr321
- SRE Citas axas_@hotmail.com
- US FactFinder (2010 Contiguous States) kobl0019
- Colorado Hunting Orientation Map pizard
Recent Comments
- Bargain homes in Murrieta on A Quick Test Drive of Google Table Fusion
- Bargain homes in Murrieta on A Quick Test Drive of Google Table Fusion
- balayı otelleri on Dataset of the Day: Early Voting—November 3, 2008
- haber,haberleri,başbakan on Dataset of the Day: Early Voting—November 3, 2008
- realtor tampa bay on The Spillover Effects of Foreclosures






Wow! So I understand that we can quantify inter dimensional data but the true problem comes when people try to quantify data based off of opinions. I recently saw a demo at the DC-Tech-Meet-Up of a program that quantifies values, opinions, and feelings in some cases. It was overall a great product and you can quantify how many people feel a certain way about something by letting users pick from a list but the data cant be factual because people might not say how they really feel which creates a margin of error. thus the problem lies where data is not factual and when you are dealing with different dimensions of data, one level of incorrect information create a domino effect….