In the last blog post we talked about monolithic vs. distributed architectures.  A key part of the argument is recognizing the massive value in the GIS monolith, but also realizing it is just one component of a larger whole.  The new reality is a data ecosystem with many components and users.  That said one of the big factors that will drive change are the challenges of scope and scale that GIS will face in a new and bigger data world.  The challenge of scope – simply there is too much data covering too many topics for GIS professionals to have enough bodies and subject matter expertise to cover it all.   The challenge of scale is the ability of GIS professionals and their technology architectures to deal with the volume and speed of data creation.

Problems of Scope

The scope problem is really driven by the GIS dogma of authoritative data.  Data either needs to be created by a GIS professional or validated by them.  This means you need a human in the loop for all data flowing through the system.  The scope of  the subject matter expertise needed by GIS professionals to cover all the emerging varieties of data  is fundamentally challenged.  This does not mean that data will not require quality assurance, but it is a question of who is best to determine quality.

As location data comes from an increasing variety of devices and contributors, the GIS professional will frequently not be in the position to determine the accuracy and veracity of the data.  Just consider the range of expertise that are needed to cover only the data we see today – anthropology, sociology, economics, political science, social media, disaster response etc.  Is the disaster response professional on the ground in a better position to determine the quality of data being reported by citizens, or are the GIS professionals back at headquarters.  This does not mean that GIS professional don’t play a critical role but it should not be a monopoly.

Should an organization be dependent on having geography/geomatic academic departments generate GIS curriculums to create a new generation of social media analysts before responding to the pressing need to analyze a new source of data?  The inherent problem in having the discipline of geography create a specialty practice for every aspect of science that has a geographic/location component has long been recognized as “the recurring identity crisis that plagues modern geography and its practitioners (Tuason 1987)”.

The Problem of Human Scale

This is where the problem of scale becomes potentially insurmountable.  As the volume of location enabled data increases at an exponential rate this raises the real problem of how does the number of GIS professionals scale to keep pace with the speed and volume of the new data that must be verified.  The structure of GIS as a technology and a profession was not built to handle massive volumes of external data.   Because of its monolithic structure, data was always envisioned to be generated by professionals solely within in the GIS workflow.  Now, Twitter alone is generating millions of location enabled messages per day.  Simply, there are not enough trained professionals to verify each new piece of data even if they did have the tools.  It is a supply and demand problem.  The demands of data being generated has far outstripped the supply of trained professionals to verify it – requiring a new paradigm in order to adapt.  This is not to say the concept of verified and unverified data is not critical to effective operations.  It is saying that in order to keep up with the rapidly growing volume of data the verification and validation of data cannot continue to be purely dependent on trained human professionals doing this by hand or with current tools.  Innovation, automation, statistical inference and the use of crowd sourcing to empower verification and validation of data is greatly needed.

Limitations in GIS Technology Scaling

In addition to problems scaling the number of professionals needed to meet demand, there is also a problem with computational scaling.  As the amount of data has increased so has the demand for greater computational power.  In the architecture of traditional GIS technology additional computation power is dependent on processor (CPU) speed.  A GIS application is only as powerful as the CPU of the desktop application or the single server providing the computational processing.  The problem with this is that around 2004 the industry started to hit what is called “Peak Mhz”:

That’s the point when processor speed effectively peaked as chip manufacturers began competing along other dimensions. Those other dimensions–energy efficiency, size and cost–are driving ubiquitous computing, as their chips become more efficient, smaller and cheaper, thus making them increasingly easier to include into everyday objects (Kuniavsky 2010).

Peak CPU Mhz Processing Power Achieved by Year

The application of this trend to GIS was called out by Mike Migurski in a recent talk, “The practical effect of this (“Peak Mhz”) for GIS was that you could no longer rely on next year’s computers making your work faster by default, and for developers it became necessary to respond to the change by modifying development tactics (Migurski 2011)”.  One response was using cheap storage to pre-compute tiles for large data sources resulting in the modern “slippy” map made popular by Google Maps.  A second adaptation has been distributing computational power across multiple processors and the use of NoSQL[1] data stores.  The architecture of most GIS applications prevents them from having computation distributed across multiple servers.  The amount of processing power that can be thrown at any computation is limited by the size of the processor of a single server.  So, while it is straight forward to deploy a GIS server to a cloud environment like Amazon it is not possible to distribute processing across multiple servers in the cloud, which is the primary advantage of a cloud architecture.

The problems of scope and scale really come down to culture and technology.  The culture of GIS needs to adapt to an emerging new reality while technology changes the fundamentals of how we do computation.  It is easy to see this post and others as an attack on GIS professionals and the industry at large.  My intent though is to raise awareness about the areas where the industry needs to grow and evolve to become relevant to a much larger market.  I see huge potential for all the people that have dedicated their work lives to understanding geography and technology.  That said as long we let vested interests throttle the evolution of geospatial technologies for parochial interests we could well miss the opportunities in front of us.  In the next post we’ll talk about the potential to use science to solve to scope and scale problems of validating big and disperse data.

[1] NoSQL is a movement in database management approaches that ditched the classic relational database approach for new innovations that strived to create distributed datastores (minus ACID).  This included techniques like key-value stores, document databases, and graph databases (Wikipedia 2011).

 

5 Responses to The Challenges of Scope and Scale that Face GIS

  1. MicahWilli says:

    I LOVE this Post. Even though geocommons is a for-profit company you’re touching on a whole host of industry-wide topics. Adaptability and remaining relevant is something that we cannot ignore as GIS Pros.

  2. GIS professional says:

    As Jeff Jonas (IBM) has said, “Big data new physics”. One vital component to data generation is quality. Most veteran GIS professionals have always used authoritative data that’s been collected by trained professionals. Since the paradigm shift towards exploiting ubiquitous location data, it is somewhat expected to see some trepidation from the geospatial community. Although my impression from the GIS community is there exists a level of ignorance regarding the source of this new data. If these distinctions were better defined I think you would see an acceptance shift in the industry. What GIS professionals typically want to know is the method in which the data was collected. Volunteered Geographic Information (VGI) needs to be classified. Lumping foundation data (OSM, Wikimapia) and activity data (twitter, Flikr, etc.) together and calling it the same VGI is a mistake. I take more credence in activity based location data because the data provider is typically helpless to improve the accuracy of their offered spatial data. On the other hand, the majority of volunteers who create foundation data have full control over quality but lack the training to recognize and control it.

    I must agree that there is a definite level of apprehension among seasoned GIS professionals to use VGI data. Although I imagine over time people will eventually realize it is an excellent source for information.

  3. How true… it seems like three directions are emerging:
    1) Oracle Spatial massive parallelisation in their Exadata machines on Sun boxes
    2) SAP uses virtualisation throwing RAM at it and also parallelising but it memory
    3) noSQL and Google/Amazon etc. hashing in post-RDBMS realm across the net
    I can just see Sun Micro’s Gage and McNealy smiling as their prescient “the network is the computer” came to be, only morphed into “the internet is the computer”

  4. gogeo says:

    Good article, it certainly makes you think and reminds me of a presentation i listened to about 10 years ago that asked “did geography have a rotten core?” the conclusion was no, but new technology (GIS/Remote Sensing) were degrading traditional geography. You could see this going the same way with traditional GIS on one side and the new VGI on the other.

    I agree with the comment above that VGI should separate things such as OSM and flickr/facebook/twitter as they represent a completely different kind of volunteered data. Perhaps the OSM’s should be considered GIS enthusiasts or the data they create “Informed VGI”?

    I didnt agree that GIS/Remote Sensing were degrading geography and see the new technologies and VGI as an opportunity btw……

  5. [...] been unleashed upon GIS software and GIS users. Slashgeo’s take on a recent GeoIQ blog about “The Challenges of Scope and Scale that Face GIS” highlighted the data explosion (his workplace alone pumps out 450GB per week of new public [...]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>