The Wikipedia for Data Meme and Federating the Data Commons
Bret Taylor had a great blog post today on the need for a Wikipedia for data. In short Bret feels like the difficulty in accessing data for web development project seriously hinders innovation. Lots of people resonated with the blog post and it is already up to 51 comments and has spawned several related posts – including ReadWriteWeb’s excellent list of web application offering open source data and related services.
We’ve spent a good bit of time working on the geospatial side of producing a large repository of open data. Specifically, dealing with the ability to handled structure quantitative data. I thought this might be a good time to talk a bit about the challenges we ran into and possible solution to the bigger problem posed by Bret in his blog post.
When we launched GeoCommons just short of a year ago we had grand ambitions of creating a community around geospatial data and creating maps. As with most first attempts we got overly ambitious and tried to pull off too many features in the first go, and ran into a serious case of overall mediocrity.
That aside the big technical hurdle that eventually led us to take GeoCommons down was a data scaling issue. The constant reading and writing to the database caused performance to really drag as the overall database got bigger and bigger. In the first go we used MySQL and hit a wall with about one billion data entities and then again with PostGIS when we got up to around five billion.
This problem was only further complicated by the size of the files we were working with, which often were over 10 mb. Geospatial data files trend large and can become quite monstrous. As a result we had upload and download times that far exceeded the patience of most general web users. Also, executing operations (mathematical, query, mining etc.) with data sets that large, and producing visual results in an acceptable time, created yet another challenge.
In short when Bret said, “DataWiki seems like an extremely hard problem” he was correct and we were just tackling one data niche. The good news is that the problems are solvable given enough time and resources, and we were able to find solutions for all the barriers we ran into with the first attempt at GeoCommons.
While this is good news for us I think it illustrates a bigger point – it is unlikely that one company is going to be able to solve the whole DataWiki problem. The list of current open data projects listed in the ReadWriteWeb post illustrates this hunch. Even a project as comprehensive and as well funded as Freebase is “better equipped to handle conceptual rather than statistical information on topics”.
Given the vast amount data out there and the limited resources any one company has, it seems that connecting multiple different repositories through a federation approach would be a clever way to go forward. This of course opens up an entire can of worms on best approaches and possible standards.
There are some important prerequisites like establishing metadata and schemas as well as the use of common formats, but a practical real world implementation I think will get us a long way. We’ve been talking with some folks in the GeoWeb space about doing some federation test beds with GeoCommons and I believe there are some interesting approaches to look at.
I’m curious what other folks out there think about the possibility of federation and who out there is open to doing test bed implementations? I believe that real potential of a DataWiki and the concept of a semantic web driven by data is going to require an interconnection of all the new data resource coming on line instead of the current islands of innovation.
4 Responses to The Wikipedia for Data Meme and Federating the Data Commons
Leave a Reply Cancel reply
About Us
Welcome to the GeoIQ blog. We write about features of our GeoIQ analytics engine, what is new and exciting in the GeoCommons community, and general industry thought leadership and discussions of geospatial data visualization and analysis.
Please explore what we're working on and let us know if you have any questions or ideas!
New GeoCommons Maps- Salavan villages blewislao
- Mapping_exercise blewislao
- Lao Districts blewislao
- Mapping exercise 4 blewislao
- Oklahoma Senate Districts, 2012 - 2020 OKHouseGIS
- infousa data lmgrobar@gmail.com
Recent Comments
- Today in APIs: Twitter’s X-Warning, TaskRabbit’s API and 6 New APIs on Using the Google Translate Function to Make Multilingual Maps in GeoCommons
- marketing birmingham on Chance of Winning the Lottery: 5,000,000 to 1, Chance of a Child Actually Profiting from Lottery Dollars: 5,000,000 to 1 (approximately)
- Using the Google Translate Function to Make Multilingual Maps in GeoCommons | GeoIQ Blog on Dynamically Map your Google Spreadsheets with GeoCommons
- Coffee Machines on Dataset of the Day: Starbucks Closure Data
- JulieB on Dataset of the Day: Who is more Generous? Republicans or Democrats?





How is this federation concept different from the concept of linked data?
http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/
Hi Sean,
Thanks for the pointer. I think conceptually there is a lot in common with what is proposed in the link. What I am less sure of is whether or not the full stack of W3C semantic web standards is the best way to go about it. They bring a lot of robustness but they also bring a lot of overhead. We’ve talked with the Mapufacture folks at possibly using Atom as a testbed with them, but I think it will take working through some implementation approaches before we get a better sense of what works well. I do think whatever approach we try it should also be able to interoperate with existing standards like W3C, Dublin Core, OGC etc., but not sure yet what the best approach is to solving the problem. I do think it will need to be a community solution based on existing standards and not a proprietary one though.
Yes, there’s growing interest in using Atom to do linked data without RDF. I’m part of a NEH funded project to do this among digital humanities projects at UNC, NYU, and Kings College. ORE-Atom is another example: http://www.openarchives.org/ore/0.3/atom-implementation.
Cool thanks for the link. It would be great to get your feedback and ideas as the implementation is fleshed out. Hopefully we can add something to the cause. Maybe there is a way to link the data/ideas from the two projects together? Meeting with Andrew Thursday, so will have more substance then.