Over the President’s Day weekend Code for America held a DataCamp at Big Windows Labs in DC. There was a great diverse group of participants who came up with lots of innovative ideas for remixing government data and creating cool apps.

Travis Pinney, Kate Chapman and I decided to work on a project trying to structure the DC health inspection data and fuse it with Yelp restaurant reviews. We were curious what the relationship was between the health inspection ratings for restaurants and their popularity. DC is appropriately famous being a leading innovator with making government data available to the public. Health inspections for restaurants is no different and there is a clean interface for browsing through the full inspection for each food establishment. The problem is the data is in HTML forms or worse PDF. Travis did some brilliant scripting to scrape the data out. Unfortunately this means we could not get the PDF data, but on the upside the majority was in the HTML forms. I experimented with doing the same with Needlebase which was fun to learn, but I couldn’t keep up with Travis’s skills for some of the edge cases in the data. Next Travis used the Yelp API to grab data on the user ratings for restaurants in DC. The good news is we got a merged list of 515 restaurants with both health inspection and Yelp ratings. The bad news is that there 2500+ health reports and 3700+ Yelp reviewed restaurants. So, we missed a lot, but we are we were also pretty conservative in joining the data.

Without getting into the gory details joining locations can be an ugly affair. You can have multiple locations at the sames address (e.g an airport) and have multiple restaurants with the same name, so a unique ID is quite elusive between disparate data sets. There a good number of tricks you can use but the over all problem is difficult. It will be interesting to see how the various start ups like Factual, SimpleGeo and Cloudmade tackle the problem as they build out large POI repositories and look to cross link data. Pivoting on location (lat/long) seems like a good solution and I’m sure we’ll start seeing location based place graphs in the near future.

Back to our data experiment – while we did not get all the data merged 515 is a pretty good sample to look at potential trends. Once Travis had done all the hard work I uploaded the merged data set to GeoCommons to geocode the data. You can find the raw data here available in a variety of formats to remix yourself. Next I plotted out the points in GeoCommons and ran an analysis correlating the health inspection risk score to the Yelp ratings. I wanted to see if health inspection scores were a good predictors of popularity on Yelp. If they were a good predictor I should see a high negative correlation – i.e. low risk scores are correlated with high Yelp reviews.

The correlation did come back negative but only 2% of the variation in Yelp ratings was explained by health inspections risk scores. You can see the results on the map below. While the correlation was not meaningful if you click the gray triangle in the layers palette for the correlation you will see a scatter plot. Click the circles in the upper right hand quadrant to see restaurants that are popular on Yelp but have high health risk scores. Is your favorite restaurant throwing some serious health violations?

View full map

You can play around clicking on any of the quadrants to see patterns of health risk vs. popularity across DC. If you would like to see the raw data for the correlation analysis it is available here. Feel free to remix the data and come up with another perspective like animating it over time. DataCamp was awesome and the project was a fun one. Many thanks to Code for America for setting it up and Big Window Labs for hosting us.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>