Slate.com published a fun article and set of maps about the languages spoken in the U.S., other than English and Spanish.
One of the maps struck me as somewhat surprising:
Is New York really the only state where Chinese is the most spoken language after English and Spanish? And why no African languages made it to the map?
Being the nerd I am, I looked up the original data from the American Community Survey (the data source referred to in the original article) using Census Bureau’s American FactFinder. And it would indeed seem that the data on the map is (partially) wrong – or at least it doesn’t match the data I could find.
The table below has the correct most-spoken non-English, non-Spanish language (or group of languages) for each state, with the ones that were wrong in the original map highlighted:
|Alaska||Other Native North American languages|
|Hawaii||Other Pacific Island languages|
|Louisiana||French (incl. Patois, Cajun)|
|Maine||French (incl. Patois, Cajun)|
|Massachusetts||Portuguese or Portuguese Creole|
|Montana||Other Native North American languages|
|New Hampshire||French (incl. Patois, Cajun)|
|Rhode Island||Portuguese or Portuguese Creole|
|South Dakota||Other Native North American languages|
|Utah||Other Pacific Island languages|
|Vermont||French (incl. Patois, Cajun)|
|West Virginia||German/French (exact same number of speakers)|
What could explain the errors? For starters, I’m probably using at least a slightly different data set from the original author, as I couldn’t find a data that had the “Other” categories broken down in the same level of detail as in the Slate article. (I’m using a data set “LANGUAGE SPOKEN AT HOME BY ABILITY TO SPEAK ENGLISH FOR THE POPULATION 5 YEARS AND OVER, 2008-2012 American Community Survey 5-Year Estimates”, which should be the most reliable current data available on the FactFinder web site.) So if the original article is using older but more detailed data, e.g. from 2005–07, that could explain at least some of the difference.
Another plausible scenario is that Slate uses the wrong data column in the same/similar data set. The data I used includes three values for each language: the total number of speakers, those who “speak English ‘very well’”, and those who “speak English less than ‘very well’”. With a quick glance at the data it seems to me that the original map actually shows the language with the biggest number of those “very well” speaking people, not the total speakers, but I didn’t test this hypothesis thoroughly.
Whatever the problem here, I can’t really blame the original author. The Census Bureau’s several websites are awfully difficult to use, the categorizations used are confusing and the data formats are a mess. It was hard work to simply get the data for all the states and clean it up into a usable format. (Now that I’ve done the job once, you can download the data here in a more user-friendly format if you want to play with it.)
This seems to unfortunately typical of a lot of open government data all around the world. A few magnificent exceptions aside, too much of the world’s open data is in an obscure or messy data format, hidden behind a crappy interface, accessible only to the most dedicated of hacks and wonks. As happy as I am for Gapminder, Google Public Data, and the like, I would rather see governments themselves clean up their act and start thinking seriously about how Joe Public can actually access their data. It isn’t enough that the data exists somewhere in some format. It needs to be accessible for regular people.