Slate’s language map and messy census data

Slate.com published a fun article and set of maps about the languages spoken in the U.S., other than English and Spanish.

One of the maps struck me as somewhat surprising:

Is New York really the only state where Chinese is the most spoken language after English and Spanish? And why no African languages made it to the map?

Being the nerd I am, I looked up the original data from the American Community Survey (the data source referred to in the original article) using Census Bureau’s American FactFinder. And it would indeed seem that the data on the map is (partially) wrong – or at least it doesn’t match the data I could find.

The table below has the correct most-spoken non-English, non-Spanish language (or group of languages) for each state, with the ones that were wrong in the original map highlighted:

Alabama	German
Alaska	Other Native North American languages
Arizona	Navajo
Arkansas	German
California	Chinese
Colorado	German
Connecticut	Polish
Delaware	Chinese
Florida	French Creole
Georgia	Korean
Hawaii	Other Pacific Island languages
Idaho	German
Illinois	Polish
Indiana	German
Iowa	German
Kansas	German
Kentucky	German
Louisiana	French (incl. Patois, Cajun)
Maine	French (incl. Patois, Cajun)
Maryland	African languages
Massachusetts	Portuguese or Portuguese Creole
Michigan	Arabic
Minnesota	African languages
Mississippi	Vietnamese
Missouri	German
Montana	Other Native North American languages
Nebraska	Vietnamese
Nevada	Tagalog
New Hampshire	French (incl. Patois, Cajun)
New Jersey	Chinese
New Mexico	Navajo
New York	Chinese
North Carolina	Chinese
North Dakota	German
Ohio	German
Oklahoma	Vietnamese
Oregon	Chinese
Pennsylvania	Chinese
Rhode Island	Portuguese or Portuguese Creole
South Carolina	German
South Dakota	Other Native North American languages
Tennessee	German
Texas	Vietnamese
Utah	Other Pacific Island languages
Vermont	French (incl. Patois, Cajun)
Virginia	Korean
Washington	Chinese
West Virginia	German/French (exact same number of speakers)
Wisconsin	Hmong
Wyoming	German

What could explain the errors? For starters, I’m probably using at least a slightly different data set from the original author, as I couldn’t find a data that had the “Other” categories broken down in the same level of detail as in the Slate article. (I’m using a data set “LANGUAGE SPOKEN AT HOME BY ABILITY TO SPEAK ENGLISH FOR THE POPULATION 5 YEARS AND OVER, 2008-2012 American Community Survey 5-Year Estimates”, which should be the most reliable current data available on the FactFinder web site.) So if the original article is using older but more detailed data, e.g. from 2005–07, that could explain at least some of the difference.

Another plausible scenario is that Slate uses the wrong data column in the same/similar data set. The data I used includes three values for each language: the total number of speakers, those who “speak English ‘very well’”, and those who “speak English less than ‘very well’”. With a quick glance at the data it seems to me that the original map actually shows the language with the biggest number of those “very well” speaking people, not the total speakers, but I didn’t test this hypothesis thoroughly.

Whatever the problem here, I can’t really blame the original author. The Census Bureau’s several websites are awfully difficult to use, the categorizations used are confusing and the data formats are a mess. It was hard work to simply get the data for all the states and clean it up into a usable format. (Now that I’ve done the job once, you can download the data here in a more user-friendly format if you want to play with it.)

This seems to unfortunately typical of a lot of open government data all around the world. A few magnificent exceptions aside, too much of the world’s open data is in an obscure or messy data format, hidden behind a crappy interface, accessible only to the most dedicated of hacks and wonks. As happy as I am for Gapminder, Google Public Data, and the like, I would rather see governments themselves clean up their act and start thinking seriously about how Joe Public can actually access their data. It isn’t enough that the data exists somewhere in some format. It needs to be accessible for regular people.

Vastaa Peruuta vastaus