Slate’s language map and messy census data published a fun article and set of maps about the languages spoken in the U.S., other than English and Spanish.

One of the maps struck me as somewhat surprising:



Is New York really the only state where Chinese is the most spoken language after English and Spanish? And why no African languages made it to the map?

Being the nerd I am, I looked up the original data from the American Community Survey (the data source referred to in the original article) using Census Bureau’s American FactFinder. And it would indeed seem that the data on the map is (partially) wrong – or at least it doesn’t match the data I could find.

The table below has the correct most-spoken non-English, non-Spanish language (or group of languages) for each state, with the ones that were wrong in the original map highlighted:

Alabama German
Alaska Other Native North American languages
Arizona Navajo
Arkansas German
California Chinese
Colorado German
Connecticut Polish
Delaware Chinese
Florida French Creole
Georgia Korean
Hawaii Other Pacific Island languages
Idaho German
Illinois Polish
Indiana German
Iowa German
Kansas German
Kentucky German
Louisiana French (incl. Patois, Cajun)
Maine French (incl. Patois, Cajun)
Maryland African languages
Massachusetts Portuguese or Portuguese Creole
Michigan Arabic
Minnesota African languages
Mississippi Vietnamese
Missouri German
Montana Other Native North American languages
Nebraska Vietnamese
Nevada Tagalog
New Hampshire  French (incl. Patois, Cajun)
New Jersey Chinese
New Mexico Navajo
New York Chinese
North Carolina Chinese
North Dakota  German
Ohio German
Oklahoma Vietnamese
Oregon Chinese
Pennsylvania Chinese
Rhode Island Portuguese or Portuguese Creole
South Carolina German
South Dakota Other Native North American languages
Tennessee German
Texas Vietnamese
Utah Other Pacific Island languages
Vermont French (incl. Patois, Cajun)
Virginia Korean
Washington Chinese
West Virginia German/French (exact same number of speakers)
Wisconsin Hmong
Wyoming German

What could explain the errors? For starters, I’m probably using at least a slightly different data set from the original author, as I couldn’t find a data that had the “Other” categories broken down in the same level of detail as in the Slate article. (I’m using a data set “LANGUAGE SPOKEN AT HOME BY ABILITY TO SPEAK ENGLISH FOR THE POPULATION 5 YEARS AND OVER, 2008-2012 American Community Survey 5-Year Estimates”, which should be the most reliable current data available on the FactFinder web site.) So if the original article is using older but more detailed data, e.g. from 2005–07, that could explain at least some of the difference.

Another plausible scenario is that Slate uses the wrong data column in the same/similar data set. The data I used includes three values for each language: the total number of speakers, those who “speak English ‘very well’”, and those who “speak English less than ‘very well’”. With a quick glance at the data it seems to me that the original map actually shows the language with the biggest number of those “very well” speaking people, not the total speakers, but I didn’t test this hypothesis thoroughly.

Whatever the problem here, I can’t really blame the original author. The Census Bureau’s several websites are awfully difficult to use, the categorizations used are confusing and the data formats are a mess. It was hard work to simply get the data for all the states and clean it up into a usable format. (Now that I’ve done the job once, you can download the data here in a more user-friendly format if you want to play with it.)

This seems to unfortunately typical of a lot of open government data all around the world. A few magnificent exceptions aside, too much of the world’s open data is in an obscure or messy data format, hidden behind a crappy interface, accessible only to the most dedicated of hacks and wonks. As happy as I am for Gapminder, Google Public Data, and the like, I would rather see governments themselves clean up their act and start thinking seriously about how Joe Public can actually access their data. It isn’t enough that the data exists somewhere in some format. It needs to be accessible for regular people.


Sähköpostiosoitettasi ei julkaista. Pakolliset kentät on merkitty *