Data Is
(via my O’Reilly blog)
I know that “data” is technically the plural of “datum”. But I find it jarring when I read that “the data are transmitted” somewhere. In common usage (both speech and informal writing) that data “is transmitted.”
It’s not that “data” is singular; it’s more like a nonspecific collective noun, like “air”. It has come to mean—and I’m going to really massacre the language here, just to emphasize the distinction I’m trying to make—“some datums”. We say “the data is corrupt” in the same way we say “the air is polluted.”
At this point you may be thinking I’m just upset that the dictionary doesn’t agree with the way I do things. For the record, though, I’m a careful speaker and writer who usually argues for the rules people have forgotten rather than the common, often sloppy usage. This time I think the change in usage has happened for good reasons.
One reason, I think, is that “datum” is so rarely a useful word. I’m not sure why, but we rarely need to distinguish between singular and plural with respect to data; it’s almost never important to talk about a single datum.
A related reason is that it’s unclear what constitutes a datum. Is it always a bit? Or some larger group of data? (See how slippery it is? Is it reasonable to say that a datum is composed of a group of smaller data?)
My “air” analogy illustrates that problem quite well. Is a molecule of oxygen also an “air molecule”? Air is a mixture, so identifying the smallest unit of air is a tricky thing.
There are contexts, perhaps, where data are discrete and well structured so that the distinction makes sense. But in most cases, data is complex, with an almost fractal structure, and the line between data and datum is almost impossible to draw. (This paragraph is a test, by the way. Which of those sentences seemed most natural to you?)
I think it’s time to acknowledge that the old rule, in this case, is obsolete. Circumstance and usage has turned “data” into a collective, singular noun. It refers to “some data”—and in the tradition of computer science, “some” can mean “zero or more”. “Datum” can still be useful on the rare occasions where you need to emphasize a singular unit that can’t be described as a bit, byte, octet, scalar, etc.
Update: a respondent, “gojomo”, points out that the correct linguistic term for the common usage of “data” is “mass noun”. Other examples of mass nouns include water, blood, light, money, and cheese.
Update 2 (2017-05-15): I recently re-watched Guy Steele’s brilliant talk from 1998, Growing a Language (transcript here). In that talk, Steele defines “data” like this:
A datum is a set of bits that has a meaning; data is the mass noun for a set of datums.
So in 1998, it was accepted usage (accepted by Guy Steele, anyway, which is good enough for me) to treat “data” as a mass noun.