Data Is
(via my O’Reilly blog)
I know that “data” is technically the plural of “datum”. But I find it jarring when I read that “the data are transmitted” somewhere. In common usage (both speech and informal writing) that data “is transmitted.”
It’s not that “data” is singular; it’s more like a nonspecific collective noun, like “air”. It has come to mean—and I’m going to really massacre the language here, just to emphasize the distinction I’m trying to make—“some datums”. We say “the data is corrupt” in the same way we say “the air is polluted.”
At this point you may be thinking I’m just upset that the dictionary doesn’t agree with the way I do things. For the record, though, I’m a careful speaker and writer who usually argues for the rules people have forgotten rather than the common, often sloppy usage. This time I think the change in usage has happened for good reasons.
One reason, I think, is that “datum” is so rarely a useful word. I’m not sure why, but we rarely need to distinguish between singular and plural with respect to data; it’s almost never important to talk about a single datum.
A related reason is that it’s unclear what constitutes a datum. Is it always a bit? Or some larger group of data? (See how slippery it is? Is it reasonable to say that a datum is composed of a group of smaller data?)
My “air” analogy illustrates that problem quite well. Is a molecule of oxygen also an “air molecule”? Air is a mixture, so identifying the smallest unit of air is a tricky thing.
There are contexts, perhaps, where data are discrete and well structured so that the distinction makes sense. But in most cases, data is complex, with an almost fractal structure, and the line between data and datum is almost impossible to draw. (This paragraph is a test, by the way. Which of those sentences seemed most natural to you?)
I think it’s time to acknowledge that the old rule, in this case, is obsolete. Circumstance and usage has turned “data” into a collective, singular noun. It refers to “some data”—and in the tradition of computer science, “some” can mean “zero or more”. “Datum” can still be useful on the rare occasions where you need to emphasize a singular unit that can’t be described as a bit, byte, octet, scalar, etc.
Update: a respondent, “gojomo”, points out that the correct linguistic term for the common usage of “data” is “mass noun”. Other examples of mass nouns include water, blood, light, money, and cheese.
Update 2 (2017-05-15): I recently re-watched Guy Steele’s brilliant talk from 1998, Growing a Language (transcript here). In that talk, Steele defines “data” like this:
A datum is a set of bits that has a meaning; data is the mass noun for a set of datums.
So in 1998, it was accepted usage (accepted by Guy Steele, anyway, which is good enough for me) to treat “data” as a mass noun.
Christmas Books: The Parrot’s Theorem
I really enjoy books, and my taste is broad and often a little strange. Plus, I pay attention to books (using resources like the bookshelf section of Rael Dornfest’s page). So it’s really unusual for someone to give me a book that’s both right up my alley and also unknown to me. This year, it happened twice. Here’s the first one.
We lived in Australia for a few years in the early ’90s, and naturally made some wonderful friends there. For Christmas this year, our friends Doug and Trisha Paice sent me a copy of The Parrot’s Theorem, by Denis Guedj. It’s a novel about the history of mathematics, and it fits my criteria for great gifts: I wouldn’t have bought it for myself, but I’m delighted to have it. I’m halfway through, and it’s a lot of fun.
If you’re looking for just a good page-turner of a novel, you can safely skip it—the story probably won’t grab you if you don’t have at least a passing interest in the history of mathematics. And there are some distinct weaknesses in the writing (which I think may be due to a sloppy translation from the original French). But it’s fantastic for me … I find the basic theme interesting, and I would love to know more about it, but I probably wouldn’t bother to slog through a serious book about the history of mathematics. But the fictional story of The Parrot’s Theorem gives the topic a narrative structure that makes it a fun and easy read.
(Additionally, through this book I was reminded of another book that I had heard of but forgotten: Sophie’s World: A Novel About the History of Philosophy. Supposedly it is a terrific book, working better as a novel than The Parrot’s Theorem. I’ll have to add it to my wish list.)
Pluggable optimizations
I enjoy reading Bill Venners’ interviews with software development luminaries. Bill himself is (from what I’ve seen) a talented and tasteful developer, and he picks some of the best to interview. Plus, he makes sure he’s familiar with each person’s work, and asks intelligent questions. This week I read the final part of his interview with Martin Fowler, and it really resonated with some lessons I’ve learned over the past few years.
Last year I gave a talk at JavaOne (and later for two other audiences) called Stalking Your Shadow: Adventures in Garbage Collection Optimization. (Although it sounds like an arcane optimization talk, in reality it’s sort of a “stealth agile” talk—the firmest recommendation in it is to do tightly iterative development with performance testing beginning very early in the process, so you can catch poor decisions early, while they’re easy to change.) In that talk, I point out that the right optimization strategies are strongly dependent on your choice of platform, that different optimization strategies might either conflict with each other or reinforce each other, and that you must measure the effect of your changes to see whether they help or hurt performance.
The implication there, of course, is that if you are thinking about multiple optimizations in your system, then you must—if you want to avoid what Mike Clark calls “thrash tuning”—-have ways to mix and match those optimizations as you measure, to see which combination produces an acceptable result. One trick I recommend is to implement each of your optimizations as aspects using AspectJ. AspectJ makes it very easy to choose, from build to build, which aspects are included in your system.
I was focusing on optimizing at a particular point in time, but Martin talked to Bill Venners about the ongoing lifecycle of software. He discusses how advances in platform implementation technology can turn today’s optimizations into performance drags (he’s talking specifically about VMs, but the same things apply to the OS and compiler). He recommends that optimizations be revisited with each platform upgrade. To do that, of course, you need to keep the design simple and the optimizations well encapsulated.
My strategy of using AspectJ to insert the optimizations from the outside could make this a breeze. Start with the clean design, and after profiling you can leave the clean code in place, but replace it using an aspect that implements the optimization. Later, you can easily build with optimizations included or excluded and run your performance tests again.
(This also reminds me of a story I’ve heard several times about the development of Microsoft Excel. They have a very simple evaluator that they are certain is correct, but it’s very slow. They also have a completely separate, highly optimized and fiendishly complex evaluator. During development and testing they run it with both evaluators turned on, with the simple, slow evaluator checking the work of the fast one. Then they turn the slow evaluator off for the production build. This strategy seems to work well—I’ve certainly encountered numerous bugs in Excel, but none have involved the evaluator, and evaluation is fast.)
Christmas Books: Victorian Science Fiction
At any given moment, I’m in the middle of more than one book. I just tend to get fiercely interested in something else before I finish with one book, so I go off on a tangent for a while with a new book, eventually coming back to finish the first.
About a week before Christmas I was rereading Neal Stephenson’s The Diamond Age. It’s a really interesting book, in part because of its form: it is quite definitely a science fiction novel, but it is structured as a Victorian novel, complete with the distinctive chapter headings: a little graphical ornament, and a short synopsis of the events that will happen in the chapter.
Then I spent a lunch break at the bookstore, and came across a book I’d heard many good things about: To Say Nothing of the Dog, by Connie Willis. Based on universally good reviews, I’ve been wanting to read it for a few years, and I was in a hurry, so I just snatched it from the shelf and bought it on an impulse.
Lo and behold, it’s another Victorian science-fiction novel. Without planning it, I find myself in the position of simultaneously reading two Victorian science fiction novels. There can’t be too many books that fit that description; what an interesting coincidence.
Back online
I’ve had a couple of weeks off from work, and I spent it mostly away from my computer, doing things with the boys, having a relaxing time.
Not that I haven’t been thinking about topics I wanted to blog. In fact, I’ve got a whole list of observations about the several wonderful books I got from Christmas; I’ll be blogging them over the next few days under the overall title “Christmas Books.”