Inside the mind of Jason Hunter

Jason’s too-hip AIM icon and iChat combined to give me a good laugh yesterday:

Intellectual Accuracy

I’ve been gradually coming to understand the impact of the Eldred decision, and it’s been fascinating to read Lessig’s blog during the past week or so. He points to a great piece by Doc Searls arguing that many people completely miss the point because they think of copyright as a property right. And in calling for a return to the original 14-year copyright term, The Economist makes the related point that the originators of the idea of copyright didn’t see it as a property right at all.

In this country, at least, calling copyright a property right creates some strange contradictions. In the U.S., property rights are nearly sacred, and can’t be violated by the government except in very limited circumstances, and then only on a case-by-case basis for specific items. The idea of limited terms for intellectual property rights simply doesn’t fit well into the overall view of property rights.

I’m tilting at windmills, I suppose, but I’m going to stop using the term “intellectual property.” (I’m curious, by the way, about the origin of that term and how it came into common use.) For the moment, until I hear a better alternative, I think I’ll call it “creative output” instead.

Data Is

(via my O’Reilly blog)

I know that “data” is technically the plural of “datum”. But I find it jarring when I read that “the data are transmitted” somewhere. In common usage (both speech and informal writing) that data “is transmitted.”

It’s not that “data” is singular; it’s more like a nonspecific collective noun, like “air”. It has come to meanand I’m going to really massacre the language here, just to emphasize the distinction I’m trying to make“some datums”. We say “the data is corrupt” in the same way we say “the air is polluted.”

At this point you may be thinking I’m just upset that the dictionary doesn’t agree with the way I do things. For the record, though, I’m a careful speaker and writer who usually argues for the rules people have forgotten rather than the common, often sloppy usage. This time I think the change in usage has happened for good reasons.

One reason, I think, is that “datum” is so rarely a useful word. I’m not sure why, but we rarely need to distinguish between singular and plural with respect to data; it’s almost never important to talk about a single datum.

A related reason is that it’s unclear what constitutes a datum. Is it always a bit? Or some larger group of data? (See how slippery it is? Is it reasonable to say that a datum is composed of a group of smaller data?)

My “air” analogy illustrates that problem quite well. Is a molecule of oxygen also an “air molecule”? Air is a mixture, so identifying the smallest unit of air is a tricky thing.

There are contexts, perhaps, where data are discrete and well structured so that the distinction makes sense. But in most cases, data is complex, with an almost fractal structure, and the line between data and datum is almost impossible to draw. (This paragraph is a test, by the way. Which of those sentences seemed most natural to you?)

I think it’s time to acknowledge that the old rule, in this case, is obsolete. Circumstance and usage has turned “data” into a collective, singular noun. It refers to “some data”—and in the tradition of computer science, “some” can mean “zero or more”. “Datum” can still be useful on the rare occasions where you need to emphasize a singular unit that can’t be described as a bit, byte, octet, scalar, etc.

Update: a respondent, “gojomo”, points out that the correct linguistic term for the common usage of “data” is “mass noun”. Other examples of mass nouns include water, blood, light, money, and cheese.

Update 2 (2017-05-15): I recently re-watched Guy Steele’s brilliant talk from 1998, Growing a Language (transcript here). In that talk, Steele defines “data” like this:

A datum is a set of bits that has a meaning; data is the mass noun for a set of datums.

So in 1998, it was accepted usage (accepted by Guy Steele, anyway, which is good enough for me) to treat “data” as a mass noun.

Christmas Books: The Parrot’s Theorem

I really enjoy books, and my taste is broad and often a little strange. Plus, I pay attention to books (using resources like the bookshelf section of Rael Dornfest’s page). So it’s really unusual for someone to give me a book that’s both right up my alley and also unknown to me. This year, it happened twice. Here’s the first one.

We lived in Australia for a few years in the early ’90s, and naturally made some wonderful friends there. For Christmas this year, our friends Doug and Trisha Paice sent me a copy of The Parrot’s Theorem, by Denis Guedj. It’s a novel about the history of mathematics, and it fits my criteria for great gifts: I wouldn’t have bought it for myself, but I’m delighted to have it. I’m halfway through, and it’s a lot of fun.

If you’re looking for just a good page-turner of a novel, you can safely skip itthe story probably won’t grab you if you don’t have at least a passing interest in the history of mathematics. And there are some distinct weaknesses in the writing (which I think may be due to a sloppy translation from the original French). But it’s fantastic for me … I find the basic theme interesting, and I would love to know more about it, but I probably wouldn’t bother to slog through a serious book about the history of mathematics. But the fictional story of The Parrot’s Theorem gives the topic a narrative structure that makes it a fun and easy read.

(Additionally, through this book I was reminded of another book that I had heard of but forgotten: Sophie’s World: A Novel About the History of Philosophy. Supposedly it is a terrific book, working better as a novel than The Parrot’s Theorem. I’ll have to add it to my wish list.)

Pluggable optimizations

I enjoy reading Bill Venners’ interviews with software development luminaries. Bill himself is (from what I’ve seen) a talented and tasteful developer, and he picks some of the best to interview. Plus, he makes sure he’s familiar with each person’s work, and asks intelligent questions. This week I read the final part of his interview with Martin Fowler, and it really resonated with some lessons I’ve learned over the past few years.

Last year I gave a talk at JavaOne (and later for two other audiences) called Stalking Your Shadow: Adventures in Garbage Collection Optimization. (Although it sounds like an arcane optimization talk, in reality it’s sort of a “stealth agile” talkthe firmest recommendation in it is to do tightly iterative development with performance testing beginning very early in the process, so you can catch poor decisions early, while they’re easy to change.) In that talk, I point out that the right optimization strategies are strongly dependent on your choice of platform, that different optimization strategies might either conflict with each other or reinforce each other, and that you must measure the effect of your changes to see whether they help or hurt performance.

The implication there, of course, is that if you are thinking about multiple optimizations in your system, then you mustif you want to avoid what Mike Clark calls “thrash tuning”—-have ways to mix and match those optimizations as you measure, to see which combination produces an acceptable result. One trick I recommend is to implement each of your optimizations as aspects using AspectJ. AspectJ makes it very easy to choose, from build to build, which aspects are included in your system.

I was focusing on optimizing at a particular point in time, but Martin talked to Bill Venners about the ongoing lifecycle of software. He discusses how advances in platform implementation technology can turn today’s optimizations into performance drags (he’s talking specifically about VMs, but the same things apply to the OS and compiler). He recommends that optimizations be revisited with each platform upgrade. To do that, of course, you need to keep the design simple and the optimizations well encapsulated.

My strategy of using AspectJ to insert the optimizations from the outside could make this a breeze. Start with the clean design, and after profiling you can leave the clean code in place, but replace it using an aspect that implements the optimization. Later, you can easily build with optimizations included or excluded and run your performance tests again.

(This also reminds me of a story I’ve heard several times about the development of Microsoft Excel. They have a very simple evaluator that they are certain is correct, but it’s very slow. They also have a completely separate, highly optimized and fiendishly complex evaluator. During development and testing they run it with both evaluators turned on, with the simple, slow evaluator checking the work of the fast one. Then they turn the slow evaluator off for the production build. This strategy seems to work wellI’ve certainly encountered numerous bugs in Excel, but none have involved the evaluator, and evaluation is fast.)

subscribe via RSS or JSON Feed