A position paper for the 2001 OOPSLA Software Archeology Workshop
Glenn Vanderburg, Delphi Consultants, LLC <glv@delphis.com>

Experience Report

There’s nothing more useless than a bored archaeologist.

—Douglas Adams

Late last year, I was brought in by a client to investigate a serious performance problem with a large Java system. Like many of the performance problems I see, this one challenged the usual “profile, identify hot spots, optimize” approach, requiring in some cases some rather extensive redesign. In this case, the worst problem was garbage collector performance, with two systems (the program itself and the garbage collector) competing for resources and affecting each other in subtle ways.

Through the course of the year I had been learning about “aspect-oriented programming” (AOP) and its Java-based incarnation, AspectJ. The AspectJ documentation mentions logging and tracing as an ideal application of AOP, especially for conservative developers that don’t want to commit production code to a new, unfamiliar technology. By using AspectJ for tracing, you can try it out in development but keep it out of the production code. As a consultant, I can’t be more daring than my clients will allowthat is, not much moreso I had been looking for an opportunity to try AspectJ as an archaeological tool.

I began with the usual CPU and memory profiling, of course. Then I began using AspectJ to trace the system as I tried to understand how problem points in the system could be reworked to use less CPU or create fewer objects. I was pleased with the approach. It took just a few minutes to instrument the entire system for exhaustive control-flow tracing, and I could rapidly and easily narrow the focus of the trace to just particular packages, classes, or the subtree below a single method call. I could trace event flow just as easily as control flow.

Space doesn’t permit many examples, but the following code is all it takes to trace method and constructor entry and exit in all com.delphis packages:

aspect Trace {
    pointcut allmethods(): target(com.delphis..*)
                           && (execution(new(..)) || execution(* *(..)));
    before(): allmethods() {
        System.out.println("Entering: " + thisJoinPointStaticPart.getSignature());
    }
    after(): allmethods() {
        System.out.println("Exiting: " + thisJoinPointStaticPart.getSignature());
    }
}

To my delight, I found that aspects allowed me to do more than just trace. The interactions between the program and the garbage collector were complex, and the effects of a change were difficult to predict. I was able in one case to replace all use of an object pool with direct calls to the constructor with just a few lines of AspectJ code. Simply by choosing whether to include that aspect in the build, I was able to turn object pooling on and off, measuring the effects.

In another case, I needed to see what was happening in a particular method, but only in the case where that method was running in the thread that handled communication events. The problem was that the program used a thread pool, and the event thread had a different, undistinguished name (such as “Thread-11”) in each run. I was able to quickly write an aspect that changed the name of the thread as soon as it was chosen to be the event thread. Another aspect implemented the appropriate tracing, so long as the name of the current thread was “CommEventHandler”.

Using AOP for these investigations eliminated a lot of worry. I could implement complex, thorough tracing, and keep it in place through a series of modifications to the production code, without having to dread going through and ripping the tracing out when I was done. The production Makefile didn’t know anything about the aspects, and the production build machine didn’t have AspectJ installed, so there was no danger of my archaeological tools finding their way into a production build.

Position Statement

The most difficult problems of software archaeology involve understanding the runtime behavior of the system, cutting across the boundaries defined by explicit static structures in the source code. As such, the concepts and notations of aspect-oriented programming are natural fits for archaeological tools.

The Task

Making changes to an existing software system usually involves three steps. The first is to understand the structure of the system well enough to see the general shape of the change. The second is to zero in on the parts that will be affected, understanding them in detail. The third is to implement the change, including testing and debugging. (Although it’s rarely done, part of the testing should include researching whether the change has harmed the overall quality of the system.) Just as with the original development, this is an iterative process, except that in my experience changing an existing system involves more false starts, where iteration means abandoning some or all of the progress made and starting again.

Obviously, all software development involves all of those same things to some degree. But in the situations that are the focus of this workshop, there are two aspects that take on greater importance. One is quickly building an understanding of a completely unfamiliar system, how it is structured and how it behaves. The other is focusing, identifying the relevant parts of the system so that we can build a detailed understanding of the parts that are significant for our change, while ignoring the rest.

Understanding the system structure. There are two kinds of structure involved: The static relationships represented explicitly in the structure of the code, and the dynamic structure, the behavior of the system at runtime. There are tools that can do a fairly good job of extracting the static structure semi-automatically. At any rate it’s usually not too hard to do manually, as the structure is fairly explicit, if unnecessarily complex, in the code. It’s the dynamic structure that’s more difficult to see. Object-oriented systems, with their late binding mechanisms, increase this difficulty. It may be all but impossible to discern purely from examination of the code what kind of entity lies behind a variable, or what code will be invoked in response to a method call. In fact, the answers to those questions may vary depending on the input the program receives at runtime.

Focusing and pruning. In the static view, we can do a fairly good job identifying coarse-grained structure by looking at package structures, class hierarchies, UML class diagrams, and so forth, and weeding out the parts that seem irrelevant or at too low or high a level. Where we most need help is with handling the much greater volume of information that our tools give us about the dynamic behavior of the system.

So the most difficult problems of software archaeology relate to the runtime behavior of the system.

Existing Tools

There are few good tools for understanding the behavior of a running system. Worse, there are tools that confuse us: many UML modeling tools will happily use code analysis to construct UML “behavior” diagrams, but the code analysis can only produce an incomplete, static view. As a result, wherever the code uses objects that are passed in as parameters, returned as results, or created indirectly by factories, the diagrams are incomplete (when the declared type is an interface or abstract class, so that the tool has no method implementation to analyze) or incorrect (when the runtime type turns out to be a subclass that has overridden methods and modified their behavior).

Debuggers, CPU and memory profilers, software visualization and algorithm animation systems, and tracing packages all provide ways to examine running systems. But they all have serious limitations. Debuggers provide a series of “frozen” views; through them we can build a picture of the system’s behavior, but the process is a lot like trying to understand the plot of a movie by looking at a few dozen frames of the film. Profilers emphasize performance and resource utilization, not structure and relationships. Software visualization tools are usually quite limited; although useful for analyzing data structures and algorithms, they provide little help in understanding larger-scale system design and architecture, or program communication patterns.

Today, tracing alone provides a rich, thorough, tailorable, dynamic view of a system in action. Yet it suffers from a pair of flaws: it is invasive and manual. Tracing requires inserting statements into the code at all of the appropriate places, preceded (of course) by the necessity of finding all of those places. Sometimes it’s fairly easy to build scripts to automate part of the process, but not always. We may or may not want to leave the tracing statements in the code after our archaeological expedition is done; they can impact the performance of the system, and they make the code more complex by cluttering it with “housekeeping” code. Most often, we end up choosing to leave some of the tracing statements in the code.

Where tracing gets even more cumbersome is in providing a focus. As we learn about the system from watching the traces, we want to zoom in on some particular aspect of the system’s behavior. Doing that usually means adding new tracing statements. Often the new tracing entries are swamped by the old, requiring us to either remove the other entries, filter the trace stream, or add some sort of configurable “tracing level” system, which brings its own problems. Combine this with the exploratory, probing, false-start nature of software archaeology, and the result is a mess: the code is full of tracing statements covering multiple levels of detail, the traces are voluminous, verbose, and inconsistent, and there are multiple, overlapping tracing levels.

Whereas the static structure of the system is largely hierarchical, tracing tends to look at layers or events or flows of control that slice across the hierarchy. While it’s certainly possible to stretch the archaeology metaphor too far, it’s hard to miss the correspondence with strata and slices and cores.

Dynamic Structure: A Crosscutting Concern

These views of the system at runtime are what the aspect-oriented programming community calls crosscutting concerns. Therefore it’s natural to apply aspect-oriented techniques and tools to them. AOP has many interesting characteristics, but for our purposes it suffices to think of it as a generalization of Lisp’s “advice” facilities, or as a “meta-object protocol”, or as a sophisticated, pattern-driven, programmatic source code transformer.

The AspectJ language (a superset of Java that compiles to compatible Java bytecodes) offers tremendous advantages when researching Java programs. In a single place, with a few lines of code, we can identify all points in the system that fit some criterionall method calls, all method returns, all exceptional returns, all event receptions, all thread creations, all calls to methods with a particular name, or declared by a particular interface, or with particular parameter types, or particular return typesand insert tracing statements or other code of our own choosing at those points. We can compose those criteria arbitrarily.

The structuring mechanisms of AOP allow us to separate tracing aspects by intent: one aspect can trace control flow, while another aspect traces event flow, and yet another traces communication between threads. The build mechanisms then allow us to choose which aspects are included in the system for a particular build, and ignore the rest. This makes it easy to keep the tracing code around for later use, including a check for unforseen effects after the functional change has been made.

These features allow us to quickly build aspects that provide different views into the running system, and to choose at any point which views we want to see.

The code inserted by aspects can do more than tracing. It can alter behavior, provide identifying names to objects (such as GUI components or thread handles) that would otherwise be indistinguishable in the debugger, add or replace toString methods so that trace messages or debuggers can provide more meaningful information.

Also, because we can easily remove all of the tracing aspects from the production build, we can make the tracing code very dynamic, testing complicated conditions at runtime to determine when and whether to log a trace message. An example included in the AspectJ distribution uses aspects to install a tracing control panel along with the tracing code, allowing particular tracing aspects to be turned on or off while the application is running.

Finally, if some of the aspects should be included in production builds, there are two options: either compile them in using AspectJ, or use a tool supplied with the compiler to identify all of the places in the Java code where the tracing statements should be added.

Do these techniques apply to other languages? Certainly, to a greater or lesser degree. Languages that provide richer support for reflection don’t require a special add-on tool to do this kind of thing. However, AspectJ does provide an important tool that would be applicable (in analogous form) in any language: the pointcut notation, a specialized sublanguage for identifying sets of points in a program that match certain criteria.

Next Steps

There are several interesting directions in which we can take this idea.

The AspectJ distribution comes with a generalized tracing aspect that can be applied to any system. Although it’s a good start, it’s a long way from being a full-fledged archaeology tool. With a little work, one could build a richer set of tracing aspects, tracing more than just control flow, with a runtime tracing control panel. Via AspectJ’s aspect inheritance capabilities and named pointcuts, those tracing aspects could be easily tailorable to provide customized views into a system.

A logical next step is to build a tool that uses those traces to generate accurate behavior models and diagrams of portions of the system.

Ultimately, it would be useful to apply the notion (and notation) of pointcuts to other tools. It would be very useful to be able to specify conditional breakpoints in a debugger using pointcut syntax, allowing breakpoints that, for example, break just before calling any method with a particular name or signature. It’s also possible to imagine applications to profilers, monitoring tools, and visualization tools.