The Wall: September 2013

You learn more from failure than you do success. In this post, I'd like to list my mistakes and failures from versions 0.01 to 0.99 of the DTraceToolkit, as lessons to learn from. There is much detail here that may only interest a small number of people. And if you are a recruiter, I can assure you that this won't interest you at all, so please stop reading. In fact, I'm another Brendan Gregg – the one who makes many mistakes – since there are several of us, after all. :-)

As a summary of the lessons learned, skip to the end for the "Learning From Mistakes" section.

Background

By 2005 I was sharing a small collection of DTrace scripts – tools – on my homepage. These were popular with a wide audience, including those who didn't have time to write DTrace scripts from scratch, and those who weren't programmers anyway. For these casual DTrace users I created a toolkit, to:

Give people useful tools to run immediately
Present good examples of DTrace to learn from

I was doing performance consulting and training, and had a keen sense of what problems needed solving. So creating useful tools was easy: I already had a laundry list of needs. As for good examples: I made my own coding style and stuck to it. I also created documentation for every script: a man page, and a file showing example usage. And I tested every script carefully.

The toolkit has been successful, helping people solve issues and learn DTrace. The scripts have also have been included, as tools, in multiple operating systems by default, including Mac OS X and Oracle Solaris.

Mistake 1. Missing Scripts

The observability coverage was a little uneven, as it was based on my performance experience at the time. Some areas I didn't write enough scripts for, and some areas I missed by mistake: for example, socket I/O from the syscall interface. I should have drawn a functional diagram of the system and kernel internals, to look for areas that were lacking scripts and observability.

I did cover the missing areas when I wrote numerous new scripts for the DTrace book, which are shared online as its own collection. But I haven't figured out how to include them in the DTraceToolkit, as they were created as example scripts (that likely need tweaking to run on different OSes) and not robust tools. So, now I have two collections of scripts, and over 400 scripts total.

Mistake 2. Too Many Scripts

I think I have too many scripts for the target audience: casual users. I'm reminded of the function/complexity ratio described in The Mythical Man Month, where the addition of some function came at the cost of much more complexity, making computers harder to learn and use. Casual users may only use a handful of the scripts, and having extra scripts around adds complexity to browse. This also becomes a maintenance burden, and testing can fall behind (it already has).

I did use a directory hierarchy to organize the scripts, with "top scripts" in the top level directory, which I think helps. I've also been thinking of creating a separate collection for performance engineers which contains every script (400+), and reducing the DTraceToolkit to 100 or less for the casual users.

Mistake 3. Inventing My Own Style

I should not have invented my own style to begin with. This included double asterisks for block comments, and no spaces after commas. From the original iosnoop:

#!/usr/sbin/dtrace -s
/*
** iosnoop.d - A program to print I/O events as they happen, with useful
** details such as UID, PID, inode, command, etc. 
** Written in DTrace (Solaris 10 build 51).
**
** 29-Mar-2004, ver 0.60.  (check for newer versions)
[...]

At the time I thought it was neat. I now think it looks bad. After about 50 scripts, someone from Sun suggested I follow "cstyle", as that was the standard for Sun's C code. This seemed to be a better idea than my own invented style, but, I had already written 50 scripts! I had to rewrite all the scripts to be cstyled - a nusiance. I should have asked others in the DTrace community before creating my own style, as many would have made the same suggestion: use cstyle.

Mistake 4. Complex scripts

Some of the scripts are too long and too complicated. These make poor examples to learn DTrace from, which was one of the goals of the toolkit. They also are a pain to maintain. In some cases it was necessary, since the point of the script was to resemble a typical Unix *stat tool, and therefore needed shell wrapping (getopts). Eg, iosnoop, iotop, and execsnoop. But in other cases it wasn't necessary, like with tcpsnoop.d.

One reason tcpsnoop.d is complex is the provider it uses (next mistake), but another reason was the aim. I wanted a tool that emitted the exact same output as snoop (Solaris tcpdump), but decorated with a PID column and other kernel context that can't be seen on the wire. This turned out to be very complex without a stable tcp provider to use, especially correctly matching the TCP handshake packets, RST packets from closed port SYNs, and other TCP behavior. I should have stopped and gone back to the problems this would solve, primarily, identifying TCP sessions to their PID, and quantifying their workload. That could have been solved with tracing send and receive packets alone. Other objectives, like tracing closed port packets, should have been handled by separate scripts. Doing everything at once was ideal, but not practical at the time.

The DTrace book taught me discipline for creating short, simple, and useful scripts, that fit on half a textbook page. I think that's a better approach where possible, which may mean several small scripts for specific problems, instead of one long script. That will help with readability, maintenance, and testing.

Mistake 5. The fbt Provider

I suspected I was making this mistake at the time, and I was. The function boundary tracing (fbt) provider does dynamic tracing of kernel functions, arguments, return values, and data structures. While the observability is amazing, any script that is fbt-based is tied to a kernel version (the code it instruments), and may stop working after an upgrade or a patch. This happened with tcpsnoop.d and tcptop, complex scripts that instrumented many TCP internals. An excerpt:

/*
 * TCP Fetch "port closed" ports
 */
fbt:ip:tcp_xchg:entry
/self->reset/
{
#if defined(_BIG_ENDIAN)
 self->lport = (uint16_t)arg0;
 self->fport = (uint16_t)arg1;
#else
[...]

The function this traces, tcp_xchg(), was removed in a kernel update. This broke tcpsnoop.d. The sad part is that this code was only necessary for tracing closed port packets (RST), and most of the time people weren't using tcpsnoop.d for that. See the earlier mistake. Had this been separate scripts, the breakage would be isolated, and, easier to fix and test.

I stopped creating fbt-based TCP scripts after tcpsnoop and tcptop, waiting for a stable DTrace tcp provider to be developed (which I ended up developing). I think it was useful to have them there, even though they usually didn't work, because it has been important to show that DTrace can do kernel-level TCP tracing. See my other post on the value of broken tools.

Mistake 6. The syscall Provider

I knew this mistake was theoretically possible, but never expected it would actually happen. The DTrace syscall provider is technically an unstable interface, due to the way it instruments the kernel trap table implementation. The DTrace manual mentioned the couple of places where this mattered, as probe names differed from syscall names. So any script that traces syscalls could break after a kernel upgrade, if the syscall trap table interface changed. Well, I'd expect syscalls to be added, but I didn't expect much changes to the existing names.

Oracle Solaris 11 went ahead and changed the syscall trap table significantly. This was to slightly simplify the code (housekeeping), making it a little easier for the few kernel engineers who maintain it. A side effect is that it broke many DTrace one-liners and scripts that customers were using. Taking that into consideration, I think this change was a mistake. It's adding more complexity for customers to endure than the function it provides (see the Mythical Man Month). In one use case, customers on Oracle Solaris 11 must include the birthday of an engineer in their scripts (a magic number) in order to have similar functionality as before. This is just madness.

Who's mistake is this ultimately? Everyone's. The DTrace syscall provider implementation was unstable to begin with, and while not a big problem at the time, you could argue that it should have been implemented stable from the very beginning. Perhaps I should not have used it so much, although without a stable alternative, much of the value of the DTraceToolkit would be lost. Avoiding it would also have made DTrace adoption more difficult, as tutorials usually begin by tracing the well-understood system calls. But the biggest mistake may be Oracle making these changes without also fixing the syscall provider. They did fix DTraceToolkit scripts and shipped them with Oracle Solaris 11, but much of the existing documentation for DTrace has not been fixed, nor is it easy to. All documentation needs to note that Oracle Solaris 11 is a special case, where the trap table differences are so large you do need to learn them. Fail.

Mistake 7. Scripts That Build Scripts

I shouldn't have written complex scripts that generate DTrace scripts. I did this once, with errinfo, a Perl program that builds and executes DTrace as a co-process. After finishing that I realized it was a bad idea, and did no more. It makes the script a difficult example to learn from and difficult to maintain. What I did do for many scripts (iosnoop, opensnoop, etc), was to break the program in two halves: the top half is a shell script for processing options, and the bottom half is the D script. This kept it simple, and maintenance wasn't too difficult.

Mistake 8. The Names iosnoop and iotop

I should have called these disksnoop and disktop, to make it clear that they are tracing physical I/O and not logical I/O (syscalls). It's confused a lot of people, and I'm sorry. I should have picked better names. iotop has since been ported to Linux (more than once), where the name has been kept, so I've indirectly confused those users as well.

Mistake 9. Not Testing iosnoop With High Enough Load

I didn't test iosnoop with a high enough disk IOPS rate. My development servers had a limited number of disks, where it ran fine. But on large scale servers, people reported "dynamic variable drops". The fix was to increase the dynvarsize tunable, which I have done in the next version (comes after the "switchrate" line):

 /* boost the following if you get "dynamic variable drops" */
 #pragma D option dynvarsize=16m

I could also change the key to the associative arrays to reduce overhead. It is currently the device and block number (this->dev, this->blk), but using the pre-translated message pointer (arg0) also works for many if not all OSes. The down side is that using arg0 is an unstable interface, which I'd rather avoid.

Mistake 10. Using io:genunix::start

I should not have used this probe at all, or, I should have added a comment to explain why I was. io:::start traces all disk I/O from the block device interface, and, NFS client I/O. Specifying "genunix" as the module name matched the disk I/O only, which is what I wanted, as NFS I/O was from a different module. But that's an unstable way to do it, and these scripts didn't work on other OSes that didn't have "genunix" as the module name. In later versions of my scripts, I removed the genunix, which means they also match NFS I/O.

I explained this in the DTrace book, and mentioned the stable fix: using io:::start with the predicate /args[1]->dev_ name != "nfs"/.

Mistake 11. Using Global Associative Arrays

These can become corrupted, and did for the cputimes script. This one in particular:

        /* determine the name for this thread */
        program[cpu] = pid == 0 ? idle[cpu] ? "IDLE" : "KERNEL" :
            OPT_all ? execname : "PROCESS";

That's saving a string to a global associative array, program[], which is keyed on the CPU id. Unlike aggregations or thread-local variables, these are not multi-CPU safe. cputimes would sometimes print strings that were garbled. Fortunately this wasn't subtle corruption, but was pretty obvious.

Mistake 12. dapptrace

I should have warned about the overheads of dapptrace more clearly. With the -a option, it traces every user-level function, which for some applications will significantly slow the target. Without -a, it traces just the application text segment, which, depending on the application, can also incur high overhead. I warned about it in the example file, but I should have warned about it in the script and man page as well.

Maybe I shouldn't have even written this tool. With applications, I'm usually tracing a specific function or group of functions, which minimizes the overhead.

Mistake 13. Missing Units On Output

I should have always put units on the output. For example, running iosnoop with completion times (-t, for the TIME column) and I/O delta times (start to done, -D, for the DELTA column):

# iosnoop -Dt
TIME           DELTA     UID  PID D    BLOCK   SIZE      COMM PATHNAME
633389165137   1903        0    1 W 749429752   4096   launchd ??/T/etilqs_sa...
633389165187   1461        0    1 W 385608448   4096   launchd ??/log/opendir...
633389165215   2018        0    1 W 749429656   8192   launchd ??/T/etilqs_sa...
[...]

Where they microseconds or milliseconds? I forget. While the units are documented in the script, the USAGE message, and the man page, it's really handy to have it on the output as well. The next version does this:

# iosnoop -Dt
TIME(us)       DELTA(us) UID  PID D    BLOCK   SIZE      COMM PATHNAME
[...]

All scripts should include units on the output.

Mistake 14. Capital Directories

I'm not sure starting directories with a capital letter was a great idea. It helped differentiate the scripts from the directories, and listed directories first. eg:

~/DTraceToolkit-0.99> ls
Apps            Java            Perl            User            install     
Bin             JavaScript      Php             Version         iopattern   
Code            Kernel          Proc            Zones           iosnoop     
Cpu             License         Python          dexplorer       iotop       
Disk            Locks           README          dtruss          opensnoop   
Docs            Man             Ruby            dvmstat         procsystime 
Examples        Mem             Shell           errinfo         rwsnoop     
FS              Misc            Snippits        execsnoop       rwtop       
Guide           Net             System          hotkernel       statsnoop   
Include         Notes           Tcl             hotuser

Well, that seems pretty useful. But then, so is aliasing ls to "ls -F", or the colorized version of ls. Those techniques won't list the directories first, but they will differentiate them. This may be a choice between having directories listed first and together, or, not needing to hit the shift key so much. Or, I rearrange things to not mix scripts with directories.

Mistake 15. Not Organizing For Other OSes or Kernel Versions

I didn't think much about organizing the toolkit to support multiple OSes, as those ports hadn't happened yet. I should have thought harder, since around the same time DTrace was being released as open source, and I honestly expected it to be on Linux by the end of 2005. All I did do was put the OS type and version at the top of the scripts (better than nothing). Organizing the hierarchy is a problem I need to wrestle with now.

Mistake 16. Not Testing New Kernels

I didn't have a test plan for handling numerous new kernel versions. In part because I didn't yet know how much of a problem it would be, but also because I didn't expect the DTraceToolkit to be around for that long (see next mistake). When I began writing the DTraceToolkit, there was only one kernel version to test. Soon there were two, three, and then more than I had servers at home to test on. Pretty quickly people were running the DTraceToolkit on Solaris versions that I hadn't tested at all, and I didn't have the spare equipment to test.

I did get a lot of help from Stephan Parvu, who automated testing and was able to quickly do a smoke test on several kernel versions and platforms. But as the years went by, Sun kept releasing new Solaris versions, and staying on top was hard. There have been 12 versions of Solaris 10 so far (last was Jan this year), on two different platforms (x86, SPARC), meaning there are 24 OS versions to test just for Solaris 10 coverage.

One problem was that I didn't have ready access to the different versions of Solaris to test: ideally, I'd want 24 online servers spanning every version of Solaris 10, and root logins. At one point someone at Sun was going to contribute a pool of test servers to the OpenSolaris project. Unfortunately, I had to sign the OpenSolaris contributor agreement before I could use them. I didn't, and that turned out later to be a good decision.

The other problem was that testing was very time consuming, and was most of the script development time. The smaller scripts would take about 2 hours to develop: 20 minutes to write the script, 20 minutes to write the example file and man page, and then 80 minutes to test it and fix bugs found during testing. Larger scripts took longer - some took weeks. It's easy to write a DTrace script that produces numbers, but it's harder produce accurate numbers. Testing often involved running a variety of "known workloads" and then seeing if the DTrace tool measured the workloads exactly, and investigating when it didn't.

Testing is even harder now than before, as there are multiple OSes to test.

Mistake 17. Not Planning For Success

I didn't anticipate the DTraceToolkit would be this successful, and that it would be used years later on multiple OSes. I half expected Sun to release some new DTrace-related thing, like a DTrace GUI, that would make the toolkit redundant. If I had expected it to last this long, I could have planned testing kernel versions and multiple OSes better. I guess this is a lesson for any project: what if it is really successful in the long term? What architectural choices should be made now, that won't be regretted later?

Mistake 18. Private Development and Testing

I should have made the development version of the DTraceToolkit public, even though it contained partially tested scripts. This would have allowed others to easily participate in testing these scripts, accelerating their development. Some people did help test, but this required me to email around tarballs, when I could have just had a public URL. It didn't seem like a good idea in 2005 when I created the toolkit, but nowadays it is commonplace to find both stable and unstable versions of a project online.

Mistake 19. Soliciting Contributions From Beginners

I encouraged everyone to send me scripts, which was a mistake. Most of the submissions were from DTrace beginners, some noting "this is my first ever DTrace script". Many of these produced invalid or misleading metrics. Most of the beginners were also inexperienced programmers, who didn't test or know how to test. And many didn't follow the toolkit coding style, or any programming style. I was sympathetic as they did have good intentions, and they were trying to help my project. So I would try to explain how to improve or test their script. At one point I also wrote websites on style, dos and don'ts, and hints & tips.

Problem was, most beginners didn't really understand what I was talking about, unless I spent serious time explaining. Since they had already contributed their time, some felt that I was obligated to return the favor, and explain, and keep explaining, until they understood everything. This took hours, many more than it would take to write the scripts myself. And that time didn't pay off: once the beginner realized I was serious about accuracy and testing, or the real complexity of what they were doing, they usually lost interest.

A common misconception with beginners was that the hard work was in writing the script, and that any extra work, like testing and documentation, was minor and could just be done by me. Testing is the hard work, and is where most of the development time is spent.

I should have only encouraged experienced software engineers. Some of whom did send me scripts, which I included. These were often accurate, tested, and styled to begin with. They were usually created when the engineer was debugging a particular software issue, and had created a DTrace script to solve it after reading the source code and learning all the nuances.

Mistake 20. Joining Sun

I joined Sun in late 2006, and in terms of the DTraceToolkit, this was a mistake. My work on the DTraceToolkit mostly stopped, and not just because I was busier. It's not easy or short to explain why. There are a number of pros and cons when working on an open source project like this, and before Sun it seemed that the pros outweighed the cons. Pros included things like helping fellow sysadmins – which I really enjoyed – and creating tools that I'd use in my own career. But there were various cons too, and Sun added more, to the point where it was hard to justify volunteering my spare time for this project.

Just as an example – and only because it's the easiest to summarize – at one point Sun legal became involved, and investigated whether Sun could claim ownership of the DTraceToolkit. As an employee, I was now in the crosshairs of my own company's lawyers. It wasn't fun, and they tried, very hard, for months. Thanks again to those that helped eventually put a stop to that!

I don't have hard feelings towards the lawyers. In fact, I admire how hard they were working for Sun – just like I was, so on some level it was understandable. But other events were much less understandable, and much worse. Again, there's no easy or short way to explain it all, and the problems weren't limited to the DTraceToolkit - DTrace itself came under friendly fire at Sun as well.

Instead of working on the DTraceToolkit I've been writing books, which have similar incentives: I get to help other people in the industry, especially sysadmins, as well as create references for myself, which also include tools. I never stopped contributing - I changed the medium to do so.

Learning From Mistakes

I need to:

Get the DTraceToolkit on github.
Organize it to support different OSes.
Update existing scripts and add missing scripts.
Split the DTraceToolkit into two: a small number of well-tested scripts for casual users, and a large number of less-tested scripts for performance engineers.

And by less-tested I do mean probably broken. This is embracing the reality of dynamic tracing, which exposes interfaces that are not stable. That doesn't make the scripts useless, rather, these scripts need to be treated differently: they are examples of solving problems (even if they don't work), not ready-to-go tools.

The split can be either in the DTraceToolkit as subdirectories, or as separate toolkits. This addresses several of the mistakes listed above. I hope to follow up this post with another explaining how the split is done, which can refer to this post as background.

And the final lesson I should learn: I should stop writing books, to get my spare time back! (My colleagues wouldn't believe that for a second.) Ok, well, maybe I'll a break from writing very long books, how about that? :-)

The Wall

05 September, 2013

DTraceToolkit 0.XX Mistakes