24 June, 2008

DTrace in New York

Back in February I gave several DTrace talks in New York, including one at the New York OpenSolaris User Group meeting (NYCOSUG). I used an updated slide deck and was asked to put the PDF on my blog; I think Isaac must have beaten me to it and put it here (thanks!). I did intend to blog about this in case anyone was looking - sorry for the delay.

The NYCOSUG had a good turn out and asked some great questions, allowing me to deviate from the prepared slides and cover other things of interest (which is the value of an in-person presentation.) After the presentation I realised there was one point I could have explained better, which would make an interesting blog post.

I started with the following simple demos - the point of these is to build on something commonly understood (such as the behaviour of fork() and exec()), to introduce something new - DTrace.

Tracing exec():
# dtrace -n 'syscall::exec*: { trace(execname); }'
dtrace: description 'syscall::exec*: ' matched 4 probes
CPU ID FUNCTION:NAME
0 98087 exece:entry bash
0 98088 exece:return ls
^C

In the above output, we traced an exece() system call - printing the current process name when we entered and returned from that system call. That process name changed from "bash" to "ls" (I executed "ls -l" in another window), which is what exec() does - replaces the current process image with another.

While unsuprising, the significance is that we are able to dymanically trace this kernel activity whenever we would like, along with thousands of other kernel events. I could, for example, trace the time taken for exec() to execute; or the exit status when exec() returned along with the error code; I can also trace the internal operation of exec() with enough detail to fill hundreds of pages (I just counted 47396 lines of output when tracing every kernel function entry and return during exec()).

Now to trace fork():
# dtrace -n 'syscall::fork*: { trace(pid); }'
dtrace: description 'syscall::fork*: ' matched 6 probes
CPU ID FUNCTION:NAME
0 98227 forksys:entry 87417
0 98228 forksys:return 90769
0 98228 forksys:return 87417
^C

The above system call has one entry and two returns - which is what we expect from the fork() family.

Simple as this is, some interesting behaviour is already visible. Note that the parent returned before the child? On the fork() entry, the parent's PID is traced (87417), however the child PID (90769) appears first on return.

It's possible that the output could be shuffled due to how DTrace uses per-CPU buffers to minimise performance impact; to double check, add a "timestamp" column and post sort:

# dtrace -n 'syscall::fork*: { printf("%d %d", timestamp, pid); }' -o /tmp/out.dtrace
dtrace: description 'syscall::fork*: ' matched 6 probes
# sort -n +3 /tmp/out.dtrace
CPU ID FUNCTION:NAME
1 98227 forksys:entry 268361844462135 87417
1 98228 forksys:return 268361844960455 90968
0 98228 forksys:return 268361844965924 87417

I asked the audience - why does the child return from fork() before the parent? I added that this was a very difficult question!

Someone responded to say that this was how all operating systems worked - the parent process waits for the child to complete. I said I was just tracing fork() and the parent could be scheduled first - but deliberately isn't, and explained why. My answer left them confused - and it struck me afterwards that I should have explained this better.

Consider this:
# dtrace -n 'syscall::fork*:,syscall::wait*: { trace(pid); }'
dtrace: description 'syscall::fork*:,syscall::wait*: ' matched 10 probes
CPU ID FUNCTION:NAME
0 98227 forksys:entry 87417
0 98228 forksys:return 91088
0 98228 forksys:return 87417
0 98163 waitsys:entry 87417
0 98164 waitsys:return 87417
^C

In the above output we can see both fork() and wait(), and we can discuss behaviour such as the parent process waiting for the child to complete (since I was in a shell running foreground commands.)

But I was actually asking a much deeper question, that of thread scheduling immediately after the fork() system call, and before the parent has called wait(). Immediately after fork() you have two threads - which should go on-CPU first? The parent, so that it can get to wait() sooner and before the child may have exited? Or is there a reason to schedule the child first?

As DTrace shows, the child is getting scheduled first, and the reason is one of performance. The source code explains in uts/common/disp/ts.c :

/*
* Child is placed at back of dispatcher queue and parent gives
* up processor so that the child runs first after the fork.
* This allows the child immediately execing to break the multiple
* use of copy on write pages with no disk home. The parent will
* get to steal them back rather than uselessly copying them.
*/
static void
ts_forkret(kthread_t *t, kthread_t *ct)

The fork() system call creates a clone of the parent, but rather than copy all memory pages to a new address space (which would add significant latency during process creation), Solaris bumps a reference counter on those memory pages to remember that two process refer to the same data. If one writes to a memory page later, this triggers a "copy on write" to create a private writable copy for that process. This means that expensive memory copies are only performed when needed - or if needed. Since a child process is likely to call exec(), it is likely to simply drop many existing memory references for the new process image, so copying those bytes would have been wasted cycles anyway.

However if the parent is scheduled first - before the child has had a chance to exec() - then the parent may continue writing to its address space, triggering copy on writes. Then the child executes, calls exec(), and drops those newly copied pages anyway - which were copied in vain. To avoid this, the child is scheduled first - to call exec() as soon as possible, as described in the comment above.

I learned about this behaviour when reading Solaris Internals 1st edition; but that was a time before DTrace and OpenSolaris. It's great that we all can now both read the code, and use DTrace to see it in operation.

18 February, 2008

DTraceToolkit in MacOS X

Apple included DTrace in MacOS X 10.5 (Leopard), released in October 2007. It's great to have DTrace available in MacOS X for its powerful application and kernel performance analysis. To think that there is now another kernel we can examine using DTrace is exciting - it's like discovering a new planet in the solar system.

Apart from kernel analysis, DTrace also improves general usability by answering every day questions like: why are my disks rattling? or why does my browser keep hanging? Although, your average user may not write DTrace scripts to answer these questions themselves (it's better if they do), but instead use prewritten scripts.

MacOS X includes a collection of DTrace scripts in /usr/bin, mostly from the DTraceToolkit:
leopard# grep -l DTrace /usr/bin/*
/usr/bin/bitesize.d
/usr/bin/cpuwalk.d
/usr/bin/creatbyproc.d
/usr/bin/dappprof
/usr/bin/dapptrace
/usr/bin/diskhits
/usr/bin/dispqlen.d
/usr/bin/dtruss
/usr/bin/errinfo
/usr/bin/execsnoop
/usr/bin/fddist
/usr/bin/filebyproc.d
/usr/bin/hotspot.d
/usr/bin/httpdstat.d
/usr/bin/iofile.d
/usr/bin/iofileb.d
/usr/bin/iopattern
/usr/bin/iopending
/usr/bin/iosnoop
/usr/bin/iotop
/usr/bin/kill.d
/usr/bin/lastwords
/usr/bin/loads.d
/usr/bin/newproc.d
/usr/bin/opensnoop
/usr/bin/pathopens.d
/usr/bin/pidpersec.d
/usr/bin/plockstat
/usr/bin/priclass.d
/usr/bin/pridist.d
/usr/bin/procsystime
/usr/bin/runocc.d
/usr/bin/rwbypid.d
/usr/bin/rwbytype.d
/usr/bin/rwsnoop
/usr/bin/sampleproc
/usr/bin/seeksize.d
/usr/bin/setuids.d
/usr/bin/sigdist.d
/usr/bin/syscallbypid.d
/usr/bin/syscallbyproc.d
/usr/bin/syscallbysysc.d
/usr/bin/topsyscall
/usr/bin/topsysproc
/usr/bin/weblatency.d

That's 44 DTraceToolkit scripts, plus plockstat from Solaris 10. While the DTraceToolkit now has over 200 scripts, it makes sense to pick out the most useful scripts for inclusion in /usr/bin.

Popular scripts such as iosnoop can now be run by MacOS X users:

leopard# iosnoop
UID PID D BLOCK SIZE COMM PATHNAME
501 130 R 31987472 40960 Terminal ??/dyld/dyld_shared_cache_i386
501 130 R 7879952 8192 Terminal ??/SearchManager.nib/keyedobjects.nib
501 130 R 32132304 12288 Terminal ??/dyld/dyld_shared_cache_i386
501 130 R 32132528 4096 Terminal ??/dyld/dyld_shared_cache_i386
501 130 R 32047696 12288 Terminal ??/dyld/dyld_shared_cache_i386
501 130 R 32132592 4096 Terminal ??/dyld/dyld_shared_cache_i386
501 130 R 32131512 12288 Terminal ??/dyld/dyld_shared_cache_i386
501 130 R 32033296 12288 Terminal ??/dyld/dyld_shared_cache_i386
501 130 R 32044488 4096 Terminal ??/dyld/dyld_shared_cache_i386
501 130 R 32045064 4096 Terminal ??/dyld/dyld_shared_cache_i386
501 130 R 32131344 4096 Terminal ??/dyld/dyld_shared_cache_i386
501 130 R 32048680 16384 Terminal ??/dyld/dyld_shared_cache_i386
501 130 R 32132544 8192 Terminal ??/dyld/dyld_shared_cache_i386
501 130 R 32049296 12288 Terminal ??/dyld/dyld_shared_cache_i386
-1 0 W 32482848 86016 kernel_task ??/vm/swapfile2
-1 0 W 32483040 135168 kernel_task ??/vm/swapfile2
501 130 R 32044672 4096 Terminal ??/dyld/dyld_shared_cache_i386
501 130 R 32132656 12288 Terminal ??/dyld/dyld_shared_cache_i386
[...]

The man pages are conveniently included in /usr/share/man.

I had been making preperations in the latest DTraceToolkit (0.99) for MacOS X DTrace, such as putting an "OS" field into the man pages and figuring out how to support different versions of the same script (tcpsnoop_snv, etc). Hopefully many scripts will run on both Solaris and MacOS X (especially if they use stable providers), however I expect there will be some that are specific to each. Now that QNX DTrace also exists, there is additional need for identifying OS specifics in the DTraceToolkit.

It's been great news for DTrace, Sun and Apple - who have not only gained the best performance and debugging tool available, but also the existing DTrace community.

17 February, 2008

Browsable DTraceToolkit

Stefan Parvu has created browsable HTML versions of the DTraceToolkit on the DTT test page. See DTraceToolkit ver 0.99 to browse that version.

A goal of the DTraceToolkit is to provide documented examples of DTrace scripting, in addition to what is available in the DTrace Guide. However these examples have been reaching a limited audience of those who download, unzip, and browse through the text files.

Now that the DTraceToolkit is browsable online, its contents can be found by internet search engines. This should help people not only find examples of DTrace usage, but also solutions to some common observability problems.

There have been a few other items of DTraceToolkit news which I'll blog about soon. Please excuse my infrequent blog postings - I've been busy since joining Sun on a particular project which consumes most of my spare time. It will be worth it, which should be clear once I can start posting about it on my Sun blog.