The Wall: November 2005

30 November, 2005

DTraceToolkit 0.88

I've just uploaded the latest version of the DTraceToolkit, version 0.88 (88 scripts). I've updated the OpenSolaris DTraceToolkit site to point to the new version. This version has many updated scripts and a few new ones.

Between 0.80 and 1.00 I'll be doing more work revisiting code and retesting code rather than adding scripts.

Homepage

Let me explain what is going on with my homepage URL for future reference.

My homepage is at http://www.brendangregg.com. It's a DNS pointer that points to wherever my homepage actually lives, which may well change - however my name will not!

And yes, it is likely that my actual homepage may move at some point. Some people have experienced DNS problems with the current location (users.tpg.com.au), and I've just uploaded a new DTraceToolkit and have almost run out of space!,

1070 files used (10%) - authorized: 10000 files
30688 Kbytes used (99%) - authorized: 30720 Kb

If you've linked to www.brendangregg.com, then no problem - it will always point to the right place (which is why I have the thing - I have been through a painful website move in the past).

29 November, 2005

Sys Admin Magazine

The December, 2005 copy contains an article on the DTraceToolkit written by Ryan Matteson. Grab a copy! The article is "Observing I/O Behavior with the DTraceToolkit", and is quite good. It was also selected as the feature article - which means it will be available online for some time,

http://www.samag.com/documents/sam0512a/

Thanks Matty, and Sys Admin Magazine!

24 November, 2005

DTrace Translators

While teaching a DTrace class in Sydney, I've been asked about translators. They are quite useful, so I've prepared the following as a quick demo.

This is a DTrace program to trace the time() syscall, print the process, it's parent, it's grand-parent, and so on.

#!/usr/sbin/dtrace -s

/* Declare Translator */

typedef struct ancestory {
        string me;              /* my cmd */
        string p;               /* parent cmd */
        string gp;              /* grand-parent cmd */
        string ggp;             /* great-grand-parent cmd */
        string gggp;            /* great-great-grand-parent cmd */
} ancestory_t;

translator ancestory_t < struct _kthread *T > {

        /* fetch my details */
        me = T->t_procp->p_user.u_comm;

        /* fetch anscestor details if they exist */
        p = T->t_procp->p_parent != NULL ?
            T->t_procp->p_parent->p_user.u_comm :
            "<none>";
        gp = T->t_procp->p_parent != NULL ?
            T->t_procp->p_parent->p_parent != NULL ?
            T->t_procp->p_parent->p_parent->p_user.u_comm :
            "<none>" : "<none>";
        ggp = T->t_procp->p_parent != NULL ?
            T->t_procp->p_parent->p_parent != NULL ?
            T->t_procp->p_parent->p_parent->p_parent != NULL ?
            T->t_procp->p_parent->p_parent->p_parent->p_user.u_comm :
            "<none>" : "<none>" : "<none>";
        gggp = T->t_procp->p_parent != NULL ?
            T->t_procp->p_parent->p_parent != NULL ?
            T->t_procp->p_parent->p_parent->p_parent != NULL ?
            T->t_procp->p_parent->p_parent->p_parent->p_parent != NULL ?
            T->t_procp->p_parent->p_parent->p_parent->p_parent->p_user.u_comm :
            "<none>" : "<none>" : "<none>" : "<none>";
};

inline ancestory_t *ancestors = xlate <ancestory_t *> (curthread);

/* Main Program */

syscall::gtime:entry
{
        printf("%s, %s, %s, %s, %s", ancestors->me,
            ancestors->p, ancestors->gp, ancestors->ggp, ancestors->gggp);
}

The main program at the end is quite consise, it prints the details from "ancestors". The translator has walked the p_parent pointers carefully, returning "<none>" if the pointer is NULL. ("ancestors->me" is unnecessary since we have "execname", I've included it as a simple demonstration).

The output is,

# ./transdemo.d
dtrace: script './transdemo.d' matched 1 probe
CPU     ID                    FUNCTION:NAME
  0   6615                      gtime:entry bash, sh, bash, sshd, sshd
  0   6615                      gtime:entry date, bash, sh, bash, sshd
  0   6615                      gtime:entry bash, sh, bash, sshd, sshd
  0   6615                      gtime:entry nscd, init, sched, <none>, <none>
  0   6615                      gtime:entry nscd, init, sched, <none>, <none>
  0   6615                      gtime:entry nscd, init, sched, <none>, <none>
  0   6615                      gtime:entry nscd, init, sched, <none>, <none>
[...]

Without our careful pointer tests, the NULL parent pointers would have caused DTrace to print errors rather than our "<none>" keywords.

The "Declare Translator" section can be cut-n-pasted into a new .d file in /usr/lib/dtrace (eg, /usr/lib/dtrace/anscestors.d) where it will be automatically imported by every future DTrace script.

Take a look under /usr/lib/dtrace at the existing translator scripts, they are quite fascinating.

19 November, 2005

ZFS

The doors have been flung open for ZFS, Sun's "last word in filsystems". It's now in OpenSolaris and there is a ZFS Community page where you can find introductions, demonstrations, and advanced discussions. We don't know when ZFS will appear in Solaris 10, but for it to appear in OpenSolaris shows that the process has began.

ZFS raises the bar for filesystems to a new height, and even changes the way you think about filesystems. Let me provide a quick demo, although I don't have an array of spare disks handy - you'll need to pretend that each of these 1 Gb slices is actually a seperate disk,

# zpool create apps mirror c0t1d0s0 c0t1d0s1 mirror c0t1d0s3 c0t1d0s4

That's it - 1 command for a ZFS pool, that is both mirrored and dynamically striped (think RAID 1+0), 256 bit checksum'd, remounted on boot, and can be grown to a virtually unlimited size.

Lets run a few status commands to check it worked.

# zpool list
NAME                    SIZE    USED   AVAIL    CAP  HEALTH     ALTROOT
apps                   1.98G   33.0K   1.98G     0%  ONLINE     -
#
# zpool status
  pool: apps
 state: ONLINE
 scrub: none requested
config:

        NAME          STATE     READ WRITE CKSUM
        apps          ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c0t1d0s0  ONLINE       0     0     0
            c0t1d0s1  ONLINE       0     0     0
          mirror      ONLINE       0     0     0
            c0t1d0s3  ONLINE       0     0     0
            c0t1d0s4  ONLINE       0     0     0
#
# df -h -F zfs
Filesystem             size   used  avail capacity  Mounted on
apps                   2.0G     8K   2.0G     1%    /apps

The size of 2 Gb is correct, and the "zpool status" command neatly prints the layout.

Now perhaps a slightly more realistic demo (although still no seperate disks, sorry). Rather than having all the disks combine to one filesystem, ZFS is really intended to combine disks into pools, and then have multiple filesystems share a pool. The following quick demo shows this,

# zpool create fast mirror c0t1d0s0 c0t1d0s1 mirror c0t1d0s3 c0t1d0s4
# zfs create fast/apps
# zfs create fast/oracle
# zfs create fast/home
# zfs set mountpoint=/export/home fast/home
# zfs set compression=on fast/home
# zfs set quota=500m fast/home
# zfs list
NAME                   USED  AVAIL  REFER  MOUNTPOINT
fast                  91.0K  1.97G   9.5K  /fast
fast/apps                8K  1.97G     8K  /fast/apps
fast/home                8K   500M     8K  /export/home
fast/oracle              8K  1.97G     8K  /fast/oracle

Each filesystem may have different options set, such as quotas, reservations and compression. If a filesystem was running out of space, quotas can be changed live with a single command. If the pool was running out of space, disks can be added live with a single command.

As a programmer there are times when you encounter something that is so elegant and obvious that you are struck with a feeling that it is right. For a moment you can clearly see what the developer was thinking and that they achieved it perfectly. ZFS is one of those moments.