User Process Tracing
DTrace is an extremely powerful tool for understanding the behavior of user processes. DTrace can be invaluable when debugging, analyzing performance problems, or simply understanding the behavior of a complex application. This chapter focuses on the DTrace facilities relevant for tracing user process activity and provides examples to illustrate their use.
copyin and copyinstr Subroutines
DTrace's interaction with processes is a little different than most traditional debuggers or observability tools. Many such tools appear to execute within the scope of the process, letting users dereference pointers to program variables directly. Rather than appearing to execute within or as part of the process itself, DTrace probes execute in the Solaris kernel. To access process data, a probe needs to use the copyin or copyinstr subroutines to copy user process data into the address space of the kernel.
For example, consider the following write(2) system call:
ssize_t write(int fd, const void *buf, size_t nbytes);
The following D program illustrates an incorrect attempt to print the contents of a string passed to the write(2) system call:
syscall::write:entry { printf("%s", stringof(arg1)); /* incorrect use of arg1 */ }
If you try to run this script, DTrace will produce error messages similar to the following example:
dtrace: error on enabled probe ID 1 (ID 37: syscall::write:entry): \ invalid address (0x10038a000) in action #1
The arg1 variable, containing the value of the buf parameter, is an address that refers to memory in the process executing the system call. To read the string at that address, use the copyinstr subroutine and record its result with the printf action:
syscall::write:entry { printf("%s", copyinstr(arg1)); /* correct use of arg1 */ }
The output of this script shows all of the strings being passed to the write(2) system call. Occasionally, however, you might see irregular output similar to the following example:
0 37 write:entry mada&^%**&
The copyinstr subroutine acts on an input argument that is the user address of a null-terminated ASCII string. However, buffers passed to the write(2) system call might refer to binary data rather than ASCII strings or to ASCII strings which don't include a terminating null byte. To print only as much of the string as the caller intended, use the two parameter version of the copyinstr subroutine which includes the size of the targeted string buffer:
syscall::write:entry { printf("%s", copyinstr(arg1, arg2)); }
An alternate way to accomplish the same end would be to use the copyin subroutine which takes an address and size:
syscall::write:entry { printf("%s", stringof(copyin(arg1, arg2))); }
Notice that the stringof operator is necessary so that DTrace properly converts the user data retrieved using copyin to a string. The use of stringof is not necessary when using copyinstr because this function always returns type string.
Avoiding Errors
The copyin and copyinstr subroutines cannot read from user addresses which have not yet been touched so even a valid address may cause an error if the page containing that address has not yet been faulted in by being accessed. Consider the following example:
# dtrace -n syscall::openat:entry'{ trace(copyinstr(arg1)); }' dtrace: description 'syscall::openat:entry' matched 1 probe CPU ID FUNCTION:NAME dtrace: error on enabled probe ID 2 (ID 50: syscall::openat:entry): invalid address (0x9af1b) in action #1 at DIF offset 52
In the above example output, the application was functioning properly, and the address in arg0 was valid, but it referred to a page that had not yet been accessed by the corresponding process. To resolve this issue, wait for kernel or application to use the data before tracing it. For example, you might wait until the system call returns to apply copyinstr, as shown in the following example:
# dtrace -n syscall::openat:entry'{ self->file = arg1; }' \ -n syscall::openat:return'{ trace(copyinstr(self->file)); self->file = 0; }' dtrace: description 'syscall::openat:entry' matched 1 probe CPU ID FUNCTION:NAME 2 51 open:return /dev/null
Eliminating dtrace(1M) Interference
If you trace every call to the write(2) system call, you will cause a cascade of output. Each call to write causes the dtrace(1M) command to call write as it displays the output, and so on. This feedback loop is a good example of how the dtrace command can interfere with the desired data. You can use a simple predicate to prevent these unwanted data from being traced:
syscall::write:entry /pid != $pid/ { printf("%s", stringof(copyin(arg1, arg2))); } The $pid macro variable expands to the process identifier of the process that enabled the probes. The pid variable contains the process identifier of the process whose thread was running on the CPU where the probe was fired. Therefore the predicate /pid != $pid/ ensures that the script does not trace any events related to the running of this script itself. |
syscall Provider
The syscall provider enables you to trace every system call entry and return. System calls can be a good starting point for understanding a process's behavior, especially if the process seems to be spending a large amount of time executing or blocked in the kernel. You can use the prstat(1M) command to see where processes are spending time:
$ prstat -m -p 31337 PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/NLWP 13499 user1 53 44 0.0 0.0 0.0 0.0 2.5 0.0 4K 24 9K 0 mystery/6
This example shows that the process is consuming a large amount of system time. One possible explanation for this behavior is that the process is executing a large number of system calls. You can use a simple D program specified on the command-line to see which system calls are happening most often:
# dtrace -n syscall:::entry'/pid == 31337/{ @syscalls[probefunc] = count(); }' dtrace: description 'syscall:::entry' matched 215 probes ^C open 1 lwp_park 2 times 4 fcntl 5 close 6 sigaction 6 read 10 ioctl 14 sigprocmask 106 write 1092
This report shows which system calls are being called most often, in this case, the write(2) system call. You can use the syscall provider to further examine the source of all the write system calls:
# dtrace -n syscall::write:entry'/pid == 31337/{ @writes = quantize(arg2); }' dtrace: description 'syscall::write:entry' matched 1 probe ^C value ------------- Distribution ------------- count 0 | 0 1 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1037 2 |@ 3 4 | 0 8 | 0 16 | 0 32 |@ 3 64 | 0 128 | 0 256 | 0 512 | 0 1024 |@ 5 2048 | 0
The output shows that the process is executing many write system calls with a relatively small amount of data. This ratio could be the source of the performance problem for this particular process. This example illustrates a general methodology for investigating system call behavior.
ustack Action
Tracing a process thread's stack at the time a particular probe is activated is often useful for examining a problem in more detail. The ustack action traces the user thread's stack. If, for example, a process that opens many files occasionally fails in the open(2) system call, you can use the ustack action to discover the code path that executes the failed open:
syscall::openat:entry /pid == $1/ { self->path = copyinstr(arg0); } syscall::openat:return /self->path != NULL && arg1 == -1/ { printf("open for '%s' failed", self->path); ustack(); }
This script also illustrates the use of the $1 macro variable which takes the value of the first operand specified on the dtrace(1M) command-line:
# dtrace -s ./badopen.d 31337 dtrace: script './badopen.d' matched 2 probes CPU ID FUNCTION:NAME 0 40 open:return open for '/usr/lib/foo' failed libc.so.1`__open+0x4 libc.so.1`open+0x6c 420b0 tcsh`dosource+0xe0 tcsh`execute+0x978 tcsh`execute+0xba0 tcsh`process+0x50c tcsh`main+0x1d54 tcsh`_start+0xdc
The ustack action records program counter (PC) values for the stack and dtrace(1M) resolves those PC values to symbol names by looking though the process's symbol tables. If dtrace can't resolve the PC value to a symbol, it will print out the value as a hexadecimal integer.
If a process exits or is killed before the ustack data is formatted for output, dtrace might be unable to convert the PC values in the stack trace to symbol names, and will be forced to display them as hexadecimal integers. To work around this limitation, specify a process of interest with the -c or -p option to dtrace. See Chapter 14, dtrace(1M) Utility for details on these and other options. If the process ID or command is not known in advance, the following example D program that can be used to work around the limitation:
/* * This example uses the open(2) system call probe, but this technique * is applicable to any script using the ustack() action where the stack * being traced is in a process that may exit soon. */ syscall::openat:entry { ustack(); stop_pids[pid] = 1; } syscall::rexit:entry /stop_pids[pid] != 0/ { printf("stopping pid %d", pid); stop(); stop_pids[pid] = 0; }
The above script stops a process just before it exits if the ustack action has been applied to a thread in that process. This technique ensures that the dtrace command will be able to resolve the PC values to symbolic names. Notice that the value of stop_pids[pid] is set to 0 after it has been used to clear the dynamic variable. Remember to set stopped processes running again using the prun(1) command or your system will accumulate many stopped processes.
uregs[] Array
The uregs[] array enables you to access individual user registers. The following tables list indices into the uregs[] array corresponding to each supported Solaris system architecture.
SPARC uregs[] Constants
Constant | Register |
---|---|
R_G0..R_G7 | %g0..%g7 global registers |
R_O0..R_O7 | %o0..%o7 out registers |
R_L0..R_L7 | %l0..%l7 local registers |
R_I0..R_I7 | %i0..%i7 in registers |
R_CCR | %ccr condition code register |
R_PC | %pc program counter |
R_NPC | %npc next program counter |
R_Y | %y multiply/divide register |
R_ASI | %asi address space identifier register |
R_FPRS | %fprs floating-point registers state |
x86 uregs[] Constants
Constant | Register |
---|---|
R_CS | %cs |
R_GS | %gs |
R_ES | %es |
R_DS | %ds |
R_EDI | %edi |
R_ESI | %esi |
R_EBP | %ebp |
R_EAX | %eax |
R_ESP | %esp |
R_EAX | %eax |
R_EBX | %ebx |
R_ECX | %ecx |
R_EDX | %edx |
R_TRAPNO | %trapno |
R_ERR | %err |
R_EIP | %eip |
R_CS | %cs |
R_ERR | %err |
R_EFL | %efl |
R_UESP | %uesp |
R_SS | %ss |
On AMD64 platforms, the uregs array has the same content as it does on x86 platforms, plus the additional elements listed in the following table:
amd64 uregs[] Constants
Constant | Register |
---|---|
R_RSP | %rsp |
R_RFL | %rfl |
R_RIP | %rip |
R_RAX | %rax |
R_RCX | %rcx |
R_RDX | %rdx |
R_RBX | %rbx |
R_RBP | %rbp |
R_RSI | %rsi |
R_RDI | %rdi |
R_R8 | %r8 |
R_R9 | %r9 |
R_R10 | %r10 |
R_R11 | %r11 |
R_R12 | %r12 |
R_R13 | %r13 |
R_R14 | %r14 |
R_R15 | %r15 |
The aliases listed in the following table can be used on all platforms:
Common uregs[] Constants
Constant | Register |
---|---|
R_PC | program counter register |
R_SP | stack pointer register |
R_R0 | first return code |
R_R1 | second return code |
pid Provider
The pid provider enables you to trace any instruction in a process. Unlike most other providers, pid probes are created on demand based on the probe descriptions found in your D programs. As a result, no pid probes are listed in the output of dtrace -l until you have enabled them yourself.
User Function Boundary Tracing
The simplest mode of operation for the pid provider is as the user space analogue to the fbt provider. The following example program traces all function entries and returns that are made from a single function. The $1 macro variable (the first operand on the command line) is the process ID for the process to trace. The $2 macro variable (the second operand on the command line) is the name of the function from which to trace all function calls.
Example: userfunc.d: Trace User Function Entry and Return
pid$1::$2:entry { self->trace = 1; } pid$1::$2:return /self->trace/ { self->trace = 0; } pid$1:::entry, pid$1:::return /self->trace/ { }
Type in the above example script and save it in a file named userfunc.d, and then chmod it to be executable. This script produces output similar to the following example:
# ./userfunc.d 15032 execute dtrace: script './userfunc.d' matched 11594 probes 0 -> execute 0 -> execute 0 -> Dfix 0 <- Dfix 0 -> s_strsave 0 -> malloc 0 <- malloc 0 <- s_strsave 0 -> set 0 -> malloc 0 <- malloc 0 <- set 0 -> set1 0 -> tglob 0 <- tglob 0 <- set1 0 -> setq 0 -> s_strcmp 0 <- s_strcmp ...
The pid provider can only be used on processes that are already running. You can use the $target macro variable (see Chapter 15, Scripting) and the dtrace -c and -p options to create and grab processes of interest and instrument them using DTrace. For example, the following D script can be used to determine the distribution of function calls made to libc by a particular subject process:
pid$target:libc.so::entry { @[probefunc] = count(); }
To determine the distribution of such calls made by the date(1) command, save the script in a file named libc.d and execute the following command:
# dtrace -s libc.d -c date dtrace: script 'libc.d' matched 2476 probes Fri Jul 30 14:08:54 PDT 2004 dtrace: pid 109196 has exited pthread_rwlock_unlock 1 _fflush_u 1 rwlock_lock 1 rw_write_held 1 strftime 1 _close 1 _read 1 __open 1 _open 1 strstr 1 load_zoneinfo 1 ... _ti_bind_guard 47 _ti_bind_clear 94
Tracing Arbitrary Instructions
You can use the pid provider to trace any instruction in any user function. Upon demand, the pid provider will create a probe for every instruction in a function. The name of each probe is the offset of its corresponding instruction in the function expressed as a hexadecimal integer. For example, to enable a probe associated with the instruction at offset 0x1c in function foo of module bar.so in the process with PID 123, you can use the following command:
# dtrace -n pid123:bar.so:foo:1c
To enable all of the probes in the function foo, including the probe for each instruction, you can use the command:
# dtrace -n pid123:bar.so:foo:
This command demonstrates an extremely powerful technique for debugging and analyzing user applications. Infrequent errors can be difficult to debug because they can be difficult to reproduce. Often, you can identify a problem after the failure has occurred, too late to reconstruct the code path. The following example demonstrates how to combine the pid provider with speculative tracing (see Chapter 13, Speculative Tracing) to solve this problem by tracing every instruction in a function.
Example: errorpath.d: Trace User Function Call Error Path
pid$1::$2:entry { self->spec = speculation(); speculate(self->spec); printf("%x %x %x %x %x", arg0, arg1, arg2, arg3, arg4); } pid$1::$2: /self->spec/ { speculate(self->spec); } pid$1::$2:return /self->spec && arg1 == 0/ { discard(self->spec); self->spec = 0; } pid$1::$2:return /self->spec && arg1 != 0/ { commit(self->spec); self->spec = 0; }
Executing errorpath.d results in output similar to the following example:
# ./errorpath.d 100461 _chdir dtrace: script './errorpath.d' matched 19 probes CPU ID FUNCTION:NAME 0 25253 _chdir:entry 81e08 6d140 ffbfcb20 656c73 0 0 25253 _chdir:entry 0 25269 _chdir:0 0 25270 _chdir:4 0 25271 _chdir:8 0 25272 _chdir:c 0 25273 _chdir:10 0 25274 _chdir:14 0 25275 _chdir:18 0 25276 _chdir:1c 0 25277 _chdir:20 0 25278 _chdir:24 0 25279 _chdir:28 0 25280 _chdir:2c 0 25268 _chdir:return