#+TITLE: Dynamic Tracing on OpenBSD 7.3 #+DATE: 2023-05-21T23:43:05-04:00 #+DRAFT: false #+DESCRIPTION: Taking a look at OpenBSD's dynamic tracing framework #+TAGS[]: openbsd btrace #+KEYWORDS[]: openbsd btrace #+SLUG: #+SUMMARY: #+ATTR_HTML: :title Puffy with a stopwatch #+ATTR_HTML: :alt Puffy with a stopwatch [[file:cover.png]] A couple years ago a new device silently entered the OpenBSD source tree. The =dt(4)= device first appearing in OpenBSD 6.7 with [[https://man.openbsd.org/OpenBSD-6.7/dt][very sparse information]]. In the releases since, it's gained quite a bit of functionality and now has tools to manipulate it properly. The documentation has gotten better over the last couple years, but there isn't much info about how to leverage it to do anything useful. The [[https://man.openbsd.org/dt][=dt=]] pseudo device driver is connected to =/dev/dt=, which is used to interact with the device through [[https://man.openbsd.org/ioctl.2][=ioctl=]] calls. =dt= stand for "dynamic tracer", similar to [[https://en.wikipedia.org/wiki/DTrace][=dtrace=]] and [[https://github.com/iovisor/bpftrace][=bpftrace=]]. It's a device that lets you interact with probes on various parts of the system. By default you can't communicate with the device since it can't be opened unless the =kern.allowdt= [[https://man.openbsd.org/sysctl.conf][=sysctl=]] flag is set to =1=. However you can't set this specific =sysctl= flag while the system's =kern.securelevel= is above =0=, which it is under normal circumstances. The [[https://man.openbsd.org/securelevel][=securelevel=]] of the machine is set during the boot process and can't be lowered once the machine is running. That means you need to set the flag before the machine's =securelevel= is raised. You can do this by adding the following line to [[https://man.openbsd.org/sysctl.conf][=/etc/sysctl.conf=]]. #+begin_src conf kern.allowdt=1 #+end_src You can see in =/etc/rc= that =/etc/sysctl.conf= is loaded very early in the boot process, and that the =securelevel= is raised near the end of the boot. #+CAPTION: Taken from =/etc/rc= #+begin_src shell 50 # Apply sysctl.conf(5) settings. 51 sysctl_conf() { 52 # do not use a pipe as limits would only be applied to the subshell 53 set -- $(stripcom /etc/sysctl.conf) 54 while [[ $# > 0 ]] ; do 55 sysctl "$1" 56 57 case "$1" in 58 kern.maxproc=*) 59 update_limit -p maxproc 60 ;; 61 kern.maxfiles=*) 62 update_limit -n openfiles 63 ;; 64 esac 65 shift 66 done 67 } [...] 589 [[ -f /etc/rc.securelevel ]] && sh /etc/rc.securelevel 590 591 # rc.securelevel did not specifically set -1 or 2, so select the default: 1. 592 (($(sysctl -n kern.securelevel) == 0)) && sysctl kern.securelevel=1 #+end_src You could also run the machine in =Permanently insecure mode= by adding ~kern.securelevel=-1~ in =/etc/sysctl.conf=. This would let you change the =kern.allowdt= flag at runtime, but it's not recommended because it greatly reduces the security of your system. Once you've added the line to your =sysctl.conf= and rebooted you should be able to open the device as root. The main way to interact with the device is through the [[https://man.openbsd.org/btrace][=btrace=]] command. =btrace= is the "bug tracer" tool, used to run scripts written in [[https://man.openbsd.org/bt][=bt=]], the bug tracing language. =bt= uses the same syntax as Linux's =bpftrace= tool. Using the =bt= language and =btrace= tool, you can profile system internals and probe the inner workings of many programs. You can see which processes are forking, opening files, reading and writing to file descriptors, using pipes, using chown, pledging, listening on a socket, etc. You can see the full kernel and userland stack traces, process ids, function arguments, command names, thread id, cpu id, and return values. You can see the full list of supported probes with =btrace -l=. #+CAPTION: A sample of the available probes #+begin_src # doas btrace -l profile:hz:97 interval:hz:1 syscall:exit:entry syscall:exit:return syscall:fork:entry syscall:fork:return syscall:read:entry syscall:read:return syscall:write:entry syscall:write:return syscall:open:entry syscall:open:return syscall:close:entry syscall:close:return syscall:getentropy:entry syscall:getentropy:return syscall:__tfork:entry syscall:__tfork:return syscall:unlink:entry syscall:unlink:return syscall:wait4:entry syscall:wait4:return syscall:chdir:entry syscall:chdir:return syscall:fchdir:entry syscall:fchdir:return syscall:mknod:entry syscall:mknod:return syscall:chmod:entry syscall:chmod:return syscall:chown:entry syscall:chown:return [...] tracepoint:raw_syscalls:sys_enter(register_t) tracepoint:raw_syscalls:sys_exit(register_t) tracepoint:uvm:fault(vaddr_t, vm_fault_t, vm_prot_t) tracepoint:uvm:map_insert(vaddr_t, vaddr_t, vm_prot_t) tracepoint:uvm:map_remove(vaddr_t, vaddr_t, vm_prot_t) tracepoint:uvm:malloc(int, void *, size_t, int) tracepoint:uvm:free(int, void *, size_t) tracepoint:uvm:pool_get(void *, void *, int) tracepoint:uvm:pool_put(void *, void *) tracepoint:vfs:bufcache_rel(long, int, int64_t) tracepoint:vfs:bufcache_take(long, int, int64_t) tracepoint:vfs:cleaner(long, int, long, long) tracepoint:vmm:guest_enter(void *, void *) tracepoint:vmm:guest_exit(void *, void *, uint64_t) tracepoint:vmm:inout(void *, uint16_t, uint8_t, uint32_t) [...] #+end_src Currently it seems most probes are for system calls =syscall:*= but there are also a couple trace points for other function calls, like the scheduler =tracepoint:sched:enqueue=, virtual memory system =tracepoint:uvm:malloc=, virtual machine manager (vmm) =tracepoint:vmm:guest_enter=, etc. Trace points are often functions that have arguments that you can inspect as part of your probe. You can print values every time a probe is called using =printf=. You can also collect aggregates values, build simple histograms, calculate sums, or get min/max values. Check [[https://man.openbsd.org/bt.5][=bt=]] man page for full syntax. Since it's based on the =bpftrace= syntax you should also be able to look at those resources for ideas of how to use it. The =bt= language is somewhat similar to awk. =bt= programs have probes, filters, and actions. They're put together like this. #+begin_src PROBE /FILTER/ { ACTION(S) } #+end_src The action is called every time the probe gets activated, and you can optionally filter the probe to only activate on certain conditions like when it's a certain PID or thread ID using the filter format ~/pid == 1234/~ There are special =BEGIN= and =END= probes that are called once at the beginning and end of program execution, which can be used to set and clear variables. Variables are global and take the form of =@var= for a scalars or =@var[key]= for a maps. There are functions like =clear(@map)= and =delete(@map[key])= that operate on map values. You can also do basic math operations on values. Here are a couple examples of how it can be used: - Print the process name and pid every time =fork= is called #+begin_src doas btrace -e 'syscall:fork:entry { printf("%s[%d]\n", comm, pid) }' ksh[68490] ksh[68490] ksh[68490] ksh[68490] sh[83762] ksh[68490] ksh[87548] ksh[87548] ksh[87548] sh[33064] smtpd[32246] smtpd[33076] smtpd[33076] ksh[88916] smtpd[33076] smtpd[33076] smtpd[33076] ksh[52579] smtpd[33076] smtpd[33076] #+end_src - Count how many time processes call =read= (values printed after Ctrl-C) #+begin_src doas btrace -e 'syscall:read:entry { @[comm] = count() }' ^C @[sshd]: 243 @[ksh]: 48 @[less]: 31 @[cat]: 2 #+end_src - See which processes are calling which syscalls, and how many times #+begin_src doas btrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm, arg0] = count() }' ^C @[sshd, 48]: 354 @[top, 49]: 292 @[sshd, 109]: 176 @[top, 86]: 155 @[top, 202]: 124 @[sshd, 3]: 92 @[sshd, 4]: 92 @[less, 86]: 82 @[ksh, 4]: 55 @[less, 49]: 42 @[top, 46]: 36 @[sshd, 54]: 34 @[top, 169]: 31 @[ksh, 3]: 28 @[top, 3]: 27 @[ksh, 202]: 26 @[top, 252]: 22 @[ksh, 46]: 20 @[top, 253]: 19 @[ksh, 40]: 18 @[top, 5]: 18 @[top, 6]: 18 @[top, 74]: 18 @[top, 53]: 16 @[top, 54]: 15 @[ksh, 54]: 14 @[ksh, 99]: 14 @[less, 159]: 14 @[less, 74]: 14 @[top, 159]: 14 @[top, 4]: 11 @[ksh, 48]: 10 @[ksh, 6]: 9 @[sshd, 73]: 9 @[ksh, 38]: 8 #+end_src You can check the syscall numbers here: https://github.com/openbsd/src/blob/master/sys/kern/syscalls.master - See a histogram of the size of =read()= calls for a certain process #+begin_src doas btrace -e 'syscall:read:return /pid == 5353/ { @readsize = hist(retval) }' ^C @readsize: [0, 1) 3 |@@@@@@@@@@@@@@@@@@@ | [2, 4) 3 |@@@@@@@@@@@@@@@@@@@ | [32, 64) 8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| #+end_src - See a histogram of how long it takes for a =read()= to return #+begin_src doas btrace -e 'syscall:read:entry { @start[tid] = nsecs } syscall:read:return /@start[tid]/ { @times = hist(nsecs - @start[tid]); delete(@start[tid]) } END { clear(@start) }' ^C@times: [0] 3 |@@@ | [1K, 2K) 3 |@@@ | [2K, 4K) 50 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| [4K, 8K) 24 |@@@@@@@@@@@@@@@@@@@@@@@@ | [8K, 16K) 8 |@@@@@@@@ | [16M, 32M) 2 |@@ | [32M, 64M) 1 |@ | [64M, 128M) 5 |@@@@@ | [128M, 256M) 4 |@@@@ | #+end_src There are also a couple of examples in =/usr/share/btrace= that use sampling and command line arguments. You can also check out the Linux [[https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md][=bpftrace= guide]] for ideas.