#+TITLE: Dynamic Tracing on OpenBSD 7.3
#+DATE: 2023-05-21T23:43:05-04:00
#+DRAFT: false
#+DESCRIPTION: Taking a look at OpenBSD's dynamic tracing framework
#+TAGS[]: openbsd btrace
#+KEYWORDS[]: openbsd btrace
#+SLUG:
#+SUMMARY:

#+ATTR_HTML: :title Puffy with a stopwatch
#+ATTR_HTML: :alt Puffy with a stopwatch
[[file:cover.png]]

A couple years ago a new device silently entered the OpenBSD source
tree.

The =dt(4)= device first appearing in OpenBSD 6.7 with [[https://man.openbsd.org/OpenBSD-6.7/dt][very sparse
information]]. In the releases since, it's gained quite a bit of
functionality and now has tools to manipulate it properly.

The documentation has gotten better over the last couple years, but
there isn't much info about how to leverage it to do anything useful.

The [[https://man.openbsd.org/dt][=dt=]] pseudo device driver is connected to =/dev/dt=, which is
used to interact with the device through [[https://man.openbsd.org/ioctl.2][=ioctl=]] calls. =dt= stand
for "dynamic tracer", similar to [[https://en.wikipedia.org/wiki/DTrace][=dtrace=]] and [[https://github.com/iovisor/bpftrace][=bpftrace=]]. It's a
device that lets you interact with probes on various parts of the
system.

By default you can't communicate with the device since it can't be
opened unless the =kern.allowdt= [[https://man.openbsd.org/sysctl.conf][=sysctl=]] flag is set to =1=.
However you can't set this specific =sysctl= flag while the system's
=kern.securelevel= is above =0=, which it is under normal circumstances.

The [[https://man.openbsd.org/securelevel][=securelevel=]] of the machine is set during the boot process and
can't be lowered once the machine is running. That means you need to
set the flag before the machine's =securelevel= is raised. You can
do this by adding the following line to [[https://man.openbsd.org/sysctl.conf][=/etc/sysctl.conf=]].

#+begin_src conf
kern.allowdt=1
#+end_src


You can see in =/etc/rc= that =/etc/sysctl.conf= is loaded very early
in the boot process, and that the =securelevel= is raised near the end
of the boot.

#+CAPTION: Taken from =/etc/rc=
#+begin_src shell
50 # Apply sysctl.conf(5) settings.
51 sysctl_conf() {
52         # do not use a pipe as limits would only be applied to the subshell
53         set -- $(stripcom /etc/sysctl.conf)
54         while [[ $# > 0 ]] ; do
55                 sysctl "$1"
56
57                 case "$1" in
58                 kern.maxproc=*)
59                         update_limit -p maxproc
60                         ;;
61                 kern.maxfiles=*)
62                         update_limit -n openfiles
63                         ;;
64                 esac
65                 shift
66         done
67 }

[...]

589 [[ -f /etc/rc.securelevel ]] && sh /etc/rc.securelevel
590
591 # rc.securelevel did not specifically set -1 or 2, so select the default: 1.
592 (($(sysctl -n kern.securelevel) == 0)) && sysctl kern.securelevel=1

#+end_src


You could also run the machine in =Permanently insecure mode= by
adding ~kern.securelevel=-1~ in =/etc/sysctl.conf=. This would let
you change the =kern.allowdt= flag at runtime, but it's not
recommended because it greatly reduces the security of your system.

Once you've added the line to your =sysctl.conf= and rebooted you
should be able to open the device as root.

The main way to interact with the device is through the [[https://man.openbsd.org/btrace][=btrace=]]
command. =btrace= is the "bug tracer" tool, used to run scripts
written in [[https://man.openbsd.org/bt][=bt=]], the bug tracing language. =bt= uses the same syntax
as Linux's =bpftrace= tool.

Using the =bt= language and =btrace= tool, you can profile system
internals and probe the inner workings of many programs. You can see
which processes are forking, opening files, reading and writing to
file descriptors, using pipes, using chown, pledging, listening on a
socket, etc. You can see the full kernel and userland stack traces,
process ids, function arguments, command names, thread id, cpu id,
and return values.

You can see the full list of supported probes with =btrace -l=.

#+CAPTION: A sample of the available probes
#+begin_src
# doas btrace -l
profile:hz:97
interval:hz:1
syscall:exit:entry
syscall:exit:return
syscall:fork:entry
syscall:fork:return
syscall:read:entry
syscall:read:return
syscall:write:entry
syscall:write:return
syscall:open:entry
syscall:open:return
syscall:close:entry
syscall:close:return
syscall:getentropy:entry
syscall:getentropy:return
syscall:__tfork:entry
syscall:__tfork:return
syscall:unlink:entry
syscall:unlink:return
syscall:wait4:entry
syscall:wait4:return
syscall:chdir:entry
syscall:chdir:return
syscall:fchdir:entry
syscall:fchdir:return
syscall:mknod:entry
syscall:mknod:return
syscall:chmod:entry
syscall:chmod:return
syscall:chown:entry
syscall:chown:return
[...]
tracepoint:raw_syscalls:sys_enter(register_t)
tracepoint:raw_syscalls:sys_exit(register_t)
tracepoint:uvm:fault(vaddr_t, vm_fault_t, vm_prot_t)
tracepoint:uvm:map_insert(vaddr_t, vaddr_t, vm_prot_t)
tracepoint:uvm:map_remove(vaddr_t, vaddr_t, vm_prot_t)
tracepoint:uvm:malloc(int, void *, size_t, int)
tracepoint:uvm:free(int, void *, size_t)
tracepoint:uvm:pool_get(void *, void *, int)
tracepoint:uvm:pool_put(void *, void *)
tracepoint:vfs:bufcache_rel(long, int, int64_t)
tracepoint:vfs:bufcache_take(long, int, int64_t)
tracepoint:vfs:cleaner(long, int, long, long)
tracepoint:vmm:guest_enter(void *, void *)
tracepoint:vmm:guest_exit(void *, void *, uint64_t)
tracepoint:vmm:inout(void *, uint16_t, uint8_t, uint32_t)
[...]
#+end_src


Currently it seems most probes are for system calls =syscall:*= but
there are also a couple trace points for other function calls, like
the scheduler =tracepoint:sched:enqueue=, virtual memory system
=tracepoint:uvm:malloc=, virtual machine manager (vmm)
=tracepoint:vmm:guest_enter=, etc.

Trace points are often functions that have arguments that you can
inspect as part of your probe.

You can print values every time a probe is called using =printf=.
You can also collect aggregates values, build simple histograms,
calculate sums, or get min/max values. Check [[https://man.openbsd.org/bt.5][=bt=]] man page for full
syntax. Since it's based on the =bpftrace= syntax you should also be
able to look at those resources for ideas of how to use it.

The =bt= language is somewhat similar to awk.

=bt= programs have probes, filters, and actions.

They're put together like this.

#+begin_src
PROBE /FILTER/ { ACTION(S) }
#+end_src

The action is called every time the probe gets activated, and you
can optionally filter the probe to only activate on certain
conditions like when it's a certain PID or thread ID using the
filter format ~/pid == 1234/~

There are special =BEGIN= and =END= probes that are called once at
the beginning and end of program execution, which can be used to set
and clear variables.

Variables are global and take the form of =@var= for a scalars or =@var[key]= for
a maps. There are functions like =clear(@map)= and =delete(@map[key])= that
operate on map values.

You can also do basic math operations on values.

Here are a couple examples of how it can be used:

- Print the process name and pid every time =fork= is called
  #+begin_src
  doas btrace -e 'syscall:fork:entry { printf("%s[%d]\n", comm, pid) }'
  ksh[68490]
  ksh[68490]
  ksh[68490]
  ksh[68490]
  sh[83762]
  ksh[68490]
  ksh[87548]
  ksh[87548]
  ksh[87548]
  sh[33064]
  smtpd[32246]
  smtpd[33076]
  smtpd[33076]
  ksh[88916]
  smtpd[33076]
  smtpd[33076]
  smtpd[33076]
  ksh[52579]
  smtpd[33076]
  smtpd[33076]

  #+end_src

- Count how many time processes call =read= (values printed after Ctrl-C)
  #+begin_src
  doas btrace -e 'syscall:read:entry { @[comm] = count() }'
  ^C
  @[sshd]: 243
  @[ksh]: 48
  @[less]: 31
  @[cat]: 2
  #+end_src

- See which processes are calling which syscalls, and how many times
  #+begin_src
  doas btrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm, arg0] = count() }'
  ^C
  @[sshd, 48]: 354
  @[top, 49]: 292
  @[sshd, 109]: 176
  @[top, 86]: 155
  @[top, 202]: 124
  @[sshd, 3]: 92
  @[sshd, 4]: 92
  @[less, 86]: 82
  @[ksh, 4]: 55
  @[less, 49]: 42
  @[top, 46]: 36
  @[sshd, 54]: 34
  @[top, 169]: 31
  @[ksh, 3]: 28
  @[top, 3]: 27
  @[ksh, 202]: 26
  @[top, 252]: 22
  @[ksh, 46]: 20
  @[top, 253]: 19
  @[ksh, 40]: 18
  @[top, 5]: 18
  @[top, 6]: 18
  @[top, 74]: 18
  @[top, 53]: 16
  @[top, 54]: 15
  @[ksh, 54]: 14
  @[ksh, 99]: 14
  @[less, 159]: 14
  @[less, 74]: 14
  @[top, 159]: 14
  @[top, 4]: 11
  @[ksh, 48]: 10
  @[ksh, 6]: 9
  @[sshd, 73]: 9
  @[ksh, 38]: 8
  #+end_src

  You can check the syscall numbers here:
  https://github.com/openbsd/src/blob/master/sys/kern/syscalls.master

- See a histogram of the size of =read()= calls for a certain process
  #+begin_src
  doas btrace -e 'syscall:read:return /pid == 5353/ { @readsize = hist(retval) }'
  ^C
  @readsize:
  [0, 1)             3 |@@@@@@@@@@@@@@@@@@@                                 |
  [2, 4)             3 |@@@@@@@@@@@@@@@@@@@                                 |
  [32, 64)           8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  #+end_src

- See a histogram of how long it takes for a =read()= to return
  #+begin_src
  doas btrace -e 'syscall:read:entry { @start[tid] = nsecs } syscall:read:return /@start[tid]/ { @times = hist(nsecs - @start[tid]); delete(@start[tid]) } END { clear(@start) }'
  ^C@times:
  [0]                3 |@@@                                                 |
  [1K, 2K)           3 |@@@                                                 |
  [2K, 4K)          50 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
  [4K, 8K)          24 |@@@@@@@@@@@@@@@@@@@@@@@@                            |
  [8K, 16K)          8 |@@@@@@@@                                            |
  [16M, 32M)         2 |@@                                                  |
  [32M, 64M)         1 |@                                                   |
  [64M, 128M)        5 |@@@@@                                               |
  [128M, 256M)       4 |@@@@                                                |
  #+end_src

There are also a couple of examples in =/usr/share/btrace= that use
sampling and command line arguments.

You can also check out the Linux [[https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md][=bpftrace= guide]] for ideas.