summaryrefslogtreecommitdiffstats
path: root/content/posts/openbsd-looking-at-btrace/index.org
blob: bfd767404197275311a82a27c2f01f3d1d29e38c (plain) (blame)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
#+TITLE: Looking at btrace on OpenBSD 7.3
#+DATE: 2023-05-21T23:43:05-04:00
#+DRAFT: true
#+DESCRIPTION: Taking a look at OpenBSD's dynamic tracing framework
#+TAGS[]: openbsd btrace
#+KEYWORDS[]: openbsd btrace
#+SLUG:
#+SUMMARY:

* What it is

  A couple years ago a new device silently entered the OpenBSD source
  tree.

  The =dt(4)= device first appearing in OpenBSD 6.7 with [[https://man.openbsd.org/OpenBSD-6.7/dt][very sparse
  information]]. In the releases since, it's gained quite a bit of
  functionality and now has tools to manipulate it properly. After
  playing around with on my test machine running 7.3-current, it's
  obvious that a lot of work has gone into it.

  The documentation has gotten better over the last couple years, but
  there isn't much info about how to leverage it to do anything useful.

  The [[https://man.openbsd.org/dt][=dt=]] pseudo device driver is connected to =/dev/dt=, which is
  used to interact with the device through [[https://man.openbsd.org/ioctl.2][=ioctl=]] calls. =dt= stand
  for "dynamic tracer", similar to [[https://en.wikipedia.org/wiki/DTrace][=dtrace=]] and [[https://github.com/iovisor/bpftrace][=bpftrace=]]. It's a
  device that lets you interact with probes on various parts of the
  system.

* How to enable it

  By default you can't communicate with the device since it can't be
  opened unless the =kern.allowdt= [[https://man.openbsd.org/sysctl.conf][=sysctl=]] flag is set to =1=.
  However you can't set this specific =sysctl= flag while the system's
  =kern.securelevel= is above =0=, which it is under normal circumstances.

  The [[https://man.openbsd.org/securelevel][=securelevel=]] of the machine is set during the boot process and
  can't be lowered once the machine is running. That means you need to
  set the flag before the machine's =securelevel= is raised. You can
  do this by adding the following line to [[https://man.openbsd.org/sysctl.conf][=/etc/sysctl.conf=]].

  #+begin_src conf
  kern.allowdt=1
  #+end_src


  You can see in =/etc/rc= that =/etc/sysctl.conf= is loaded very early
  in the boot process, and that the =securelevel= is raised near the end
  of the boot.

  #+CAPTION: Taken from =/etc/rc=
  #+begin_src shell
  50 # Apply sysctl.conf(5) settings.
  51 sysctl_conf() {
  52         # do not use a pipe as limits would only be applied to the subshell
  53         set -- $(stripcom /etc/sysctl.conf)
  54         while [[ $# > 0 ]] ; do
  55                 sysctl "$1"
  56
  57                 case "$1" in
  58                 kern.maxproc=*)
  59                         update_limit -p maxproc
  60                         ;;
  61                 kern.maxfiles=*)
  62                         update_limit -n openfiles
  63                         ;;
  64                 esac
  65                 shift
  66         done
  67 }

  [...]

  589 [[ -f /etc/rc.securelevel ]] && sh /etc/rc.securelevel
  590
  591 # rc.securelevel did not specifically set -1 or 2, so select the default: 1.
  592 (($(sysctl -n kern.securelevel) == 0)) && sysctl kern.securelevel=1

  #+end_src


  You could also run the machine in =Permanently insecure mode= by
  adding ~kern.securelevel=-1~ in =/etc/sysctl.conf=. This would let
  you change the =kern.allowdt= flag at runtime, but it's not
  recommended because it greatly reduces the security of your system.

  Once you've added the line to your =sysctl.conf= and rebooted you
  should be able to open the device as root.

  The main way to interact with the device is through the [[https://man.openbsd.org/btrace][=btrace=]]
  command. =btrace= is the "bug tracer" tool, used to run scripts
  written in [[https://man.openbsd.org/bt][=bt=]], the bug tracing language. =bt= uses the same syntax
  as Linux's =bpftrace= tool.

* What it can do

  Using the =bt= language and =btrace= tool, you can profile system
  internals and probe the inner workings of many programs. You can see
  which processes are forking, opening files, reading and writing to
  file descriptors, using pipes, using chown, pledging, listening on a
  socket, etc. You can see the full kernel and userland stack traces,
  process ids, function arguments, command names, thread id, cpu id,
  and return values.

  You can see the full list of supported probes with =btrace -l=.

  #+CAPTION: A sample of the available probes
  #+begin_src
  # doas btrace -l
  profile:hz:97
  interval:hz:1
  syscall:exit:entry
  syscall:exit:return
  syscall:fork:entry
  syscall:fork:return
  syscall:read:entry
  syscall:read:return
  syscall:write:entry
  syscall:write:return
  syscall:open:entry
  syscall:open:return
  syscall:close:entry
  syscall:close:return
  syscall:getentropy:entry
  syscall:getentropy:return
  syscall:__tfork:entry
  syscall:__tfork:return
  syscall:unlink:entry
  syscall:unlink:return
  syscall:wait4:entry
  syscall:wait4:return
  syscall:chdir:entry
  syscall:chdir:return
  syscall:fchdir:entry
  syscall:fchdir:return
  syscall:mknod:entry
  syscall:mknod:return
  syscall:chmod:entry
  syscall:chmod:return
  syscall:chown:entry
  syscall:chown:return
  [...]
  tracepoint:raw_syscalls:sys_enter(register_t)
  tracepoint:raw_syscalls:sys_exit(register_t)
  tracepoint:uvm:fault(vaddr_t, vm_fault_t, vm_prot_t)
  tracepoint:uvm:map_insert(vaddr_t, vaddr_t, vm_prot_t)
  tracepoint:uvm:map_remove(vaddr_t, vaddr_t, vm_prot_t)
  tracepoint:uvm:malloc(int, void *, size_t, int)
  tracepoint:uvm:free(int, void *, size_t)
  tracepoint:uvm:pool_get(void *, void *, int)
  tracepoint:uvm:pool_put(void *, void *)
  tracepoint:vfs:bufcache_rel(long, int, int64_t)
  tracepoint:vfs:bufcache_take(long, int, int64_t)
  tracepoint:vfs:cleaner(long, int, long, long)
  tracepoint:vmm:guest_enter(void *, void *)
  tracepoint:vmm:guest_exit(void *, void *, uint64_t)
  tracepoint:vmm:inout(void *, uint16_t, uint8_t, uint32_t)
  [...]
  #+end_src


  Currently it seems most probes are for system calls =syscall:*= but
  there are also a couple trace points for other function calls, like
  the scheduler =tracepoint:sched:enqueue=, virtual memory system
  =tracepoint:uvm:malloc=, virtual machine manager (vmm)
  =tracepoint:vmm:guest_enter=, etc.

  Trace points are often functions that have arguments that you can
  inspect as part of your probe.

  You can print values every time a probe is called using =printf=.
  You can also collect aggregates values, build simple histograms,
  calculate sums, or get min/max values. Check [[https://man.openbsd.org/bt.5][=bt=]] man page for full
  syntax. Since it's based on the =bpftrace= syntax you should also be
  able to look at those resources for ideas of how to use it.

  The =bt= language is somewhat similar to awk.

  =bt= programs have probes, filters, and actions.

  They're put together like this.

  #+begin_src
  PROBE /FILTER/ { ACTION(S) }
  #+end_src

  The action is called every time the probe gets activated, and you
  can optionally filter the probe to only activate on certain
  conditions like when it's a certain PID or thread ID using the
  filter format ~/pid == 1234/~

  There are special =BEGIN= and =END= probes that are called once at
  the beginning and end of program execution, which can be used to set
  and clear variables.

  Variables are global and take the form of =@var= for a scalars or =@var[key]= for
  a maps. There are functions like =clear(@map)= and =delete(@map[key])= that
  operate on map values.

  You can also do basic math operations on values.

* Examples

  - Print the process name and paid every time =fork= is called
    #+begin_src
    doas btrace -e 'syscall:fork:entry { printf("%s[%d]\n", comm, pid) }'
    ksh[68490]
    ksh[68490]
    ksh[68490]
    ksh[68490]
    sh[83762]
    ksh[68490]
    ksh[87548]
    ksh[87548]
    ksh[87548]
    sh[33064]
    smtpd[32246]
    smtpd[33076]
    smtpd[33076]
    ksh[88916]
    smtpd[33076]
    smtpd[33076]
    smtpd[33076]
    ksh[52579]
    smtpd[33076]
    smtpd[33076]

    #+end_src

  - Count how many time processes call =read= (values printed after Ctrl-C)
    #+begin_src
    doas btrace -e 'syscall:read:entry { @[comm] = count() }'
    ^C
    @[sshd]: 243
    @[ksh]: 48
    @[less]: 31
    @[cat]: 2
    #+end_src

  - See which processes are calling which syscalls, and how many times
    #+begin_src
    doas btrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm, arg0] = count() }'
    ^C
    @[sshd, 48]: 354
    @[top, 49]: 292
    @[sshd, 109]: 176
    @[top, 86]: 155
    @[top, 202]: 124
    @[sshd, 3]: 92
    @[sshd, 4]: 92
    @[less, 86]: 82
    @[ksh, 4]: 55
    @[less, 49]: 42
    @[top, 46]: 36
    @[sshd, 54]: 34
    @[top, 169]: 31
    @[ksh, 3]: 28
    @[top, 3]: 27
    @[ksh, 202]: 26
    @[top, 252]: 22
    @[ksh, 46]: 20
    @[top, 253]: 19
    @[ksh, 40]: 18
    @[top, 5]: 18
    @[top, 6]: 18
    @[top, 74]: 18
    @[top, 53]: 16
    @[top, 54]: 15
    @[ksh, 54]: 14
    @[ksh, 99]: 14
    @[less, 159]: 14
    @[less, 74]: 14
    @[top, 159]: 14
    @[top, 4]: 11
    @[ksh, 48]: 10
    @[ksh, 6]: 9
    @[sshd, 73]: 9
    @[ksh, 38]: 8
    #+end_src

    You can check the syscall numbers here:
    https://github.com/openbsd/src/blob/master/sys/kern/syscalls.master

  - See a histogram of the size of =read()= calls for a certain process
    #+begin_src
    doas btrace -e 'syscall:read:return /pid == 5353/ { @readsize = hist(retval) }'
    ^C
    @readsize:
    [0, 1)             3 |@@@@@@@@@@@@@@@@@@@                                 |
    [2, 4)             3 |@@@@@@@@@@@@@@@@@@@                                 |
    [32, 64)           8 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    #+end_src

  - See a histogram of how long it takes for a =read()= to return
    #+begin_src
    doas btrace -e 'syscall:read:entry { @start[tid] = nsecs } syscall:read:return /@start[tid]/ { @times = hist(nsecs - @start[tid]); delete(@start[tid]) } END { clear(@start) }'
    ^C@times:
    [0]                3 |@@@                                                 |
    [1K, 2K)           3 |@@@                                                 |
    [2K, 4K)          50 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    [4K, 8K)          24 |@@@@@@@@@@@@@@@@@@@@@@@@                            |
    [8K, 16K)          8 |@@@@@@@@                                            |
    [16M, 32M)         2 |@@                                                  |
    [32M, 64M)         1 |@                                                   |
    [64M, 128M)        5 |@@@@@                                               |
    [128M, 256M)       4 |@@@@                                                |
    #+end_src

  There are also a couple of examples in =/usr/share/btrace= that use
  sampling and command line arguments.

  You can also check out the Linux [[https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md][=bpftrace= guide]] for ideas.