Linux perf
is a tool that allows counting and sampling of various events in the hardware and in the kernel. Hardware events are available via performance monitoring units (PMU); they measure CPU cycles, cache misses, branches, etc. Kernel events include scheduling context switches, page faults, block I/O, etc.
perf
is good for answering these and similar questions:
sched_yield
system call?Despite claims that perf can also tell you where and why your program was blocking, I was not able to get it to do so after many attempts. For alternative ways to perform off-CPU analysis, read this post by Brendan Gregg or use WiredTiger's Operation Tracking.
What follows is a quick cheat sheet of how to use perf
with WiredTiger. Most of the information in this cheat sheet comes from this excellent tutorial by Brendan Gregg. If you need more information, please refer to that source.
Build WiredTiger as usual. You do not need to add any special compilation flags.
To run perf
with WiredTiger, you simply insert the name of the perf
command before the WiredTiger executable. For example, if you want to record default perf
statistics for a wtperf
job, you'd run:
perf stat wtperf <wtperf options>
If you want to profile CPU time of wtperf
, you'd run:
perf record wtperf <wtperf options>
Any environment variables you care about will go before the perf
command.
If you wanted to run perf
on an already running WiredTiger process, you supply a -p <PID>
option to either of these commands:
perf record -p <PID>
or
perf stat -p <PID>
The command perf stat
will report the total count of events for the entire execution.
perf stat wtperf <wtperf options>
By default it will report things like clock cycles, context switches, CPU migrations, page faults, cycles, instructions, branches and branch misses.
Here is a sample output:
55275.953046 task-clock (msec) # 1.466 CPUs utilized 7,349,023 context-switches # 0.133 M/sec 38,219 cpu-migrations # 0.691 K/sec 160,562 page-faults # 0.003 M/sec 173,665,597,703 cycles # 3.142 GHz 90,146,578,819 instructions # 0.52 insn per cycle 22,199,890,600 branches # 401.619 M/sec 387,770,656 branch-misses # 1.75% of all branches 37.710693533 seconds time elapsed
Alternatively, you can decide what events you want to include. For example, if you wanted to count last-level cache (LLC) misses, one way to do this is like so:
perf stat -e LLC-load-misses wtperf <wtperf options>
Or if you wanted all default events and all cache events, run:
perf stat -d wtperf <wtperf options>
To learn about all events you can count or sample with perf, use the command:
perf list
To count all system calls by type, run:
perf stat -e 'syscalls:sys_enter_*'
Unlike perf stat
, which counts all events, perf record
samples events. The simplest way to invoke it is like so:
perf record <command>
This will sample on-CPU functions.
By default, the output will be placed into a file named perf.data
. To examine it, run:
perf report
perf report
will show, for each sampled event, the call stacks that were recorded at the time of sampling. The call stacks will be sorted from the most frequently occurring to the least frequently occurring.
At the time of this writing, the default sampling frequency is 4000; that is, a sample is taken after every 4000 events. This number may change, so to be sure check the setting in the header in the perf
output, like so:
perf report --header
If you want to change the output file, use the -o
option:
perf record -o <new_output_file> <command>
And run the reporting with the -i
option:
perf report -i <new_output_file>
Here is a way to look at the output with all the call stacks expanded:
perf report -n --stdio
Sample on-cpu functions with call stacks:
perf record -g <command>
Sample as above, but using a specific sampling frequency (99 hertz):
perf record -F 99 -g <command>
Sample last-level cache load misses with call stacks:
perf record -e LLC-load-misses -g <command>
Sample cache misses for a running process:
perf record -e LLC-load-misses -g -p <PID>
Sample page faults:
perf record -e page-faults -g