Table of Contents
As the Profiling HOWTO states, two sets of data about profiled code behavior are recorded independently: function call frequency and time spent in each function.
Profiling a kernel is normally employed when the goal is to compare the difference of new changes in the kernel to a previous one or to track down some sort of low level performance problem.
First, take a look at either Chapter 9, Kernel Tuning or the Kernel FAQ at NetBSD. The only difference in procedure for setting up a kernel with profiling enabled is when you run config add the -p option. The build area is ../compile/<KERNEL_NAME>.PROF , for example, a GENERIC kernel would be ../compile/GENERIC.PROF.
Following is a quick summary of how to compile a kernel with profiling enabled on the i386 port, the assumptions are that the appropriate sources are available under /usr/src and the GENERIC configuration is being used, of course, that may not always be the situation:
Once the new kernel is in place and the system has rebooted, it is time to turn on the monitoring and start looking at results.
To start kgmon:
quark# kgmon -b kgmon: kernel profiling is running.
Next, send the data into the file gmon.out:
quark# kgmon -p
Now, it is time to make the output readable:
quark# gprof /netbsd > gprof.out
Since gmon is looking for gmon.out, it should find it in the current working directory.
By just running kgmon alone, you may not get the information you need, however, if you are comparing the differences between two different kernels, then a known good baseline should be used. Note, it is generally a good idea to stress the subsystem if you know what it is both in the baseline and with the newer (or different) kernel.
Now that kgmon can run, collect and parse information, it is time to actually look at some of that information. In this particular instance, a GENERIC kernel is running with profiling enabled for about an hour with only system processes and no adverse load, in the fault insertion section, the example will be large enough that even under a minimal load detection of the problem should be easy.
The flat profile is a list of functions, the number of times they were called and how long it took (in seconds). Following is sample output from the quiet system:
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls ns/call ns/call name 99.77 163.87 163.87 idle 0.03 163.92 0.05 219 228310.50 228354.34 _wdc_ata_bio_start 0.02 163.96 0.04 219 182648.40 391184.96 wdc_ata_bio_intr 0.01 163.98 0.02 3412 5861.66 6463.02 pmap_enter 0.01 164.00 0.02 548 36496.35 36496.35 pmap_zero_page 0.01 164.02 0.02 Xspllower 0.01 164.03 0.01 481968 20.75 20.75 gettick 0.01 164.04 0.01 6695 1493.65 1493.65 VOP_LOCK 0.01 164.05 0.01 3251 3075.98 21013.45 syscall_plain ...
As expected, idle was the highest in percentage, however, there were still some things going on, for example, a little further down there is the vn_lock function:
... 0.00 164.14 0.00 6711 0.00 0.00 VOP_UNLOCK 0.00 164.14 0.00 6677 0.00 1493.65 vn_lock 0.00 164.14 0.00 6441 0.00 0.00 genfs_unlock
This is to be expected, since locking still has to take place, regardless.
The call graph is an augmented version of the flat profile showing subsequent calls from the listed functions. First, here is some sample output:
Call graph (explanation follows) granularity: each sample hit covers 4 byte(s) for 0.01% of 164.14 seconds index % time self children called name <spontaneous> [1] 99.8 163.87 0.00 idle [1] ----------------------------------------------- <spontaneous> [2] 0.1 0.01 0.08 syscall1 [2] 0.01 0.06 3251/3251 syscall_plain [7] 0.00 0.01 414/1660 trap [9] ----------------------------------------------- 0.00 0.09 219/219 Xintr14 [6] [3] 0.1 0.00 0.09 219 pciide_compat_intr [3] 0.00 0.09 219/219 wdcintr [5] ----------------------------------------------- ...
Now this can be a little confusing. The index number is mapped to from the trailing number on the end of the line, for example,
... 0.00 0.01 85/85 dofilewrite [68] [72] 0.0 0.00 0.01 85 soo_write [72] 0.00 0.01 85/89 sosend [71] ... Here we see that dofilewrite was called first, now we can look at the index number for 64 and see what was happening there: ... 0.00 0.01 101/103 ffs_full_fsync <cycle 6> [58] [64] 0.0 0.00 0.01 103 bawrite [64] 0.00 0.01 103/105 VOP_BWRITE [60] ...
And so on, in this way, a "visual trace" can be established.
At the end of the call graph right after the terms section is an index by function name which can help map indexes as well.
In this example, I have modified an area of the kernel I know will create a problem that will be blatantly obvious.
Here is the top portion of the flat profile after running the system for about an hour with little interaction from users:
Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls us/call us/call name 93.97 139.13 139.13 idle 5.87 147.82 8.69 23 377826.09 377842.52 check_exec 0.01 147.84 0.02 243 82.30 82.30 pmap_copy_page 0.01 147.86 0.02 131 152.67 152.67 _wdc_ata_bio_start 0.01 147.88 0.02 131 152.67 271.85 wdc_ata_bio_intr 0.01 147.89 0.01 4428 2.26 2.66 uvn_findpage 0.01 147.90 0.01 4145 2.41 2.41 uvm_pageactivate 0.01 147.91 0.01 2473 4.04 3532.40 syscall_plain 0.01 147.92 0.01 1717 5.82 5.82 i486_copyout 0.01 147.93 0.01 1430 6.99 56.52 uvm_fault 0.01 147.94 0.01 1309 7.64 7.64 pool_get 0.01 147.95 0.01 673 14.86 38.43 genfs_getpages 0.01 147.96 0.01 498 20.08 20.08 pmap_zero_page 0.01 147.97 0.01 219 45.66 46.28 uvm_unmap_remove 0.01 147.98 0.01 111 90.09 90.09 selscan ...
As is obvious, there is a large difference in performance. Right off the bat the idle time is noticeably less. The main difference here is that one particular function has a large time across the board with very few calls. That function is check_exec. While at first, this may not seem strange if a lot of commands had been executed, when compared to the flat profile of the first measurement, proportionally it does not seem right:
... 0.00 164.14 0.00 37 0.00 62747.49 check_exec ...
The call in the first measurement is made 37 times and has a better performance. Obviously something in or around that function is wrong. To eliminate other functions, a look at the call graph can help, here is the first instance of check_exec
... ----------------------------------------------- 0.00 8.69 23/23 syscall_plain [3] [4] 5.9 0.00 8.69 23 sys_execve [4] 8.69 0.00 23/23 check_exec [5] 0.00 0.00 20/20 elf32_copyargs [67] ...
Notice how the time of 8.69 seems to affect the two previous functions. It is possible that there is something wrong with them, however, the next instance of check_exec seems to prove otherwise:
... ----------------------------------------------- 8.69 0.00 23/23 sys_execve [4] [5] 5.9 8.69 0.00 23 check_exec [5] ...
Now we can see that the problem, most likely, resides in check_exec. Of course, problems are not always this simple and in fact, here is the simpleton code that was inserted right after check_exec (the function is in sys/kern/kern_exec.c):
... /* A Cheap fault insertion */ for (x = 0; x < 100000000; x++) { y = x; } ..
Not exactly glamorous, but enough to register a large change with profiling.
Kernel profiling can be enlightening for anyone and provides a much more refined method of hunting down performance problems that are not as easy to find using conventional means, it is also not nearly as hard as most people think, if you can compile a kernel, you can get profiling to work.