Kernel Debugging Tools and Techniques

Choices of debugging tools
• Add debug code, recompile and run
– printk, but bug may disappear if it's timing sensitive and data is
written to a serial console
– Set console log level to 0 and use dmesg instead
• Patch code at runtime to print or gather data
– Ftrace, Kprobes
• Patch code at runtime to stop kernel and analyze
– KDB, KGDB
• Run the kernel under the control of VM like QEMU,
VirtualBox

printk()
• Kernel-space equivalent of printf()
• Each kernel message are prepended a
string representing its loglevel n
– “<n>Hello world!”
• Loglevel determines the severity of the
message

Printk loglevel
• Messages with level lower than console_loglevel are
shown to the console
• console_loglevel can be changed via
– dmesg -n level
– syslog system call
– echo n > /proc/sys/kernel/printk
Name String Meaning Alias macro
KERN_EMERG "0" Emergency messages, system is about to crash or is unstable pr_emerg()
KERN_ALERT "1" Something bad happened and action must be taken immediately pr_alert()
KERN_CRIT "2" A serious hardware/software failure pr_crit()
KERN_ERR "3" Often used by drivers to indicate difficulties with the hardware pr_err()
KERN_WARNING "4" nothing serious by itself but might indicate problems pr_warning()
KERN_NOTICE "5" Nothing serious. Often used to report security events. pr_notice()
KERN_INFO "6" Informational message e.g. startup info. at driver initialization pr_info()
KERN_DEBUG "7" Debug messages
pr_debug()if DEBUG is
defined
KERN_DEFAULT "d" The default kernel loglevel
KERN_CONT "" "continued" line after a line that had no enclosing n pr_cont()

Kernel log buffer
• kernel log buffer stores kernel messages
• It is a circular buffer. Old messages are
overwritten when the buffer is full
– Use klogd daemon to keep old msgs in a file
– Log buffer size is configurable
• Kernel log buffer can be manipulated via
syslog system call
– or dmesg command line tool

syslog system call
• int syslog(int type, char *bufp, int len)
/*
* Commands to sys_syslog:
*
* 0 -- Close the log. Currently a NOP.
* 1 -- Open the log. Currently a NOP.
* 2 -- Read from the log (wait until the buffer is nonempty)
* 3 -- Read all messages remaining in the ring buffer
* 4 -- Read and clear all messages remaining in the ring buffer
* 5 -- Clear ring buffer.
* 6 -- Disable printk to console
* 7 -- Enable printk to console
* 8 -- Set level of messages printed to console
* 9 -- Return number of unread characters in the log buffer
*/

Klogd and syslogd
• Klogd is “kernel log daemon”. It receives kernel
messages via syslog system call (or /proc/kmsg) and
redirect them to syslogd
• syslogd differentiate messages by facility.priority (ex.
LOG_KERN.LOG_ERR) and consults /etc/syslog.conf to
know how to deal with them (discard or save in a file)
Kernel
Log buffer
/proc/kmsg
sys_syslog()
klogd
syslogd
file
files
Kernel space User space
C library:
openlog()
closelog()
syslog()
other
daemons

Use printk macros
• Do not remove debug printk
– you may need it later to debug another related issue
• Undefine DEBUG to remove debug messages in
a production kernel
• For drivers, use dev_dbg() instead

Limit the rate of your printk
• Printk may overwhelm the console if
– printk in a code which get executed very often
– printk in a frequently-triggered IRQ handler (eg. Timer)
• printk_ratelimit() return 0 when message to be
printed should be surpressed
• printk_once()
– no matter how often you call it, it prints once and
never again
if (printk_ratelimit( ))
printk(KERN_NOTICE "The printer is still on firen");

printk_ratelimit() implementation
• The two variable can be modified via
/proc/sys/kernel/
/* minimum time in jiffies between messages */
int printk_ratelimit_jiffies = 5*HZ;
/* number of messages we send before ratelimiting */
int printk_ratelimit_burst = 10;
int printk_ratelimit(void)
{
return __printk_ratelimit(printk_ratelimit_jiffies,
printk_ratelimit_burst);
}

/proc file system
• A software-created, pseudo file system
• Contains many system information, ex:
– /proc/<pid>/maps
– /proc/sys/kernel/*
– /proc/interrupts
– /proc/meminfo
• Use of /proc fs is discouraged, they should
contain only information about process
• You should use sysfs or debugfs instead

debugfs
• a simple way to make information
available to user space
– Unlike sysfs, which has strict one-value-per-
file rules
– NOT a stable API for user space
– mount -t debugfs none /sys/kernel/debug

debugfs example
#include <linux/module.h>
#include <linux/debugfs.h>
#define len 200
u64 intvalue, hexvalue;
struct dentry *dirret, *fileret, *u64int, *u64hex;
char _buf[len];
static ssize_t myreader(struct file *fp,
char __user *user_buffer, size_t count, loff_t *pos)
{
char *kbuf = (char *)file_inode(fp)->i_private;
return simple_read_from_buffer(user_buffer, count, pos,
kbuf, len);
}
static ssize_t mywriter(struct file *fp,
const char __user *user_buffer, size_t count, loff_t *pos)
{
char *kbuf = (char *)file_inode(fp)->i_private;
return simple_write_to_buffer(kbuf, len, pos,
user_buffer, count);
}
static const struct file_operations fops_debug = {
.read = myreader,
.write = mywriter,
};
static int __init init_debug(void) {
/* create a directory in /sys/kernel/debug */
dirret = debugfs_create_dir(“mydebug", NULL);
if (IS_ERR_OR_NULL(dirret))
return -ENODEV;
/* create a file in the above directory
This requires read and write file operations */
fileret = debugfs_create_file("text", 0644, dirret,
_buf, &fops_debug);
/* create a file which takes in a int(64) value */
u64int = debugfs_create_u64("number", 0644, dirret,
&intvalue);
/* takes a hex decimal value */
u64hex = debugfs_create_x64("hexnum", 0644, dirret,
&hexvalue);
return 0;
}
static void __exit exit_debug(void) {
/* remove mydebug dir recursively */
debugfs_remove_recursive(dirret);
}
module_init(init_debug);
module_exit(exit_debug);

strace: system call trace
• Intercepts and records
– system calls issued by a process
– signals a process received
• Where to use
– Have a in indepth understanding of the exactly behavior of a program
– Debug the exactly argument or system call a program issued
– When you don’t have access to the source code
• Syntax
– strace [option] <command [args]>
• Common option
– -c -- count time, calls, and errors for each syscall and report summary
– -f -- follow forks
– -T -- print time spent in each syscall
– -e expr -- a qualifying expression: option=[!]all or option=[!]val1[,val2]...
(options: trace, abbrev, verbose, raw, signal, read, or write)

strace output example
execve("/bin/dmesg", ["dmesg"], [/* 22 vars */]) = 0
...
syslog(0x3, 0x95d3858, 0x4008) = 16384
write(1, "amily 2nIP: routing cache hash t"..., 4096amily
write(1, "to accept 2 bytes to c1bd7f9e fr"...,
...
munmap(0xb7d6b000, 4096) = 0
exit_group(0) = ?
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
92.75 0.013263 35 374 write
5.02 0.000718 718 1 syslog
0.51 0.000073 18 4 1 open
0.47 0.000067 34 2 munmap
0.41 0.000058 12 5 old_mmap
0.34 0.000048 24 2 mmap2
0.11 0.000016 4 4 fstat64
0.10 0.000015 15 1 read
0.10 0.000015 8 2 mprotect
0.08 0.000012 3 4 brk
0.04 0.000006 2 3 close
0.04 0.000006 6 1 uname
0.02 0.000003 3 1 set_thread_area
------ ----------- ----------- --------- --------- ----------------
100.00 0.014300 404 1 total

Kernel oops
• When kernel detects some bug in itself
– Fault: Kernel kill faulting process and try to continue
• Some locks and data structures may not be released
properly; the system cannot be trusted anymore
– Panic: system halts, usually in interrupt context or in
idle, init task where kernel think it cannot recover itself
• Oops message contains
– Error message
– Contents of registers
– Stack dump
– Function call trace
• Enable CONFIG_KALLSYMS at kernel
configuration to have symbolic call trace
(otherwise all you see are binary addresses)

Kernel Oops Example
• Code below will trigger an oops
ssize_t faulty_write (struct file *filp, const char __user *buf, size_t count,
loff_t *pos)
{
/* make a simple fault by dereferencing a NULL pointer */
*(int *)0 = 0;
return 0;
}
struct file_operations faulty_fops = {
.read = faulty_read,
.write = faulty_write,
.owner = THIS_MODULE
};

Kernel Oops Example
Unable to handle kernel NULL pointer dereference at virtual address 00000000
Internal error: Oops: 817 [#1] SMP ARM
Modules linked in: faulty(O) bnep hci_uart btbcm bluetooth brcmfmac brcmutil
CPU: 1 PID: 835 Comm: bash Tainted: G O 4.4.21-v7+ #911
task: b6a605c0 ti: b6ae8000 task.ti: b6ae8000
PC is at faulty_write+0x18/0x20 [faulty]
pc : [<7f33c018>] lr : [<8015736c>] sp : b6ae9ed0 ip : b6ae9ee0 fp : b6ae9edc
r10: 00000000 r9 : b6ae8000 r8 : 8000fd08
r7 : b6ae9f80 r6 : 01493c08 r5 : b6ae9f80 r4 : b93953c0
r3 : b6ae9f80 r2 : 00000002 r1 : 01493c08 r0 : 00000000
Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user
Control: 10c5383d Table: 36b3806a DAC: 00000055
Process bash (pid: 835, stack limit = 0xb6ae8210)
Stack: (0xb6ae9ed0 to 0xb6aea000)
9ec0: b6ae9f4c b6ae9ee0 8015736c 7f33c00c
9ee0: 00000000 0000000a b934f600 80174fc0 b6ae9f3c b6ae9f00 80174fc0 805b66fc
9f00: b6ae9f3c 801559f8 00000000 80157c34 00000000 00000000 b6ae9f44 b6ae9f28
9f20: 80155a0c 80159158 b93953c0 b93953c0 00000002 01493c08 b6ae9f80 8000fd08
9f40: b6ae9f7c b6ae9f50 80157c64 80157344 80155a0c 801752e8 b93953c0 b93953c0
9f60: 00000002 01493c08 8000fd08 b6ae8000 b6ae9fa4 b6ae9f80 801585d4 80157bd0
[<7f33c018>] (faulty_write [faulty]) from [<8015736c>] (__vfs_write+0x34/0xe8)
[<8015736c>] (__vfs_write) from [<80157c64>] (vfs_write+0xa0/0x1a8)
[<80157c64>] (vfs_write) from [<801585d4>] (SyS_write+0x54/0xb0)
[<801585d4>] (SyS_write) from [<8000fb40>] (ret_fast_syscall+0x0/0x1c)
Code: e24cb004 e52de004 e8bd4000 e3a00000 (e5800000)

Calling convention
• An low-level scheme for how subroutines
receive parameters from their caller and
how they return a result
• ARM 32 register allocation:
Register Use Comment
r15 Program counter
r14 Link register Used by BL instruction
r13 Stack pointer Must 8 bytes aligned
r12 For Intra procedure call
r4 to r11: For local variables Callee saved
r0 to r3 For arguments and return values Caller saved

decodecode
• A script for disassembling oops code
pi@raspberrypi:~/linux $ dmesg | scripts/decodecode
[ 80.573075] Code: e24cb004 e52de004 e8bd4000 e3a00000 (e5800000)
All code
========
0: e24cb004 sub fp, ip, #4
4: e52de004 push {lr} ; (str lr, [sp, #-4]!)
8: e8bd4000 ldmfd sp!, {lr}
c: e3a00000 mov r0, #0
10:* e5800000 str r0, [r0] <-- trapping instruction
Code starting with the faulting instruction
===========================================
0: e5800000 str r0, [r0]

Finding oops code with GDB
• Module should be compiled with “-g”
– Add “ccflags-y := -g” to module’s Makefile
pi@raspberrypi:~/sunplus/oops $ cat /proc/modules
faulty 1367 0 - Live 0x7f33c000 (O)
bnep 10340 2 - Live 0x7f335000
...
pi@raspberrypi:~/sunplus/oops $ gdb
GNU gdb (Raspbian 7.7.1+dfsg-5) 7.7.1
(gdb) add-symbol-file faulty.ko 0x7f33c000
add symbol table from file "faulty.ko" at
.text_addr = 0x7f33c000
(y or n) y
Reading symbols from faulty.ko...done.
(gdb) list *0x7f33c018
0x7f33c018 is in faulty_write (/home/pi/sunplus/oops/faulty.c:51).
46
47 ssize_t faulty_write (struct file *filp, const char __user *buf,
size_t count,
48 loff_t *pos)
49 {
50 /* make a simple fault by dereferencing a NULL pointer */
51 *(int *)0 = 0;
52 return 0;
53 }

gdb – observe kernel variables
• gdb can observe variables in the kernel
• How to use?
– gdb /usr/src/linux/vmlinux /proc/kcore
– p jiffies /* print the value of jiffies variable */
– p jiffies /* you get the same value, since gdb cache value readed
from the core file */
– core-file /proc/kcore /* flush gdb cache */
– p jiffies /* you get a different value of jiffies */
• vmlinux is the name of the uncompressed ELF kernel
executable, not bzImage
• kcore represent the kernel executable in the format of a
core file
• Disadvantage
– Read-only access to the kernel

Introduction of KGDB and KDB
●
●
Linux kernel has two different debugger front ends
(kdb and kgdb) which interface to the debug core
KDB
– Use on a system console or serial console
– Not a source level debugger, aimed at doing simple
analysis or diagnosis
– Function
●
●
●
Data: Read/write memory, registers
Linux: process lists, backtrace, dmesg.
Control: set breakpoints, single step instruction

KGDB
●
●
source level debugger, used with GDB to debug a
Linux kernel
Two machines (physical or virtual) are required for
using KGDB
– Communicate via network or serial connection
– Target machine runs the kernel to be debugged
– Development machine runs a instance of GDB against
vmlinux file which contains the symbols.

KGDB Kernel Configuration (1)
●
●
●
●
CONFIG_DEBUG_INFO=y
– Required by GDB for source level debugging. This
adds debug symbols to kernel and modules (gcc -g)
CONFIG_KALLSYMS=y
– Required by KDB to access symbols by name
CONFIG_FRAME_POINTER=y
– Save frame info. in registers or stack to allows GDB to
construct stack back traces more accurately
CONFIG_DEBUG_RODATA=n
– Page tables will disallow write to kernel read-only data.
If this is enabled, you cannot use software breakpoints

KGDB Kernel Configuration (2)
●
●
●
●
●
CONFIG_EXPERIMENTAL=y
CONFIG_KGDB=y
CONFIG_KGDB_SERIAL_CONSOLE=y
– kgdboc is a KGDB I/O driver for use KGDB/KDB over
serial console
CONFIG_SERIAL_8250=y
– Driver for standard serial ports
CONFIG_SERIAL_8250_CONSOLE=y
– Allow the use of a serial port as system console

KDB Kernel Configuration
●
●
●
KGDB must first be enabled before KDB is
enabled. To use KDB on a serial console, kgdboc
and a serial port driver are also needed
CONFIG_KGDB_KDB=y
– include kdb frontend for kgdb
CONFIG_KDB_KEYBOARD=y
– KDB can use a PS/2 type keyboard for an input device

Kernel Parameters - kgdboc
●
●
kgdboc=[kms][[,]kbd][[,]serial_device][,baud]
– Designed to work with a single serial port which is used
for your primary console and for kernel debugging
– kms (kernel mode setting) integration to allow entering
kdb on a graphic console
– Can be configured in kernel boot parameters or at
runtime with sysfs
– does not support interrupting the target via the gdb
remote protocol. You must manually send a sysrq-g
Enable / Disable kgdboc
– echo ttyS0,115200 > /sys/module/kgdboc/parameters/kgdboc
– echo “” > /sys/module/kgdboc/parameters/kgdboc

Kernel Parameters - kgdbwait
●
●
●
●
It makes kernel stop as early as I/O driver supports
and wait for a debugger connection during booting
of a kernel
Useful for debugging kernel initialization
Note
– A KGDB I/O driver must be compiled into kernel and
kgdbwait should always follow the parameter for
KGDB I/O driver in kernel command line
Example
– kgdboc=ttyS0,115200 kgdbwait

Using KDB on serial port
●
●
●
Configure I/O driver
– Boot kernel with kgdboc parameters or
– Configure kgdboc via sysfs
Enter the kernel debugger manually by sending a
sysrq-g or by waiting for an oops or fault
– echo g > /proc/sysrq-trigger
– Minicom: Ctrl-a, f, g
– Telnet: Ctrl-], send break<RET>, g
At KDB prompt, enter “help” to see a list of
commands, “go” to resume kernel execution

Some KDB commands
Command Usage Description
----------------------------------------------------------
md <vaddr> Display Memory Contents
mm <vaddr> <contents> Modify Memory Contents
go [<vaddr>] Continue Execution
rd Display Registers
rm <reg> <contents> Modify Registers
bt [<vaddr>] Stack traceback
help Display Help Message
kgdb Enter kgdb mode
ps [<flags>|A] Display active task list
pid <pidnum> Switch to another task
lsmod List loaded kernel modules
dmesg [lines] Display syslog buffer
kill <-signal> <pid> Send a signal to a process
summary Summarize the system
bp [<vaddr>] Set/Display breakpoints
ss Single Step

Using KGDB and GDB (1)
●
●
●
Configure kgdboc
– kgdb, like kdb will only hook up to the kernel trap
hooks if a KGDB I/O driver is loaded and configured
Stop kernel execution
– Send a sysrq-g, if you see a kdb prompt, enter “kgdb”
– or you can use kgdbwait for debugging kernel boot.
Connect from from gdb
Serial port TCP port
$ gdb ./vmlinux
(gdb) set remotebaud 115200
(gdb) target remote /dev/ttyS0
$ gdb ./vmlinux
(gdb) target remote 192.168.1.99:1234

Using KGDB and GDB (2)
● Reminder
– If you “continue” in gdb, and need to "break in" again,
you need to issue another sysrq-g
– You can put a breakpoint at sys_sync and then run
"sync" from a shell to break into the debugger

Kernel profiling with perf
• perf is a command-line profiling tool based on
perf_events kernel interface
– It’s event-based sampling.
When a PMU counter overflows, a sample is recorded.
usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS]
The most commonly used perf commands are:
annotate Read perf.data (created by perf record) and display annotated code
archive Create archive with object files with build-ids found in perf.data file
data Data file related processing
diff Read perf.data files and display the differential profile
evlist List the event names in a perf.data file
kmem Tool to trace/measure kernel memory properties
list List all symbolic event types
lock Analyze lock events
mem Profile memory accesses
record Run a command and record its profile into perf.data
report Read perf.data (created by perf record) and display the profile
sched Tool to trace/measure scheduler properties (latencies)
script Read perf.data (created by perf record) and display trace output
stat Run a command and gather performance counter statistics
timechart Tool to visualize total system behavior during a workload
top System profiling tool.
trace strace inspired tool
probe Define new dynamic tracepoints

Use perf_events for CPU profiling
• Flame Graphs visualize profiled code
$ git clone --depth 1 https://github.com/brendangregg/FlameGraph
$ sudo perf record -F 99 -a -g -- sleep 30
$ perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > perf.svg

Example of perf report
$ pi@raspberrypi:~/sunplus $ sudo perf record -g -a sleep 10
$ pi@raspberrypi:~/sunplus $ sudo perf report
Samples: 5K of event 'cycles:ppp', Event count (approx.): 184814613
Children Self Command Shared Object Symbol
+ 83.86% 1.97% swapper [kernel.kallsyms] [k] cpu_startup_entry
+ 70.22% 0.00% swapper [kernel.kallsyms] [k] secondary_start_kernel
+ 70.22% 0.00% swapper [unknown] [k] 0x000095ac
+ 67.09% 0.45% swapper [kernel.kallsyms] [k] default_idle_call
+ 66.11% 61.30% swapper [kernel.kallsyms] [k] arch_cpu_idle
...
$pi@raspberrypi:~/sunplus $ sudo perf kmem record
^C[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.199 MB perf.data (814 samples) ]
pi@raspberrypi:~/sunplus $ sudo perf kmem stat --caller
Failed to read max nodes, using default of 8
---------------------------------------------------------------------------------------------------------
Callsite | Total_alloc/Per | Total_req/Per | Hit | Ping-pong | Frag
---------------------------------------------------------------------------------------------------------
kthread_create_on_node+5c | 64/64 | 28/28 | 1 | 0 | 56.250%
bcm2835_dma_create_cb_chain+54 | 832/277 | 560/186 | 3 | 3 | 32.692%
alloc_worker+30 | 128/128 | 88/88 | 1 | 0 | 31.250%
alloc_skb_with_frags+58 | 512/512 | 384/384 | 1 | 0 | 25.000%
...
SUMMARY (SLAB allocator)
========================
Total bytes requested: 419,528
Total bytes allocated: 420,216
Total bytes wasted on internal fragmentation: 688
Internal fragmentation: 0.163725%
Cross CPU allocations: 0/326

ftrace
• Useful for event tracing, analyzing latencies and performance issues
• The proc sysctl ftrace_enable is a big on/off switch. Default is enabled
– To disable: echo 0 > /proc/sys/kernel/ftrace_enabled
• Summary of /sys/kernel/debug/tracing
Filename Description
current_tracer Set or display the current tracer that is configured
available_tracers Tracers listed here can be configured by echoing their name into current_tracer
tracing_on Enable or disables writing to the ring buffer (tracing overhead may still be occurring)
trace Output of the trace in a human readable format
tracing_max_latency Some of the tracers record the max latency. For example, the time interrupts are disabled.
tracing_thresh Latency tracers will record a trace whenever the latency is greater than the number (in ms)
in this file
set_ftrace_pid Have the function tracer only trace a single thread
set_graph_function Set a "trigger" function where tracing should start with the function graph tracer
stack_trace The stack back trace of the largest stack that was encountered when the stack tracer is
activated
trace_marker This is a very useful file for synchronizing user space with events happening in the kernel.
Writing strings into this file will be written into the ftrace buffer

List of tracers
Name of tracers Description
function Function call tracer to trace all kernel functions
function_graph Trace both entry and exit of the functions. It then provides the ability to draw a
graph of function calls like C source code
irqsoff Traces the areas that disable interrupts and saves the trace with the longest
max latency. See tracing_max_latency.
preemptoff Traces and records the amount of time for which preemption is disabled.
preemptirqsoff Traces and records the largest time for which irqs and/or preemption is
disabled.
wakeup Traces and records the max latency that it takes for the highest priority task to
get scheduled after it has been woken up.
wakeup_rt Traces and records the max latency that it takes for just RT tasks
nop To remove all tracers from tracing simply echo "nop" into current_tracer

Example of function tracer
# echo SyS_nanosleep hrtimer_interrupt > set_ftrace_filter
# echo function > current_tracer
# echo 1 > tracing_on
# usleep 1
# cat trace
# tracer: function
#
# entries-in-buffer/entries-written: 5/5 #P:4
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
usleep-2665 [001] .... 4186.475355: sys_nanosleep <-system_call_fastpath
<idle>-0 [001] d.h1 4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt
usleep-2665 [001] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt
Note: function tracer uses ring buffers to store
entries. The newest data may overwrite the
oldest data.Sometimes using echo to stop the
trace is not sufficient because the tracing could
have overwritten the data that you wanted to
record. For this reason, it is sometimes better
to disable tracing directly from a program.

Example of function-graph
tracer
• This tracer can also measure execution time of a function
• To trace only one function and all of its children:
# echo __do_fault > set_graph_function
# echo function_graph > current_tracer
# usleep 1
# cat trace
#
# tracer: function_graph
#
# CPU DURATION FUNCTION CALLS
# | | | | | | |
0) | __do_fault() {
0) | filemap_fault() {
0) 0.408 us | find_get_page();
0) 0.085 us | _cond_resched();
0) 2.462 us | }
0) 0.087 us | _raw_spin_lock();
0) 0.104 us | add_mm_counter_fast();
0) 0.106 us | page_add_file_rmap();
0) 0.090 us | _raw_spin_unlock();
0) | unlock_page() {
0) 0.103 us | page_waitqueue();
0) 0.146 us | __wake_up_bit();
0) 1.508 us | }
0) 8.403 us | }

Example of irqsoff tracer
# tracer: irqsoff
#
# irqsoff latency trace v1.1.5 on 3.8.0-test+
# --------------------------------------------------------------------
# latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4)
# -----------------
# | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0)
# -----------------
# => started at: run_timer_softirq
# => ended at: run_timer_softirq
#
#
# _------=> CPU#
# / _-----=> irqs-off
# | / _----=> need-resched
# || / _---=> hardirq/softirq
# ||| / _--=> preempt-depth
# |||| / delay
# cmd pid ||||| time | caller
# / ||||| | /
<idle>-0 0d.s2 0us+: _raw_spin_lock_irq <-run_timer_softirq
<idle>-0 0dNs3 17us : _raw_spin_unlock_irq <-run_timer_softirq
<idle>-0 0dNs3 17us+: trace_hardirqs_on <-run_timer_softirq
<idle>-0 0dNs3 25us : <stack trace>
=> _raw_spin_unlock_irq
=> run_timer_softirq
=> __do_softirq
...
# echo 0 > options/function-trace
# echo irqsoff > current_tracer
# echo 0 > tracing_max_latency
# ls -ltr
[...]
# cat trace
Note the above example had function-trace not set. If we set
function-trace, we get a much larger output

Example of stack tracer
• ftrace makes it convenient to check the stack size at
every function call
# echo 1 > /proc/sys/kernel/stack_tracer_enabled
After running it for a few minutes, the output looks like:
# cat stack_max_size
2928
# cat stack_trace
Depth Size Location (18 entries)
----- ---- --------
0) 2928 224 update_sd_lb_stats+0xbc/0x4ac
1) 2704 160 find_busiest_group+0x31/0x1f1
2) 2544 256 load_balance+0xd9/0x662
3) 2288 80 idle_balance+0xbb/0x130
4) 2208 128 __schedule+0x26e/0x5b9
5) 2080 16 schedule+0x64/0x66
6) 2064 128 schedule_timeout+0x34/0xe0
7) 1936 112 wait_for_common+0x97/0xf1
8) 1824 16 wait_for_completion+0x1d/0x1f
9) 1808 128 flush_work+0xfe/0x119
10) 1680 16 tty_flush_to_ldisc+0x1e/0x20
11) 1664 48 input_available_p+0x1d/0x5c
12) 1616 48 n_tty_poll+0x6d/0x134
13) 1568 64 tty_poll+0x64/0x7f
14) 1504 880 do_select+0x31e/0x511
15) 624 400 core_sys_select+0x177/0x216
16) 224 96 sys_select+0x91/0xb9
17) 128 128 system_call_fastpath+0x16/0x1b

ftrace homework
• Read https://www.kernel.org/doc/Documentation/trace/events.txt
This document is about event tracing (static tracepoints)
• perf-tools is a collection of performance analysis tools for Linux
ftrace and perf_events. Try to find a good use of it in your work. You
can download it from https://github.com/brendangregg/perf-tools.git
• Write a small program using ftrace to track the number of context
switches per second for each CPU.
$ sudo ./ftrace_ctxt_switches.py
...
Duration (sec): 61.386, Context switches (per sec): CPU0: 1130 ( 18) CPU1: 5875 ( 96) CPU2: 183 ( 3) CPU3: 230 ( 4)
Duration (sec): 63.784, Context switches (per sec): CPU0: 1138 ( 18) CPU1: 6028 ( 95) CPU2: 188 ( 3) CPU3: 230 ( 4)
...

References
• Linux Device Drivers, 3rd Edition, Jonathan
Corbet
• Linux kernel source,
http://lxr.free-electrons.com
• Choose a Linux tracer, Brendan Gregg
– http://www.brendangregg.com/blog/2015-07-
08/choosing-a-linux-tracer.html
• KDB and KGDB kernel documentation
– http://kernel.org/pub/linux/kernel/people/jwessel/kdb/

Kernel Debugging Tools and Techniques

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Kernel Debugging Tools and Techniques

Similar to Kernel Debugging Tools and Techniques (20)

Recently uploaded

Recently uploaded (20)

Kernel Debugging Tools and Techniques