Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Linux kernel debugging

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Nächste SlideShare
Linux Instrumentation
Linux Instrumentation
Wird geladen in …3
×

Hier ansehen

1 von 47 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Linux kernel debugging (20)

Anzeige

Aktuellste (20)

Linux kernel debugging

  1. 1. Kernel Debugging Hao-Ran Liu
  2. 2. Choices of debugging tools • Add debug code, recompile and run – printk, but bug may disappear if it's timing sensitive and data is written to a serial console – Set console log level to 0 and use dmesg instead • Patch code at runtime to print or gather data – Ftrace, Kprobes • Patch code at runtime to stop kernel and analyze – KDB, KGDB • Run the kernel under the control of VM like QEMU, VirtualBox
  3. 3. printk() • Kernel-space equivalent of printf() • Each kernel message are prepended a string representing its loglevel n – “<n>Hello world!” • Loglevel determines the severity of the message
  4. 4. Printk loglevel • Messages with level lower than console_loglevel are shown to the console • console_loglevel can be changed via – dmesg -n level – syslog system call – echo n > /proc/sys/kernel/printk Name String Meaning Alias macro KERN_EMERG "0" Emergency messages, system is about to crash or is unstable pr_emerg() KERN_ALERT "1" Something bad happened and action must be taken immediately pr_alert() KERN_CRIT "2" A serious hardware/software failure pr_crit() KERN_ERR "3" Often used by drivers to indicate difficulties with the hardware pr_err() KERN_WARNING "4" nothing serious by itself but might indicate problems pr_warning() KERN_NOTICE "5" Nothing serious. Often used to report security events. pr_notice() KERN_INFO "6" Informational message e.g. startup info. at driver initialization pr_info() KERN_DEBUG "7" Debug messages pr_debug()if DEBUG is defined KERN_DEFAULT "d" The default kernel loglevel KERN_CONT "" "continued" line after a line that had no enclosing n pr_cont()
  5. 5. Kernel log buffer • kernel log buffer stores kernel messages • It is a circular buffer. Old messages are overwritten when the buffer is full – Use klogd daemon to keep old msgs in a file – Log buffer size is configurable • Kernel log buffer can be manipulated via syslog system call – or dmesg command line tool
  6. 6. syslog system call • int syslog(int type, char *bufp, int len) /* * Commands to sys_syslog: * * 0 -- Close the log. Currently a NOP. * 1 -- Open the log. Currently a NOP. * 2 -- Read from the log (wait until the buffer is nonempty) * 3 -- Read all messages remaining in the ring buffer * 4 -- Read and clear all messages remaining in the ring buffer * 5 -- Clear ring buffer. * 6 -- Disable printk to console * 7 -- Enable printk to console * 8 -- Set level of messages printed to console * 9 -- Return number of unread characters in the log buffer */
  7. 7. Klogd and syslogd • Klogd is “kernel log daemon”. It receives kernel messages via syslog system call (or /proc/kmsg) and redirect them to syslogd • syslogd differentiate messages by facility.priority (ex. LOG_KERN.LOG_ERR) and consults /etc/syslog.conf to know how to deal with them (discard or save in a file) Kernel Log buffer /proc/kmsg sys_syslog() klogd syslogd file files Kernel space User space C library: openlog() closelog() syslog() other daemons
  8. 8. Use printk macros • Do not remove debug printk – you may need it later to debug another related issue • Undefine DEBUG to remove debug messages in a production kernel • For drivers, use dev_dbg() instead
  9. 9. Limit the rate of your printk • Printk may overwhelm the console if – printk in a code which get executed very often – printk in a frequently-triggered IRQ handler (eg. Timer) • printk_ratelimit() return 0 when message to be printed should be surpressed • printk_once() – no matter how often you call it, it prints once and never again if (printk_ratelimit( )) printk(KERN_NOTICE "The printer is still on firen");
  10. 10. printk_ratelimit() implementation • The two variable can be modified via /proc/sys/kernel/ /* minimum time in jiffies between messages */ int printk_ratelimit_jiffies = 5*HZ; /* number of messages we send before ratelimiting */ int printk_ratelimit_burst = 10; int printk_ratelimit(void) { return __printk_ratelimit(printk_ratelimit_jiffies, printk_ratelimit_burst); }
  11. 11. /proc file system • A software-created, pseudo file system • Contains many system information, ex: – /proc/<pid>/maps – /proc/sys/kernel/* – /proc/interrupts – /proc/meminfo • Use of /proc fs is discouraged, they should contain only information about process • You should use sysfs or debugfs instead
  12. 12. debugfs • a simple way to make information available to user space – Unlike sysfs, which has strict one-value-per- file rules – NOT a stable API for user space – mount -t debugfs none /sys/kernel/debug
  13. 13. debugfs example #include <linux/module.h> #include <linux/debugfs.h> #define len 200 u64 intvalue, hexvalue; struct dentry *dirret, *fileret, *u64int, *u64hex; char _buf[len]; static ssize_t myreader(struct file *fp, char __user *user_buffer, size_t count, loff_t *pos) { char *kbuf = (char *)file_inode(fp)->i_private; return simple_read_from_buffer(user_buffer, count, pos, kbuf, len); } static ssize_t mywriter(struct file *fp, const char __user *user_buffer, size_t count, loff_t *pos) { char *kbuf = (char *)file_inode(fp)->i_private; return simple_write_to_buffer(kbuf, len, pos, user_buffer, count); } static const struct file_operations fops_debug = { .read = myreader, .write = mywriter, }; static int __init init_debug(void) { /* create a directory in /sys/kernel/debug */ dirret = debugfs_create_dir(“mydebug", NULL); if (IS_ERR_OR_NULL(dirret)) return -ENODEV; /* create a file in the above directory This requires read and write file operations */ fileret = debugfs_create_file("text", 0644, dirret, _buf, &fops_debug); /* create a file which takes in a int(64) value */ u64int = debugfs_create_u64("number", 0644, dirret, &intvalue); /* takes a hex decimal value */ u64hex = debugfs_create_x64("hexnum", 0644, dirret, &hexvalue); return 0; } static void __exit exit_debug(void) { /* remove mydebug dir recursively */ debugfs_remove_recursive(dirret); } module_init(init_debug); module_exit(exit_debug);
  14. 14. strace: system call trace • Intercepts and records – system calls issued by a process – signals a process received • Where to use – Have a in indepth understanding of the exactly behavior of a program – Debug the exactly argument or system call a program issued – When you don’t have access to the source code • Syntax – strace [option] <command [args]> • Common option – -c -- count time, calls, and errors for each syscall and report summary – -f -- follow forks – -T -- print time spent in each syscall – -e expr -- a qualifying expression: option=[!]all or option=[!]val1[,val2]... (options: trace, abbrev, verbose, raw, signal, read, or write)
  15. 15. strace output example execve("/bin/dmesg", ["dmesg"], [/* 22 vars */]) = 0 ... syslog(0x3, 0x95d3858, 0x4008) = 16384 write(1, "amily 2nIP: routing cache hash t"..., 4096amily write(1, "to accept 2 bytes to c1bd7f9e fr"..., ... munmap(0xb7d6b000, 4096) = 0 exit_group(0) = ? % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 92.75 0.013263 35 374 write 5.02 0.000718 718 1 syslog 0.51 0.000073 18 4 1 open 0.47 0.000067 34 2 munmap 0.41 0.000058 12 5 old_mmap 0.34 0.000048 24 2 mmap2 0.11 0.000016 4 4 fstat64 0.10 0.000015 15 1 read 0.10 0.000015 8 2 mprotect 0.08 0.000012 3 4 brk 0.04 0.000006 2 3 close 0.04 0.000006 6 1 uname 0.02 0.000003 3 1 set_thread_area ------ ----------- ----------- --------- --------- ---------------- 100.00 0.014300 404 1 total
  16. 16. Kernel oops • When kernel detects some bug in itself – Fault: Kernel kill faulting process and try to continue • Some locks and data structures may not be released properly; the system cannot be trusted anymore – Panic: system halts, usually in interrupt context or in idle, init task where kernel think it cannot recover itself • Oops message contains – Error message – Contents of registers – Stack dump – Function call trace • Enable CONFIG_KALLSYMS at kernel configuration to have symbolic call trace (otherwise all you see are binary addresses)
  17. 17. Kernel Oops Example • Code below will trigger an oops ssize_t faulty_write (struct file *filp, const char __user *buf, size_t count, loff_t *pos) { /* make a simple fault by dereferencing a NULL pointer */ *(int *)0 = 0; return 0; } struct file_operations faulty_fops = { .read = faulty_read, .write = faulty_write, .owner = THIS_MODULE };
  18. 18. Kernel Oops Example Unable to handle kernel NULL pointer dereference at virtual address 00000000 Internal error: Oops: 817 [#1] SMP ARM Modules linked in: faulty(O) bnep hci_uart btbcm bluetooth brcmfmac brcmutil CPU: 1 PID: 835 Comm: bash Tainted: G O 4.4.21-v7+ #911 task: b6a605c0 ti: b6ae8000 task.ti: b6ae8000 PC is at faulty_write+0x18/0x20 [faulty] pc : [<7f33c018>] lr : [<8015736c>] sp : b6ae9ed0 ip : b6ae9ee0 fp : b6ae9edc r10: 00000000 r9 : b6ae8000 r8 : 8000fd08 r7 : b6ae9f80 r6 : 01493c08 r5 : b6ae9f80 r4 : b93953c0 r3 : b6ae9f80 r2 : 00000002 r1 : 01493c08 r0 : 00000000 Flags: nzCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment user Control: 10c5383d Table: 36b3806a DAC: 00000055 Process bash (pid: 835, stack limit = 0xb6ae8210) Stack: (0xb6ae9ed0 to 0xb6aea000) 9ec0: b6ae9f4c b6ae9ee0 8015736c 7f33c00c 9ee0: 00000000 0000000a b934f600 80174fc0 b6ae9f3c b6ae9f00 80174fc0 805b66fc 9f00: b6ae9f3c 801559f8 00000000 80157c34 00000000 00000000 b6ae9f44 b6ae9f28 9f20: 80155a0c 80159158 b93953c0 b93953c0 00000002 01493c08 b6ae9f80 8000fd08 9f40: b6ae9f7c b6ae9f50 80157c64 80157344 80155a0c 801752e8 b93953c0 b93953c0 9f60: 00000002 01493c08 8000fd08 b6ae8000 b6ae9fa4 b6ae9f80 801585d4 80157bd0 [<7f33c018>] (faulty_write [faulty]) from [<8015736c>] (__vfs_write+0x34/0xe8) [<8015736c>] (__vfs_write) from [<80157c64>] (vfs_write+0xa0/0x1a8) [<80157c64>] (vfs_write) from [<801585d4>] (SyS_write+0x54/0xb0) [<801585d4>] (SyS_write) from [<8000fb40>] (ret_fast_syscall+0x0/0x1c) Code: e24cb004 e52de004 e8bd4000 e3a00000 (e5800000)
  19. 19. Calling convention • An low-level scheme for how subroutines receive parameters from their caller and how they return a result • ARM 32 register allocation: Register Use Comment r15 Program counter r14 Link register Used by BL instruction r13 Stack pointer Must 8 bytes aligned r12 For Intra procedure call r4 to r11: For local variables Callee saved r0 to r3 For arguments and return values Caller saved
  20. 20. ARM32 Calling convention
  21. 21. decodecode • A script for disassembling oops code pi@raspberrypi:~/linux $ dmesg | scripts/decodecode [ 80.573075] Code: e24cb004 e52de004 e8bd4000 e3a00000 (e5800000) All code ======== 0: e24cb004 sub fp, ip, #4 4: e52de004 push {lr} ; (str lr, [sp, #-4]!) 8: e8bd4000 ldmfd sp!, {lr} c: e3a00000 mov r0, #0 10:* e5800000 str r0, [r0] <-- trapping instruction Code starting with the faulting instruction =========================================== 0: e5800000 str r0, [r0]
  22. 22. Finding oops code with GDB • Module should be compiled with “-g” – Add “ccflags-y := -g” to module’s Makefile pi@raspberrypi:~/sunplus/oops $ cat /proc/modules faulty 1367 0 - Live 0x7f33c000 (O) bnep 10340 2 - Live 0x7f335000 ... pi@raspberrypi:~/sunplus/oops $ gdb GNU gdb (Raspbian 7.7.1+dfsg-5) 7.7.1 (gdb) add-symbol-file faulty.ko 0x7f33c000 add symbol table from file "faulty.ko" at .text_addr = 0x7f33c000 (y or n) y Reading symbols from faulty.ko...done. (gdb) list *0x7f33c018 0x7f33c018 is in faulty_write (/home/pi/sunplus/oops/faulty.c:51). 46 47 ssize_t faulty_write (struct file *filp, const char __user *buf, size_t count, 48 loff_t *pos) 49 { 50 /* make a simple fault by dereferencing a NULL pointer */ 51 *(int *)0 = 0; 52 return 0; 53 }
  23. 23. gdb – observe kernel variables • gdb can observe variables in the kernel • How to use? – gdb /usr/src/linux/vmlinux /proc/kcore – p jiffies /* print the value of jiffies variable */ – p jiffies /* you get the same value, since gdb cache value readed from the core file */ – core-file /proc/kcore /* flush gdb cache */ – p jiffies /* you get a different value of jiffies */ • vmlinux is the name of the uncompressed ELF kernel executable, not bzImage • kcore represent the kernel executable in the format of a core file • Disadvantage – Read-only access to the kernel
  24. 24. Introduction of KGDB and KDB ● ● Linux kernel has two different debugger front ends (kdb and kgdb) which interface to the debug core KDB – Use on a system console or serial console – Not a source level debugger, aimed at doing simple analysis or diagnosis – Function ● ● ● Data: Read/write memory, registers Linux: process lists, backtrace, dmesg. Control: set breakpoints, single step instruction
  25. 25. KGDB ● ● source level debugger, used with GDB to debug a Linux kernel Two machines (physical or virtual) are required for using KGDB – Communicate via network or serial connection – Target machine runs the kernel to be debugged – Development machine runs a instance of GDB against vmlinux file which contains the symbols.
  26. 26. KGDB Kernel Configuration (1) ● ● ● ● CONFIG_DEBUG_INFO=y – Required by GDB for source level debugging. This adds debug symbols to kernel and modules (gcc -g) CONFIG_KALLSYMS=y – Required by KDB to access symbols by name CONFIG_FRAME_POINTER=y – Save frame info. in registers or stack to allows GDB to construct stack back traces more accurately CONFIG_DEBUG_RODATA=n – Page tables will disallow write to kernel read-only data. If this is enabled, you cannot use software breakpoints
  27. 27. KGDB Kernel Configuration (2) ● ● ● ● ● CONFIG_EXPERIMENTAL=y CONFIG_KGDB=y CONFIG_KGDB_SERIAL_CONSOLE=y – kgdboc is a KGDB I/O driver for use KGDB/KDB over serial console CONFIG_SERIAL_8250=y – Driver for standard serial ports CONFIG_SERIAL_8250_CONSOLE=y – Allow the use of a serial port as system console
  28. 28. KDB Kernel Configuration ● ● ● KGDB must first be enabled before KDB is enabled. To use KDB on a serial console, kgdboc and a serial port driver are also needed CONFIG_KGDB_KDB=y – include kdb frontend for kgdb CONFIG_KDB_KEYBOARD=y – KDB can use a PS/2 type keyboard for an input device
  29. 29. Kernel Parameters - kgdboc ● ● kgdboc=[kms][[,]kbd][[,]serial_device][,baud] – Designed to work with a single serial port which is used for your primary console and for kernel debugging – kms (kernel mode setting) integration to allow entering kdb on a graphic console – Can be configured in kernel boot parameters or at runtime with sysfs – does not support interrupting the target via the gdb remote protocol. You must manually send a sysrq-g Enable / Disable kgdboc – echo ttyS0,115200 > /sys/module/kgdboc/parameters/kgdboc – echo “” > /sys/module/kgdboc/parameters/kgdboc
  30. 30. Kernel Parameters - kgdbwait ● ● ● ● It makes kernel stop as early as I/O driver supports and wait for a debugger connection during booting of a kernel Useful for debugging kernel initialization Note – A KGDB I/O driver must be compiled into kernel and kgdbwait should always follow the parameter for KGDB I/O driver in kernel command line Example – kgdboc=ttyS0,115200 kgdbwait
  31. 31. Using KDB on serial port ● ● ● Configure I/O driver – Boot kernel with kgdboc parameters or – Configure kgdboc via sysfs Enter the kernel debugger manually by sending a sysrq-g or by waiting for an oops or fault – echo g > /proc/sysrq-trigger – Minicom: Ctrl-a, f, g – Telnet: Ctrl-], send break<RET>, g At KDB prompt, enter “help” to see a list of commands, “go” to resume kernel execution
  32. 32. Some KDB commands Command Usage Description ---------------------------------------------------------- md <vaddr> Display Memory Contents mm <vaddr> <contents> Modify Memory Contents go [<vaddr>] Continue Execution rd Display Registers rm <reg> <contents> Modify Registers bt [<vaddr>] Stack traceback help Display Help Message kgdb Enter kgdb mode ps [<flags>|A] Display active task list pid <pidnum> Switch to another task lsmod List loaded kernel modules dmesg [lines] Display syslog buffer kill <-signal> <pid> Send a signal to a process summary Summarize the system bp [<vaddr>] Set/Display breakpoints ss Single Step
  33. 33. Screenshot of KDB with GDB
  34. 34. Using KGDB and GDB (1) ● ● ● Configure kgdboc – kgdb, like kdb will only hook up to the kernel trap hooks if a KGDB I/O driver is loaded and configured Stop kernel execution – Send a sysrq-g, if you see a kdb prompt, enter “kgdb” – or you can use kgdbwait for debugging kernel boot. Connect from from gdb Serial port TCP port $ gdb ./vmlinux (gdb) set remotebaud 115200 (gdb) target remote /dev/ttyS0 $ gdb ./vmlinux (gdb) target remote 192.168.1.99:1234
  35. 35. Using KGDB and GDB (2) ● Reminder – If you “continue” in gdb, and need to "break in" again, you need to issue another sysrq-g – You can put a breakpoint at sys_sync and then run "sync" from a shell to break into the debugger
  36. 36. Screenshot of KGDB and GDB
  37. 37. Kernel profiling with perf • perf is a command-line profiling tool based on perf_events kernel interface – It’s event-based sampling. When a PMU counter overflows, a sample is recorded. usage: perf [--version] [--help] [OPTIONS] COMMAND [ARGS] The most commonly used perf commands are: annotate Read perf.data (created by perf record) and display annotated code archive Create archive with object files with build-ids found in perf.data file data Data file related processing diff Read perf.data files and display the differential profile evlist List the event names in a perf.data file kmem Tool to trace/measure kernel memory properties list List all symbolic event types lock Analyze lock events mem Profile memory accesses record Run a command and record its profile into perf.data report Read perf.data (created by perf record) and display the profile sched Tool to trace/measure scheduler properties (latencies) script Read perf.data (created by perf record) and display trace output stat Run a command and gather performance counter statistics timechart Tool to visualize total system behavior during a workload top System profiling tool. trace strace inspired tool probe Define new dynamic tracepoints
  38. 38. Use perf_events for CPU profiling • Flame Graphs visualize profiled code $ git clone --depth 1 https://github.com/brendangregg/FlameGraph $ sudo perf record -F 99 -a -g -- sleep 30 $ perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > perf.svg
  39. 39. Example of perf report $ pi@raspberrypi:~/sunplus $ sudo perf record -g -a sleep 10 $ pi@raspberrypi:~/sunplus $ sudo perf report Samples: 5K of event 'cycles:ppp', Event count (approx.): 184814613 Children Self Command Shared Object Symbol + 83.86% 1.97% swapper [kernel.kallsyms] [k] cpu_startup_entry + 70.22% 0.00% swapper [kernel.kallsyms] [k] secondary_start_kernel + 70.22% 0.00% swapper [unknown] [k] 0x000095ac + 67.09% 0.45% swapper [kernel.kallsyms] [k] default_idle_call + 66.11% 61.30% swapper [kernel.kallsyms] [k] arch_cpu_idle ... $pi@raspberrypi:~/sunplus $ sudo perf kmem record ^C[ perf record: Woken up 1 times to write data ] [ perf record: Captured and wrote 0.199 MB perf.data (814 samples) ] pi@raspberrypi:~/sunplus $ sudo perf kmem stat --caller Failed to read max nodes, using default of 8 --------------------------------------------------------------------------------------------------------- Callsite | Total_alloc/Per | Total_req/Per | Hit | Ping-pong | Frag --------------------------------------------------------------------------------------------------------- kthread_create_on_node+5c | 64/64 | 28/28 | 1 | 0 | 56.250% bcm2835_dma_create_cb_chain+54 | 832/277 | 560/186 | 3 | 3 | 32.692% alloc_worker+30 | 128/128 | 88/88 | 1 | 0 | 31.250% alloc_skb_with_frags+58 | 512/512 | 384/384 | 1 | 0 | 25.000% ... SUMMARY (SLAB allocator) ======================== Total bytes requested: 419,528 Total bytes allocated: 420,216 Total bytes wasted on internal fragmentation: 688 Internal fragmentation: 0.163725% Cross CPU allocations: 0/326
  40. 40. ftrace • Useful for event tracing, analyzing latencies and performance issues • The proc sysctl ftrace_enable is a big on/off switch. Default is enabled – To disable: echo 0 > /proc/sys/kernel/ftrace_enabled • Summary of /sys/kernel/debug/tracing Filename Description current_tracer Set or display the current tracer that is configured available_tracers Tracers listed here can be configured by echoing their name into current_tracer tracing_on Enable or disables writing to the ring buffer (tracing overhead may still be occurring) trace Output of the trace in a human readable format tracing_max_latency Some of the tracers record the max latency. For example, the time interrupts are disabled. tracing_thresh Latency tracers will record a trace whenever the latency is greater than the number (in ms) in this file set_ftrace_pid Have the function tracer only trace a single thread set_graph_function Set a "trigger" function where tracing should start with the function graph tracer stack_trace The stack back trace of the largest stack that was encountered when the stack tracer is activated trace_marker This is a very useful file for synchronizing user space with events happening in the kernel. Writing strings into this file will be written into the ftrace buffer
  41. 41. List of tracers Name of tracers Description function Function call tracer to trace all kernel functions function_graph Trace both entry and exit of the functions. It then provides the ability to draw a graph of function calls like C source code irqsoff Traces the areas that disable interrupts and saves the trace with the longest max latency. See tracing_max_latency. preemptoff Traces and records the amount of time for which preemption is disabled. preemptirqsoff Traces and records the largest time for which irqs and/or preemption is disabled. wakeup Traces and records the max latency that it takes for the highest priority task to get scheduled after it has been woken up. wakeup_rt Traces and records the max latency that it takes for just RT tasks nop To remove all tracers from tracing simply echo "nop" into current_tracer
  42. 42. Example of function tracer # echo SyS_nanosleep hrtimer_interrupt > set_ftrace_filter # echo function > current_tracer # echo 1 > tracing_on # usleep 1 # echo 0 > tracing_on # cat trace # tracer: function # # entries-in-buffer/entries-written: 5/5 #P:4 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | usleep-2665 [001] .... 4186.475355: sys_nanosleep <-system_call_fastpath <idle>-0 [001] d.h1 4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt usleep-2665 [001] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt <idle>-0 [003] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt <idle>-0 [002] d.h1 4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt Note: function tracer uses ring buffers to store entries. The newest data may overwrite the oldest data.Sometimes using echo to stop the trace is not sufficient because the tracing could have overwritten the data that you wanted to record. For this reason, it is sometimes better to disable tracing directly from a program.
  43. 43. Example of function-graph tracer • This tracer can also measure execution time of a function • To trace only one function and all of its children: # echo __do_fault > set_graph_function # echo function_graph > current_tracer # echo 1 > tracing_on # usleep 1 # echo 0 > tracing_on # cat trace # # tracer: function_graph # # CPU DURATION FUNCTION CALLS # | | | | | | | 0) | __do_fault() { 0) | filemap_fault() { 0) 0.408 us | find_get_page(); 0) 0.085 us | _cond_resched(); 0) 2.462 us | } 0) 0.087 us | _raw_spin_lock(); 0) 0.104 us | add_mm_counter_fast(); 0) 0.106 us | page_add_file_rmap(); 0) 0.090 us | _raw_spin_unlock(); 0) | unlock_page() { 0) 0.103 us | page_waitqueue(); 0) 0.146 us | __wake_up_bit(); 0) 1.508 us | } 0) 8.403 us | }
  44. 44. Example of irqsoff tracer # tracer: irqsoff # # irqsoff latency trace v1.1.5 on 3.8.0-test+ # -------------------------------------------------------------------- # latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) # ----------------- # | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0) # ----------------- # => started at: run_timer_softirq # => ended at: run_timer_softirq # # # _------=> CPU# # / _-----=> irqs-off # | / _----=> need-resched # || / _---=> hardirq/softirq # ||| / _--=> preempt-depth # |||| / delay # cmd pid ||||| time | caller # / ||||| | / <idle>-0 0d.s2 0us+: _raw_spin_lock_irq <-run_timer_softirq <idle>-0 0dNs3 17us : _raw_spin_unlock_irq <-run_timer_softirq <idle>-0 0dNs3 17us+: trace_hardirqs_on <-run_timer_softirq <idle>-0 0dNs3 25us : <stack trace> => _raw_spin_unlock_irq => run_timer_softirq => __do_softirq ... # echo 0 > options/function-trace # echo irqsoff > current_tracer # echo 1 > tracing_on # echo 0 > tracing_max_latency # ls -ltr [...] # echo 0 > tracing_on # cat trace Note the above example had function-trace not set. If we set function-trace, we get a much larger output
  45. 45. Example of stack tracer • ftrace makes it convenient to check the stack size at every function call # echo 1 > /proc/sys/kernel/stack_tracer_enabled After running it for a few minutes, the output looks like: # cat stack_max_size 2928 # cat stack_trace Depth Size Location (18 entries) ----- ---- -------- 0) 2928 224 update_sd_lb_stats+0xbc/0x4ac 1) 2704 160 find_busiest_group+0x31/0x1f1 2) 2544 256 load_balance+0xd9/0x662 3) 2288 80 idle_balance+0xbb/0x130 4) 2208 128 __schedule+0x26e/0x5b9 5) 2080 16 schedule+0x64/0x66 6) 2064 128 schedule_timeout+0x34/0xe0 7) 1936 112 wait_for_common+0x97/0xf1 8) 1824 16 wait_for_completion+0x1d/0x1f 9) 1808 128 flush_work+0xfe/0x119 10) 1680 16 tty_flush_to_ldisc+0x1e/0x20 11) 1664 48 input_available_p+0x1d/0x5c 12) 1616 48 n_tty_poll+0x6d/0x134 13) 1568 64 tty_poll+0x64/0x7f 14) 1504 880 do_select+0x31e/0x511 15) 624 400 core_sys_select+0x177/0x216 16) 224 96 sys_select+0x91/0xb9 17) 128 128 system_call_fastpath+0x16/0x1b
  46. 46. ftrace homework • Read https://www.kernel.org/doc/Documentation/trace/events.txt This document is about event tracing (static tracepoints) • perf-tools is a collection of performance analysis tools for Linux ftrace and perf_events. Try to find a good use of it in your work. You can download it from https://github.com/brendangregg/perf-tools.git • Write a small program using ftrace to track the number of context switches per second for each CPU. $ sudo ./ftrace_ctxt_switches.py ... Duration (sec): 61.386, Context switches (per sec): CPU0: 1130 ( 18) CPU1: 5875 ( 96) CPU2: 183 ( 3) CPU3: 230 ( 4) Duration (sec): 63.784, Context switches (per sec): CPU0: 1138 ( 18) CPU1: 6028 ( 95) CPU2: 188 ( 3) CPU3: 230 ( 4) ...
  47. 47. References • Linux Device Drivers, 3rd Edition, Jonathan Corbet • Linux kernel source, http://lxr.free-electrons.com • Choose a Linux tracer, Brendan Gregg – http://www.brendangregg.com/blog/2015-07- 08/choosing-a-linux-tracer.html • KDB and KGDB kernel documentation – http://kernel.org/pub/linux/kernel/people/jwessel/kdb/

×