The Linux 4.x series introduced a new powerful engine of programmable tracing (BPF) that allows to actually look inside the kernel at runtime. This talk will show you how to exploit this engine in order to debug problems or identify performance bottlenecks in a complex environment like a cloud. This talk will cover the latest Linux superpowers that allow to see what is happening “under the hood” of the Linux kernel at runtime. I will explain how to exploit these “superpowers” to measure and trace complex events at runtime in a cloud environment. For example, we will see how we can measure latency distribution of filesystem I/O, details of storage device operations, like individual block I/O request timeouts, or TCP buffer allocations, investigating stack traces of certain events, identify memory leaks, performance bottlenecks and a whole lot more.
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
Linux kernel tracing superpowers in the cloud
1. Linux kernel tracing superpowers
in the cloud
Andrea Righi
andrea@betterservers.com
@arighi
2. Who am I?
●
Andrea Righi
●
Performance engineer @
BetterServers.com
●
My main activities
●
Linux kernel stuff
●
Virtualization
●
Storage
●
Cloud computing
10. CPU sampling vs tracing
●
Sampling
●
Create a periodic timed interrupt that collects the
current program counter, function address and the
entire stack back trace
●
Tracing
●
Record times and invocations of specific events
11. Generic performance analysis tools
●
uptime → system lifetime and load average
●
top → generic overall system stat
●
vmstat 1 → system/memory stat by time
●
mpstat -P ALL 1 → CPU load balancing
●
pidstat 1 → process usage
●
iostat -kxd 1 → disk I/O
●
free -m → memory usage
●
sar -n DEV 1 → network I/O
●
dmesg | tail → last kernel error messages
13. perf
●
perf is a powerful multi-tool and profiler
●
Interval sampling
●
CPU performance counter events
●
user + kernel sampling and tracing
●
event filtering
●
perf top → best tool to get an idea of what’s
going on in the system
14. Visualizing traces: flame graphs
●
CPU flame graphs
●
x-axis
sample population
●
y-axis
●
stack depth
●
Wider boxes =
More samples =
More time spent on CPU
16. strace
●
strace(1): system call tracer in Linux
●
It uses the ptrace() system call that pauses the
target process for each syscall so that the
debugger can read the state
●
And it’s doing this twice: when the syscall begins
and when it ends!
17. strace overhead
### Regular execution ###
righiandr@Dell:~$ dd if=/dev/zero of=/dev/null bs=1 count=500k
512000+0 records in
512000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 0,201641 s, 2,5 MB/s
### Strace execution (tracing a syscall that is never called) ###
righiandr@Dell:~$ strace -eaccept dd if=/dev/zero of=/dev/null bs=1 count=500k
512000+0 records in
512000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 11,7989 s, 43,4 kB/s
+++ exited with 0 +++
18. Tracepoint
●
A tracepoint is special code statically placed in
your program (programmer defines where to put
the tracepoint)
●
If someone wants to see when the tracepoint is
hit and extract data they can “enable” or “activate”
the tracepoint using a specific interface
●
Two elements are required:
●
Tracepoint definition (placed in a header file)
●
Tracepoint statement (in C code)
20. Kprobes (Kernel probes)
●
Trap almost every kernel code address, specifying a handler routine to be
invoked when the breakpoint is hit
●
How does it work?
●
Make a copy of the probed instruction and replace the original instruction with a
breakpoint instruction (int3 on x86)
●
When the breakpoint is hit, a trap occurs, CPU's registers are saved and the control
passes to the Kprobes pre-handler
●
The saved instruction is executed in single-step mode
●
The Kprobes post-handler is executed
●
The rest of the original function is executed
●
Same mechanism can be applied to user-space
●
uprobes
26. eBPF: definition
●
eBPF: a highly efficient virtual machine that
lives in the kernel
●
Ingo Molnar described eBPF as
● “One of the more interesting features in this cycle is the
ability to attach eBPF programs (user-defined, sandboxed
bytecode executed by the kernel) to kprobes. This allows
user-defined instrumentation on a live kernel image that can
never crash, hang or interfere with the kernel negatively”
27. eBPF history
●
Initially it was BPF: Berkeley Packet Filter
●
It has its roots in BSD in the very early 1990’s
●
Originally designed as a mechanism for fast filtering network packets
●
Initially used in Linux by tcpdump to implement the filtering “engine”
behind its complex command-line syntax
●
Linux introduced eBPF: extended Berkeley Packet Filter (3.18 –
December 2014)
●
More efficient / more generic than the original BPF
●
Kernel 4.9: eBPF programs can be attached to perf_events
●
Timed samples can now run BPF programs!
28. eBPF as a VM
●
Example assembly of a simple eBPF filter
●
Load 16-bit quantity from offset 12 in the packet to the
accumulator (ethernet type)
●
Compare the value to see if the packet is an IP packet
●
If the packet is IP, return TRUE (packet is accepted)
●
otherwise return 0 (packet is rejected)
●
Only 4 VM instructions to filter IP packets!
ldh [12]
jeq #ETHERTYPE_IP, l1, l2
l1: ret #TRUE
l2: ret #0
29. eBPF context
●
eBPF is not specific to any particular context
●
packet filtering: context is a packet
●
tracing: context is a snapshot of processor registers when the tracepoint is hit
●
JIT:
●
every BPF instruction is mapped to a x86 instruction sequence
●
accumulator and index registers stored directly into processor’s registers
●
program is placed in a vmalloc() space and executed directly when a context
is processed
30. How to write a eBPF filter
●
A filter can be written in C
●
GCC backend as well as LLVM
backend
●
Compiler generates eBPF byte
code which resides in an ELF file
●
Load the program into the kernel
by using the bpf() syscall
/*
* tracing filter example to print events
* for loobpack device only if attached to
* netif_receive_skb()
*/
#include <linux/skbuff.h>
#include <linux/netdevice.h>
#include <linux/bpf.h>
#include <trace/bpf_trace.h>
void filter(struct bpf_context *ctx)
{
char devname[4] = "lo";
struct net_device *dev;
struct sk_buff *skb = 0;
skb = (struct sk_buff *)ctx->regs.si;
dev = bpf_load_pointer(&skb->dev);
if (bpf_memcmp(dev->name, devname, 2) == 0) {
char fmt[] = "skb %p dev %p n";
bpf_trace_printk(fmt, sizeof(fmt),
(long)skb, (long)dev, 0);
}
}
33. Parasite thread injection
●
Concept of parasite thread injection introduced in Linux 3.4
(via PTRACE_SEIZE)
●
Attach to the target pid without stopping it and becoming a
“parasite” thread of pid
●
Original goal: freeze and restore TCP connections during
checkpoint/restart
●
Example
●
python-pyrasite: injecting code into running Python programs
34. References
●
Brendan Gregg blog
●
http://brendangregg.com/blog/
●
BCC tools
●
https://github.com/iovisor/bcc
●
Perf-tools
●
https://github.com/brendangregg/perf-tools
●
Perf-labs
●
https://github.com/brendangregg/perf-labs
●
Linux documentation
●
http://lxr.linux.no/linux/Documentation/trace
●
http://lxr.linux.no/linux/Documentation/kprobes.txt
●
The BSD Packet Filter: A New Architecture for User-level Packet Capture -
S. McCanne and V. Jacobson
●
http://www.tcpdump.org/papers/bpf-usenix93.pdf
●
Linux weekly news
●
http://lwn.net