SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Debugging Linux issues with eBPF
One incident from start to finish with dynamic tracing applied
Ivan Babrou
Performance @ Cloudflare
What does Cloudflare do
CDN
Moving content physically
closer to visitors with
our CDN.
Intelligent caching
Unlimited DDOS
mitigation
Unlimited bandwidth at
flat pricing with free
plans
Edge access control
IPFS gateway
Onion service
Website Optimization
Making web fast and up to
date for everyone.
TLS 1.3 (with 0-RTT)
HTTP/2 + QUIC
Server push
AMP
Origin load-balancing
Smart routing
Serverless / Edge Workers
Post quantum crypto
DNS
Cloudflare is the fastest
managed DNS providers
in the world.
1.1.1.1
2606:4700:4700::1111
DNS over TLS
160+
Data centers globally
4.5M+
DNS requests/s
across authoritative, recursive
and internal
10%
Internet requests
everyday
10M+
HTTP requests/second
Websites, apps & APIs
in 150 countries
10M+
Cloudflare’s anycast network
Network capacity
20Tbps
350B+
DNS requests/day
across authoritative, recursive
and internal
800B+
HTTP requests/day
Cloudflare’s anycast network (daily ironic numbers)
Network capacity
1.73Ebpd
Link to slides with speaker notes
Slideshare doesn’t allow links on the first 3 slides
Cloudflare is a Debian shop
● All machines were running Debian Jessie on bare metal
● OS boots over PXE into memory, packages and configs are ephemeral
● Kernel can be swapped as easy as OS
● New Stable (stretch) came out, we wanted to keep up
● Very easy to upgrade:
○ Build all packages for both distributions
○ Upgrade machines in groups, look at metrics, fix issues, repeat
○ Gradually phase out Jessie
○ Pop a bottle of champagne and celebrate
Cloudflare core Kafka platform at the time
● Kafka is a distributed log with multiple producers and consumers
● 3 clusters: 2 small (dns + logs) with 9 nodes, 1 big (http) with 106 nodes
● 2 x 10C Intel Xeon E5-2630 v4 @ 2.2GHz (40 logical CPUs), 128GB RAM
● 12 x 800GB SSD in RAID0
● 2 x 10G bonded NIC
● Mostly network bound at ~100Gbps ingress and ~700Gbps egress
● Check out our blog post on Kafka compression
● We also blogged about our Gen 9 edge machines recently
Small clusters went ok, big one did not
One node upgraded to Stretch
Perf to the rescue: “perf top -F 99”
RCU stalls in dmesg
[ 4923.462841] INFO: rcu_sched self-detected stall on CPU
[ 4923.462843] 13-...: (2 GPs behind) idle=ea7/140000000000001/0 softirq=1/2 fqs=4198
[ 4923.462845] (t=8403 jiffies g=110722 c=110721 q=6440)
Error logging issues
Aug 15 21:51:35 myhost kernel: INFO: rcu_sched detected stalls on CPUs/tasks:
Aug 15 21:51:35 myhost kernel: 26-...: (1881 ticks this GP) idle=76f/140000000000000/0
softirq=8/8 fqs=365
Aug 15 21:51:35 myhost kernel: (detected by 0, t=2102 jiffies, g=1837293, c=1837292, q=262)
Aug 15 21:51:35 myhost kernel: Task dump for CPU 26:
Aug 15 21:51:35 myhost kernel: java R running task 13488 1714 1513 0x00080188
Aug 15 21:51:35 myhost kernel: ffffc9000d1f7898 ffffffff814ee977 ffff88103f410400 000000000000000a
Aug 15 21:51:35 myhost kernel: 0000000000000041 ffffffff82203142 ffffc9000d1f78c0 ffffffff814eea10
Aug 15 21:51:35 myhost kernel: 0000000000000041 ffffffff82203142 ffff88103f410400 ffffc9000d1f7920
Aug 15 21:51:35 myhost kernel: Call Trace:
Aug 15 21:51:35 myhost kernel: [<ffffffff814ee977>] ? scrup+0x147/0x160
Aug 15 21:51:35 myhost kernel: [<ffffffff814eea10>] ? lf+0x80/0x90
Aug 15 21:51:35 myhost kernel: [<ffffffff814eecb5>] ? vt_console_print+0x295/0x3c0
Page allocation failures
Aug 16 01:14:51 myhost systemd-journald[13812]: Missed 17171 kernel messages
Aug 16 01:14:51 myhost kernel: [<ffffffff81171754>] shrink_inactive_list+0x1f4/0x4f0
Aug 16 01:14:51 myhost kernel: [<ffffffff8117234b>] shrink_node_memcg+0x5bb/0x780
Aug 16 01:14:51 myhost kernel: [<ffffffff811725e2>] shrink_node+0xd2/0x2f0
Aug 16 01:14:51 myhost kernel: [<ffffffff811728ef>] do_try_to_free_pages+0xef/0x310
Aug 16 01:14:51 myhost kernel: [<ffffffff81172be5>] try_to_free_pages+0xd5/0x180
Aug 16 01:14:51 myhost kernel: [<ffffffff811632db>] __alloc_pages_slowpath+0x31b/0xb80
...
[78991.546088] systemd-network: page allocation stalls for 287000ms, order:0,
mode:0x24200ca(GFP_HIGHUSER_MOVABLE)
Downgrade and investigate
● System CPU was up, so it must be the kernel upgrade
● Downgrade Stretch to Jessie
● Downgrade Linux 4.9 to 4.4 (known good, but no allocation stall logging)
● Investigate without affecting customers
● Bisection pointed at OS upgrade, kernel was not responsible
Make a flamegraph with perf
#!/bin/sh -e
# flamegraph-perf [perf args here] > flamegraph.svg
# Explicitly setting output and input to perf.data is needed to make perf work over ssh without TTY.
perf record -o perf.data "$@"
# Fetch JVM stack maps if possible, this requires -XX:+PreserveFramePointer
export JAVA_HOME=/usr/lib/jvm/oracle-java8-jdk-amd64 AGENT_HOME=/usr/local/perf-map-agent
/usr/local/flamegraph/jmaps 1>&2
IDLE_REGEXPS="^swapper;.*(cpuidle|cpu_idle|cpu_bringup_and_idle|native_safe_halt|xen_hypercall_sched_op|x
en_hypercall_vcpu_op)"
perf script -i perf.data | /usr/local/flamegraph/stackcollapse-perf.pl --all grep -E -v "$IDLE_REGEXPS" |
/usr/local/flamegraph/flamegraph.pl --colors=java --hash --title=$(hostname)
Full system flamegraphs point at sendfile
Jessie
Stretch
sendfile
Enhance
Stretch sendfile flamegraph spinlocks
eBPF and BCC tools
Latency of sendfile on Jessie: < 31us
$ sudo /usr/share/bcc/tools/funclatency -uTi 1 do_sendfile
Tracing 1 functions for "do_sendfile"... Hit Ctrl-C to end.
23:27:25
usecs : count distribution
0 -> 1 : 9 | |
2 -> 3 : 47 |**** |
4 -> 7 : 53 |***** |
8 -> 15 : 379 |****************************************|
16 -> 31 : 329 |********************************** |
32 -> 63 : 101 |********** |
64 -> 127 : 23 |** |
128 -> 255 : 50 |***** |
256 -> 511 : 7 | |
Latency of sendfile on Stretch: < 511us
usecs : count distribution
0 -> 1 : 1 | |
2 -> 3 : 20 |*** |
4 -> 7 : 46 |******* |
8 -> 15 : 56 |******** |
16 -> 31 : 65 |********** |
32 -> 63 : 75 |*********** |
64 -> 127 : 75 |*********** |
128 -> 255 : 258 |****************************************|
256 -> 511 : 144 |********************** |
512 -> 1023 : 24 |*** |
1024 -> 2047 : 27 |**** |
2048 -> 4095 : 28 |**** |
4096 -> 8191 : 35 |***** |
Number of mod_timer runs
# Jessie
$ sudo /usr/share/bcc/tools/funccount -T -i 1
mod_timer
Tracing 1 functions for "mod_timer"... Hit Ctrl-C
to end.
00:33:36
FUNC COUNT
mod_timer 60482
00:33:37
FUNC COUNT
mod_timer 58263
00:33:38
FUNC COUNT
mod_timer 54626
# Stretch
$ sudo /usr/share/bcc/tools/funccount -T -i 1
mod_timer
Tracing 1 functions for "mod_timer"... Hit Ctrl-C
to end.
00:33:28
FUNC COUNT
mod_timer 149068
00:33:29
FUNC COUNT
mod_timer 155994
00:33:30
FUNC COUNT
mod_timer 160688
Number of lock_timer_base runs
# Jessie
$ sudo /usr/share/bcc/tools/funccount -T -i 1
lock_timer_base
Tracing 1 functions for "lock_timer_base"... Hit
Ctrl-C to end.
00:32:36
FUNC COUNT
lock_timer_base 15962
00:32:37
FUNC COUNT
lock_timer_base 16261
00:32:38
FUNC COUNT
lock_timer_base 15806
# Stretch
$ sudo /usr/share/bcc/tools/funccount -T -i 1
lock_timer_base
Tracing 1 functions for "lock_timer_base"... Hit
Ctrl-C to end.
00:32:32
FUNC COUNT
lock_timer_base 119189
00:32:33
FUNC COUNT
lock_timer_base 196895
00:32:34
FUNC COUNT
lock_timer_base 140085
We can trace timer tracepoints with perf
$ sudo perf list | fgrep timer:
timer:hrtimer_cancel [Tracepoint event]
timer:hrtimer_expire_entry [Tracepoint event]
timer:hrtimer_expire_exit [Tracepoint event]
timer:hrtimer_init [Tracepoint event]
timer:hrtimer_start [Tracepoint event]
timer:itimer_expire [Tracepoint event]
timer:itimer_state [Tracepoint event]
timer:tick_stop [Tracepoint event]
timer:timer_cancel [Tracepoint event]
timer:timer_expire_entry [Tracepoint event]
timer:timer_expire_exit [Tracepoint event]
timer:timer_init [Tracepoint event]
timer:timer_start [Tracepoint event]
Number of timers per function
# Jessie
$ sudo perf record -e timer:timer_start -p 23485 --
sleep 10 && sudo perf script | sed 's/.*
function=//g' | awk '{ print $1 }' | sort | uniq -c
[ perf record: Woken up 54 times to write data ]
[ perf record: Captured and wrote 17.778 MB
perf.data (173520 samples) ]
2 clocksource_watchdog
5 cursor_timer_handler
2 dev_watchdog
10 garp_join_timer
2 ixgbe_service_timer
4769 tcp_delack_timer
171 tcp_keepalive_timer
168512 tcp_write_timer
# Stretch
$ sudo perf record -e timer:timer_start -p 3416 --
sleep 10 && sudo perf script | sed 's/.*
function=//g' | awk '{ print $1 }' | sort | uniq -c
[ perf record: Woken up 671 times to write data ]
[ perf record: Captured and wrote 198.273 MB
perf.data (1988650 samples) ]
6 clocksource_watchdog
12 cursor_timer_handler
2 dev_watchdog
18 garp_join_timer
4 ixgbe_service_timer
4622 tcp_delack_timer
1 tcp_keepalive_timer
1983978 tcp_write_timer
Timer flamegraphs comparison
Jessie
Stretch
tcp_push_one
Number of calls for hot functions
# Jessie
$ sudo /usr/share/bcc/tools/funccount -T -i 1
tcp_sendmsg
Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C
to end.
03:33:33
FUNC COUNT
tcp_sendmsg 21166
$ sudo /usr/share/bcc/tools/funccount -T -i 1
tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-
C to end.
03:37:14
FUNC COUNT
tcp_push_one 496
# Stretch
$ sudo /usr/share/bcc/tools/funccount -T -i 1
tcp_sendmsg
Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C
to end.
03:33:30
FUNC COUNT
tcp_sendmsg 53834
$ sudo /usr/share/bcc/tools/funccount -T -i 1
tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-
C to end.
03:37:10
FUNC COUNT
tcp_push_one 64483
Count stacks leading to tcp_push_one
$ sudo stackcount -i 10 tcp_push_one
Stacks for tcp_push_one (stackcount)
tcp_push_one
inet_sendpage
kernel_sendpage
sock_sendpage
pipe_to_sendpage
__splice_from_pipe
splice_from_pipe
generic_splice_sendpage
direct_splice_actor
splice_direct_to_actor
do_splice_direct
do_sendfile
sys_sendfile64
do_syscall_64
return_from_SYSCALL_64
4950
tcp_push_one
inet_sendmsg
sock_sendmsg
kernel_sendmsg
sock_no_sendpage
tcp_sendpage
inet_sendpage
kernel_sendpage
sock_sendpage
pipe_to_sendpage
__splice_from_pipe
splice_from_pipe
generic_splice_sendpage
...
return_from_SYSCALL_64
735110
Diff of the most popular stack
--- jessie.txt 2017-08-16 21:14:13.000000000 -0700
+++ stretch.txt 2017-08-16 21:14:20.000000000 -0700
@@ -1,4 +1,9 @@
tcp_push_one
+inet_sendmsg
+sock_sendmsg
+kernel_sendmsg
+sock_no_sendpage
+tcp_sendpage
inet_sendpage
kernel_sendpage
sock_sendpage
Let’s look at tcp_sendpage
int tcp_sendpage(struct sock *sk, struct page *page, int offset, size_t size, int flags) {
ssize_t res;
if (!(sk->sk_route_caps & NETIF_F_SG) ||
!sk_check_csum_caps(sk))
return sock_no_sendpage(sk->sk_socket, page, offset, size,
flags);
lock_sock(sk);
tcp_rate_check_app_limited(sk); /* is sending application-limited? */
res = do_tcp_sendpages(sk, page, offset, size, flags);
release_sock(sk);
return res;
}
what we see on the stack
segmentation offload
Cloudflare network setup
eth2 -->| |--> vlan10
|---> bond0 -->|
eth3 -->| |--> vlan100
Missing offload settings
eth2 -->| |--> vlan10
|---> bond0 -->|
eth3 -->| |--> vlan100
Compare ethtool -k settings on vlan10
-tx-checksumming: off
+tx-checksumming: on
- tx-checksum-ip-generic: off
+ tx-checksum-ip-generic: on
-scatter-gather: off
- tx-scatter-gather: off
+scatter-gather: on
+ tx-scatter-gather: on
-tcp-segmentation-offload: off
- tx-tcp-segmentation: off [requested on]
- tx-tcp-ecn-segmentation: off [requested on]
- tx-tcp-mangleid-segmentation: off [requested on]
- tx-tcp6-segmentation: off [requested on]
-udp-fragmentation-offload: off [requested on]
-generic-segmentation-offload: off [requested on]
+tcp-segmentation-offload: on
+ tx-tcp-segmentation: on
+ tx-tcp-ecn-segmentation: on
+ tx-tcp-mangleid-segmentation: on
+ tx-tcp6-segmentation: on
+udp-fragmentation-offload: on
+generic-segmentation-offload: on
Ha! Easy fix, let’s just enable it:
$ sudo ethtool -K vlan10 sg on
Actual changes:
tx-checksumming: on
tx-checksum-ip-generic: on
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: on
tx-tcp-mangleid-segmentation: on
tx-tcp6-segmentation: on
udp-fragmentation-offload: on
R in SRE stands for Reboot
Kafka restarted
It was a bug in systemd all along
Logs cluster effect
Stretch upgrade
Offload fixed
DNS cluster effect
Stretch upgrade
Offload fixed
Lessons learned
● It’s important to pay closer attention and seemingly unrelated metrics
● Linux kernel can be easily traced with perf and bcc tools
○ Tools work out of the box
○ You don’t have to be a developer
● TCP offload is incredibly important and applies to vlan interfaces
● Switching OS on reboot proved to be useful
But really it was just an excuse
● Internal blog post about this is from Aug 2017
● External blog post in Cloudflare blog is from May 2018
● All to show where ebpf_exporter can be useful
○ Our tool to export hidden kernel metrics with eBPF
○ Can trace any kernel function and hardware counters
○ IO latency histograms, timer counters, TCP retransmits, etc.
○ Exports data in Prometheus (OpenMetrics) format
Can be nicely visualized with new Grafana
Disk upgrade in production
Thank you
● Blog post this talk is based on
● Github for ebpf_exporter: https://github.com/cloudflare/ebpf_exporter
● Slides for ebpf_exporter talk with presenter notes (and a blog post)
○ Disclaimer: contains statistical dinosaur gifs
● Training on ebpf_exporter with Alexander Huynh
○ Look for “Hidden Linux Metrics with Prometheus eBPF Exporter”
○ Wednesday, Oct 31st, 11:45 - 12:30, Cumberland room 3-4
● We’re hiring
Ivan on twitter: @ibobrik

Weitere ähnliche Inhalte

Was ist angesagt?

Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersBrendan Gregg
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityBrendan Gregg
 
eBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceeBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceSUSE Labs Taipei
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingViller Hsiao
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedBrendan Gregg
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and moreBrendan Gregg
 
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing PerformanceBrendan Gregg
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedAnne Nicolas
 
Performance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux KernelPerformance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux Kernellcplcp1
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPFAlex Maestretti
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005dflexer
 
Enable DPDK and SR-IOV for containerized virtual network functions with zun
Enable DPDK and SR-IOV for containerized virtual network functions with zunEnable DPDK and SR-IOV for containerized virtual network functions with zun
Enable DPDK and SR-IOV for containerized virtual network functions with zunheut2008
 
Linux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - WonokaerunLinux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - Wonokaerunidsecconf
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBbmbouter
 
Troubleshooting Tips from a Docker Support Engineer
Troubleshooting Tips from a Docker Support EngineerTroubleshooting Tips from a Docker Support Engineer
Troubleshooting Tips from a Docker Support EngineerJeff Anderson
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFBrendan Gregg
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityBrendan Gregg
 
Linux Container Basics
Linux Container BasicsLinux Container Basics
Linux Container BasicsMichael Kehoe
 

Was ist angesagt? (20)

Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
 
eBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to UserspaceeBPF Trace from Kernel to Userspace
eBPF Trace from Kernel to Userspace
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
Meet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracingMeet cute-between-ebpf-and-tracing
Meet cute-between-ebpf-and-tracing
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
 
eBPF Workshop
eBPF WorkshopeBPF Workshop
eBPF Workshop
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
 
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
 
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
 
Performance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux KernelPerformance Analysis Tools for Linux Kernel
Performance Analysis Tools for Linux Kernel
 
Security Monitoring with eBPF
Security Monitoring with eBPFSecurity Monitoring with eBPF
Security Monitoring with eBPF
 
Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005Tuning parallelcodeonsolaris005
Tuning parallelcodeonsolaris005
 
Enable DPDK and SR-IOV for containerized virtual network functions with zun
Enable DPDK and SR-IOV for containerized virtual network functions with zunEnable DPDK and SR-IOV for containerized virtual network functions with zun
Enable DPDK and SR-IOV for containerized virtual network functions with zun
 
Linux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - WonokaerunLinux kernel-rootkit-dev - Wonokaerun
Linux kernel-rootkit-dev - Wonokaerun
 
Debugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDBDebugging Hung Python Processes With GDB
Debugging Hung Python Processes With GDB
 
Troubleshooting Tips from a Docker Support Engineer
Troubleshooting Tips from a Docker Support EngineerTroubleshooting Tips from a Docker Support Engineer
Troubleshooting Tips from a Docker Support Engineer
 
Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPF
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
Linux Container Basics
Linux Container BasicsLinux Container Basics
Linux Container Basics
 

Ähnlich wie Debugging linux issues with eBPF

Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...Ontico
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
 
Reverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemReverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemCyber Security Alliance
 
Disruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on LinuxDisruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on LinuxNaoto MATSUMOTO
 
Percona Live UK 2014 Part III
Percona Live UK 2014  Part IIIPercona Live UK 2014  Part III
Percona Live UK 2014 Part IIIAlkin Tezuysal
 
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全維泰 蔡
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016Brendan Gregg
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceBrendan Gregg
 
Handy Networking Tools and How to Use Them
Handy Networking Tools and How to Use ThemHandy Networking Tools and How to Use Them
Handy Networking Tools and How to Use ThemSneha Inguva
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFBrendan Gregg
 
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Anne Nicolas
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging RubyAman Gupta
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFBrendan Gregg
 
Openstack 101
Openstack 101Openstack 101
Openstack 101POSSCON
 
Varnish @ Velocity Ignite
Varnish @ Velocity IgniteVarnish @ Velocity Ignite
Varnish @ Velocity IgniteArtur Bergman
 
AWS re:Invent 2016: Making Every Packet Count (NET404)
AWS re:Invent 2016: Making Every Packet Count (NET404)AWS re:Invent 2016: Making Every Packet Count (NET404)
AWS re:Invent 2016: Making Every Packet Count (NET404)Amazon Web Services
 

Ähnlich wie Debugging linux issues with eBPF (20)

Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
Как понять, что происходит на сервере? / Александр Крижановский (NatSys Lab.,...
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
Reverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande ModemReverse engineering Swisscom's Centro Grande Modem
Reverse engineering Swisscom's Centro Grande Modem
 
Disruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on LinuxDisruptive IP Networking with Intel DPDK on Linux
Disruptive IP Networking with Intel DPDK on Linux
 
Percona Live UK 2014 Part III
Percona Live UK 2014  Part IIIPercona Live UK 2014  Part III
Percona Live UK 2014 Part III
 
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
Linux 系統管理與安全:進階系統管理系統防駭與資訊安全
 
Linux Systems Performance 2016
Linux Systems Performance 2016Linux Systems Performance 2016
Linux Systems Performance 2016
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
Handy Networking Tools and How to Use Them
Handy Networking Tools and How to Use ThemHandy Networking Tools and How to Use Them
Handy Networking Tools and How to Use Them
 
test
testtest
test
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
 
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
 
Quic illustrated
Quic illustratedQuic illustrated
Quic illustrated
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging Ruby
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
 
Openstack 101
Openstack 101Openstack 101
Openstack 101
 
Varnish @ Velocity Ignite
Varnish @ Velocity IgniteVarnish @ Velocity Ignite
Varnish @ Velocity Ignite
 
Linux networking
Linux networkingLinux networking
Linux networking
 
Hacking the swisscom modem
Hacking the swisscom modemHacking the swisscom modem
Hacking the swisscom modem
 
AWS re:Invent 2016: Making Every Packet Count (NET404)
AWS re:Invent 2016: Making Every Packet Count (NET404)AWS re:Invent 2016: Making Every Packet Count (NET404)
AWS re:Invent 2016: Making Every Packet Count (NET404)
 

Kürzlich hochgeladen

Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 

Kürzlich hochgeladen (20)

Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 

Debugging linux issues with eBPF

  • 1. Debugging Linux issues with eBPF One incident from start to finish with dynamic tracing applied
  • 3. What does Cloudflare do CDN Moving content physically closer to visitors with our CDN. Intelligent caching Unlimited DDOS mitigation Unlimited bandwidth at flat pricing with free plans Edge access control IPFS gateway Onion service Website Optimization Making web fast and up to date for everyone. TLS 1.3 (with 0-RTT) HTTP/2 + QUIC Server push AMP Origin load-balancing Smart routing Serverless / Edge Workers Post quantum crypto DNS Cloudflare is the fastest managed DNS providers in the world. 1.1.1.1 2606:4700:4700::1111 DNS over TLS
  • 4. 160+ Data centers globally 4.5M+ DNS requests/s across authoritative, recursive and internal 10% Internet requests everyday 10M+ HTTP requests/second Websites, apps & APIs in 150 countries 10M+ Cloudflare’s anycast network Network capacity 20Tbps
  • 5. 350B+ DNS requests/day across authoritative, recursive and internal 800B+ HTTP requests/day Cloudflare’s anycast network (daily ironic numbers) Network capacity 1.73Ebpd
  • 6. Link to slides with speaker notes Slideshare doesn’t allow links on the first 3 slides
  • 7. Cloudflare is a Debian shop ● All machines were running Debian Jessie on bare metal ● OS boots over PXE into memory, packages and configs are ephemeral ● Kernel can be swapped as easy as OS ● New Stable (stretch) came out, we wanted to keep up ● Very easy to upgrade: ○ Build all packages for both distributions ○ Upgrade machines in groups, look at metrics, fix issues, repeat ○ Gradually phase out Jessie ○ Pop a bottle of champagne and celebrate
  • 8. Cloudflare core Kafka platform at the time ● Kafka is a distributed log with multiple producers and consumers ● 3 clusters: 2 small (dns + logs) with 9 nodes, 1 big (http) with 106 nodes ● 2 x 10C Intel Xeon E5-2630 v4 @ 2.2GHz (40 logical CPUs), 128GB RAM ● 12 x 800GB SSD in RAID0 ● 2 x 10G bonded NIC ● Mostly network bound at ~100Gbps ingress and ~700Gbps egress ● Check out our blog post on Kafka compression ● We also blogged about our Gen 9 edge machines recently
  • 9. Small clusters went ok, big one did not One node upgraded to Stretch
  • 10. Perf to the rescue: “perf top -F 99”
  • 11. RCU stalls in dmesg [ 4923.462841] INFO: rcu_sched self-detected stall on CPU [ 4923.462843] 13-...: (2 GPs behind) idle=ea7/140000000000001/0 softirq=1/2 fqs=4198 [ 4923.462845] (t=8403 jiffies g=110722 c=110721 q=6440)
  • 12. Error logging issues Aug 15 21:51:35 myhost kernel: INFO: rcu_sched detected stalls on CPUs/tasks: Aug 15 21:51:35 myhost kernel: 26-...: (1881 ticks this GP) idle=76f/140000000000000/0 softirq=8/8 fqs=365 Aug 15 21:51:35 myhost kernel: (detected by 0, t=2102 jiffies, g=1837293, c=1837292, q=262) Aug 15 21:51:35 myhost kernel: Task dump for CPU 26: Aug 15 21:51:35 myhost kernel: java R running task 13488 1714 1513 0x00080188 Aug 15 21:51:35 myhost kernel: ffffc9000d1f7898 ffffffff814ee977 ffff88103f410400 000000000000000a Aug 15 21:51:35 myhost kernel: 0000000000000041 ffffffff82203142 ffffc9000d1f78c0 ffffffff814eea10 Aug 15 21:51:35 myhost kernel: 0000000000000041 ffffffff82203142 ffff88103f410400 ffffc9000d1f7920 Aug 15 21:51:35 myhost kernel: Call Trace: Aug 15 21:51:35 myhost kernel: [<ffffffff814ee977>] ? scrup+0x147/0x160 Aug 15 21:51:35 myhost kernel: [<ffffffff814eea10>] ? lf+0x80/0x90 Aug 15 21:51:35 myhost kernel: [<ffffffff814eecb5>] ? vt_console_print+0x295/0x3c0
  • 13. Page allocation failures Aug 16 01:14:51 myhost systemd-journald[13812]: Missed 17171 kernel messages Aug 16 01:14:51 myhost kernel: [<ffffffff81171754>] shrink_inactive_list+0x1f4/0x4f0 Aug 16 01:14:51 myhost kernel: [<ffffffff8117234b>] shrink_node_memcg+0x5bb/0x780 Aug 16 01:14:51 myhost kernel: [<ffffffff811725e2>] shrink_node+0xd2/0x2f0 Aug 16 01:14:51 myhost kernel: [<ffffffff811728ef>] do_try_to_free_pages+0xef/0x310 Aug 16 01:14:51 myhost kernel: [<ffffffff81172be5>] try_to_free_pages+0xd5/0x180 Aug 16 01:14:51 myhost kernel: [<ffffffff811632db>] __alloc_pages_slowpath+0x31b/0xb80 ... [78991.546088] systemd-network: page allocation stalls for 287000ms, order:0, mode:0x24200ca(GFP_HIGHUSER_MOVABLE)
  • 14. Downgrade and investigate ● System CPU was up, so it must be the kernel upgrade ● Downgrade Stretch to Jessie ● Downgrade Linux 4.9 to 4.4 (known good, but no allocation stall logging) ● Investigate without affecting customers ● Bisection pointed at OS upgrade, kernel was not responsible
  • 15. Make a flamegraph with perf #!/bin/sh -e # flamegraph-perf [perf args here] > flamegraph.svg # Explicitly setting output and input to perf.data is needed to make perf work over ssh without TTY. perf record -o perf.data "$@" # Fetch JVM stack maps if possible, this requires -XX:+PreserveFramePointer export JAVA_HOME=/usr/lib/jvm/oracle-java8-jdk-amd64 AGENT_HOME=/usr/local/perf-map-agent /usr/local/flamegraph/jmaps 1>&2 IDLE_REGEXPS="^swapper;.*(cpuidle|cpu_idle|cpu_bringup_and_idle|native_safe_halt|xen_hypercall_sched_op|x en_hypercall_vcpu_op)" perf script -i perf.data | /usr/local/flamegraph/stackcollapse-perf.pl --all grep -E -v "$IDLE_REGEXPS" | /usr/local/flamegraph/flamegraph.pl --colors=java --hash --title=$(hostname)
  • 16. Full system flamegraphs point at sendfile Jessie Stretch sendfile
  • 19. eBPF and BCC tools
  • 20. Latency of sendfile on Jessie: < 31us $ sudo /usr/share/bcc/tools/funclatency -uTi 1 do_sendfile Tracing 1 functions for "do_sendfile"... Hit Ctrl-C to end. 23:27:25 usecs : count distribution 0 -> 1 : 9 | | 2 -> 3 : 47 |**** | 4 -> 7 : 53 |***** | 8 -> 15 : 379 |****************************************| 16 -> 31 : 329 |********************************** | 32 -> 63 : 101 |********** | 64 -> 127 : 23 |** | 128 -> 255 : 50 |***** | 256 -> 511 : 7 | |
  • 21. Latency of sendfile on Stretch: < 511us usecs : count distribution 0 -> 1 : 1 | | 2 -> 3 : 20 |*** | 4 -> 7 : 46 |******* | 8 -> 15 : 56 |******** | 16 -> 31 : 65 |********** | 32 -> 63 : 75 |*********** | 64 -> 127 : 75 |*********** | 128 -> 255 : 258 |****************************************| 256 -> 511 : 144 |********************** | 512 -> 1023 : 24 |*** | 1024 -> 2047 : 27 |**** | 2048 -> 4095 : 28 |**** | 4096 -> 8191 : 35 |***** |
  • 22. Number of mod_timer runs # Jessie $ sudo /usr/share/bcc/tools/funccount -T -i 1 mod_timer Tracing 1 functions for "mod_timer"... Hit Ctrl-C to end. 00:33:36 FUNC COUNT mod_timer 60482 00:33:37 FUNC COUNT mod_timer 58263 00:33:38 FUNC COUNT mod_timer 54626 # Stretch $ sudo /usr/share/bcc/tools/funccount -T -i 1 mod_timer Tracing 1 functions for "mod_timer"... Hit Ctrl-C to end. 00:33:28 FUNC COUNT mod_timer 149068 00:33:29 FUNC COUNT mod_timer 155994 00:33:30 FUNC COUNT mod_timer 160688
  • 23. Number of lock_timer_base runs # Jessie $ sudo /usr/share/bcc/tools/funccount -T -i 1 lock_timer_base Tracing 1 functions for "lock_timer_base"... Hit Ctrl-C to end. 00:32:36 FUNC COUNT lock_timer_base 15962 00:32:37 FUNC COUNT lock_timer_base 16261 00:32:38 FUNC COUNT lock_timer_base 15806 # Stretch $ sudo /usr/share/bcc/tools/funccount -T -i 1 lock_timer_base Tracing 1 functions for "lock_timer_base"... Hit Ctrl-C to end. 00:32:32 FUNC COUNT lock_timer_base 119189 00:32:33 FUNC COUNT lock_timer_base 196895 00:32:34 FUNC COUNT lock_timer_base 140085
  • 24. We can trace timer tracepoints with perf $ sudo perf list | fgrep timer: timer:hrtimer_cancel [Tracepoint event] timer:hrtimer_expire_entry [Tracepoint event] timer:hrtimer_expire_exit [Tracepoint event] timer:hrtimer_init [Tracepoint event] timer:hrtimer_start [Tracepoint event] timer:itimer_expire [Tracepoint event] timer:itimer_state [Tracepoint event] timer:tick_stop [Tracepoint event] timer:timer_cancel [Tracepoint event] timer:timer_expire_entry [Tracepoint event] timer:timer_expire_exit [Tracepoint event] timer:timer_init [Tracepoint event] timer:timer_start [Tracepoint event]
  • 25. Number of timers per function # Jessie $ sudo perf record -e timer:timer_start -p 23485 -- sleep 10 && sudo perf script | sed 's/.* function=//g' | awk '{ print $1 }' | sort | uniq -c [ perf record: Woken up 54 times to write data ] [ perf record: Captured and wrote 17.778 MB perf.data (173520 samples) ] 2 clocksource_watchdog 5 cursor_timer_handler 2 dev_watchdog 10 garp_join_timer 2 ixgbe_service_timer 4769 tcp_delack_timer 171 tcp_keepalive_timer 168512 tcp_write_timer # Stretch $ sudo perf record -e timer:timer_start -p 3416 -- sleep 10 && sudo perf script | sed 's/.* function=//g' | awk '{ print $1 }' | sort | uniq -c [ perf record: Woken up 671 times to write data ] [ perf record: Captured and wrote 198.273 MB perf.data (1988650 samples) ] 6 clocksource_watchdog 12 cursor_timer_handler 2 dev_watchdog 18 garp_join_timer 4 ixgbe_service_timer 4622 tcp_delack_timer 1 tcp_keepalive_timer 1983978 tcp_write_timer
  • 27. Number of calls for hot functions # Jessie $ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_sendmsg Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end. 03:33:33 FUNC COUNT tcp_sendmsg 21166 $ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_push_one Tracing 1 functions for "tcp_push_one"... Hit Ctrl- C to end. 03:37:14 FUNC COUNT tcp_push_one 496 # Stretch $ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_sendmsg Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end. 03:33:30 FUNC COUNT tcp_sendmsg 53834 $ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_push_one Tracing 1 functions for "tcp_push_one"... Hit Ctrl- C to end. 03:37:10 FUNC COUNT tcp_push_one 64483
  • 28. Count stacks leading to tcp_push_one $ sudo stackcount -i 10 tcp_push_one
  • 29. Stacks for tcp_push_one (stackcount) tcp_push_one inet_sendpage kernel_sendpage sock_sendpage pipe_to_sendpage __splice_from_pipe splice_from_pipe generic_splice_sendpage direct_splice_actor splice_direct_to_actor do_splice_direct do_sendfile sys_sendfile64 do_syscall_64 return_from_SYSCALL_64 4950 tcp_push_one inet_sendmsg sock_sendmsg kernel_sendmsg sock_no_sendpage tcp_sendpage inet_sendpage kernel_sendpage sock_sendpage pipe_to_sendpage __splice_from_pipe splice_from_pipe generic_splice_sendpage ... return_from_SYSCALL_64 735110
  • 30. Diff of the most popular stack --- jessie.txt 2017-08-16 21:14:13.000000000 -0700 +++ stretch.txt 2017-08-16 21:14:20.000000000 -0700 @@ -1,4 +1,9 @@ tcp_push_one +inet_sendmsg +sock_sendmsg +kernel_sendmsg +sock_no_sendpage +tcp_sendpage inet_sendpage kernel_sendpage sock_sendpage
  • 31. Let’s look at tcp_sendpage int tcp_sendpage(struct sock *sk, struct page *page, int offset, size_t size, int flags) { ssize_t res; if (!(sk->sk_route_caps & NETIF_F_SG) || !sk_check_csum_caps(sk)) return sock_no_sendpage(sk->sk_socket, page, offset, size, flags); lock_sock(sk); tcp_rate_check_app_limited(sk); /* is sending application-limited? */ res = do_tcp_sendpages(sk, page, offset, size, flags); release_sock(sk); return res; } what we see on the stack segmentation offload
  • 32. Cloudflare network setup eth2 -->| |--> vlan10 |---> bond0 -->| eth3 -->| |--> vlan100
  • 33. Missing offload settings eth2 -->| |--> vlan10 |---> bond0 -->| eth3 -->| |--> vlan100
  • 34. Compare ethtool -k settings on vlan10 -tx-checksumming: off +tx-checksumming: on - tx-checksum-ip-generic: off + tx-checksum-ip-generic: on -scatter-gather: off - tx-scatter-gather: off +scatter-gather: on + tx-scatter-gather: on -tcp-segmentation-offload: off - tx-tcp-segmentation: off [requested on] - tx-tcp-ecn-segmentation: off [requested on] - tx-tcp-mangleid-segmentation: off [requested on] - tx-tcp6-segmentation: off [requested on] -udp-fragmentation-offload: off [requested on] -generic-segmentation-offload: off [requested on] +tcp-segmentation-offload: on + tx-tcp-segmentation: on + tx-tcp-ecn-segmentation: on + tx-tcp-mangleid-segmentation: on + tx-tcp6-segmentation: on +udp-fragmentation-offload: on +generic-segmentation-offload: on
  • 35. Ha! Easy fix, let’s just enable it: $ sudo ethtool -K vlan10 sg on Actual changes: tx-checksumming: on tx-checksum-ip-generic: on tcp-segmentation-offload: on tx-tcp-segmentation: on tx-tcp-ecn-segmentation: on tx-tcp-mangleid-segmentation: on tx-tcp6-segmentation: on udp-fragmentation-offload: on
  • 36. R in SRE stands for Reboot Kafka restarted
  • 37. It was a bug in systemd all along
  • 38. Logs cluster effect Stretch upgrade Offload fixed
  • 39. DNS cluster effect Stretch upgrade Offload fixed
  • 40. Lessons learned ● It’s important to pay closer attention and seemingly unrelated metrics ● Linux kernel can be easily traced with perf and bcc tools ○ Tools work out of the box ○ You don’t have to be a developer ● TCP offload is incredibly important and applies to vlan interfaces ● Switching OS on reboot proved to be useful
  • 41. But really it was just an excuse ● Internal blog post about this is from Aug 2017 ● External blog post in Cloudflare blog is from May 2018 ● All to show where ebpf_exporter can be useful ○ Our tool to export hidden kernel metrics with eBPF ○ Can trace any kernel function and hardware counters ○ IO latency histograms, timer counters, TCP retransmits, etc. ○ Exports data in Prometheus (OpenMetrics) format
  • 42. Can be nicely visualized with new Grafana Disk upgrade in production
  • 43. Thank you ● Blog post this talk is based on ● Github for ebpf_exporter: https://github.com/cloudflare/ebpf_exporter ● Slides for ebpf_exporter talk with presenter notes (and a blog post) ○ Disclaimer: contains statistical dinosaur gifs ● Training on ebpf_exporter with Alexander Huynh ○ Look for “Hidden Linux Metrics with Prometheus eBPF Exporter” ○ Wednesday, Oct 31st, 11:45 - 12:30, Cumberland room 3-4 ● We’re hiring Ivan on twitter: @ibobrik

Hinweis der Redaktion

  1. Hello, Today we’re going to go through one production issue from start to finish and see how we can apply dynamic tracing to get to the bottom of the problem.
  2. My name is Ivan and I work for a company called Cloudflare, where I focus on performance and efficiency of our products.
  3. To give you some context, thise are some key areas Cloudflare specializes in. In addition to being a good old CDN service with free unlimited DDOS protection, we try to be at the front of innovation with technologies like TLS v1.3, QUIC and edge workers, making internet faster and and more secure for end users and website owners. We’re also the fastest authoritative and recursive DNS provider. Our resolver 1.1.1.1 is privacy oriented and supports things like DNS over TLS, stipping intermediates from knowing your DNS requests, not to mention DNSSEC. If you have a website of any size, you should totally put this behind Cloudflare.
  4. Here are some numbers to give you an idea of the scale we operate on. We have 160 datacenters around the world and plan to grow to at least 200 next year. At peak these datacenters process more than 10 million HTTP requests per second. At the same time the very same datacenters serve 4.5 million DNS requests per second across internal and external DNS. That’s a lot of data to analyze and we collect logs into core datacenters for processing and analytics.
  5. I often get frustrated when people show numbers that are not scaled to seconds. I figured I cannot win them, so I may as well just join them Here you see numbers per day. My favorite one is network capaity, which is 1.73 exabytes per day. As you can see, these numbers make no sense. It gets even weirder when different metrics are scaled to different time units. Please don’t use this as a reference, always scale down to second.
  6. Now to set a scene for this talk specifically, it makes sense to tell a little on our hardware and software stack. All machines serving traffic and doing backend analytics are bare metal servers running Debian, at that point in time we were running Jessie. We’re big fans of ephemeral stuff and not a single machine has OS installed on persistent storage. Instead, we boot from a minimal immutable initramfs from network and install all packages and configs on top of that into ramfs with configuration management system. This means that on reboot every machine is clean and OS and kernel can be swapped with just a reboot. And the story starts with my personal desire to update Debian to the latest Stable release, which was Stretch at that time. Our plan for this upgrade was quite simple because of our setup. We can just build all necessary packages for both distributions, switch some group of machines into Stretch, fix what’s broken and carry on to the next group of machines. No need to wipe disks, reinstall anything or deal with dependency issues. We even only needed to build just one OS image as opposed to one image per workload. On the edge every machine is the same, so that part was trivial. In core datacenters where backend out of band processing happens we have different machines doing different workloads, which means we have a more diverse set of metrics to look at, but we can also switch some groups completely faster.
  7. One of such groups was a set of our Kafka clusters. If you’re not familiar with Kafka, it’s basically a distributed log system. Multiple producers append messages to topics and then multiple consumers read those logs. For the most part we’re using it as a queue with a large on-disk buffer that can get us time to fix issues in consumers without losing data. We have three major clusters: DNS and Logs are small with just 9 nodes each, and HTTP is massive with 106 nodes. You can see the specs for HTTP cluster at that time on the slides: 128GB of RAM and two Broadwell Xeon CPUs in NUMA setup with 40 logical CPUs. We opted out for 12 SSDs in RAID0 to prevent IO trashing from consumers falling out of page cache. Disk level redundancy is absent in favor of larger usable disk space and higher throughput, we rely on 3x replication instead. In terms of network we had 2x10G NIC in bonded setup for maximum network throughput. It was not intended to provide any redundancy. We used to have a lot of issues with being network bound, but in the end that was solved by aggressive compression with zstd. Funnily enough, we also opted out to have 2x25G NICs, just because they are cheaper, even though we are not network bound anymore. Check out our blog post about Kafka compression or a recent one about Gen 9 edge servers if you want to learn more.
  8. So we did our upgrade on small Kafka clusters and it went pretty well, at least nobody said anything and user facing metrics looked good. If you were listening to talks yesterday, that’s what apparently should be alerted on, so no alerts fired. On the big HTTP cluster, however, we started seeing issues with consumers timing out and lagging, so we looked closer at the metrics we had. And this is what we saw: one upgraded node was using a lot more CPU than before, 5x more in fact. By itself this is not as big of an issue, you can see that we’re not stressing out CPUs that much. Typica Kafka CPU usage before this upgrade was around 3 logical CPUs out of 40, which leaves a lot of room. Still, having 5x CPU usage was definitely an unexpected outcome. For control datapoints, we compared the problematic machine to another machine where no upgrade happened, and an intermediary node that received a full software stack upgrade on reboot, but not an OS upgrade, which we optimistically bundled with a minor kernel upgrade. Neither of these two nodes experienced the same CPU saturation issues, even though their setups were practically identical.
  9. For debugging CPU saturation issues, we depend on linux perf command to find the cause. It’s included with the kernel and on end user distributions you can install it with package like linux-base or something. The first question that comes to mind when we see CPU saturation issues is what is using the CPU. In tools like top we can see what processes occupy CPU, but with perf you can see which functions inside these processes sit on CPU the most. This covers kernel and user space for well behaved programs that have a way to decode stacks. That includes C/C++ with frame pointers and Go. Here you can see top-like output from perf with the most expensive functions in terms of CPU time. Sorting is a bit confusing, because it sorts by inclusive time, but we’re mostly interested in “self” column, which shows how often the very tip of the stack is on CPU. In this case most of the time is taken by some spinlock slowpath. Spinlocks in the kernel exist to protect critical sections from concurrent access. There are two reasons to use them: * Critical section is small and is not contended * Lock owner cannot sleep (like interrupts cannot do that) If spinlock cannot be acquired, caller burns CPU until it can get hold of the lock. While it may sound like a questionable idea at first, there are legitimate uses for this mechanism. In our situation it seems like spinlock is really contended and half of CPU cycles are not doing useful work. We don’t know what lock is causing this to happen from this output, however. There were also other symptoms, so let’s look at them first.
  10. If anything bad happens in production, it’s always a good idea to have a look at dmesg. Messages there can be cryptic, but they can at least point you in the right direction. Fixing an issue is 95% knowing where to find the issue. In that particular case we saw RCU stalls, where RCU stands for read-copy-update. I’m not exactly an expert in this, but it sounds like another synchronization mechanism and it can be affected by spinlocks we saw before. We've seen rare RCU stalls before, and our (suboptimal) solution was to reboot the machine if no other issues can be found. 99% of the time reboot fixed the issue for a long time. However, one can only handle so many reboots before the problem becomes severe enough to warrant a deep dive. In this case we had other clues.
  11. While looking deeper into dmesg, we noticed issues around writing messages to the console. This suggested that we were logging too many errors, and the actual failure may be earlier in the process. Armed with this knowledge, we looked at the very beginning of the message chain.
  12. And this is what we saw. If you work with NUMA machines, you may immediately see “shrink_node” and have a minor PTSD episode. What you should be looking at is the number of missed kernel messages. There were so many errors, journald wasn’t able to keep up. We have console access to work around that, and that’s where we saw page allocation stalls in the second log except. You don't want your page allocations to stall for 5 minutes, especially when it's order zero allocation, which is the smallest allocation of one 4 KiB page.
  13. Comparing to our control nodes, the only two possible explanations were: a minor kernel upgrade, and the switch from Debian Jessie to Debian Stretch. We suspected the former, since CPU usage implies a kernel issue. Just to be safe, we rolled both the kernel back from 4.9 to a known good 4.4, and downgraded the affected nodes back to Debian Jessie. This was a reasonable compromise, since we needed to minimize downtime on production nodes. Then we proceeded to look into the issue in isolation. To our surprise, after some bisecting we found that OS upgrade alone was responsible for our issues, kernel was off the hook. Now all that remained is to find out what exactly was going on.
  14. Flamegraphs are a great way to visualize stacks that cause CPU usage in the system. We have a wrapper around Brendan Gregg’s flamegraph scripts that removes idle time and enables JVM stacks out of the box. This gives us a way to get an overview of CPU usage in one command.
  15. And this is how full system flamegraphs look like. We have jessie in the background on the left and stretch in the foreground on the right. This may be hard to see, but the idea is that each bar is a stack frame and width corresponds to frequency of this stack’s appearance, which is a proxy for CPU usage. You can see a fat column of frames on the left on Stretch, that’s not present on Jessie. We can see it’s the sendfile syscall and it’s highlighted in purple. It’s also present and highlighted on Jessie, but it’s tiny and quite hard to see. Flamegraphs allow you to click on the frame, which will zoom into stacks containing this frame, generating some sort of a sub-flamegraph.
  16. So let’s click on sendfile on Stretch and see what’s going on.
  17. This is what we saw. For somebody who’s not a kernel developer this just looks like a bunch of TCP stuff, which is exactly what I saw. Some colleagues suggested that the differences in the graphs may be due to TCP offload being disabled, but upon checking our NIC settings, we found that the feature flags were identical. You can also see some spinlocks at the tip of the flamegraph, which reinforces our initial findings with perf top. Let’s see what else we can figure out from here.
  18. To find out what’s going on with the system, we’ll be using bcc tools. Linux kernel has a VM that allows us to attach lightweight and safe probes to trace the kernel. eBPF itself is a hot topic and there are talks that explore it in great detail, slides for this talk link to them if you are interested. To clarify, VM here is more like JVM that provides runtime and not like KVM that provides hardware virtualization. You can compile code down to this VM from any language, so don’t look surprised when one day you’ll see javascript running in the kernel. I warned you. For the sake of brevity let’s just say that there’s a collection of readily available utilities that can help you debug various parts of the kernel and underlying hardware. That collection is called BCC tools and we’re going to use some of these to get to the bottom of our issue. On this slide you can see how different subsystems can be traced with different tools.
  19. To trace latency distributions of sendfile syscalls between Jessie and Stretch, we’re going to use funclatency. It takes a function name and prints exponential latency histogram for the function calls. Here we print latency histogram for do_sendfile, which is sendfile syscall function, in microseconds, every second. You can see that most of the calls on Jessie hover between 8 and 31 microseconds. Is that good or bad? I don’t know, but a good way to find out is to compare against another system.
  20. Now let’s look at what’s going on with Stretch. I had to cut some parts, because histogram was not fitting into the slide. If on Jessie we saw most of the calls complete in under 31 microsecond, here we see that that number is 511 microseconds, that’s a whopping 16x jump in latency.
  21. In the flamegraphs, you can see timers being set at the tip (mod_timer function is responsible for that), with these timers taking locks. We can count number of function calls instead of measuring their latency, and this is where funccount tool comes in. Feeding mod_timer as an argument to it we can see how many function calls there were every second. Here we have Jessie on the left and Stretch on the right. On stretch we installed 3x more timers than on Jessie. That’s not 16x difference, but still something.
  22. If we look at the number of locks taken for these timers by running funccount on lock_timer_base function, we can see an even bigger difference, around 10x this time. To sum up: on Stretch we installed 3x more timers, resulting in 10x the amount of contention. It definitely seems like we’re onto something.
  23. We can look at the kernel source code to figure out which timers are being scheduled based on the flamegraph, but that seems like a tedious task. Instead, we can use perf tool again to gather some stats on this for us. There’s a bunch of tracepoints in the kernel that provide insight into timer subsystem. We’re going to use timer_start for our needs.
  24. Here we record all timers started for 10s and then print function names they were triggering with respective counts. On Stretch we install 12x tcp_write_timer timers, that sounds like something that can cause issues. Remember: we are on a bandwidth bound workload where interface is 20G, that’s a lot of bytes to move.
  25. Taking specific flamegraphs of the timers revealed the differences in their operation. It’s probably hard ro see, but tcp_push_one really stands out on Stretch. Let’s dig in.
  26. The traces showed huge variations of tcp_sendmsg and tcp_push_one within sendfile, which is expected from the flamegraphs before.
  27. To further introspect, we leveraged a kernel feature available since 4.9: an ability to count and aggregate stacks in the kernel. BCC tools include stackcout tool that does exactly that, so let’s take advantage of that.
  28. The most popular Jessie stack is on the left and the most popular Stretch stack is on the right. There were a few much less popular stacks too, but there’s only so much one can fit on the slides. Stretch stack was too long, “…” is the same as highlighted section in Jessie stack. These are mostly the same and it’s not exactly fun to find the difference, so let’s just look at the diff on the next slide.
  29. We see 5 extra functions in the middle of the stack, starting with tcp_sendpage. Time to look at the source code. Usually I just google the function name and it gives me a result to elixir.bootlin.com, where I swap “latest” to my kernel version. Source code there allows you to click on identifiers and jump around the code to navigate.
  30. This is how tcp_sendpage function looks like, I pasted it verbatim from the kernel source. From tcp_sendpage our stack jumps into sock_no_sendpage. If you lookup what NET_F_SG means, you’ll find it’s segmentation offload. Segmentation offload is a technique where kernel doesn’t split TCP stream into packets, but instead offloads this job to a NIC. This makes a big difference when you want to send large chunks of data over high speed links. That’s exactly what we are doing and we definitely want to have offload enabled.
  31. Let’s take a pause and see how we configure network on our machines. Our 2x10G NIC provides eth2 and eth3, which we then bond into bond0 interface. On top of that bond0 we create two vlan interfaces, one for public internet and one for internal network.
  32. It turned out that we had segmentation offload enabled for only a few of our NICs: eth2, eth3, and bond0. When we checked NIC settings for offload earlier, we only checked physical interfaces and bonded one, but ignored vlan interfaces, where offload was indeed missing.
  33. We compared ethtool output for vlan interface and there was our issue in plain sight.
  34. We can just enable TCP offload by enabling scatter-gather (which is what “sg” stands for) and be done with it. Easy, right? Imagine our disappointment when this did not work. So much work with clear indication that this is the cause and the fix did not work.
  35. The last missing piece we found was that offload changes are applied only during connection initiation. We turned Kafka off and back on again to start offloading and immediately saw positive effects, which is green line. This is not 5x change I mentioned at the beginning, because we were experimenting on a lightly loaded node to avoid disruptions.
  36. Our network interfaces are managed by systemd networkd, so it turns out that missing offload settings were a bug in systemd in the end. It’s not clear whether upstream or Debian patches are responsible for this, however. In the meantime, we work around our upstream issue by enabling offload features automatically on boot if they are disabled on VLAN interfaces.
  37. Having a fix enabled, we rebooted our logs Kafka cluster to upgrade to the latest kernel, and on 5 day CPU usage history you can see clear positive results.
  38. On DNS cluster results were more dramatic because of the higher load. On this screenshot only one node is fixed, but you can see how much better it behaves compared to the rest.
  39. The first lesson here is to pay closer attention to metrics during major upgrades. We did not see major CPU changes on moderately loaded cluster and did not expect to see any effects on fully loaded machines. In the end we were not upgrading Kafka, which was main consumer of user CPU, or kernel, which was consuming system CPU. The second lesson is how useful perf and bcc tools were at pointing us to where the issue is. These tools work out of the box, they are safe and do not require any third party kernel modules. More importantly, they do not require operator to be a kernel expert, you just need some basic understanding of concepts. Another lesson is how important TCP offload is and how its importance grows non-linearly with traffic. It was unexpected that supposedly purely virtual vlan interfaces could be affected by offload, but it turned out they were. Challenge your assumptions often, I guess. Lastly, we used our ability to swap OS and kernels on reboot to the fullest. Having no need to install OS meant we didn’t have to reinstall it and could iterate quickly.
  40. Internal blog post about this incident was published in August 2017, heavily truncated external blog post went out in May 2018. That external blog post is what this talk is based on. All of it to illustrate how the tool we wrote can be used. If during debugging we were using bcc tools to count timers firing in the kernel ad hoc, we could’ve had a metric for this and noticed the issue sooner by just seeing an increase on a graph. This is what ebpf_exporter allows you to have: you can trace any function in the kernel (and in userspace) at very low overhead and create metrics in Prometheus format from it. For example, you can have latency histogram for disk io as a metric, which is not normally possible with procfs or anything else.
  41. Here’s a slide from my presentation of ebpf_exporter, which shows the level of detail you can get. On the left you can see IO wait time from /proc/diskstats, which is what Linux provides, and on the right you can see heatmap of IO latency, which is what ebpf_exporter enables. With the histograms you can see how many IOs landed in a particular bucket and things like multimodal distributions can be seen. You can also see how many IOs went above some threshold, allowing you to have alerts on this. Same goes for timers, kernel does not keep count of what is firing anywhere for collection.
  42. That’s all I had to talk about today. On the slides you have some links on the topic. Slides with speaker notes will be available on the LISA18 website and I’ll also tweet the link. I encourage you to look at my talk on ebpf_exporter itself, which goes into details about why histograms are so great. It involves dinosaur gifs in a very scientific way you probably do not expect, so make sure to check that out. My colleague Alex will be doing a training on ebpf_exporter tomorrow if you want to learn more about that, please come and talk to us. Slides have the information on time and location. If you want to learn more about eBPF itself, you can find Brendan Gregg around and ask him as well as myself.