Presented at LISA18: https://www.usenix.org/conference/lisa18/presentation/babrou
This is a technical dive into how we used eBPF to solve real-world issues uncovered during an innocent OS upgrade. We'll see how we debugged 10x CPU increase in Kafka after Debian upgrade and what lessons we learned. We'll get from high-level effects like increased CPU to flamegraphs showing us where the problem lies to tracing timers and functions calls in the Linux kernel.
The focus is on tools what operational engineers can use to debug performance issues in production. This particular issue happened at Cloudflare on a Kafka cluster doing 100Gbps of ingress and many multiple of that egress.
3. What does Cloudflare do
CDN
Moving content physically
closer to visitors with
our CDN.
Intelligent caching
Unlimited DDOS
mitigation
Unlimited bandwidth at
flat pricing with free
plans
Edge access control
IPFS gateway
Onion service
Website Optimization
Making web fast and up to
date for everyone.
TLS 1.3 (with 0-RTT)
HTTP/2 + QUIC
Server push
AMP
Origin load-balancing
Smart routing
Serverless / Edge Workers
Post quantum crypto
DNS
Cloudflare is the fastest
managed DNS providers
in the world.
1.1.1.1
2606:4700:4700::1111
DNS over TLS
4. 160+
Data centers globally
4.5M+
DNS requests/s
across authoritative, recursive
and internal
10%
Internet requests
everyday
10M+
HTTP requests/second
Websites, apps & APIs
in 150 countries
10M+
Cloudflare’s anycast network
Network capacity
20Tbps
6. Link to slides with speaker notes
Slideshare doesn’t allow links on the first 3 slides
7. Cloudflare is a Debian shop
● All machines were running Debian Jessie on bare metal
● OS boots over PXE into memory, packages and configs are ephemeral
● Kernel can be swapped as easy as OS
● New Stable (stretch) came out, we wanted to keep up
● Very easy to upgrade:
○ Build all packages for both distributions
○ Upgrade machines in groups, look at metrics, fix issues, repeat
○ Gradually phase out Jessie
○ Pop a bottle of champagne and celebrate
8. Cloudflare core Kafka platform at the time
● Kafka is a distributed log with multiple producers and consumers
● 3 clusters: 2 small (dns + logs) with 9 nodes, 1 big (http) with 106 nodes
● 2 x 10C Intel Xeon E5-2630 v4 @ 2.2GHz (40 logical CPUs), 128GB RAM
● 12 x 800GB SSD in RAID0
● 2 x 10G bonded NIC
● Mostly network bound at ~100Gbps ingress and ~700Gbps egress
● Check out our blog post on Kafka compression
● We also blogged about our Gen 9 edge machines recently
11. RCU stalls in dmesg
[ 4923.462841] INFO: rcu_sched self-detected stall on CPU
[ 4923.462843] 13-...: (2 GPs behind) idle=ea7/140000000000001/0 softirq=1/2 fqs=4198
[ 4923.462845] (t=8403 jiffies g=110722 c=110721 q=6440)
12. Error logging issues
Aug 15 21:51:35 myhost kernel: INFO: rcu_sched detected stalls on CPUs/tasks:
Aug 15 21:51:35 myhost kernel: 26-...: (1881 ticks this GP) idle=76f/140000000000000/0
softirq=8/8 fqs=365
Aug 15 21:51:35 myhost kernel: (detected by 0, t=2102 jiffies, g=1837293, c=1837292, q=262)
Aug 15 21:51:35 myhost kernel: Task dump for CPU 26:
Aug 15 21:51:35 myhost kernel: java R running task 13488 1714 1513 0x00080188
Aug 15 21:51:35 myhost kernel: ffffc9000d1f7898 ffffffff814ee977 ffff88103f410400 000000000000000a
Aug 15 21:51:35 myhost kernel: 0000000000000041 ffffffff82203142 ffffc9000d1f78c0 ffffffff814eea10
Aug 15 21:51:35 myhost kernel: 0000000000000041 ffffffff82203142 ffff88103f410400 ffffc9000d1f7920
Aug 15 21:51:35 myhost kernel: Call Trace:
Aug 15 21:51:35 myhost kernel: [<ffffffff814ee977>] ? scrup+0x147/0x160
Aug 15 21:51:35 myhost kernel: [<ffffffff814eea10>] ? lf+0x80/0x90
Aug 15 21:51:35 myhost kernel: [<ffffffff814eecb5>] ? vt_console_print+0x295/0x3c0
13. Page allocation failures
Aug 16 01:14:51 myhost systemd-journald[13812]: Missed 17171 kernel messages
Aug 16 01:14:51 myhost kernel: [<ffffffff81171754>] shrink_inactive_list+0x1f4/0x4f0
Aug 16 01:14:51 myhost kernel: [<ffffffff8117234b>] shrink_node_memcg+0x5bb/0x780
Aug 16 01:14:51 myhost kernel: [<ffffffff811725e2>] shrink_node+0xd2/0x2f0
Aug 16 01:14:51 myhost kernel: [<ffffffff811728ef>] do_try_to_free_pages+0xef/0x310
Aug 16 01:14:51 myhost kernel: [<ffffffff81172be5>] try_to_free_pages+0xd5/0x180
Aug 16 01:14:51 myhost kernel: [<ffffffff811632db>] __alloc_pages_slowpath+0x31b/0xb80
...
[78991.546088] systemd-network: page allocation stalls for 287000ms, order:0,
mode:0x24200ca(GFP_HIGHUSER_MOVABLE)
14. Downgrade and investigate
● System CPU was up, so it must be the kernel upgrade
● Downgrade Stretch to Jessie
● Downgrade Linux 4.9 to 4.4 (known good, but no allocation stall logging)
● Investigate without affecting customers
● Bisection pointed at OS upgrade, kernel was not responsible
15. Make a flamegraph with perf
#!/bin/sh -e
# flamegraph-perf [perf args here] > flamegraph.svg
# Explicitly setting output and input to perf.data is needed to make perf work over ssh without TTY.
perf record -o perf.data "$@"
# Fetch JVM stack maps if possible, this requires -XX:+PreserveFramePointer
export JAVA_HOME=/usr/lib/jvm/oracle-java8-jdk-amd64 AGENT_HOME=/usr/local/perf-map-agent
/usr/local/flamegraph/jmaps 1>&2
IDLE_REGEXPS="^swapper;.*(cpuidle|cpu_idle|cpu_bringup_and_idle|native_safe_halt|xen_hypercall_sched_op|x
en_hypercall_vcpu_op)"
perf script -i perf.data | /usr/local/flamegraph/stackcollapse-perf.pl --all grep -E -v "$IDLE_REGEXPS" |
/usr/local/flamegraph/flamegraph.pl --colors=java --hash --title=$(hostname)
30. Diff of the most popular stack
--- jessie.txt 2017-08-16 21:14:13.000000000 -0700
+++ stretch.txt 2017-08-16 21:14:20.000000000 -0700
@@ -1,4 +1,9 @@
tcp_push_one
+inet_sendmsg
+sock_sendmsg
+kernel_sendmsg
+sock_no_sendpage
+tcp_sendpage
inet_sendpage
kernel_sendpage
sock_sendpage
31. Let’s look at tcp_sendpage
int tcp_sendpage(struct sock *sk, struct page *page, int offset, size_t size, int flags) {
ssize_t res;
if (!(sk->sk_route_caps & NETIF_F_SG) ||
!sk_check_csum_caps(sk))
return sock_no_sendpage(sk->sk_socket, page, offset, size,
flags);
lock_sock(sk);
tcp_rate_check_app_limited(sk); /* is sending application-limited? */
res = do_tcp_sendpages(sk, page, offset, size, flags);
release_sock(sk);
return res;
}
what we see on the stack
segmentation offload
34. Compare ethtool -k settings on vlan10
-tx-checksumming: off
+tx-checksumming: on
- tx-checksum-ip-generic: off
+ tx-checksum-ip-generic: on
-scatter-gather: off
- tx-scatter-gather: off
+scatter-gather: on
+ tx-scatter-gather: on
-tcp-segmentation-offload: off
- tx-tcp-segmentation: off [requested on]
- tx-tcp-ecn-segmentation: off [requested on]
- tx-tcp-mangleid-segmentation: off [requested on]
- tx-tcp6-segmentation: off [requested on]
-udp-fragmentation-offload: off [requested on]
-generic-segmentation-offload: off [requested on]
+tcp-segmentation-offload: on
+ tx-tcp-segmentation: on
+ tx-tcp-ecn-segmentation: on
+ tx-tcp-mangleid-segmentation: on
+ tx-tcp6-segmentation: on
+udp-fragmentation-offload: on
+generic-segmentation-offload: on
35. Ha! Easy fix, let’s just enable it:
$ sudo ethtool -K vlan10 sg on
Actual changes:
tx-checksumming: on
tx-checksum-ip-generic: on
tcp-segmentation-offload: on
tx-tcp-segmentation: on
tx-tcp-ecn-segmentation: on
tx-tcp-mangleid-segmentation: on
tx-tcp6-segmentation: on
udp-fragmentation-offload: on
40. Lessons learned
● It’s important to pay closer attention and seemingly unrelated metrics
● Linux kernel can be easily traced with perf and bcc tools
○ Tools work out of the box
○ You don’t have to be a developer
● TCP offload is incredibly important and applies to vlan interfaces
● Switching OS on reboot proved to be useful
41. But really it was just an excuse
● Internal blog post about this is from Aug 2017
● External blog post in Cloudflare blog is from May 2018
● All to show where ebpf_exporter can be useful
○ Our tool to export hidden kernel metrics with eBPF
○ Can trace any kernel function and hardware counters
○ IO latency histograms, timer counters, TCP retransmits, etc.
○ Exports data in Prometheus (OpenMetrics) format
42. Can be nicely visualized with new Grafana
Disk upgrade in production
43. Thank you
● Blog post this talk is based on
● Github for ebpf_exporter: https://github.com/cloudflare/ebpf_exporter
● Slides for ebpf_exporter talk with presenter notes (and a blog post)
○ Disclaimer: contains statistical dinosaur gifs
● Training on ebpf_exporter with Alexander Huynh
○ Look for “Hidden Linux Metrics with Prometheus eBPF Exporter”
○ Wednesday, Oct 31st, 11:45 - 12:30, Cumberland room 3-4
● We’re hiring
Ivan on twitter: @ibobrik
Hinweis der Redaktion
Hello,
Today we’re going to go through one production issue from start to finish and see how we can apply dynamic tracing to get to the bottom of the problem.
My name is Ivan and I work for a company called Cloudflare, where I focus on performance and efficiency of our products.
To give you some context, thise are some key areas Cloudflare specializes in.
In addition to being a good old CDN service with free unlimited DDOS protection, we try to be at the front of innovation with technologies like TLS v1.3, QUIC and edge workers, making internet faster and and more secure for end users and website owners.
We’re also the fastest authoritative and recursive DNS provider. Our resolver 1.1.1.1 is privacy oriented and supports things like DNS over TLS, stipping intermediates from knowing your DNS requests, not to mention DNSSEC.
If you have a website of any size, you should totally put this behind Cloudflare.
Here are some numbers to give you an idea of the scale we operate on.
We have 160 datacenters around the world and plan to grow to at least 200 next year.
At peak these datacenters process more than 10 million HTTP requests per second. At the same time the very same datacenters serve 4.5 million DNS requests per second across internal and external DNS.
That’s a lot of data to analyze and we collect logs into core datacenters for processing and analytics.
I often get frustrated when people show numbers that are not scaled to seconds. I figured I cannot win them, so I may as well just join them
Here you see numbers per day. My favorite one is network capaity, which is 1.73 exabytes per day.
As you can see, these numbers make no sense. It gets even weirder when different metrics are scaled to different time units.
Please don’t use this as a reference, always scale down to second.
Now to set a scene for this talk specifically, it makes sense to tell a little on our hardware and software stack.
All machines serving traffic and doing backend analytics are bare metal servers running Debian, at that point in time we were running Jessie.
We’re big fans of ephemeral stuff and not a single machine has OS installed on persistent storage. Instead, we boot from a minimal immutable initramfs from network and install all packages and configs on top of that into ramfs with configuration management system. This means that on reboot every machine is clean and OS and kernel can be swapped with just a reboot.
And the story starts with my personal desire to update Debian to the latest Stable release, which was Stretch at that time.
Our plan for this upgrade was quite simple because of our setup. We can just build all necessary packages for both distributions, switch some group of machines into Stretch, fix what’s broken and carry on to the next group of machines. No need to wipe disks, reinstall anything or deal with dependency issues. We even only needed to build just one OS image as opposed to one image per workload.
On the edge every machine is the same, so that part was trivial. In core datacenters where backend out of band processing happens we have different machines doing different workloads, which means we have a more diverse set of metrics to look at, but we can also switch some groups completely faster.
One of such groups was a set of our Kafka clusters. If you’re not familiar with Kafka, it’s basically a distributed log system. Multiple producers append messages to topics and then multiple consumers read those logs. For the most part we’re using it as a queue with a large on-disk buffer that can get us time to fix issues in consumers without losing data.
We have three major clusters: DNS and Logs are small with just 9 nodes each, and HTTP is massive with 106 nodes.
You can see the specs for HTTP cluster at that time on the slides: 128GB of RAM and two Broadwell Xeon CPUs in NUMA setup with 40 logical CPUs.
We opted out for 12 SSDs in RAID0 to prevent IO trashing from consumers falling out of page cache. Disk level redundancy is absent in favor of larger usable disk space and higher throughput, we rely on 3x replication instead.
In terms of network we had 2x10G NIC in bonded setup for maximum network throughput. It was not intended to provide any redundancy.
We used to have a lot of issues with being network bound, but in the end that was solved by aggressive compression with zstd. Funnily enough, we also opted out to have 2x25G NICs, just because they are cheaper, even though we are not network bound anymore.
Check out our blog post about Kafka compression or a recent one about Gen 9 edge servers if you want to learn more.
So we did our upgrade on small Kafka clusters and it went pretty well, at least nobody said anything and user facing metrics looked good. If you were listening to talks yesterday, that’s what apparently should be alerted on, so no alerts fired.
On the big HTTP cluster, however, we started seeing issues with consumers timing out and lagging, so we looked closer at the metrics we had.
And this is what we saw: one upgraded node was using a lot more CPU than before, 5x more in fact. By itself this is not as big of an issue, you can see that we’re not stressing out CPUs that much. Typica Kafka CPU usage before this upgrade was around 3 logical CPUs out of 40, which leaves a lot of room.
Still, having 5x CPU usage was definitely an unexpected outcome. For control datapoints, we compared the problematic machine to another machine where no upgrade happened, and an intermediary node that received a full software stack upgrade on reboot, but not an OS upgrade, which we optimistically bundled with a minor kernel upgrade. Neither of these two nodes experienced the same CPU saturation issues, even though their setups were practically identical.
For debugging CPU saturation issues, we depend on linux perf command to find the cause. It’s included with the kernel and on end user distributions you can install it with package like linux-base or something.
The first question that comes to mind when we see CPU saturation issues is what is using the CPU. In tools like top we can see what processes occupy CPU, but with perf you can see which functions inside these processes sit on CPU the most. This covers kernel and user space for well behaved programs that have a way to decode stacks. That includes C/C++ with frame pointers and Go.
Here you can see top-like output from perf with the most expensive functions in terms of CPU time. Sorting is a bit confusing, because it sorts by inclusive time, but we’re mostly interested in “self” column, which shows how often the very tip of the stack is on CPU. In this case most of the time is taken by some spinlock slowpath.
Spinlocks in the kernel exist to protect critical sections from concurrent access. There are two reasons to use them:
* Critical section is small and is not contended
* Lock owner cannot sleep (like interrupts cannot do that)
If spinlock cannot be acquired, caller burns CPU until it can get hold of the lock. While it may sound like a questionable idea at first, there are legitimate uses for this mechanism.
In our situation it seems like spinlock is really contended and half of CPU cycles are not doing useful work.
We don’t know what lock is causing this to happen from this output, however.
There were also other symptoms, so let’s look at them first.
If anything bad happens in production, it’s always a good idea to have a look at dmesg. Messages there can be cryptic, but they can at least point you in the right direction. Fixing an issue is 95% knowing where to find the issue.
In that particular case we saw RCU stalls, where RCU stands for read-copy-update. I’m not exactly an expert in this, but it sounds like another synchronization mechanism and it can be affected by spinlocks we saw before.
We've seen rare RCU stalls before, and our (suboptimal) solution was to reboot the machine if no other issues can be found. 99% of the time reboot fixed the issue for a long time.
However, one can only handle so many reboots before the problem becomes severe enough to warrant a deep dive. In this case we had other clues.
While looking deeper into dmesg, we noticed issues around writing messages to the console.
This suggested that we were logging too many errors, and the actual failure may be earlier in the process. Armed with this knowledge, we looked at the very beginning of the message chain.
And this is what we saw.
If you work with NUMA machines, you may immediately see “shrink_node” and have a minor PTSD episode.
What you should be looking at is the number of missed kernel messages. There were so many errors, journald wasn’t able to keep up. We have console access to work around that, and that’s where we saw page allocation stalls in the second log except.
You don't want your page allocations to stall for 5 minutes, especially when it's order zero allocation, which is the smallest allocation of one 4 KiB page.
Comparing to our control nodes, the only two possible explanations were: a minor kernel upgrade, and the switch from Debian Jessie to Debian Stretch. We suspected the former, since CPU usage implies a kernel issue.
Just to be safe, we rolled both the kernel back from 4.9 to a known good 4.4, and downgraded the affected nodes back to Debian Jessie. This was a reasonable compromise, since we needed to minimize downtime on production nodes.
Then we proceeded to look into the issue in isolation.
To our surprise, after some bisecting we found that OS upgrade alone was responsible for our issues, kernel was off the hook.
Now all that remained is to find out what exactly was going on.
Flamegraphs are a great way to visualize stacks that cause CPU usage in the system.
We have a wrapper around Brendan Gregg’s flamegraph scripts that removes idle time and enables JVM stacks out of the box.
This gives us a way to get an overview of CPU usage in one command.
And this is how full system flamegraphs look like. We have jessie in the background on the left and stretch in the foreground on the right.
This may be hard to see, but the idea is that each bar is a stack frame and width corresponds to frequency of this stack’s appearance, which is a proxy for CPU usage.
You can see a fat column of frames on the left on Stretch, that’s not present on Jessie. We can see it’s the sendfile syscall and it’s highlighted in purple. It’s also present and highlighted on Jessie, but it’s tiny and quite hard to see.
Flamegraphs allow you to click on the frame, which will zoom into stacks containing this frame, generating some sort of a sub-flamegraph.
So let’s click on sendfile on Stretch and see what’s going on.
This is what we saw. For somebody who’s not a kernel developer this just looks like a bunch of TCP stuff, which is exactly what I saw.
Some colleagues suggested that the differences in the graphs may be due to TCP offload being disabled, but upon checking our NIC settings, we found that the feature flags were identical.
You can also see some spinlocks at the tip of the flamegraph, which reinforces our initial findings with perf top.
Let’s see what else we can figure out from here.
To find out what’s going on with the system, we’ll be using bcc tools. Linux kernel has a VM that allows us to attach lightweight and safe probes to trace the kernel. eBPF itself is a hot topic and there are talks that explore it in great detail, slides for this talk link to them if you are interested.
To clarify, VM here is more like JVM that provides runtime and not like KVM that provides hardware virtualization. You can compile code down to this VM from any language, so don’t look surprised when one day you’ll see javascript running in the kernel. I warned you.
For the sake of brevity let’s just say that there’s a collection of readily available utilities that can help you debug various parts of the kernel and underlying hardware. That collection is called BCC tools and we’re going to use some of these to get to the bottom of our issue.
On this slide you can see how different subsystems can be traced with different tools.
To trace latency distributions of sendfile syscalls between Jessie and Stretch, we’re going to use funclatency. It takes a function name and prints exponential latency histogram for the function calls. Here we print latency histogram for do_sendfile, which is sendfile syscall function, in microseconds, every second.
You can see that most of the calls on Jessie hover between 8 and 31 microseconds. Is that good or bad? I don’t know, but a good way to find out is to compare against another system.
Now let’s look at what’s going on with Stretch. I had to cut some parts, because histogram was not fitting into the slide.
If on Jessie we saw most of the calls complete in under 31 microsecond, here we see that that number is 511 microseconds, that’s a whopping 16x jump in latency.
In the flamegraphs, you can see timers being set at the tip (mod_timer function is responsible for that), with these timers taking locks.
We can count number of function calls instead of measuring their latency, and this is where funccount tool comes in. Feeding mod_timer as an argument to it we can see how many function calls there were every second.
Here we have Jessie on the left and Stretch on the right. On stretch we installed 3x more timers than on Jessie. That’s not 16x difference, but still something.
If we look at the number of locks taken for these timers by running funccount on lock_timer_base function, we can see an even bigger difference, around 10x this time.
To sum up: on Stretch we installed 3x more timers, resulting in 10x the amount of contention. It definitely seems like we’re onto something.
We can look at the kernel source code to figure out which timers are being scheduled based on the flamegraph, but that seems like a tedious task. Instead, we can use perf tool again to gather some stats on this for us.
There’s a bunch of tracepoints in the kernel that provide insight into timer subsystem. We’re going to use timer_start for our needs.
Here we record all timers started for 10s and then print function names they were triggering with respective counts.
On Stretch we install 12x tcp_write_timer timers, that sounds like something that can cause issues. Remember: we are on a bandwidth bound workload where interface is 20G, that’s a lot of bytes to move.
Taking specific flamegraphs of the timers revealed the differences in their operation.
It’s probably hard ro see, but tcp_push_one really stands out on Stretch.
Let’s dig in.
The traces showed huge variations of tcp_sendmsg and tcp_push_one within sendfile, which is expected from the flamegraphs before.
To further introspect, we leveraged a kernel feature available since 4.9: an ability to count and aggregate stacks in the kernel. BCC tools include stackcout tool that does exactly that, so let’s take advantage of that.
The most popular Jessie stack is on the left and the most popular Stretch stack is on the right. There were a few much less popular stacks too, but there’s only so much one can fit on the slides.
Stretch stack was too long, “…” is the same as highlighted section in Jessie stack.
These are mostly the same and it’s not exactly fun to find the difference, so let’s just look at the diff on the next slide.
We see 5 extra functions in the middle of the stack, starting with tcp_sendpage. Time to look at the source code.
Usually I just google the function name and it gives me a result to elixir.bootlin.com, where I swap “latest” to my kernel version. Source code there allows you to click on identifiers and jump around the code to navigate.
This is how tcp_sendpage function looks like, I pasted it verbatim from the kernel source.
From tcp_sendpage our stack jumps into sock_no_sendpage. If you lookup what NET_F_SG means, you’ll find it’s segmentation offload.
Segmentation offload is a technique where kernel doesn’t split TCP stream into packets, but instead offloads this job to a NIC. This makes a big difference when you want to send large chunks of data over high speed links. That’s exactly what we are doing and we definitely want to have offload enabled.
Let’s take a pause and see how we configure network on our machines. Our 2x10G NIC provides eth2 and eth3, which we then bond into bond0 interface. On top of that bond0 we create two vlan interfaces, one for public internet and one for internal network.
It turned out that we had segmentation offload enabled for only a few of our NICs: eth2, eth3, and bond0. When we checked NIC settings for offload earlier, we only checked physical interfaces and bonded one, but ignored vlan interfaces, where offload was indeed missing.
We compared ethtool output for vlan interface and there was our issue in plain sight.
We can just enable TCP offload by enabling scatter-gather (which is what “sg” stands for) and be done with it. Easy, right?
Imagine our disappointment when this did not work. So much work with clear indication that this is the cause and the fix did not work.
The last missing piece we found was that offload changes are applied only during connection initiation. We turned Kafka off and back on again to start offloading and immediately saw positive effects, which is green line.
This is not 5x change I mentioned at the beginning, because we were experimenting on a lightly loaded node to avoid disruptions.
Our network interfaces are managed by systemd networkd, so it turns out that missing offload settings were a bug in systemd in the end. It’s not clear whether upstream or Debian patches are responsible for this, however.
In the meantime, we work around our upstream issue by enabling offload features automatically on boot if they are disabled on VLAN interfaces.
Having a fix enabled, we rebooted our logs Kafka cluster to upgrade to the latest kernel, and on 5 day CPU usage history you can see clear positive results.
On DNS cluster results were more dramatic because of the higher load. On this screenshot only one node is fixed, but you can see how much better it behaves compared to the rest.
The first lesson here is to pay closer attention to metrics during major upgrades. We did not see major CPU changes on moderately loaded cluster and did not expect to see any effects on fully loaded machines. In the end we were not upgrading Kafka, which was main consumer of user CPU, or kernel, which was consuming system CPU.
The second lesson is how useful perf and bcc tools were at pointing us to where the issue is. These tools work out of the box, they are safe and do not require any third party kernel modules. More importantly, they do not require operator to be a kernel expert, you just need some basic understanding of concepts.
Another lesson is how important TCP offload is and how its importance grows non-linearly with traffic. It was unexpected that supposedly purely virtual vlan interfaces could be affected by offload, but it turned out they were. Challenge your assumptions often, I guess.
Lastly, we used our ability to swap OS and kernels on reboot to the fullest. Having no need to install OS meant we didn’t have to reinstall it and could iterate quickly.
Internal blog post about this incident was published in August 2017, heavily truncated external blog post went out in May 2018. That external blog post is what this talk is based on.
All of it to illustrate how the tool we wrote can be used. If during debugging we were using bcc tools to count timers firing in the kernel ad hoc, we could’ve had a metric for this and noticed the issue sooner by just seeing an increase on a graph. This is what ebpf_exporter allows you to have: you can trace any function in the kernel (and in userspace) at very low overhead and create metrics in Prometheus format from it.
For example, you can have latency histogram for disk io as a metric, which is not normally possible with procfs or anything else.
Here’s a slide from my presentation of ebpf_exporter, which shows the level of detail you can get. On the left you can see IO wait time from /proc/diskstats, which is what Linux provides, and on the right you can see heatmap of IO latency, which is what ebpf_exporter enables.
With the histograms you can see how many IOs landed in a particular bucket and things like multimodal distributions can be seen. You can also see how many IOs went above some threshold, allowing you to have alerts on this.
Same goes for timers, kernel does not keep count of what is firing anywhere for collection.
That’s all I had to talk about today. On the slides you have some links on the topic. Slides with speaker notes will be available on the LISA18 website and I’ll also tweet the link.
I encourage you to look at my talk on ebpf_exporter itself, which goes into details about why histograms are so great. It involves dinosaur gifs in a very scientific way you probably do not expect, so make sure to check that out.
My colleague Alex will be doing a training on ebpf_exporter tomorrow if you want to learn more about that, please come and talk to us. Slides have the information on time and location.
If you want to learn more about eBPF itself, you can find Brendan Gregg around and ask him as well as myself.