We describe a modification to the Linux Kernel which gives an SRE control over the combined bandwidth of logging on a node of a distributed system, while providing a way for the logging source owner (container or service) to control what happens when the bandwidth limit is hit.
Axa Assurance Maroc - Insurer Innovation Award 2024
Let’s Fix Logging Once and for All
1. Brought to you by
Let's Fix Logging
Once and for All
Peter Portante
Senior Principal Software Engineer at
2. Abstract … and Why?
“We describe a modification to the Linux Kernel which gives an SRE control over
the combined bandwidth of logging on a node of a distributed system, while
providing a way for the logging source owner (container or service) to control what
happens when the bandwidth limit is hit.”
■ Why do we care?
● Because a node can become unstable when one or more processes consume disk or network
resources due to bugs (or unintended behaviors) or malicious code
■ Why separation of policy from rate-limit?
● So that the SREs can provide a stable platform, while application / service owners maintain
behavior in the face of limits
3. Peter Portante
Senior Principal Software Engineer at Red Hat, Inc.
■ Something cool I’ve done - 7 club passing
■ My perspective on P99s - New and hopeful
■ Another thing about me - I enjoy yard work and puttering
■ What I do away from work - I love to juggle clubs
5. First Principles
■ Restore behavioral control for logging on a node to the SRE
● An SRE should be able to set a limit for the total logging rate of a node
■ Applications retain control of their behavior when limits are hit
● Should the application slow to meet the logging rate?
● Should the application ignore the limit by dropping logs?
6. Node Rate-Limit for SRE
■ Implement an opt-in “bandwidth gate” for file descriptors
■ SRE sets bandwidth limit for the gate
● System-wide
● Amount per interval (100 MB/sec, 10 Mb/min, etc.)
■ write() system call does not move data if bandwidth limit is hit during interval
■ SRE directs participating frameworks (systemd, podman/conmon, etc.) to use
the gate
7. Behavioral Policies for the Application
■ Add policy associated with the application
● Policy is either “drop” or “block” (default set by the SRE for the system)
■ For “drop”, write() system call always returns number of bytes that were given as written
● But only actually writes amount that can fit in that interval’s bandwidth
■ For “block”, write() system call returns number of bytes able to be written in the interval,
blocks when total number of bytes for interval has been reached
● The key is that write() will block before any data is transferred from the user’s buffer
when the limit is hit
● Frameworks that create processes (systemd, podman/conmon, etc.) set requested policy
9. What Changed
■ Container run-times which byte-capture / interpret stdout & stderr by
default, and write the data to disk first
● Podman / CRI-O (conmon), Docker
■ Densification of applications as a node’s memory and compute resources
have grown
● With 10+ cores per socket, and hyper-threads, node concurrency can easily generate more log
data than available local disk or network bandwidth can handle
■ Separation of who writes applications from who runs them where
● Containers make it easy to build an app once, and run it anywhere
15. But why in the Kernel?
■ Both conmon and systemd could implement a similar mechanism in
user-space
● BUT data is transferred through a pipe (conmon) and a socket (systemd) before those services
can handle it
■ For systemd
● One can already come close to this solution with the existing behaviors, BUT the application
owner has no control over drop vs block
■ For conmon
● A shared memory segment could be used across all conmon processes, BUT then the SRE has
to consider how to manage each sub-system separately
■ The kernel-based solution avoids unnecessary resource usage and gives the
SRE one-place to set the logging limit
16. SRE Sets Node’s Logging Bandwidth Limit
■ A simple agreed-upon sysconfig file containing the bandwidth limit
● /etc/sysconfig/logging-bandwidth
■ INTERVAL = 10 secs
■ MAXIMUM_BYTES = 100 MiB
■ eBPF script for implementing rate-limit and policy enforcement provided
■ Systemd and Podman (conmon) “opt-in” creating pipes and sockets with eBPF
hook enabled
17. Policy Provided via Systemd & Podman
■ Systemd
● In service file
■ StdoutLoggingPolicy = drop
■ StderrLoggingPolicy = block
■ SyslogLoggingPolicy = block
■ Podman (conmon)
● $ podman run
--log-opt stdoutloggingpolicy=drop
--log-opt stderrloggingpolicy=block
18. Policy Provided via Kubernetes Container Spec
apiVersion: v1
kind: Pod
metadata:
name: helloworld
spec:
containers:
- name: helloworld
image: helloworld
logging:
policy:
stdout: drop
stderr: block
19. Recap
■ Institute a node logging limit controlled by the SRE
■ Give application owners the ability to determine behavior at the limit
● drop vs block
■ Place the gate so data is not transferred from a process
● Avoid unnecessary data movement and resource usage
■ Implement in the Kernel to share among participating sub-systems
● podman/conmon, systemd, etc.
20. Brought to you by
Peter Portante
peter.portante@redhat.com