2. About Me
• Senior Software Engineer at Microsoft.
• Kubernetes Bangalore Meetup organizer.
• Find me at:
@surajd_
suraj.io
surajd.service@gmail.com
3. Threat Model
• An attacker who has container’s shell access tries to do malicious
activities using the container’s identity, network, cloud privileges,
volume and secret, etc.
4. Attack Vector
• Attacker tries to write and execute binaries to container filesystem.
• Mitigation/Defense:
• Disallow execution of binaries not shipped with container image.
5. Existing Solutions
• Read-only filesystem
• /dev partition is still writable.
• Scratch base image
• Only suitable for apps with static binaries.
6. New Solution
Know about binary execution even before it is executed from inside the container!
Container
rootfs
Policy
Enforcer
Executing Binary /bin/foo
✅ OR ❌
7. How do you get those notifications?
• Enter fanotify
• A filesystem notification framework in Linux.
8. What is fanotify?
From fanotify man page:
The fanotify API provides notification and interception of filesystem
events. Use cases include virus scanning and hierarchical storage
management.
… monitor all of the objects in a mounted filesystem, make access
permission decisions, and the possibility to read or modify files before
access by other applications.
Source: https://man7.org/linux/man-pages/man7/fanotify.7.html
10. Fanotify as system calls
int fanotify_init(unsigned int flags, unsigned int event_f_flags);
int fanotify_mark(int fanotify_fd, unsigned int flags, uint64_t mask,
int dirfd, const char *pathname);
11. • Watch the rootfs of the container.
• Send events/notifications when a permission to open a file for
execution is requested i.e. FAN_OPEN_EXEC_PERM.
/* Create an fanotify file descriptor with unlimited queue and unlimited
marks */
fd = fanotify_init(FAN_CLASS_CONTENT | FAN_UNLIMITED_QUEUE |
FAN_UNLIMITED_MARKS,
O_RDONLY | O_LARGEFILE | O_CLOEXEC);
/* Place a mark on the container's rootfs. Which can be derived from the
container's PID1 and looks like /proc/PID/root. */
ret = fanotify_mark(fd, FAN_MARK_ADD | FAN_MARK_MOUNT,
FAN_OPEN_EXEC_PERM | FAN_EVENT_ON_CHILD,
AT_FDCWD, path);
12. Design 1: Policy Enforcer for Containers
• Runtime Verifier using diff APIs.
13. Detour: Layers in container image
$ docker run --name=example
-it fedora /bin/bash -c
'touch foobar && rm
/usr/bin/touch'
$ docker diff example
A /foobar
C /usr
C /usr/bin
D /usr/bin/touch
Container
image layer
Container
layer
Union
Filesystem
/foobar
/foobar
/usr/bin/touch
❌
14. Runtime Verifier
Kernel App
1. Executing /bin/foo
2. Get the current diff in layer 0
3. Current diff in layer 0
4. Check if binary path
exists in returned diff
5. Allow if binary not in diff
5. Deny if binary in diff
Container
Runtime
15. Design 2: Policy Enforcer for Containers
• Pre-run: Trusted Source of Truth
• Runtime Verifier using above “Source of Truth”.
16. Pre-run: Trusted Source of Truth
• Source of truth is a list of binaries
and their hashes/signatures.
• Trusted sources:
• An image built with that metadata in
a “Secure Software Supply Chain”
environment.
• Calculate the hashes after rootfs has
been unpacked and before PID1 of
container starts.
• A service that calculates the hashes of
binaries in the image and provides
that over a REST call.
42a340d1ff0747a52db7b372eeb906d8a6de6c0d0627265
f2f09ddfefd2b0ce2 /usr/bin/ls
598bb15167292c328a9869e5cc301f3d4f92ec0a7f9bc91
351203844d70ff94e /usr/bin/cp
6a4a2172c6a818d218bc28384b9ccc068791c2f0d980775
287f47ca5d2591cbc /usr/bin/touch
6a4833060350d2434944d5d35693ac02bf0c869623e4918
4d2ea20adaf47c107 /usr/bin/less
c34dfda3e53b26d9b09ec3b1ac03ea25b04977736ab5b39
9410ddfb09748ec45 /usr/bin/awk
3f794988bc9b6e734d06c6507b4335054d01760a741b601
04a49543cf7a964ed /usr/bin/cd
581975f0d51f108bf51714664406f98fbef25a14da6d29f
f84840e2c89fc6350 /usr/bin/cat
6443adf01b4bac47cc87f41a293130431f42b1c31c09568
f4a3dc548c5e644f2 /usr/bin/chmod
17. Runtime Verifier
Kernel App Filesystem
1. Executing /bin/foo
2. Calculate current hash of
/bin/foo in rootfs
3. Current hash
4. Match current hash
with existing hash
5. Allow if hashes match
5. Deny if hashes don’t match
18. Design 1 vs Design 2
Design 1 Design 2
Does not need pre-computed hashes. Needs pre-computed hashes to function.
Depends on the container runtime to provide diffs. Container runtime agnostic.
Adds dependency on the runtime to be online. Regardless of uptime of runtime, it will continue to
run.
19. Policy Enforcer App
• Standalone app that runs as a daemon (using systemd or Kubernetes
Daemonset).
• Pros: Flexible in policy change enforcement, can be backed by an operator,
ease of deployment.
• Cons: A single daemon monitoring all node containers, can have late start
issues.
• Code as a part of custom containerd-shim.
• Pros: A single-daemon single-container mapping.
• Cons: Inflexibility in policy change enforcement, needs changes to installation
processes.
20. Advantages over the existing solutions
• Read-only filesystem
• Everything is writable but not executable.
• Scratch base image
• Run apps needing interpreter without fear of being compromised because of
the baggage.
22. Disadvantages of Fanotify based eventing
• Userspace program could slow down the container application due to
the kernel-mode user-mode transition, hash-calculation, etc.
• Limited by memory available for storing events.
• Events could be bypassed using memfd.
• Streaming scripts to interpreter like Python.
• Possibility to create deadlocks.
23. Roadmap
• A fine-grained policy allowing user to provide allowlist and denylist of
paths.
• Disallow apps from using STDIN as input.**