2. Nice to meet you
My name is Ching-Hsuan Yen, and Mango as a nickname.
- A R&D engineer in Deep Security of Trend Micro
- A CS master of NCTU
- A former leader of Linux team in NCTU CSCC
- A member of Bamboofox
2
3. Outline
Kubernetes: Secure Container Isolation
- Requirements and use cases
Are containers secure?
- The weakness of containers
Approaches to secure containers
- Kata Container and gVisor
3
5. Requirements: CIA
Confidentiality - a sandboxed process should not be able to access:
● application data in other pods - e.g. volumes, memory, writeable layer, etc.
● application metadata of other pods - e.g. container image names, pod & service names, pod labels,
etc.
● system metrics & resource usage
● system metadata - e.g. kubernetes version, os version, runtime version
Integrity - a sandboxed process should not be able to alter:
● processes or data outside the sandbox, e.g. mitigate confused deputy attacks, data tampering, etc.
● perform operations not required by the sandboxed application, e.g. a web server may not need to
make outgoing connections
Availability - a sandboxed process should not be able to affect the availability of processes or resources
outside the sandbox, e.g. mitigate local DoS attacks
5
6. Use cases
1. Sandbox vulnerable code (media library model)
2. Sandbox untrusted code (vendor blackbox model)
3. Provide maximum defense in depth (financial services model)
4. Sandbox multitenant code (hosting provider model)
5. Sandbox multitenant services
6. Mutually untrusted users want to share a cluster (KaaS model)
7. Sidecar container has distinct privileges
6
7. Use cases
1. Sandbox vulnerable code (media library model)
2. Sandbox untrusted code (vendor blackbox model)
3. Provide maximum defense in depth (financial services model)
4. Sandbox multitenant code (hosting provider model)
5. Sandbox multitenant services
6. Mutually untrusted users want to share a cluster (KaaS model)
7. Sidecar container has distinct privileges
7
8. Use cases
1. Sandbox vulnerable code (media library model)
2. Sandbox untrusted code (vendor blackbox model)
3. Provide maximum defense in depth (financial services model)
4. Sandbox multitenant code (hosting provider model)
5. Sandbox multitenant services
6. Mutually untrusted users want to share a cluster (KaaS model)
7. Sidecar container has distinct privileges
8
9. Use cases
1. Sandbox vulnerable code (media library model)
2. Sandbox untrusted code (vendor blackbox model)
3. Provide maximum defense in depth (financial services model)
4. Sandbox multitenant code (hosting provider model)
5. Sandbox multitenant services
6. Mutually untrusted users want to share a cluster (KaaS model)
7. Sidecar container has distinct privileges
9
10. Use cases
1. Sandbox vulnerable code (media library model)
2. Sandbox untrusted code (vendor blackbox model)
3. Provide maximum defense in depth (financial services model)
4. Sandbox multitenant code (hosting provider model)
5. Sandbox multitenant services
6. Mutually untrusted users want to share a cluster (KaaS model)
7. Sidecar container has distinct privileges
10
12. Current State of Container Isolation
Namespaces - Isolate kernel data structures, such as processes, mount tables, network interfaces, and others. Not all kernel data
structures have namespace isolation, such as the clock, audit logs, and keyrings.
cgroups - Limits, controls, and accounting of compute resources and devices. Examples include limiting and accounting CPU,
memory and network usage, hiding devices, and limiting the number of process IDs.
seccomp-bpf - Whitelist (filter) linux syscalls & arguments. Useful for restricting non-namespaced syscalls, poorly supported syscalls,
and syscalls that don't have associated capabilities. Docker provides a default seccomp profile, which is compatible with most
unprivileged container workloads.
AppArmor / SELinux - A Linux Security Module (AppArmor & SELinux are mutually exclusive). Mostly useful for finer grained control
of filesystem access, but recent changes are adding in more networking controls.
Users - Core linux permission model. Mostly used for filesystem permissions (DAC) and process signaling.
Capabilities - Subdivide root user privileges into various capabilities. The docker defaults drop un-namespaced capabilities (e.g. ability
to install kernel modules, manage the network devices, and reboot the machine).
12
15. Are containers secure?
Is it secure that downloads random container images and run it on the host.
Is it secure that CaaS providers allow tenants run their own images?
Is it possible that containers are secure as VMs?
15
16. Are containers secure?
Containers should be treated as a standard services e.g. nginx, postfix, sshd.
As an experienced system administrator, you should:
● Drop privileges as quickly as possible
● Run your services as non-root whenever possible
● Treat root within a container as if it is root outside of the container
16
17. Normal containers are not secure
Privileged container: too dangerous
Unprivileged container: no root no life
Namespaced container: sound good ?
Kernel
Container
Vulner
17
18. Normal containers are not secure
Privileged container: too dangerous
Unprivileged container: no root no life
Namespaced container: sound good ?
NO, not everything is namespaced.
Containers are still vulnerable.Kernel
Container
Vulner
18
19. Normal containers are not secure
Major kernel subsystems are not namespaced like:
1. SELinux
2. Cgroups
3. file systems under /sys
4. /proc/sys, /proc/sysrq-trigger, /proc/irq, /proc/bus
Devices are not namespaced:
1. /dev/mem
2. /dev/sd* file system devices
3. Kernel Modules
Just try to break one of them, you can own the system,
e.g. Dirty COW.
Kernel
Container
Vulner
19
21. How to protect the host kernel?
Keep containers out of the kernel space.
But how could container work without the host kernel?
21
22. How to protect the host kernel?
Keep containers out of the kernel space.
But how could container work without the host kernel?
Just forge one to containers!
22
23. Two ideas
gVisor: we can forge a kernel!
I means… a kernel in User Space!
Kata Container: we can forge a kernel!
I means… a kernel in Virtual Machines!
23
24. Two ideas
gVisor: we can forge a kernel!
I means… a kernel in User Space!
Kata Container: we can forge a kernel!
I means… a kernel in Virtual Machines!
24
28. Container /
Appliaction
Sentry
which acts as a
kernel
KVM
OCI Platform
gVisor: KVM (experimental)
Shim
Sentry
which acts as a VM
runsc
Intel VT
AMD-V
VM Entry
VM Exit
28
29. Boot time
Kata Container: 800ms
gVisor: 150ms
Docker runc: 140ms
Kata Container Booting Process
docker run
VM boot Kernel Agent
Start
Container
Prepare
Image
Prepare
Volumes
Create Start
Hot plug
29
30. Memory footprint
gVisor merely consumes memory as much as its runtime size.
However, memory footprint is a big issue to virtual machines.
Kata Container uses such approaches:
● Minimal rootfs
● Minimal kernel
● VM Template
● DAX/nvdimm
● Kernel Samepage Merging (KSM)
30
32. Kernel Samepage Merging
Initial state
Aggressive
Standard
Slow
No trigger(30s)
No trigger(2min)
No trigger(30s)
New trigger
Trigger
Kata Container uses KSM to merge
same memory pages of kernels
between VMs.
KSM is triggered when creating a
container of Kata Container.
Thus, each kernel between VMs
would share the same memory
pages.
32
33. VM Network Namespace
Container Network
Namespace
Networking: Kata Container
Bridge MacVTap
VM
Tap
Pod
Container
ContainerVeth
33
36. Current status
Kata Container has released its first version, which supports OCI platform like
docker or kubernetes, and works fine on ARM and x86 architectures.
gVisor is still in early development and doesn’t yet support some system call
which make it unstable.
Even that, some applications have been executable on it e.g. httpd, golang,
mongo db, but many others are not e.g. nginx, elasticsearch.
36
37. How to use
Kata-container has deb/rpm packages on x86_64 platform.
gVisor has nightly builds. https://storage.googleapis.com/gvisor/releases/nightly/latest/runsc
Enable nested-virtualization:
$ kata-runtime kata-check
Docker version > 17.0
Kernel version > 3.17
37
Why Go?
gVisor was written in Go in order to avoid security pitfalls that can plague kernels. With Go, there are strong types, built-in bounds checks, no uninitialized variables, no use-after-free, no stack overflow, and a built-in race detector.
Direct Device Assignment SRIOV NVDIMM Multi-OS KSM throttling CRI-O native support MacVTap, multi-queue netMulti Architecture Multi Hypervisor Full Hotplug K8s Multi Tenancy VM templating Frakti native supportTraffic Controller net