5. Linux Namespaces
Category Clone Flag Kernel version
Mount namespaces CLONE_NEWNS Linux 2.4.19
UTS namespaces CLONE_NEWUTS Linux 2.6.19
IPC namespaces CLONE_NEWIPC Linux 2.6.19
PID namespaces CLONE_NEWPID Linux 2.6.24
Network namespaces CLONE_NEWNET Linux 2.6.24, completed in 2.6.29
User namespaces CLONE_NEWUSER Linux 2.6.23, completed in 3.8
5
6. clone()
static char container_stack[STACK_SIZE];
char* const container_args[] = {"/bin/bash", NULL};
int container_main(void* arg)
{
// Open a shell
execv(container_args[0], container_args);
// Should never be here
}
int main()
{
int container_pid = clone(container_main, container_stack+STACK_SIZE,
SIGCHLD, NULL);
waitpid(container_pid, NULL, 0);
return 0;
}
6
7. UTS Namespace ( CLONE_NEWUTS )
Isolates system identifiers: nodename and domainname .
int container_main(void* arg)
{
sethostname("container", 10);
// Open a shell
execv(container_args[0], container_args);
// Should never be here
}
7
9. PID Namespace ( CLONE_NEWPID )
Isolate the PID space.
Processes in different PID namespaces can have the same PID.
eric@eric-vm:~/linux_namespace$ sudo ./test_pid_ns
Parent (2536) - start a container!
Container (1) - inside the container!
Why ps aux still show all processes?
9
10. Mount Namespace ( CLONE_NEWNS )
Isolate the set of filesystem mount points seen by a group of processes.
Processes in different mount namespaces can have different views of the filesystem hierarchy.
mount("proc", "/proc", "proc", 0, NULL);
Inside the container:
/ # ps aux
PID USER TIME COMMAND
1 root 0:00 /bin/sh
3 root 0:00 ps aux
10
12. User namespace ( CLONE_NEWUSER )
Isolates the user and group ID spaces.
A process's UID and GID can be different inside and outside a user namespace.
void set_map(char* file, int inside_id, int outside_id, int len) {
FILE *fd = fopen(file, "w");
fprintf(fd, "%d %d %d", inside_id, outside_id, len);
fclose(fd);
}
void set_uid_map(pid_t pid, int inside_id, int outside_id, int len) {
char file[256];
sprintf(file, "/proc/%d/uid_map", pid);
set_map(file, inside_id, outside_id, len);
}
void set_gid_map(pid_t pid, int inside_id, int outside_id, int len) {
char file[256];
sprintf(file, "/proc/%d/gid_map", pid);
set_map(file, inside_id, outside_id, len);
}
12
13. Network namespace ( CLONE_NEWNET )
Preparation
brctl addbr br0
ifconfig br0 192.168.10.1/24 up
Host
ip link add veth0 type veth peer name veth1
ip link set veth1 netns $PID
brctl addif br0 veth0
ip link set veth0 up
Container
ip link set dev veth1 name eth0
ip link set eth0 up
ip link set lo up
ip addr add 192.168.10.2/24 dev eth0
ip route add default via 192.168.10.1
13
16. Linux Control Groups
blkio (Disk I/O)
cpu (CPU quota)
cpuset (CPU cores)
devices
memory
net_cls (Network package class id)
net_prio (Network package priority)
hugetlb (HugeTLB)
cpuacct
freezer
16
17. Glance
root@eric-vm:/sys/fs/cgroup# ls
blkio cpuacct cpuset freezer memory net_cls,net_prio perf_event systemd
cpu cpu,cpuacct devices hugetlb net_cls net_prio pids
root@eric-vm:/sys/fs/cgroup/cpu$ sudo mkdir test
root@eric-vm:/sys/fs/cgroup/cpu/test$ ls
cgroup.clone_children cpuacct.stat cpuacct.usage_percpu cpu.cfs_quota_us cpu.stat
cgroup.procs cpuacct.usage cpu.cfs_period_us cpu.shares notify_o
17
18. We have a CPU killer
int main()
{
int i = 0;
for (;;) i++;
return 0;
}
top
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3985 eric 20 0 4224 648 576 R 99.9 0.1 0:15.53 deadloop
18
19. Usage
Create a group. (Yes, just mkdir )
sudo mkdir /sys/fs/cgroup/cpu/test
Set a limit. 20000 means 20% CPU time.
echo 20000 > /sys/fs/cgroup/cpu,cpuacct/test
Add a process to our group.
echo 3985 >> /sys/fs/cgroup/cpu,cpuacct/test/tasks
19
21. "Container"
Linux kernel namespaces provide the isolation (hence “container”) in which we place one or
more processes
Linux kernel cgroups (“Control groups”) provide resource limiting and accounting (CPU,
memory, I/O bandwidth, etc.)
21
22. Container Properties
A shared kernel across all containers on a single host.
Unique filesystem, a layered model using CoW (copy‑on‑write) union filesystems.
Linux namespaces are shareable (Kubernetes “pod”)
One process per container
22
23. Linux Capabilities
Add/Drop unnecessary capabilities from a container.
$ docker run --rm -ti busybox sh
/ # hostname foo
hostname: sethostname: Operation not permitted
$ docker run --rm -ti --cap-add=SYS_ADMIN busybox sh
/ # hostname foo
<hostname changed>
$ docker run --rm -ti --cap-drop=NET_RAW busybox sh
/ # ping 8.8.8.8
ping: permission denied (are you root?)
23
28. Host <‑> Container
Protecting the host from containers
THREAT MITIGATION
DoS Host (use up CPU,
memory, disk), Forkbomb
Cgroup controls, disk quotas (1.12), kernel pids limit (1.11 + Kernel
4.3)
Access host/private
information
Namespace configuration; AppArmor/SELinux profiles, seccomp
(1.10)
Kernel modification/insert
module
Capabilities (already dropped); seccomp, LSMs; don’t run --
privileged mode
Docker administrative
access (API socket
access)
Don’t share the Docker UNIX socket without Authz plugin
limitations; use TLS certificates for TCP endpoint configurations
28
29. Container <‑> Container
Malicious or Multi‑tenant
THREAT MITIGATION
DoS other containers (noisy
neighbor using significant % of
CPU, memory, disk)
Cgroup controls, disk quotas (1.12), kernel pids limit (1.11
+ Kernel 4.3)
Access other container’s
information (pids, files, etc.)
Namespace configuration; AppArmor/SELinux profile for
containers
Docker API access (full control
over other containers)
Don’t share the Docker UNIX socket without Authz
plugin limitations (1.10); use TLS certificates for TCP
endpoint configurations
29
30. External ‑> Container
The big, bad Internet
THREAT MITIGATION
DDoS attacks
Cgroup controls, disk quotas (1.12), kernel pids limit (1.11 + Kernel
4.3), Proactive monitoring infrastructure/operational readiness
Malicious (remote)
access
Appropriate application security model No weak/default passwords! ‑
‑readonly filesystem (limit blast radius)
Unpatched exploits
(underlying OS layers)
Vulnerability scanning (IBM Bluemix, Docker Data Center, CoreOS
Clair, Red Hat “SmartState” CloudForms (w/Black Duck)
30
31. Application Security
Significant container benefit: provided protections are in place (seccomp, LSMs, dropped caps,
user namespaces) the exploited application has greatly reduced ability to inflict harm beyond
container “walls”
Proper handling of secrets through dev/build/deploy process (no passwords in Dockerfile,
as an example)
Unnecessary services not exposed externally (shared namespaces; internal/management
networks)
Secure coding/design principles
31