4. A simple docker example
root@boot2docker:/home/docker# ip ad show eth1
4: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 08:00:27:91:99:33 brd ff:ff:ff:ff:ff:ff
inet 192.168.59.103/24 brd 192.168.59.255 scope global eth1
valid_lft forever preferred_lft forever
inet6 fe80::a00:27ff:fe91:9933/64 scope link
valid_lft forever preferred_lft forever
root@boot2docker:/home/docker#
root@boot2docker:/home/docker# docker run -d -P redis
6f858e1563a56574031a61e65fb8ab356752d03440b24d65739eed64f2ef84df
root@boot2docker:/home/docker#
root@boot2docker:/home/docker# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS
PORTS NAMES
6f858e1563a5 redis:latest "/entrypoint.sh redi 3 seconds ago Up 2 seconds
0.0.0.0:49154->6379/tcp kickass_colden
root@boot2docker:/home/docker#
root@boot2docker:/home/docker# docker run -it --entrypoint /bin/bash redis
root@63d30ea140b2:/data# redis-cli -h 192.168.59.103 -p 49154
192.168.59.103:49154> set k 123
OK
192.168.59.103:49154> get k
"123"
5. What happened here
• We created a container with its own filesystem,
network stack, process space, resource limitation
• We started a redis-server in the container.
• We created another container. We ran redis-cli in it
to connect to the preview redis-server with host ip
and proxy port.
6. How this happened
• What is a redis image? How to make it?
• What is a container? How to make its own
filesystem, network stack, process space, resource
limitation?
• How container starts?
7. How this happened
• What is a redis image? How to make it?
• What is a container? How to make its own
filesystem, network stack, process space, resource
limitation?
• How container starts?
8. What is a redis image
FROM dockerfile/ubuntu
# Install Redis.
RUN
cd /tmp &&
wget http://download.redis.io/redis-stable.tar.gz &&
tar xvzf redis-stable.tar.gz &&
cd redis-stable &&
make &&
make install &&
cp -f src/redis-sentinel /usr/local/bin &&
mkdir -p /etc/redis &&
cp -f *.conf /etc/redis &&
rm -rf /tmp/redis-stable* &&
sed -i 's/^(bind .*)$/# 1/' /etc/redis/redis.conf &&
sed -i 's/^(daemonize .*)$/# 1/' /etc/redis/redis.conf &&
sed -i 's/^(dir .*)$/# 1ndir /data/' /etc/redis/redis.conf &&
sed -i 's/^(logfile .*)$/# 1/' /etc/redis/redis.conf
# Define mountable directories.
VOLUME ["/data"]
# Define working directory.
WORKDIR /data
# Define default command.
CMD ["redis-server", "/etc/redis/redis.conf"]
# Expose ports.
EXPOSE 6379
9. Image
• A read-only Layer is called an image. An image
never changes.
• Each image may depend on one more image
which forms the layer beneath it. We sometimes
say that the lower image is the parent of the upper
image.
• Each image may depend on one more image
which forms the layer beneath it. We say that the
lower image is the parent of the upper image.
10. How this happened
• What is a redis image? How to make it?
• What is a container? How to make its own
filesystem, network stack, process space, resource
limitation?
• How container starts?
11. How to make a image
• Use dockerfile
• Use docker commit manually (deprecated)
13. How this happened
• What is a redis image? How to make it?
• What is a container? How to make its own
filesystem, network stack, process space, resource
limitation?
• How container starts?
14. What is a container?
• A Linux container is a copy of a Linux environment
located in a file system which is jail environment
but uses Linux NameSpaces, it runs its own init
process, separate process space, separate
filesystem and separate network stack which is
virtualized by the root OS running on the hardware.
15. Concept of image and
container
• Docker image is a layer
in the file system
• Containers are two
layers
- Layer one is init layer
based on image
- Layer two is the actual
container content
511136ea3c5a
df7546f9f060
ea13149945cb
4986bf8c1536
142b6a3eae4
0
142b6a3eae4
0-init
Container
Image
RW
RO
/dev
/dev/console
/dev/shm
/etc
/etc/hostname
/etc/hosts
/dev/mtab -> /proc/mounts
16. How this happened
• What is a redis image? How to make it?
• What is a container? How to make its own
filesystem, network stack, process space,
resource limitation?
• How container starts?
17. Linux kernel Namespace
• UTS(hostname), Mount(mount points), IPC(System V
IPC), User(UIDs), Pid(processes), Net(network stack)
• The kernel namespace API, clone, setns, unshare
• /proc/[pid]/ns/ directory
$ ls -l /proc/$$/ns
lrwxrwxrwx. 1 mtk mtk 0 Jan 14 01:20 ipc -> ipc:[4026531839]
lrwxrwxrwx. 1 mtk mtk 0 Jan 14 01:20 mnt -> mnt:[4026531840]
lrwxrwxrwx. 1 mtk mtk 0 Jan 14 01:20 net -> net:[4026531956]
lrwxrwxrwx. 1 mtk mtk 0 Jan 14 01:20 pid -> pid:[4026531836]
lrwxrwxrwx. 1 mtk mtk 0 Jan 14 01:20 user -> user:[4026531837]
lrwxrwxrwx. 1 mtk mtk 0 Jan 14 01:20 uts -> uts:[4026531838]
18. setns
• reassociate process with a namespace
• int setns(int fd, int nstype);
• CLONE_NEWIPC/CLONE_NEWNET/CLONE_NEWNS/
CLONE_NEWPID/CLONE_NEWUSER/CLONE_NEWUTS
• Each process has a /proc/[pid]/ns/ subdirectory containing
one entry for each namespace that supports being
manipulated by setns(2)
20. How this happened
• What is a redis image? How to make it?
• What is a container? How to make its own
filesystem, network stack, process space,
resource limitation?
• How container starts?
21. Storage Driver
• Docker implements vfs, aufs, device mapper, btrfs,
overlayfs, zfs currently.
• Storage driver should have the following feather
- Copy on write
- Shared memory cache
• Performance http://developerblog.redhat.com/
2014/09/30/overview-storage-scalability-docker/
22. Aufs
• Work on File-level
• Combine multiple branches in a specific order
• Each branch is just a normal directory
• Opening a file
- look it up in each branch, starting from the top, open the first one if find
- If attempts writing into it, copy it to the read-write (top) branch, then open the
copy
- That "copy-up" operation can take a while if the file is big!
• Deleting a file
- A whiteout file is created
24. Device Mapper
• Work on Block-level
• Each container and each image
gets its own block device
• At any given time, it is possible to
take a snapshot of a container or
an image
• data/metadata is sparse file
• recommend to put data on real
disk
loop0
data metadata
/dev/mapper/docker-{major}:
{minor}-{indoor}-pool
loop0
volume
1
volume
2
25. How to make its owner
filesystem
1. mount every parent layer and rw layer diff/
$cid-init on mnt/$cid-init
2. make extra files, dir, links in mnt/$cid-init
3. mount every parent layer and rw layer diff/
$cid and ro layer diff/$cid-init on mnt/$cid
4. setns to join existing mount namespace
5. mount proc/sysfs/tmpfs/cgroup…
6. create devices, setup dev symlinks, init
filesystem
7. chdir diff/$cid && chroot .
note : underline parts made by initprocess,
others made by docker daemon.
more in rootfs_linux.go
511136ea3c5a
df7546f9f060
ea13149945cb
4986bf8c1536
142b6a3eae4
0
142b6a3eae4
0-init
/var/lib/docker/aufs/diff
/var/lib/docker/aufs/mnt
142b6a3eae4
0
26. How this happened
• What is a redis image? How to make it?
• What is a container? How to make its own
filesystem, network stack, process space,
resource limitation?
• How container starts?
28. Bridge mode
1. create docker0 bridge, add eth1 to docker0,
set up docker0 iptable rule
2. create a veth device, attach one to docker0,
put another into container’s network
namespace.
3. allocate a free ip
4. set up iptable rules and userland proxy
5. setns to join existing network namespace
6. change the name of veth device to eth1 in
container
7. set mac address, ip, mtu of veth device
8. set up default gateway and route
note : underline parts made by initprocess,
others made by docker daemon.
host
eth1
10.27.149.90
docker0
172.17.42.1
contianer0
eth1
172.17.0.4
vethdb6e696
contianer1
eth1
172.17.0.5
veth8df64b7
veth device bridge
physical
device
29. Consistent mac address
• Docker generates
mac addresse for
veth device
consistent for a
given ip address.
• This can avoid arp
cache issues
func generateMacAddr(ip net.IP) net.HardwareAddr {
hw := make(net.HardwareAddr, 6)
// The first byte of the MAC address has to
comply with these rules:
// 1. Unicast: Set the least-significant bit
to 0.
// 2. Address is locally administered: Set
the second-least-significant bit (U/L) to 1.
// 3. As "small" as possible: The veth
address has to be "smaller" than the bridge
address.
hw[0] = 0x02
// The first 24 bits of the MAC represent the
Organizationally Unique Identifier (OUI).
// Since this address is locally
administered, we can do whatever we want as long
as
// it doesn't conflict with other addresses.
hw[1] = 0x42
// Insert the IP address into the last 32
bits of the MAC address.
// This is a simple way to guarantee the
address will be consistent and unique.
copy(hw[2:], ip.To4())
return hw
}
30. Port Mapping
• Docker daemon use a map to record ports and ip mappings
• Connect to local subset
- userland proxy: docker-proxy -proto tcp -host-ip 0.0.0.0 -host-port 49153 -
container-ip 172.17.0.2 -container-port 6379
- Hairpin nat (new docker versions)
- enable /sys/class/net/$vethname/brport/hairpin_mode
• Connect to others
- iptables -I POSTROUTING -t nat -s 172.17.42.1/16 ! -o docker0 -j
MASQUERADE
- iptables -t nat -A DOCKER -p tcp -d 0/0 --dport 49153 ! -i docker0 -j DNAT --to-
destination 172.17.0.2:6379
31. How this happened
• What is a redis image? How to make it?
• What is a container? How to make its own
filesystem, network stack, process space,
resource limitation?
• How container starts?
32. Cgroups support by docker
• cgroup components: cpuset, cpu, cpuacct,
memory, devices, freezer, net_cls, blkio
• docker run option: --memory, --cpuset, --cpu-
shares, --device
• docker pause/unpause
• After start background “docker native” process,
docker daemon echo the pid of it to cgroup dirs like
/cgroup/memory/docker/$cid/memory.limit_in_bytes
33. How this happened
• What is a redis image? How to make it?
• What is a container? How to make its own
filesystem, network stack, process space, resource
limitation?
• How container starts?
34. How container starts
1. creates a socketpair and starts a background
child process “docker native”
2. create network devices and applies cgroup
settings.
3. send configuration to “docker native”
4. receive error message, wait for “docker native” to
exit
5. “docker native” receive config and env from
socketpair
6. “docker native” join existing namespace with fd in
/proc/$pid/ns/*
7. init file system…
8. exec entrypoint
“docker native” is the init process in container
daemon
docker native entrypoint
start config errors
exec
client
startcreate
35. Reference
• Docker image specification
• Linux container
• Deep dive into Docker storage drivers
• Docker Architecture (v1.3)
• Hairpin_NAT
• Linux Programmer's Manual NAMESPACES