What are the advantages and disadvantages of membrane structures.pptx
How to Backup Thousands of Containers at Scale
1. Backing up thousands of
containers
OR
How to fail miserably at
copying data
OpenFest 2015
2.
3. Talk about backup systems...Why?
➢First backup system built in 1999
➢Since then, 10 different systems
➢But why built your own?
➢ simple: SCALE
➢I'm very proud of the design of the last two
systems my team and I build
5. Networking....
➢typical transfer speed over 1Gbit/s ~ 24MB/s
➢typical transfer speed over 10Gbit/s ~ 110MB/s
➢Restoring a 80% full 2TB drive
➢ ~21h over 1Gbit/s with 24MB/s
➢ ~4h and a half over 10Gbit/s with 110MB/s
➢Overlapping backups on the same network
equipment
➢Overlapping backups and restores
➢Switch uplinks
6. Architecture of container backups
➢Designed for 100,000 containers
➢backup each container at least once a day
➢30 incremental copies
➢Now I'll explain HOW :)
7. Host machine architecture
➢We use LVM
➢RAID array which exposes a single drive
➢setup a single Physical Volume on that drive
➢setup a single Volume Group using the above
PV
➢Thin provisioned VG
➢Each container with its own Logical Volume
8. Backup node architecture
➢Again we use LVM
➢RAID array which exposes a single drive
➢5 equally big Physical Volumes
➢on each PV we create a VG with thin pool
➢each container has a single LV
➢each incremental backup is a new snapshot
from the LV
➢when the max number of incremental backups
is reached, we remove the first LV
9. For now, there is nothing reallyFor now, there is nothing really
new or very interesting here.new or very interesting here.
So let me start with the funSo let me start with the fun
part.part.
10. ➢We use rsync (nothing revolutionary here)
➢We need the size of the deleted files
➢ https://github.com/kyupltd/rsync/tree/deleted-stats
➢Restore files directly in client's containers, no
SSH into them
➢ https://github.com/kyupltd/rsync/tree/mount-ns
11. Backup system architecture
➢ One central database
➢ Public/Private IP addresses
➢ Maximum slots per machine
➢ Gearman for messaging layer
➢ Scheduler for backups
➢ Backup worker
12. The Scheduler
➢ Check if we have to backup the container
➢ Get the last backup timestamp
➢ Check if the host node has available backup
slots
➢ Schedule a 'start-backup' job at the gearman
on the backup node
13. start-backup worker
➢ Works on each backup node
➢ Started as many times as the Backup server
can handle
➢ handles the actual backup
➢ creates snapshots
➢ monitors rsync
➢ remove snapshots
➢ update database
14. No problems... they say :)
➢ We lost ALL of our backups from TWO node
➢ corrupted VG metadata
➢ VG metadata is not enough (more then 2000)
LVs
➢ create the VGs a little bit smaller then the total size
of the PV
➢ separate the VGs to loose less
15. No problems... they say :)
➢ LV creation becomes sluggish because LVM tries to
scan for devices in /dev
➢ obtain_device_list_from_udev = 1
➢ write_cache_state = 0
➢ specify the devices in scan = [ “/dev” ]
➢lvmetad and dmetad break...
➢ when they breack, they corrupt the metadata of all currently
opened containers
➢lvcreate leaks file descriptors
➢ once lvmetad or dmeventd are out of FDs everything breaks
16. Then the Avatar came
➢ We wanted to reduce the restore time from 4h to
under 1h, even under 30min
➢ So instead of backing up whole containers...
➢ We now backup accounts
➢ Soon we will be able to do distributed restore
➢ single host node backup
➢ from multiple backup nodes
➢ to multiple host nodes
17. Layerd backupsSparse File
Physical Volume
Volume Group
ThinPool
Logical Volume
Snapshot6
Snapshot5
Snapshot4
Snapshot3
Snapshot2
Snapshot1
Snapshot0
Loop mount
18. Issues here
➢ We can't keep a machine UP for more then 19
hours, LVM kernel BUG
➢ 2.6 till 4.3 - when discarding data it crashes
➢ Removing old snapshots does not discard the
data
➢ LVM umounts a volume when dmeventd
reaches the limit of Fds
➢ It does umount -l, the bastard
19. Issues here
➢ LVM dmeventd try's to extend the volume, but
if you don't have free extents it will silently
umount -l your LV
➢ Monitor your thinpool metadata
➢ Make your thinpool smaller then the VG and
always plan to have a few spare PE for
extending the pool
➢ kabbi__ irc.freenode.net #lvm