Backy - VM backup beyond bacula

676 Aufrufe

Veröffentlicht am

The Flying Circus is an Operations-as-a-Service platform that supports project development teams to run their custom-develop software for clients. Earlier in 2014 we experienced a major data loss and had to perform massive disaster recovery. Unfortunately our Bacula setup was not up to the task and it took us longer and more effort to restore the data than we and our customers expected.
In this case study I’d like to present our public and very honest root cause analysis on how we managed to lose a lot of VMs’ data, how the restore happened, what we learned and how we’re trying to get better. After investigating our options for the future we decided to move away from Bacula’s file and VTL-oriented model and are currently implementing a solution based on CoW-filesystems (ZFS/btrfs), block-layer snapshots and diffing, and a small utility to glue things together.

Veröffentlicht in: Technologie
0 Kommentare
0 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

Keine Downloads
Aufrufe
Aufrufe insgesamt
676
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
16
Aktionen
Geteilt
0
Downloads
5
Kommentare
0
Gefällt mir
0
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

Backy - VM backup beyond bacula

  1. 1. backy VM backup beyond Bacula/Bareos Christian Theune
 @theuni ct@flyingcircus.io
  2. 2. Mea Culpa I should have given this talk last year. But I boarded the wrong train and noticed when I inserted the conference logo why they wrote the wrong city on it. Turns out I was heading to Nuremberg because Netways is based their and did not realise the conference is in Cologne. Looks like I made it. However, I’m happy to have the chance to present what we have now, because a lot happened in the last year and thus I don’t have to give this talk twice. :)
  3. 3. And I almost missed it — again And I almost would have to say the same again next year. My son-to-be decided to scare my wife and me on Sunday evening, and I almost did not make it here again. Thanks to the organisers for not having me tarred and feathered just yet.
  4. 4. Backup!!11!! We’ve been doing backups since ages. Of our own servers, workstations, applications, databases … and it keeps being painful. We started in 1999 with using a Tandberg drive with Amanda and we moved on and got a small tape library at some point and started using Bacula quite early. Things were reasonable, but I still hate tape libraries. And I don’t like answering pain with “you should have spend more money” if the basic thing just fails by itself.
  5. 5. • flyingcircus.io • DevOps as a Service • custom, mission-critical web applications Today I work for the Flying Circus which operates custom web applications on a public cluster (with the option to run private clusters, too). We current manage a couple of dozen physical servers and multiple hundred virtual machines to do that. Sizes vary a lot, workload varies a lot. We’ve been using bacula for a long time with a largish RAID6-based drive array. We use Ceph as the primary storage (we used to use iSCSI) and we have been taking backups directly from the storage servers running the FDs with the option to restore into a FD on the VM. We also have pre/post scripting to take snapshots and set databases into backup mode. This talk is mostly an excursion into the “why the hell did we build our own backup system” and showing what we decided to build. All of this is a perpetual work in progress, but we feel that the “build your own” for our case ends up with higher quality and less maintenance effort than we had with bacula. Whether that’s true in the long run - we’ll see. Let’s start with how we came to ask the question “How can we get away from Bacula?”
  6. 6. Part I - Oh the Pain We always have nice times and rough times with our tools, but Bacula has been on our naughty list for a while and I’d like to start with the incident that caused us to take action.
  7. 7. The story unfolds … On a relatively slow day, we suddenly got Nagios warnings about filesystem errors. One, two, and then suddenly a lot of them. It’s 14:50.
  8. 8. We quickly found that a complex bug caused a rogue server to delete all our VM images. I think we lost about 50% of running VMs at that point. Luckily the bug tried to delete the images so quickly that Ceph got confused and not all deletion requests went through. Time to restore!
  9. 9. At T+1.5 hours later we had fixed the bug (by disabling any code that deletes stuff) and restore has started. We prioritised central services and SLA customers and were well on our way.
  10. 10. At T+11 hours we were pretty much done. Except that our most valuable customer’s database VM would not want to restore. We were already on a shift cycle so that everyone could get some sleep and com back fresh, but Bacula was hitting a limit to either continue restoring or figuring out how to get around the inconsistency. Unfortunately, the inconsistency was in the middle of some 100+GB volume that took a while to start the restore and forward to the 70GiB mark where it would suddenly stop because of the inconsistency. Also, as we were trying different director options to let it ignore the inconsistency we had to halt other backups …
  11. 11. Finally, around 20 hours later, we had all customer services back online. Our most valuable customer had the most downtime. Great.
  12. 12. Root Cause Analysis After we caught some breath and with a long list of things that got in our way while trying to restore. Here’s the list of things that failed us.
  13. 13. http://flyingcircus.io/ postmortems/13266.pdf If you’d like to go into the details yourself. Here’s a URL. It’s an 8 page document. It starts with the basics and works through the event. Let me know if you read it in the future and have any questions.
  14. 14. Restore script bottleneck: global lock The restore script has evolved in a way that did not allow large scale restores as we required them. We had to tune it in multiple places to avoid locks, talk to the director in a different way. This took us about 3 hours.
  15. 15. Undetected inconsistency in important customer database After we ramped up parallelism a large customer database VM showed signs of inconsistency and aborted the restores. While others were busy this took one person many hours to diagnose and fix while other backups were restored. This also stopped other backups every now and then.
  16. 16. Bacula: complexity and the VTL In the aftermath we saw that removing the global lock from the restore script should have been easy. However, the overall complexity of instrumenting Bacula and the mismatch of the VTL when doing disk-based backups caused this to become a long and cumbersome task. I don’t think we can ever come to the point where all our scripts for rare events will be perfect and thus I’d rather prefer to have our scripts in generally good shape on a basic level and allow quick modifications as an event unfolds.
  17. 17. Not “everything” backed up. Many of our VMs carry generated installations: the VMs are provisioned by our platform layer using puppet and customer scripting and are very similar. For services our customers and us use separate service deployment tools (batou) to put customisations on top. To save disk space and reduce backup load we did not back up all VMs if a customer had, i.e. dozens of identical application servers. This however, caused us to have to re-deploy customer-specific installations during this large-scale event which was more fiddly then we would have liked (provision the VM from the beginning, re-deploy the service, find glitches under this special scenario).
  18. 18. 24 hours are not a sufficient RPO in quite a few cases Daily backups are OK for many cases. But in quite a few others, they are not. We’d love to give customers the option of making hourly backups (or hell, even on a per-minute basis). With our Bacula setup there was no feasible way to do this as it would have to at least scan all the metadata of a filesystem to determine what has changed. Also, we have some pathological cases where append-only databases would cause insane write-amplification in that case.
  19. 19. Paper cuts • Hard link farms • Boot loaders • The director as a “most valuable bottleneck” Sigh. Yes, we’re running Cyrus. But still. Not restoring hard link farms really makes me hate archive formats that have to replicate all the filesystem logic. Restoring VMs in our case means preparing an empty disk image, putting the standard base image on it, then running a restore into it and reconfiguring the boot loader. This means the restore script has to know at least something about our boot loading process. This makes backups that come from a different period of boot loader configuration broken. Sigh. The director. A single process. For hundreds of jobs. And you can’t reload on the fly. And it takes us 2 MIB of auto-generated configuration. And I never can predict how schedules will really work out.
  20. 20. Recap • Restore fiddly to script • Undetected inconsistency that was hard to deal with • Blind spots • Daily Interval • Overall complexity, performance and the VTL • Paper cuts Daily backups are OK for many cases. But in quite a few others, they are not. We’d love to give customers the option of making hourly backups (or hell, even on a per-minute basis). With our Bacula setup there was no feasible way to do this as it would have to at least scan all the metadata of a filesystem to determine what has changed. Also, we have some pathological cases where append-only databases would cause insane write-amplification in that case. We had a couple of other incidents in the mean time that needed restore and I *never* am happy to have them. I’m always immediately mad to having to restore because I know something is going to blow up.
  21. 21. Part II - Make a wish
  22. 22. Simplicity • Restore with basic Unix tools • No VTL • Not mixing data of different VMs
  23. 23. Reliability • Verification / Scrubbing / (Repair) • High frequency • Integration with storage snapshots • Not inventing new formats
  24. 24. Operability • Avoid bottlenecks / head-of-line blocking • Efficient deltas for large files (ZODB) • Parallelisation (multiple jobs and multiple servers) • Simple scripting and environment-specific integration • Coordination: pre/post actions on storage, hypervisor, VM …
  25. 25. Operability II • Simple Nagios integration to ensure we notice RPO/ SLA failures • RTO-compliance during mass-restore • Self-service for customers to restore files or VMs
  26. 26. Part III - Let’s do this!
  27. 27. –Probably someone, maybe me “One size fits all … not” It’s all about size - Bacula / Bareos are general solutions to the big problem of backing up and restoring data. They have advanced capabilities, they have a large installed base and the support many features. - However, the pure complexity of things, is what gives us the big pain. - We want to drive down complexity as much as we can, build on existing tools, and then add some of our own stuff. - So. We’re not trying to solve backup for each and everyone. However, having compute and storage prevalently
  28. 28. It’s all about size: backy ~3050 LOC in Python about 50% of the code are tests! about 94% branch coverage
  29. 29. It’s all about size: Bacula ~150k LOC in C. This is 50x the size. Considering that intellectual complexity of this rises at least geometrically, then this would be 2.500 times more complicated. Well. Ok. I’m being ironic. Nevertheless, this is a lot of code. The mount of test code that I found was about 3000 lines (simply grepping for “test” in the filename). That means that the code/test ratio is about 2%. On a code-base that is 50 times more complicated I see only a 45th of the test coverage ratio. This doesn’t feel good. Please. If anyone is outraged right now because I did not find the tests. Please tell me. I’d love to hear how Bacula or Bareos are doing quality assurance of their code nowadays.
  30. 30. It’s all about size: Bareos Just as Bacula is big, Bareos seems to have some traction and seems to be healthy from a contributions perspective. At least they managed to crank out another 100k more lines of code, totalling in 250k. My insanely dumb test script tells me there’s 5.5k lines of test-related code. That’s actually an increase of 0.2% of test coverage. But again, let me know if you’re doing unit or functional tests in a way that was not obvious for me.
  31. 31. The giants we stand on Ok. We decided to do as little as possible ourselves which means we’re standing on multiple giant’s shoulders. We only backup Qemu VM images. And luckily Qemu and Linux have an API for ensuring consistent state (fsfreeze) of a volume while it’s being mounted read/write. All our disk images are stored in Ceph which has cheap snapshots, is networked and and can hand out deltas pretty cheap, too. We choose to use btrfs as the storage target of the images. It can store sparse files and we leverage the CoW-semantics (cp —reflink) when integrating deltas from Ceph.
  32. 32. Limits • Not a general purpose backup system. • No tapes. No weird hardware. • Restore without tools. • Dead-simple configuration. So. As I said: not one size fits all. Let’s build something that fits for us perfectly. We also want to use standard hardware. At the moment we target a RAID6+Hotspare with about 50TB space. We want to be able to restore into Ceph (or in the worst case into an iSCSI host if need be, or, *whatever*). DD is our tool of choice.
  33. 33. Let’s take a tour So. Lets take a tour what backy looks like. We’ll start with a component overview.
  34. 34. So. Lets take a tour what backy looks like. We’ll start with a component overview.
  35. 35. simple!?! - Obviously this is much simpler. Right? - Ok, well. This is still quite complicated and involved. The point is: all of the tools have a very specific infrastructure-oriented job and they *contribute* to our specific solution. We use Qemu anyway. We instrument Qemu with our “fc agent” anyway. We use Ceph anyway. We use Consul for coordination anyway. - Those tools do not come into existence in our environment just for backy. Our environment is quite Unix-oriented in this way. We use composable tools that have specific jobs. Those provide us features or services or functionality that will be useful for many higher-level tasks. - So again, yes this is not simple. However, plugging those things together, and testing this plugging, proved helpful to us. We invest in tools that we can reuse and we build our own on top. - We stopped shopping for one-stop-solution tools. We don’t want a cooking-oven-microwave-tv-lawnmower. We want some sandpaper, and a hammers, and nails, and electricity. - Again, most of this code we don’t have to maintain. And we configure those components anyway. So that’s why we get along with little code that we can test well.
  36. 36. Hello CLI - Backy ships with a single command that provides sub-commands. Intentionally you should never interact with those on a daily basis. - However, I’ll show you how to interact with the CLI to trigger an immediate “random” backup and how to start the scheduler. The check command provides a general Nagios-compatible check that you can use to alert for SLA/Scheduling issues on the whole installation.
  37. 37. Running a single backup - You can simply ask for back to run a backup. This uses the job-specific configuration of the current directory “litprod00” and does all the work of getting a snapshot, exporting the diff, integrating it into the CoW copy and running a partial random verification against the original source. - Note that a differential backup of a 10GB volume took 8 seconds real time. Most of the time is spend by Ceph constructing the delta, and the other in doing a random check of the volume agains the original. -
  38. 38. Inspecting a backup - Backy has a small command to inspect the status of this VMs backup archives. Each backup we call a revision and you see that it has tags. We not how much data we backed up, how long it took and when. The ID is a short UUID. - The summary is an estimation based on the amount of data we backed up for each revision. Note that this only indicates the average backup size for each revision.
  39. 39. Inspecting a backup - An important decision was to avoid storing data of multiple machines in a very obvious fashion. - We thus create a simple directory hierarchy: a backy start directory, a directory per VM, and 2 files for each backup. - Backy also keeps a per-machine log in the directory of *all* activities that were done using any backy command for this volume. The .rev files are yaml files that store metadata about each revision. The most current revision can be found using the timestamp or is always available through the “last” symlink. -
  40. 40. Hello, daemon! - Initially we intended to run without any special daemon. However, properly doing load management and scheduling with a shifting number of configurations intended to require a few clever tricks that we put into a little daemon. - The scheduler is based on Python 3.4’s asyncio and allows us to have a relatively simple implementation of running parall jobs with low overhead. Every VM gets an infinite-loop coroutine that will schedule this VMs next backup, wait for that deadline, submit a task to a work pool that provides worker limits, which in turn simply call the back shell command to a single backup. It then cleans up any expired backups and starts from the beginning. - Also, the scheduler is stateless and thus can be stopped and started at any time without loosing queues. It doesn’t have to store data for that either. - As the scheduler is completely irrelevant to restoring, we can restore while backing up or we can simply stop the scheduler.
  41. 41. Daemon configuration - The daemon has three types of configuration. Some global options, like limiting the number of parallel jobs and the base directory where to put backups. - Then it defines multiple schedules that are named (i’ll explain those in detail in a moment) - And then it describes jobs by stating their name, their type of source (file is a simple job that we use for low-level testing, others are pure Ceph and Flying-Circus-specific consul-managed jobs), and which schedule they belong to. - That’s it. This file can be computed from our inventory in a very simple fashion, we restart the scheduler and are happy.
  42. 42. Scheduling This was probably one of the harder things to implement. At the moment we’re happy to have a very simple pattern. The general terms are “schedule”, “tag”, “interval”, and “keep”. What we didn’t do: * allow references to absolute times. those don’t make sense on a broad platform as we have to adjust backups equally throughout the day. And honestly, if it matters whether you make backups at 3 am versus 3pm then you actually are asking me to do hourly backups instead of dailys. * allow referencing any special thing like weekdays, holidays, … what … ever. * The way the schedule works is that a predictable pattern like “every 24 hours” can be derived from “which backups do exist” and “are we due for another one yet and if yes, which tags are on it?” * We then run a backup at some point, stick tags to it and are done.
  43. 43. Scheduling This was probably one of the harder things to implement. At the moment we’re happy to have a very simple pattern. The general terms are “schedule”, “tag”, “interval”, and “keep”. What we didn’t do: * allow references to absolute times. those don’t make sense on a broad platform as we have to adjust backups equally throughout the day. And honestly, if it matters whether you make backups at 3 am versus 3pm then you actually are asking me to do hourly backups instead of dailys. * allow referencing any special thing like weekdays, holidays, … what … ever. * The way the schedule works is that a predictable pattern like “every 24 hours” can be derived from “which backups do exist” and “are we due for another one yet and if yes, which tags are on it?” * We then run a backup at some point, stick tags to it and are done. Also, note that we do not have to differentiate the schedule for delta/full/differential. This is what Ceph and btrfs give us for free. We just specify a rhythm of RPOs and then we’re done.
  44. 44. Purging Nothing to see. Really. Well. OK. The actual thing is: this is the backside of scheduling.
  45. 45. Purging Every tag in the schedule has a “keep” value. It means two things: 1. do not remove this tag from revisions as long as we have less than N revisions with this tag 2. do not remove this tag from revisions as long as the last revision is younger than interval*N When a revision with a given tag runs out of those criteria (we have enough revisions with this tag and they are old enough) then the tag gets removed. Once a revision has no tags left any longer, remove the revision. btrfs takes care of any block-level references to the data that need to be deleted at that point.
  46. 46. Scrubbing • partial, random verification during backup against source • btrfs scrubbing • Raid-6 We do partial verification of a freshly made backup against the original source in Ceph. In addition to that we rely on btrfs scrubbing to warn us of any issues. On top of that we hope to reduce the chance of unrecoverable bitrot with RAID 6. I think we’re relatively safe at the moment for the amount of data we store.
  47. 47. Deleting a VM • rewrite config, reload master • rm -rf
  48. 48. Monitoring • old state is uninteresting • do I have to act? This is something I’m kinda proud of. We tried multiple things in the past to monitor bacula, but it either ended up being too complicated and brittle, or didn’t trigger at the right times, or too often, or … So, I wanted to come up with a single test that tells me whether I have to act or not. What I noticed is that old state is uninteresting as we can’t fix it anyway. If I missed a backup a week ago, then that’s happened. I can’t fix that. I can’t travel in time. (For that reason I’ve built in a way to catch up with recent backups so backy has some limited self-repair here.) When I come to the office in the morning, I want to know: are we good, or not. And that’s what I built.
  49. 49. Monitoring Backy has a simple telnet console that can give an overview of what’s going on. The SLA column is interesting. The SLA being OK means that the last backup is not older than 150% of the time of the smallest interval in our schedule. Done. Backy also has a convenience subcommand that aggregates this for all jobs. To support this backy writes the status output you see here into a status file every 30 seconds and the Nagios check reads that (so it doesn’t have to wait for a crashed daemon). The check validates that the file is fresh and no jobs have exceeded their SLA.
  50. 50. Ok. The tour has probably been fast and rough. Let’s wrap it up here and call it a day, shall we?
  51. 51. What did we leave out? • Physical host backup • Guesstimating achievable backup storage ratio * I don’t want to care about backup of physical hosts any more. Most stuff is managed automatically anyway. OS installation is mostly automatic, too. Important data can be backed up by rsyncing files into a VM that is backed up with backy in our case. * I don’t really care much about backup storage ratio. Having to keep 100% of every data for every day or hour for 3 months isn’t feasible. Storing between 2-4 times the original volume is fine. Heck. Even 10 would probably be fine. Space is cheap.
  52. 52. Future • trim-ready - waiting for our whole stack (Guest, Hypervisor, Ceph, …) to pass this through • Hot reload of scheduler • Ensuring we can move VM backup directories between different backup hosts * I don’t want to care about backup of physical hosts any more. Most stuff is managed automatically anyway. OS installation is mostly automatic, too. Important data can be backed up by rsyncing files into a VM that is backed up with backy in our case. * Restarting the scheduler is fine for now. In the future we’ll likely implement a hot-reload feature to avoid accidentally tripping up already running jobs.
  53. 53. Having your backup and eating it! I think the biggest thing I wanted to get off my chest: bacula has been good at backup to us for a long time. It’s always been a bit annoying when it came to restore. And obviously we all know by now: nobody wants backup, everybody wants restore. Whenever we fail to restore our customers data in time and consistently we fail badly. This is what our backup needs t o measure up to. We have grown out of bacula from both amount of data (restore calculations take ages) and from an operational perspective. We need to move faster. We need to integrate more. We want to solve policy-oriented issues on a completely different level. We’re used to writing code to solve our issues. We’re developers. We know coding is hard. That’s why we like small reliable tools that we can compose. Bacula isn’t very composable. The only advice that I can give is based on personal experience: I love knowing how the pieces work and contribute to the world building my own. However, the number of pieces we have to deal with is growing. And that means I want those pieces to be small, multi-purpose, do their job very-very- well and then integrate them. From my perspective: big frameworks are dead. That’s why I love nginx over Apache. Or Pyramid over Django. Small is beautiful. But I might be wrong and might say the opposite tomorrow. Caveat emptor.
  54. 54. @theuni ct@flyingcircus.io Thanks for having me and thanks for hearing me out. Do we have time for questions?
  55. 55. Image Sources • https://www.flickr.com/photos/mpa-berlin/ 14337541104/ • https://www.flickr.com/photos/ seattlemunicipalarchives/4777122561 • https://www.flickr.com/photos/jkroll/15314415946/ • https://www.flickr.com/photos/dvids/6956044669/ • https://www.flickr.com/photos/flowtastic/7354146628/
  56. 56. Image Sources • https://www.flickr.com/photos/galeria_stefbu/ 4781641072/in/pool-fotoszene/ • https://www.flickr.com/photos/dlography/6982668385/ • https://www.flickr.com/photos/ 127437845@N04/15142216255 • https://www.flickr.com/photos/ clement127/15440591160
  57. 57. Image Sources • https://www.flickr.com/photos/ clement127/15999160179 • https://www.flickr.com/photos/ 63433965@N04/5814096531/ • private pictures

×