We'll give an update on how Facebook manages CentOS at scale on our fleet, how working with the community helps us solve problems at scale and touch upon some of the tooling and processes we've developed. We'll specifically focus on the challenges around upgrading the fleet to a new major release and discuss how we plan to leverage CentOS Stream in our environment.
6. • OS team manages the bare metal experience of the fleet
• OS as a platform
• Individual teams are responsible for their own hosts
• Built on an Open Source foundation
• Linux, CentOS, rpm/yum/dnf, Chef, systemd
Infrastructure
How does it work?
7. • Community sets the direction
• We move fast; opensource often moves faster
• We don’t need to write everything ourselves
• Sharing our code means sharing the maintenance and
having others extend it
• DevConf.CZ 2017 talk: https://tinyurl.com/y7gx6nro
Infrastructure
Upstream first
8. • Stable releases
• Binary compatibility
• Security updates
• Mature and well understood tooling
• EPEL
• Close relationship with Fedora
Infrastructure
Why CentOS?
9. • Backports from Fedora Rawhide for stuff we care about
• Mostly plumbing and low-level packages
• %facebook macro to gate internal stuff
• GitHub: facebookincubator/rpm-backports
• CentOS + FTL = stable distro, moving fast
Infrastructure
FTL – Fast Thin Layer
11. • CentOS 5 6 (~2013-2016), 6 7 (2016-2018)→ →
• No in-place updates: reprovision the host from scratch
• Clean slate to ensure a good state
• Opportunity to deprecate unwanted features or tools
OS updates
Major updates
12. • Incremental Rolling OS updates
• Every two weeks we sync down the latest updates…
• …and roll them out over two weeks
• ‘yum upgrade’ kicked off via fb_yum in Chef
• Easy stop button and opt out for individual packages
Rolling OS updates
Minor releases and security updates
13. • About a year from initial PoC to first production machine
• About two years to migrate 100% of the fleet
• Bulk work: systemd conversion, validation, reprovisioning
• Stateless vs stateful services
• Last hour surprises: regressions and hidden dependencies
• DevConf 2018 talk: https://tinyurl.com/yawmjp74
CentOS 6 to 7
This day in 2018...
14. • Widespread systemd adoption
• More workloads moving to containers
• Switch to image-based provisioning
• Packaging improvements
• Increased community involvement
After CentOS 7
What we’ve been working on
15. • Running our systemd backport on the fleet
• 243 everywhere, 244 in testing
• Internal CI/CD pipeline for regression testing
• GitHub: facebookincubator/systemd-compat-libs
• GitHub: facebookincubator/pystemd
• All Systems Go 2019 talk: https://tinyurl.com/v7lxmq3
After CentOS 7
systemd
16. • Global service.d dropins (PR#13942)
• DefaultMemory{Low,Min} (PR#12211)
• DisableControllers (PR#10567)
• ExecCondition (PR#12933)
• PrivateUsers for unprivileged user managers (PR#13823)
• systemd-internal cgroup limits validation (PR#13690)
After CentOS 7
systemd feature development
17. • dcrpm: automate detection and remediation of issues
• GitHub: facebookincubator/dcrpm
• rpmdb corruption, stuck processes, etc.
• Works on Linux and OSX (!)
• Runs before every Chef run
After CentOS 7
RPM improvements: mitigation
18. • Beyond bdb: A/B testing new database backends
• ndb vs lmdb: goodbye rpmdb corruption!
• lmdb issues: hardcoded size (PR#902), locking, key size
limits (PR#899), ~2x timeouts vs ndb
• CentOS Dojo Boston 2019 talk: https://tinyurl.com/r9txeo7
• Fleet is 100% on ndb as of Jan 2020
After CentOS 7
RPM improvements: database
19. • Experimenting with CoW to speed up package installs
• cpio -> aligned extent data with no compression (kinda)
• RPM plugin uses reflinking to obtain file data
• RPM transcoder proxy to convert prebuilt packages
• Still in heavy development, details tbd
• Also: xz zstd as default compression→
After CentOS 7
RPM improvements: file format
21. • Goal: front-load as much bootstrapping work as possible
• RHEL as a proxy for CentOS
• What’s new, what’s different, what’s going to break
• One month from release to minimal deployment
• Two month from minimal deployment to dev environment
• CentOS Dojo Brussels 2019: https://tinyurl.com/qqkb8ns
RHEL 8 Beta
Bootstrapping a pilot
22. • Importing the package repositories
• Bootstrapping a base image for the installer
• Package changes: grub, network-scripts, python
• Missing packages and CodeReady Linux Builder
• Porting the internal package build pipeline
• Modularity surprises
RHEL 8 Beta
Packaging and provisioning
23. • node.centos? on RHEL
• Chasing hardcoded logic (e.g. node.centos7?)
• Package resources: yum_package vs dnf_package
• Package cache: YumCache vs PythonHelper
• DNF provider teething issues (Chef PR#8005 PR#8754)
RHEL 8 Beta
Chef bringup
24. • Release notes, internal comms prep work
• Continue productionizing the pilot
• Mostly waiting while obsessively refreshing
https://wiki.centos.org/About/Building_8
RHEL 8 Release
After the pilot
25. • CentOS 8 and CentOS Stream
• New repos: PowerTools and EPEL-playground
• Streamlining rolling OS updates
• About a month from release to open testing
• Began engaging partners and planning migration schedule
• Feel migration started in earnest in Jan
CentOS 8 and CentOS Stream
Release time!
26. • Based on the CentOS Stream repositories
• Using our kernel and systemd backport
• btrfs on / by default
• cgroup2 only
CentOS 8 at Facebook
What’s different
27. • Sharding for default OS settings in provisioning
• Reuse kernel upgrades tooling to automate host reimaging
• Automated progress tracking
• OS team acts as consulting partner
CentOS 8 at Facebook
Migration process and tooling
28. • No 32bit altarch release
• Python packaging changes
• Repository layout changes in EPEL
• nobody/nfsnobody UID change
• Modularity: build pipeline, overrides
CentOS 8 migration
Migration issues so far
29. • Targeting CentOS 7 EOS by June and EOL by December
• CentOS 8 container base images
• Wrap up the ndb conversion and make it the default
• Productionize and upstream the RPM CoW work
• ???
CentOS 8 migration
What’s next