Learning to Scale Openstack: A Case Study in Rackspace's Open Cloud Deployment was presented at OpenStack Design Summit in Portland, OR on April 17, 2013. Watch the recording of the presentation on youtube at the following link: http://www.youtube.com/watch?v=3x8X6f5mnzc
1. Rainya Mosher, Dev Manager, Deploy Infrastructure
IRC: rainya on freenode Twitter: @rainyamosher
Learning to Scale OpenStack:
A Case Study in Rackspace's
Open Cloud Deployment
April 17, 2013 at 4:30pm
2. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
It is not the critic who counts; not the man who points out
how the strong man stumbles, or where the doer of deeds
could have done them better. The credit belongs to the man
who is actually in the arena, whose face is marred by dust
and sweat and blood; who strives valiantly; . . . who at best
knows in the end the triumph of high achievement, and who
at worst, if he fails, at least fails while daring greatly.
Theodore Roosevelt
The Man in the Arena, April 1910
2
In the Arena
Learning to Scale OpenStack
3. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
Hundreds of HVs
Thousands of HVs
Tens of Thousand HVs
Hundreds of Thousand
HVs
Global
Cloud
Region Region
Cell Cell Cell
HV HV HV HV HV HV
Cell Cell
Region
3
What does “At Scale” Mean?
Learning to Scale OpenStack
4. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
Code
Package
Deploy
Verify
4
What is the Control Plane Release Strategy?
Learning to Scale OpenStack
5. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
First Scaling Hurdle – Deploy Mechanism
Learning to Scale OpenStack
5
• Aug 2012
– Rackspace launches Open Cloud
– Frequent releases to fine tune
• Sep 2012 thru Nov 2012
– Deploying code that is two weeks
from trunk takes about two hours
– Begin designing new deploy
mechanism at October Summit
• Dec 2012
– Code deploys take 4 - 6 hours
– Deploy team says, bleary-eyed,
they aren’t doing it again
• Jan 2012
– Deploy again
– Takes more than 6 hours
– Accept that it is no longer
“reasonable” and temporarily stop
deploying code releases
– Focus on the deploy mechanism
0
1
2
3
4
5
6
0
1
2
3
4
5
6
7
Aug-12 Sep-12 Oct-12 Nov-12 Dec-12 Jan-13 Feb-13
Internal Code Releases Capacity Linear (Internal Code Releases)
6. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
• switched from
Debian packages
to virtual
environments
Package
• used torrent for
package, pssh for
fact files, and
mcollective for
actions
Distribute • moved centralized
puppet master to
decentralized
masterless
puppet
Execute
6
Improving the Deploy Mechanism
Deploying from OpenStack Trunk
7. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
Second Scaling Hurdle – Catch up to Trunk
Learning to Scale OpenStack
7
• March 2013
– Production code is 2 months
behind trunk
– Trunk as of 2/28 becomes our
“v152” and bakes in preprod
– Prep for impacting DB
migrations in production
– Re-enable our CI process
• April 2013
– Deploy v152 to production
– 10x increase in DB traffic
– Community works to fix
– Re-deploy v152 with
Community fixes
– Attend Summit in Portland
and share the story
1
2
3
4
1 – Normal DB throughput ; 2 – First installation of v152; 3 – Disabled several
periodic tasks; 4 – Re-installed v152 with patches from Community & turned
periodic tasks back on
8. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
• Testing & Environments
– More robust testing coverage
– Deployer-specific testing further upstream
– Production-like dev environments
– Simulate production compute numbers on non-production hardware
• Database & Code Management
– Non-disruptive DB migration patterns
– DB calls with 6 million rows in mind, not just 60
– Code optimization paths for large datasets
• Process & Community
– Stay close to trunk, even though it is hard
– Explore options for a continuously deployable trunk
How Can We Adapt for Scale Issues?
Learning to Scale OpenStack
8
9. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
Backup Slides
Learning to Scale OpenStack
9
Many of these backup slides were first presented on 4/16/2013 during the
OpenStack Summit session “Deploying from OpenStack Trunk” and are
included here for reference.
10. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
10
Merge and Branch Strategy
Learning to Scale OpenStack
• The most recent Rackspace release
branch took over 50 minor tags
make to work in production
• Rackspace Development branch is
about 40 patches on top of
OpenStack trunk for internal service
compatability
11. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
• per-project venv
• .tar of project
venvs + configs
Package
• seed .torrent
• distribute fact
files
• verify completion
Distribute • switch version
• sync databases
• run puppet
• verify completion
Execute
11
Package and Distribute Strategy
Learning to Scale OpenStack
12. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
Deploy and Test Strategy
Learning to Scale OpenStack
• pre-code
check-in
validation
Dev
• smoke tests
• unit tests
Integration
• functional tests
• integration
tests
QA
• regression
tests
• build tests
Pre-Prod
• smoke tests
• build tests
Production
13. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
Benefits and Challenges of Trunk Deploys
Learning to Scale OpenStack
13
Why We Do It (Benefits)
• Issue Resolution
– Early detection of issues and conflicts
– Shorter feedback loop within the
community
– Faster resolution of issues
• Early Feature Delivery
– Smaller, incremental periodic releases
– More stable release candidates at end of
cycle
Why It’s Hard (Challenges)
• Code Management
– Merge conflicts with local patches
– Disruptive DB migrations
– Service restarts
– Temporary version skew
• Testing
– Devstack-based testing vs testing at
scale
– Rework when issues found in RAX deploy
pipeline
• Process
– CI/CD vs Release methodology
– Time to merge patches
14. RACKSPACE® HOSTING | WWW.RACKSPACE.COM
14
Scale of Deploy Pipeline
Learning to Scale OpenStack
1,000s of Nodes100s of Nodes10s of NodesDevStack
Dev
Integration
& QA
PreProd Production
What we as a Community are doing is hard. And a little scary. And did I mention hard? We stumble. A lot. We help each other back up, we dust ourselves off, we say “that was hard” and then we dive back in, hopefully realizing that the stumble wasn’t just failure, but growth and learning and opportunity. Over the last year, there has been lots of opportunity for growth at Rackspace in learning to scale OpenStack within our public Open Cloud deployment.
For Rackspace, “at scale” means deploying to our expanding, multi-region global cloud. A region is made of one or more cells, which in turn is made of hundreds of hypervisors. Rackspace is adding more regions in the next year and each region is trending quickly towards more than dozen cells each.
The basic strategy we use to deployOpenStack onto our public cloud is simple. We take the OpenStack code, package it up with a few local integration modifications, distribute the package, execute the code in the package, and then verify that it works. Simple in concept, but not necessarily easy in execution.
Our initial deploy mechanism used pssh to push the deployment package, a debian file, out to all the nodes. A central puppet master in each region handled configuration management for all the nodes and reported on the status of puppet runs. We prestaged the package earlier in the day, as it could take more than 30 minutes to pssh to all the nodes in the region. Once prestaged, we’d start the deploy scripts. Most nights, we’d be done within 2 hours, including verification through smoke and build tests. We had been working on a new deploy mechanism, knowing that we were going to out-grow our current process eventually, but it was difficult as the core people building the new mechanism were also the experts in the existing process. By January, we accepted we couldn’t keep going at this rate and called a halt to code releases to focus on improving the mechanism.
We completed the deploy mechanism improvement project and implemented it in all production regions by early March. We upgraded the mechanism without changing the code so that we could minimize the changed elements. The new mechanism worked! The virtual environment based packaging reduced the dependency issues we would run into during deploys. We used torrent to seed the package out to all the nodes in a matter of seconds. Masterless puppet removed the central bottleneck that puppet master had become. Mcollective actions kicked everything off and reported on progress. We declared it a success and looked to get our code releases back on track.
We’d resolved our deploy mechanism issues and it was time to catch up to trunk. We were nearly 2 months behind trunk, Grizzly feature freeze had just passed, and we knew it was going to be a challenge getting back into our previous 2 week cycle. We tagged Trunk as of 2/28/2013 as “v152” and deployed it to our internal pipeline. There were several instance faults and tracebacks discovered that were fixed in the v152 line and also submitted back up to Trunk. The DB migration for deleted_at was going to be massive due to the size of our databases, so we did some much needed maintenance to the affected table rows. The migration to include instance type data as key value pairs in the metadata table was concerning, but we’d been through large migrations before and were confident we could fix whatever issue arose. Once we deployed new code to our first data center, though, we knew we had a whole new hurdle to overcome.
Check out Wednesday’s session at 430p on how Rackspace is “Learning to Scale OpenStack” for the story behind the most recent internal release branch!