Learning to Scale OpenStack

Rainya Mosher, Dev Manager, Deploy Infrastructure
IRC: rainya on freenode Twitter: @rainyamosher
Learning to Scale OpenStack:
A Case Study in Rackspace's
Open Cloud Deployment
April 17, 2013 at 4:30pm

RACKSPACE® HOSTING | WWW.RACKSPACE.COM
It is not the critic who counts; not the man who points out
how the strong man stumbles, or where the doer of deeds
could have done them better. The credit belongs to the man
who is actually in the arena, whose face is marred by dust
and sweat and blood; who strives valiantly; . . . who at best
knows in the end the triumph of high achievement, and who
at worst, if he fails, at least fails while daring greatly.
Theodore Roosevelt
The Man in the Arena, April 1910
2
In the Arena
Learning to Scale OpenStack

Hundreds of HVs
Thousands of HVs
Tens of Thousand HVs
Hundreds of Thousand
HVs
Global
Cloud
Region Region
Cell Cell Cell
HV HV HV HV HV HV
Cell Cell
Region
3
What does “At Scale” Mean?

Code
Package
Deploy
Verify
4
What is the Control Plane Release Strategy?

First Scaling Hurdle – Deploy Mechanism
5
• Aug 2012
– Rackspace launches Open Cloud
– Frequent releases to fine tune
• Sep 2012 thru Nov 2012
– Deploying code that is two weeks
from trunk takes about two hours
– Begin designing new deploy
mechanism at October Summit
• Dec 2012
– Code deploys take 4 - 6 hours
– Deploy team says, bleary-eyed,
they aren’t doing it again
• Jan 2012
– Deploy again
– Takes more than 6 hours
– Accept that it is no longer
“reasonable” and temporarily stop
deploying code releases
– Focus on the deploy mechanism
0
1
2
3
4
5
6
0
1
2
3
4
5
6
7
Aug-12 Sep-12 Oct-12 Nov-12 Dec-12 Jan-13 Feb-13
Internal Code Releases Capacity Linear (Internal Code Releases)

• switched from
Debian packages
to virtual
environments
Package
• used torrent for
package, pssh for
fact files, and
mcollective for
actions
Distribute • moved centralized
puppet master to
decentralized
masterless
puppet
Execute
6
Improving the Deploy Mechanism
Deploying from OpenStack Trunk

Second Scaling Hurdle – Catch up to Trunk
7
• March 2013
– Production code is 2 months
behind trunk
– Trunk as of 2/28 becomes our
“v152” and bakes in preprod
– Prep for impacting DB
migrations in production
– Re-enable our CI process
• April 2013
– Deploy v152 to production
– 10x increase in DB traffic
– Community works to fix
– Re-deploy v152 with
Community fixes
– Attend Summit in Portland
and share the story
1
2
3
4
1 – Normal DB throughput ; 2 – First installation of v152; 3 – Disabled several
periodic tasks; 4 – Re-installed v152 with patches from Community & turned
periodic tasks back on

• Testing & Environments
– More robust testing coverage
– Deployer-specific testing further upstream
– Production-like dev environments
– Simulate production compute numbers on non-production hardware
• Database & Code Management
– Non-disruptive DB migration patterns
– DB calls with 6 million rows in mind, not just 60
– Code optimization paths for large datasets
• Process & Community
– Stay close to trunk, even though it is hard
– Explore options for a continuously deployable trunk
How Can We Adapt for Scale Issues?
8

Backup Slides
9
Many of these backup slides were first presented on 4/16/2013 during the
OpenStack Summit session “Deploying from OpenStack Trunk” and are
included here for reference.

10
Merge and Branch Strategy
• The most recent Rackspace release
branch took over 50 minor tags
make to work in production
• Rackspace Development branch is
about 40 patches on top of
OpenStack trunk for internal service
compatability

• per-project venv
• .tar of project
venvs + configs
Package
• seed .torrent
• distribute fact
files
• verify completion
Distribute • switch version
• sync databases
• run puppet
• verify completion
Execute
11
Package and Distribute Strategy

Deploy and Test Strategy
• pre-code
check-in
validation
Dev
• smoke tests
• unit tests
Integration
• functional tests
• integration
tests
QA
• regression
tests
• build tests
Pre-Prod
• smoke tests
• build tests
Production

Benefits and Challenges of Trunk Deploys
13
Why We Do It (Benefits)
• Issue Resolution
– Early detection of issues and conflicts
– Shorter feedback loop within the
community
– Faster resolution of issues
• Early Feature Delivery
– Smaller, incremental periodic releases
– More stable release candidates at end of
cycle
Why It’s Hard (Challenges)
• Code Management
– Merge conflicts with local patches
– Disruptive DB migrations
– Service restarts
– Temporary version skew
• Testing
– Devstack-based testing vs testing at
scale
– Rework when issues found in RAX deploy
pipeline
• Process
– CI/CD vs Release methodology
– Time to merge patches

14
Scale of Deploy Pipeline
1,000s of Nodes100s of Nodes10s of NodesDevStack
Dev
Integration
& QA
PreProd Production

Learning to Scale OpenStack

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Learning to Scale OpenStack

Ähnlich wie Learning to Scale OpenStack (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Learning to Scale OpenStack

Hinweis der Redaktion