At the Portland and Hong Kong Summits, Rackspace invited the OpenStack community into the their experiences deploying OpenStack trunk to their their Public Cloud Infrastructure. In this presentation, Rackspace's Deployment System Team will provide an update on the latest challenges, triumphs, and lessons learned deploying and operating a production OpenStack public cloud during the Icehouse cycle. We’ll conclude by sharing the vision for our next steps in OpenStack deployments during the Juno cycle and beyond.
Enjoy Night⚡Call Girls Samalka Delhi >༒8448380779 Escort Service
Learning to Scale OpenStack: An Update from the Rackspace Public Cloud
1. An update from the Rackspace Public Cloud
Learning to Scale Openstack
Rainya Mosher and Jesse Keating – Deployment Engineering
@rainyamosher @iamjkeating
2. #rackstackatl
The Rackspace
Public Cloud
6 Public Regions
3 Pre-Production Regions
10s of Thousands of nodes
Growing continually
Frequent deployments
Staying aligned with upstream
#rackstackatl
3. #rackstackatl
• We could not deploy code in a reasonable
window of time
• We did not have confidence in the code we
were deploying
• We could not keep up with upstream
Our Old Challenges
4. #rackstackatl
• Deploys taking 6+ hours
• Deploys often failed the first time
• Migrations were an unknown factor
• Deploys roughly 2 months behind upstream
Old Challenges Met
• Deploys take an hour, as short as 10 minutes
• Deploys rarely fail the first time
• Migrations tested upstream and timed downstream
• Still up to 2 months behind
5. #rackstackatl
It is by riding a bicycle that you learn
the contours of a country best, since
you have to sweat up the hills and
coast down them.
~ Ernest Hemingway
8. #rackstackatl
Scaling Glance
• Scheduled Images feature went live
• Glance saw much more usage
• Glance servers became saturated
• Builds and snapshots slowed down, eventually piling
up faster than could be consumed
• Resolved by:
– Scaling number of glance-api nodes
– Scaling size of glance-api nodes
– Scaling use of glance-bypass feature
9. #rackstackatl
Scaling Nova Cells
• Performance Cells went live
• More and more cells added to regions
• Nova cells service became single funnel slowing
down the exchange of data
• Eventually our single nova-cells service could not
consume messages faster than they were being
produced
• Resolved by:
– Scaling number of nova-cells services
– Optimizing instance healing calls
– Optimizing database usage from cells service
10. #rackstackatl
How do we anticipate where our
growth will hurt and proactively scale
to match?
12. #rackstackatl
Higher Form Orchestration
• Pre-staging content outside of deploy window
• Increased tolerance of “downed” hosts
• Targeted bring up of services
– API first, then computes
• More deployment options
– Factonly
– Cellonly
– No migrations
• Reduced complexity
– Single entry point: bin/deploy
– Single orchestration system: Ansible
13. #rackstackatl
We still treat OpenStack as a legacy
software deployment. As a community
we need to treat it more like a cloud
application, but that requires
collaboration!
16. #rackstackatl
Scaling Change
• New features coming
• New configurations coming
• Accommodate without interrupting customer
experience
• Change faster, change frequently, on an ever
growing fleet of systems
• Resolved by:
– Understanding change before it happens
– Scheduling changes to not conflict
– Dedicating release iterations to risky change
on top of known good code
– Custom deploy modes per change type
19. #rackstackatl
• Leverage object model in Icehouse for mixed-
version services
• Implement Nova conductor service
• Investigate read-only states
Zero Perceived Downtime
20. #rackstackatl
• Can we give Glance it's own pipeline and
deployment capability, independent of Nova or
other services?
• How do we combat the exponential growth of
service version combinations?
• Does this actually make the whole pipeline
any faster?
Individual Service
Deployment Pipelines
21. #rackstackatl
• Creating not just ephemeral environments, but
production ones as well
• Upgrades are easy, initial setups are a lot
harder
• Validation is critical
• Developers and Operators need to collaborate
on this use case when services are being
designed
Fully Automated
Environments
22. #rackstackatl
I have always struggled to achieve
excellence. One thing that cycling
has taught me is that if you can
achieve something without a struggle
it's not going to be satisfying.
~ Greg LeMond
A review of the Rackspace Public Cloud – sets the context for the conversation
<number>
This is our third summit presenting on this topic. Here is a brief review of some of the scale issues we were facing back at the Havana Summit in Portland
Our window of time is 30 minutes perceived downtime, 4 hour deploy windows
Code coverage wasn't great, lots of errors discovered in production
Upstream moved very fast, and we couldn't keep up with all the testing downstream
<number>
Here is a comparison of how we met some of our challenges
Our deploys are much faster, some as short as 10 minutes total in our largest environment with 3 minutes of API interruption
Deploys are now more reliable
Migration data is known ahead of time (and bad ones blocked upstream)
We still haven't solved keeping up with upstream. Many factors there.
<number>
We are also learning the countours of openstack, by being the largest public cloud operator. We get to sweat up the hills and coast back down.
<number>
Some of our new challenges – scaling not just deploying bits on nodes as fast as we can.
Scaling services
Scaling Deployments
Scaling Frequency
While we are trying to be a thought leader and front runner, collaboration is the key to success. The developer, operator, and testing communities need be aware of these scaling challenges
<number>
Scaling Services – As the size of our cloud grows, and the features of our cloud grows, the services used need to scale along with them. Here we will walk through two scaling scenarios that highlight the challenge.
<number>
Glance is an interesting case. Our glance talks acts as a middle person between HVs and Swift. As glance got used more, the bottleneck emerged. Partly due to our own configuration, but partly due to the nature of glance.
Once we resolve the glance issues, swift could be the next bottleneck, care will be needed to make sure we don't just kick performance problems down the line to the next group.
Nova cells is responsible for interacting between the global cell and all the child cells. Doing this with just a single instance was never going to scale, we just ran out of runway before the pain hit.
Through collaboration with upstream, we are
now more able to scale out nova-cells as our cell counts grow.
These challenges will repeat. New bottlenecks will be found and new resource limits will be discovered. Staying ahead of the pain is key. We will not be the only ones to experience this, we are looking for collaboration on how best to manage this kind of scale.
<number>
Our next scale challenge involves deployments.
We made great strides around Havana, what have we been doing since?
<number>
Orchestration has been our theme around deployments. We continue to iterate on the parts of the deployment causing the most pain, always making improvements for the next time.
Walk through each block and explain why the change was made
Even with the improvements, we still treat openstack like a legacy application; upgrading in place, not utilizing load balancers, stopping everything to migrate databases, preventing mixed versions, etc.. There are many things that are preventing us from getting to zero downtime, and that's where we can all work together!
<number>
A third scale challenge is frequency. This is the scale of doing things much more often.
<number>
A very relevant quote, but unlike bicycling, when you do something more often in the DevOps world, it does tend to get easier, but there are challenges to going faster!
<number>
Change comes from many sources. These changes need to be distributed to the environments, but with as little customer impact as possible. If we can't deploy changes often enough, we fall behind upstream, we fall behind our features, and we have larger deployments to consume. A snowball effect.
Our work on creating new multiple release pipelines, improving our deployment methods, and moving our tests upstream have enabled us to move faster, but not fast enough.
This is our limit. We absolutely have to make this better. This is a global need, throughout the community of developers, operators, and testers.
<number>
A quick look at what we've got cooking for the Juno cycle
<number>
In Icehouse nova made great strides toward live upgrade with object model and conductor, which give us the ability to run multiple versions of openstack at the same time, notably we could run newer nova-api against an older version in the rest of the environment and shield nova-compute from migrations. This could allow us to roll the update through without downtime of the API and the computes with less interruption.
Investigate putting API nodes in read-only during migrations to satisfy some requests and queue others
<number>
This is an ongoing conversation. If we allow each service to work independently, what does that do to the version test matrix? Can we reliably validate anything? While individual projects/services might go faster, does that allow the entire pipeline to go faster? This ties into the discussions happening now at the design summit about cross project interactions.
<number>
Yeah, we need them. Setting them up is hard, lets work together to make them easier.
The ops meetups are great for collaborating on the issues at hand.
<number>
We do a lot of things that are hard, but if it wasn't hard, it wouldn't be as satisfying. That's what keeps us coming back.
Scaling is more than just tossing code on nodes. There are a lot more considerations to take into account.
The development, operator, and tester communities need to collaborate more on where the painful parts are, particularly at scale, and work together on solutions.
<number>