An intro into the pipeline & related tools we built to build a CI/CD pipeline for building and maintaining a package based OpenStack installation, with realistic, portable multi-machine development environments.
Good afternoon, my name is Simon McCartney & today I would like to talk to you about a continuous delivery pipeline we built to help us maintain an openstack private cloud.
There should have been 2 of us on stage today, but one of my colleagues, Mick Gregg, decided to change jobs at the last minute, so was unable to join us, I have to thank him for all his hard work on this project & this presentation, thank you Mick, it’s unfortunate that you can’t be here today.
This project predates HP Helion OpenStack, so it’s not about tripleO.
We’re using Ubuntu 12.04 as our base operating system, Ubuntu’s OpenStack packages & SaltStack for configuration management & orchestration, but that’s largely irrelevant, our real challenge was about building a pipeline that worked with packaged OpenStack deploys & realistic multi-node setups for dev & test. Your environment may vary, however, many of the principles are transferable.
Why Continuous Integration, Continuous Delivery & the pipeline that comes with a CI/CD environment?
Lets walk through the advantages we saw from previous experience that motivated us to build our pipeline
Continuous Delivery is a software development & deployment strategy that enables organizations to deliver new features to users as fast and efficiently as possible. The core idea of CD is to create a repeatable, reliable and incrementally improving process for taking software from concept to customer.
Of course, configuration management is software too, it wraps & coordinates your actual payload, in our case, an OpenStack environment built on Ubuntu and packaged OpenStack.
The goal of Continuous Delivery is to enable a constant flow of changes into production via an automated software production line. The Continuous Delivery pipeline is what makes it all happen.
Snowflakes are unique & beautiful things. Services built on snowflake servers are bad, snowflakes are especially bad when you are running a service that you know will span to hundreds or thousands of machines. By making your configuration management & deployment strategy part of a codified & enforced system, we’re saying that “this is the way things will be, this is the way we will configure everything”, it brings dependability & stability, at the inconvenience of not just diving in & fixing things in production by hand.
That stability comes from forcing all changes through the same pipeline, all code & configuration changes go through the same test & deployment procedures, nothing jumps from a laptop to production, reducing incidents due to environment configuration errors and getting rid of that “it worked on my laptop” excuse.
Having an automated build & deployment system also means that you have the ability to quickly build test systems to check urgent changes – none of the “oh crap, where can we test that” panic that we often find ourselves in for the latest security fix.
Frequent Small Batches has many benefits, it forces you to automate, out of sheer boredom & frustration at the very least, but also when you provide frequent small changes you have a much better chance of constantly improving your systems & procedures, the big bang may have worked for the start of the universe, but constant evolution & improvement has worked much better for us since then. Frequent Small Batches also helps decrease scrap & rework due to long running patches that become unmaintainable, after all getting the features your customers want is actually what we’re all about.
Removing the manual steps in a process can have many benefits, if you have to do something manually, invariably it is slower, automation allows you to reduce the time taken to complete a given set of steps & reduces the potential for user error . Of course it’s not all rosey, we now have to codify our processes & remove all of the tiny judgement calls we may make when working on a live system. However the pay off for that is consistency and hopefully a faster cycle time, especially if we’ve just removed bottlenecks in a manual process, as manual processes are often tied to people or functional silos.
Why Continuous Integration & delivery – so we can test everything!
With a well built pipeline, we can test early and often, testing at each layer of the components that go to make up your deployment system, unit testing for the puppet modules, chef cookbooks or salt formula that are your building blocks. We can then test that the particular versions of each of these components work together as we expect, and that the system they build also works as expected.
Now that we have a consistent baseline on how to building & deploy, we can layer in performance tests to check for expected or unexpected changes in performance.
We can validate that our deployment strategy is complete, it can do clean deploys & upgrade deploys, validate that we haven’t forgotten something required for a new environments, as well as upgrading existing environments.
Configuration Tests – here’s a challenge, much of a configuration management strategy is about making sure that we supply the right data for a given environment to the configuration management tooling, how much of an environment can we emulate by using compute & network virtualisation, can we use neutron to build networks that match our target physical environment & validate that the ip ranges, default gateways & netmasks in our production environment work together correctly in a completely virtual & isolated environment?
Now that we’ve outlined why you would want a pipeline for a configuration management system, lets move on to how our pipeline works
To recap, our private cloud implementation is built on Ubuntu Precise using Ubuntu’s OpenStack packaging, we’re using SaltStack for configuration management & orchestration
Our development environments make extensive use of vagrant, either directly or via test-kitchen to give developers & admins production like environments.
Our pipeline is built on gerrit, jenkins, test-kitchen & some custom “whole environment” automation scripts.
We build volatile local development environments using vagrant & virtualbox and remote test environments in cloud using contractor
With our layered approach, all work begins with the configuration management module or data that controls a behavior, for our salt formula, most of this can be developed & tested in isolation using test-kitchen, which makes it light weight & quick to validate, and the use of in-module testing & validation means that this testing propagates on up to the larger CI system (the same per-module environment settings & tests get re-used for module testing when jenkins gets fired for an incoming review)
Engineers can then use our personal dev environment, built using some wrappers around vagrant to build a multi-node, multi-network dev environment, that has the same MySQL/Galera Cluster, RabbitMQ cluster & OpenStack packages as production, to validate that their changes behave as expected, once they are happy that they have a working solution, it’s time to go public.
When a patch-set hits gerrit, jenkins fires off the relevant tests, including per-module testing & an integration test that helps ensure that this module still works with all of the other modules used to build the system, at this stage we’re looking to validate that a complete system has been built & that all OpenStack components are still working – we hope to add a more complete tempest style validation soon.
Our pipeline has several brakes at our choosing, we manually pick what goes forward for full system integration testing & subsequent production deploys, we pick specific versions of configuration management modules for the deploy-kit, this is us being exceptionally cautious, the more adventurous could just have your deploy-kit track master for all modules, gitshelf fully supports that.
A patch-set on the deploy-kit repo, bumping the version of one or more configuration management modules, triggers validation of the SHA1’s for each module, does that module version actually exist? After peer review by colleagues & the subsequent merge, a fuller test starts, building a complete system using compute instances & neutron networking in our public cloud to validate that we’ve built an OpenStack Control Plane that successfully talks to itself, and that we have nova-compute nodes that we can create instances on.
At this stage, jenkins then builds the deploy-kit artifact for each of our environments, stage & multiple production environments, in our case, this is just a tarball ready for the next stage, deploying to our physical staging environment.
Deploying from an artifact is a matter of untarring & running the contained deploy script, which uses salt orchestration to activate the changes in a controlled fashion, depending on the content of the deploy.
Once the staging deploy completes successfully, we move on to production.
So this is what our local vagrant/virtualbox dev environment looks like, a 3 node control plane with 2 compute nodes. We also have a multi-az version of this, which gets kind of heavy to run on workstation, as it needs a few more nodes to be useful.
There are a few differences between this and real production, the use of qemu for one, but were more interested in making sure that we have a working configuration management system at this stage – the actual virtualization system being used by nova-compute is of lesser interest to us.
The fact that this is fully scripted means that it takes about 30min to create 5 empty VMs & build a fully working mini-cloud, on your laptop, that’s a huge stepforward from when we started this project, where people committed hypothetical changes & worked it out in production, we now have somewhere sane & sensible & disposable to validate changes.
So lets just run through those validation steps tools we use:
We’re using gerrit for all source code management, it’s our main git repo host & our code review system
We have jenkins hooked up to gerrit, to validate incoming reviews & build artifacts on successful merges, all of our jobs are built using Jenkins Job Builder, because XML is horrible.
We make pretty extensive use of test-kitchen for configuration management module testing, test-kitchen was born in the Chef community, but pretty generic & extendable, we added a salt provisioner for our needs, and a puppet provisioner is also available. With test-kitchen you have a number of test & validation frameworks available. When we run test-kitchen locally, it’s with vagrant & virtualbox, when test-kitchen is being run by jenkins, it uses LXC as it’s provider.
To come, tempest testing in the pipeline
Gitshelf is what drives our deploy-kit building, it’s a simple tool we wrote that’s a somewhere between berkshelf, librarian-puppet and the Android Open Source Projects’ repo tool.
We use it to make sure that we have a specific version of a given git repo laid out on disk as we want, it supports token expansion & creating symlinks too, just some bits & pieces we needed to build our deploy artifacts as we wanted.
In this diagram, percona, rabbitmq, openstack etc are all individual git repo, testable & useable in their own right, the gitshelf.yml in deploy-kit is used to maintain a pointer to version of the repo we want.
When jenkins is doing a full system integration test, it builds a set of openstack resources, nova compute instances, multiple neutron networks & a neutron router, and then we let our salt based deployment system take over, building the database & rabbitmq clusters before starting the openstack install.
The openstack resources are built using contractor, a small utility for building nova & neutron objects from a JSON definition, you could call it super light weight HEAT.
Once the instances are built and we’ve installed openstack and it’s supporting dependencies, we validate that we can build instances & inspect other specific behaviour.
Having this all virtualised means that we’re not blocked on physical hardware being in use for another build or other issues, after all, we’re a cloud company, we should take advantage of that!
So now to talk about how we actually deploy to production, we have some custom tooling that apples salt states in a coordinated fashion, ensuring that we move slowly through the nodes in a cluster so that we don’t break a database or messaging cluster in a single hit. We also update control plane & compute nodes on a per-AZ basis, to minimise risk.
Part of the per-host deploy is to wait for it to fully recover in our monitoring system before proceeding to the next host – this slow roll can feel very frustrating for simple changes, but any update that involves nova service restarts is inherently risky as compute & network failing to recover has a big knock on impact to our customers.
We’re pretty close to making deploys fully hands off, once we have that sorted, I want to put deploy behind some middle ware so that we can trigger deploys from rundeck or hubot or some other communal deploy mechanism.