The Firefox build and release pipeline is crucial to delivering our products to customers. Over the past year we have transformed our pipeline to a more robust and scalable system using Taskcluster, Docker and in-tree scheduling. We have also implemented release promotion, which takes existing continuous integration (CI) binaries and transforms them for release, significantly reducing wall clock time.
Attend this session to hear exciting stories about how to replace components of a large running distributed system using the ominously named strangler application approach. I’ll discuss some metrics regarding the end-to-end time for our release process. I’ll also cover how developers can implement changes to transform builds and tests themselves in-tree.
13. Release process using release promotion
Use existing
build
artifacts
Generate
updates
L10n
Unit tests
Decision
graph
Sign Builds
Performance
tests
Repackage
Builds
+
Move
artifacts
Refresh
update db
rules
Update
websites
with release
14. About:Taskcluster
● Taskcluster is a task execution framework that supports Mozilla’s continuous
integration farm + release pipeline
It is a set of components that manages task queuing, scheduling, execution and
provisioning of resources.
15.
16. Why: In-tree and Decision Graph
● Build and test configs are all in tree
○ Good news: Developer autonomy
○ Bad news: Developer autonomy
● Decision graph upon push identifies failures more quickly
● Changes can be tested locally and on try
17. Testing the graph locally
● Generates the full taskgraph.
○ ./mach taskgraph full > full.txt
● Generates an optimized taskgraph
○ ./mach taskgraph optimized > full.txt
● Generates a target taskgraph
○ ./mach taskgraph target -p parameters.yml > target.txt
● Generates a target taskgraph with json to inspect content of graph
○ ./mach taskgraph target --json -p parameters.yml > target.txt
18.
19.
20. ● Taskcluster config files are under taskcluster/ in tree
○ Example: taskcluster/ci/build/macosx.yml defines mac builds (which
actually run on Linux)
21. Changing tests
● YAML files in taskcluster/ci/test/ files define tests groups by suite name - e.g.
mochitest, reftest, talos etc
22.
23.
24. Why: Docker Containers
● Docker containers for test and build images (not all platforms)
○ Consistent environment to debug build and test failures via one click loaners
○ More self-serve developer loaners
25.
26.
27.
28.
29. Why: More autoscaling
● Moved more platforms to AWS enable autoscaling in response to bursty load
○ Moved Macosx builds to Linux cross-compile on AWS
○ Moved many Windows builds/tests to AWS
30. Why: More security
● Better security - Chain of Trust (CoT) between artifacts as they are built,
signed and moved to AWS S3/CDNs for download on releases/nightlies
● CoT is the security model for releases
● Task execution is restricted by taskcluster scopes, but that is only one type of
authentication
● CoT allows us to trace requests back to the tree and verify each previous task
in the chain.
● If CoT fails, the task is marked as invalid
31. Why+?
● Team learned new things - Docker, transforms, migration strategies,
microservices, monitoring
● Future efficiencies - allow us to continue to scale
● Migrate off technologies that did not scale to our needs
● Re-evaluate existing jobs: Are they still needed? Could they be improved?
32. Timeline for migration
● Jan 20 - Linux Desktop and Android Firefox nightly builds from Taskcluster
● Mar 13 - Mobile beta in Taskcluster
● July 2 - Mac Nightlies in Taskcluster
● Aug 30 - Windows nightlies in Taskcluster
● Nov 14 - Shipped Firefox Quantum in Taskcluster
33. Approach to migration
● Incremental portions of pool
● Communication
● Checklist
● Monitor capacity and wait times
● Monitor state after migration
● Rollback plan
● Decommission old
● Migrate more
36. 56 was a rough release
● We had many automation changes
○ New compression format for updates
○ Watersheds for win32->win64 migration for people on 64 bit hardware
○ Win32/Win64 on taskcluster
37.
38. Operation: Don’t F*ck up 57
● Implement missing release automation
● Fix our staging environment
● Smooth our merge day process
● Train team members on merges and staging releases
● Run staging releases and merges to iron out any issues
before 57 releases
● Write tests to validate update rules for 57
● Spreadsheet to coordinate update rules with relman
39. What have we learned?
● Incrementalism - change one thing, evaluate, then change
another
● Expectations change. The faster we build, the faster other
groups expect to be able to ship
● Staging environment is important to test new automation
● Communication
● Organizational changes
● Consider the operational side, not just landing code
40. Upcoming work
● In tree release promotion for beta and release builds
● Release process optimizations: measure our release end-
to-end times, common failure points with the aim of
providing more predictable and stable releases
● Staging releases on try
● More incremental fixes to make things faster
43. Additional Reading
● Justin Wood’s (Callek’s) talks on transforms
https://gitpitch.com/Callek/slideshows/transforms_2017
● All your nightlies are belong to Taskcluster
https://atlee.ca/blog/posts/migration-status.html
● Nightly builds from Taskcluster https://atlee.ca/blog/posts/nightly-builds-from-
taskcluster.html
● 2016 retrospective https://atlee.ca/blog/posts/2016-releng-retrospective.html
● What's So Special About "In-Tree?"
http://code.v.igoro.us/posts/2016/08/whats-so-special-about-in-tree.html
44. Additional Reading
● Chris Cooper Nightlies in Taskcluster
http://coopcoopbware.tumblr.com/post/156133487075/nightlies-in-taskcluster-
go-team
● Chris Cooper Mobile Betas in TC
http://coopcoopbware.tumblr.com/post/158362146735/shameless-self-
release-promotion-firefox-530b1
● So you want to rewrite that - Camille Fournier, GOTO conference, Chicago,
2014 https://www.youtube.com/watch?v=PhYUvtifJXk
Hinweis der Redaktion
Hi, my name is Kim Moir and I work in Mozilla Release Engineering. I’m also one of the unapologetic Canadians here in Austin this week.
Today I’m going to tell a story. Last month, we shipped Firefox Quantum. We released a beautiful new and much faster browser. So far the reviews have been stellar and we are all looking forward to seeing the impact that it has in the marketplace.
But there is another story. While the platform teams were transforming the browser, engineering ops teams were transforming the pipelines that deliver our products to the world. This work was ongoing while we continued to deliver betas every week, and releases on schedule. How did we do this? Why did we do this? How does this help you? This is the story I’m going to tell today.
As a side note, this picture was taken near Stanley Park in Vancouver. I took it during a work week almost two years ago. At this point we were starting a lot of the work to transform our build and release pipeline. Today the bulk of that work is now done.
I am also notorious for talking a lot about release engineering. I have a promise for you that this talk will be interesting, informative and relevant, no matter what your role at Mozilla. So let’s get started!
Faster pipelines -> feedback -> shipping
How to try it yourself! (loaners, mach commands, overview of tasks, transforms)
Lessons learned and what’s next
I’ll publish the slides online after the talk
Photo by Taylor Leopold on Unsplash
Before I start talking about the work we did the past year, I’m going to ask you why are you here. Not in the why are we here in the universe sense, but why are you here at Mozilla? Would anyone like to share why they are here at Mozilla?
I’m here because:
I care about the open web
I like release engineering at scale
I enjoy working with an amazing team who like to constantly improve things
I like to ship!
I’d also like to introduce the cast of characters that did a lot of the work I’m going to be talking about. This is the Mozilla release engineering team.
We also didn’t do all the work ourselves - there was a lot of work from the Taskcluster Platform team, Release Engineering Operations, Developer Productivity, Developer Services, Release Management, Sheriffs, Buildduty, QA and more
As I go through this presentation, I’m going to have a series of trivia questions. If you get the right answer, I’ll have stickers for you.
Trivia time - how many countries do we live in? The answer is 7
One other thing to note is that we are a very distributed team as you can see from the map. We are in New Zealand, Canada, the US, UK, France, Germany and Romania.
Picture Photo by Quinten de Graaf on Unsplash
How many of you have?
"pushed a patch to Try?"
"landed an uplift to mozilla-beta or mozilla-release?"
"received a notification that there is a new update available?"
So if you have done before, you’ve used some of the systems that release engineering builds and maintains
What does releng do?
Transform code to shippable product
Develop and maintain a build and release pipeline
Build: compile, package, sign, run tests, create updates, verify various update scenarios work
Optimize! Make things faster!
From Wikipedia: “Release engineering, is a sub-discipline in software engineering concerned with the compilation, assembly, and delivery of source code into finished products or other software components. Associated with the software release life cycle, it was said by Boris Debic of Google Inc.[1][2] that release engineering is to software engineering as manufacturing is to an industrial process.”
Photo by Garett Mizunaka on Unsplash
At previous job, I used to work with someone that said that everything in life is a constraint optimization problem. Building a build and release pipeline is the same.
We are bounded by constraints: Money, time, machines, people. How can we optimize these constraints most effectively so we have happy developers and are able to deliver product?
Photo by Uroš Jovičić on Unsplash
What are end to end times - the time from a developer lands a commit until we are able to ship the finished product
Why are end-to-end times important?
Developers love to ship. In order to ship, they need feedback on their patches. Can I ship this? Or does is there a regression that needs to be backed out? Improves happiness if they can see the results of their work more quickly
Landing small incremental patches reduces risk. - Years ago, many software teams ran nightly builds, only run once a day, bisect to figure out what broke everything. We don’t do that anymore. Too difficult to figure out what went wrong on a high velocity team with a huge number of commits.
0 days - we need to be able to get security patches to our users quickly
Trivia time - how long does a release take today?
References
2013 - https://oduinn.com/2013/12/11/on-leaving-mozilla/
What changed?
Release promotion
More parallelization of tasks
Faster machines
Moved more platforms to AWS so we can scale for bursty load (mac builds now run on Linux in AWS, windows on AWS machines)
Fastci work
Very simplified diagram
We sign with a signing key specific to CI builds. It’s important that CI, nightly and release builds have different signing keys.
Trivia time - How much does it cost to run all the jobs associated with a push to m-c? $134 This doesn’t account for costs of machine we have in data centers, like mac test machine or machines for performance tests on windows and linux.
For nightlies, the signing key is different than releases. Also, we generate language packs for different locales. And generate updates for all of that.
Very simplified diagram again
We use a process called release promotion to take the existing artifacts from a CI build and repackage them for release builds.
With CI builds on release and beta builds, we use the release key for nightlies because these builds are promoted.
In the future, we plan to sign with the release key only when the builds are promoted
Trivia time - how much does did we pay AWS for a push to m-r that we used for 57? It’s about $56
Photo by ARTHUR YAO on Unsplash
A lot of the work I’m going to talk about today is regarding our migration of buildbot to taskcluster. So I’m going to talk a little about what that is.
Release Engineering + other eng ops teams recently finished migrating builds, tests and much of release automation to Taskcluster
What does taskcluster provide? It has a lot of features as you can see from this page.
The most important features for the releng team were scheduling flexibility and platform support.
Before this migration, we had several repos full of Python code that were used to define how builds and tests ran in CI. Releng + a small group of developers knew how to make changes, and this was a bottleneck to enabling new or disabling old builds and tests
Now these configs are managed in tree and any developer can make changes
The drawback to that is that any developer can make changes. Sometimes mistakes are made. E.g. I had to write patches to back out changes that enabled tests on branches where they were not needed to run and just added to costs. Also we need to find ways of testing that developers are not touching configs for releases riding the trains.
With every push to a repo, such as autoland, a decision graph is generated automatically. Basically it contains a list of tasks and all their dependencies that are needed to run associated with that push. If it fails, the builds aren’t run which saves resources
Developers can also test these changes locally or on try
Photo by ARTHUR YAO on Unsplash
Photo by ARTHUR YAO on Unsplash
Trivia time - How many tasks are generated with a full graph? 7945 tasks and 16874 dependencies
We don’t really use the full task graph very often
It’s filtered to select the tasks we actually need
So an optimized task graph is Filter filter_target_tasks 2748 tasks
You can also specify a targetA regular push to autoland or m-c has a target of default
There are other targets that we use, for instance when we are running releases there are promote_firefox targets filters the tasks needed to promote existing ci builds to beta or release builds.
When you have your changes working locally, you can push to try.
You can also export the task in json format
Every decision task generates a parameters.yml so you can download that to use in the target task
Photo by michael podger on Unsplash
I wrote some code to generate our a graph of our dependencies but is was too large so this is a replica.
Trivia time: 7945 tasks and 16874 dependencies
Defines the various flavours of mac builds, toolchain, scripts and config to run the build, toolchains
Anyone with commit rights can change these, but don’t unless you know what you’re doing!
This is the start of the talos.yml file. You can see that for talos-chrome, we can specify that they only run on linux-qr on m-c and try. By default they only run on selected branches
You can also specify built-projects and the tests will only be scheduled for projects where there upstream dependencies are built.
taskcluster/taskgraph/transforms/ transforms the taskgraph
This is code to transform the graphs different purposes to reduce code duplication
See Callek’s talk - it’s a entire talk of it’s own and explains taskcluster transforms very clearly https://gitpitch.com/Callek/slideshows/transforms_2017
Photo by ARTHUR YAO on Unsplash
The taskcluster team implemented one-click loaners which is a super easy way to get a short term loan of a machine that has a docker environment setup configured with the job that you want to debug.
https://docs.taskcluster.net/tutorial/debug-task#content
Reproducing errors on same environment that runs on ci
One click, get an interactive terminal with that job running in it
List platforms available
Demo?
You’ll have an interactive task created for you. (Note you have to be logged into taskcluster to create one.).
You’ll be redirected to a page with several options. I chose 2 to setup the task, but not run them
The end result is page like this, with a Docker environment with the tests set up. Some caveats from https://docs.taskcluster.net/tutorial/debug-task#content
The original task command executes anyway. You can, of course, kill it manually.
The shell stays open until there are no active connections, but only until the task's maxRunTime expires, at which time it will be forcibly terminated.
Tasks generally run on EC2 spot instances which can be killed at any time.
Photo by ARTHUR YAO on Unsplash
Mozilla has a lot of bursty load on their CI farm. When Europe and North America are online, there is a lot of load, overnight it decreases. It’s expensive to have all these machines in data centers to accommodate bursty load.
So you can’t autoscale macs in AWS. This is not an offering. We wanted to cross compile Mac on Linux. This took a lot of work to get the toolchain correctly configured from Ted, Mshal, and wcosta. When we did get the builds working, a performance issue was identified. The performance of the browser built on Linux was not as good as the one on native mac hardware. So we couldn’t ship it.
It turns out that on mac we were building in a different directory than on Linux, and this was the root cause
https://bugzilla.mozilla.org/show_bug.cgi?id=1338651
“Taskcluster OS X builds are cross-compiled from a Linux Docker image. The build is done in a path under '/home/', which shows up in the symbol table of the resulting binary as STAB entries referencing the object files. Some system libraries on OS X will attempt to stat the files in those entries, which can cause noticeable performance issues. This has shown up as a result of the sandboxing system causing this behavior while reporting violations, which caused Talos regressions, as well as timeouts in GTest death tests due to the system crash reporting system causing this behavior.”
Trivia time: How many comments were on the bug to address this performance issue
200
How many people cc’ed 48
It is a small novel. Add it to your reading list.
Photo by ARTHUR YAO on Unsplash
Aki’s talk on CoT https://vreplay.mozilla.com/replay/showRecordingExternal.html?key=mHlTiJ4RZZSVRPc
Blog post on CoT https://escapewindow.dreamwidth.org/249409.html
We have been generating Chain of Trust artifacts for a while now. These are gpg-signed json blobs with the task definition, artifact shas, and other information needed to verify the task and follow the chain back to the tree. However, nothing has been verifying these artifacts until now.
With the latest scriptworker changes, scriptworker follows and verifies the chain of trust before proceeding with its task. If there is any discrepancy in the verification step, it marks the task invalid before proceeding further. This is effectively a second factor to verify task request authenticity.
Photo by ARTHUR YAO on Unsplash
This is a timeline for some of the work we did this year. There was a lot of work done in the previous year to migrate the ci builds to tc.
Tested in project branches
Mention phoenix project
Who loves to delete code? I do, it’s one of my favourite things.
From Jez Humble’s Continuous delivery page
https://continuousdelivery.com/implementing/architecture/
“One pattern that is particularly valuable in this context is the strangler application. In this pattern, we iteratively replace a monolithic architecture with a more componentized one by ensuring that new work is done following the principles of a service-oriented architecture, while accepting that the new architecture may well delegate to the system it is replacing. Over time, more and more functionality will be performed in the new architecture, and the old system being replaced is “strangled”.”
One of the things that really helped us achieve this in our transition was an application called buildbot bridge. This allowed us to schedule jobs on taskcluster, but continue to run them on buildbot. This is similar to the dispatcher function showed in the diagram above.
Trivia time: 22 beta builds, 12 Betas, 6 RCs toward 56.0
Watersheds are are rules that define an upgrade path to a newer release through a previous release
Photo by Lance Anderson on Unsplash
This is an accurate portrayal of the releng team after the 56 release cycle. We were very tired. Not just us. Other teams too. And we knew that it wouldn’t be good for the team to run through 57 hitting some of the same problems given the importance of that release.
Not sure who coined this term, but I have heard it circulate
There were many facets to this approach from other groups
For releng it was to
Running staging releases had already been a magical incantation that only a few releng folks knew how to do. So we set about documenting the process. Where were the pain points? How could more things be automated? How could we share the knowledge so everyone involved with releases could fire up a staging a release and test that their changes worked as expected?
Trivia question:
How many beta builds were there before the 57 release?
11 betas - 16 builds, 4 RCs before final release
This is an excellent talk on code rewrites as well
So you want to rewrite that - Camille Fournier
https://www.youtube.com/watch?v=PhYUvtifJXk
What is in tree release promotion? Currently the code that promotes/ships our builds for beta and release doesn’t all reside in tree, such as mozilla-central. There are several github repos for that code. So there is a lot of ongoing work in progress to migrate that functionality so it resides in tree. This will allow us to run staging releases on try, among other things.
This was a really huge rewrite. We learned a lot. We made mistakes. We learned from them and will take those lessons forward in future work.
I learned a lot from this process. In the end, our build and release pipeline is more resilient, more scaleable, and more self-serve for developers.