Continuous Delivery is considered a holy grail of a software company. This practice allows to ship product to millions of users in a matter of minutes. Tooling is an important part of process, but when the company is growing, there's more to the story.
This talk discusses engineering practices, values, and engineering culture of the company. They enable the company to ship code on the high pace.
Hey,
Let’s talk today how we can ship our code at pace of one deploy every 15 minutes to 50 million users that we currently have.
And here what you might be thinking. How many of you are doing CD? With 100 developers?
There are many different talks about continuous delivery from companies like Netflix, Etsy, Flickr. And they are great talks.
They all talk about tooling, but actually none of them vividly demonstrates the magic, that itchy feeling of excitement when you upload code to production.
I want to share that feeling with you today, and reveal today the feature that I wrote in the morning.
Because, you know, live codings are always going very smooth.
Do we have to write features that way? Of course, not. 4-5 months we are writing features then about 1 month of QA, and then release. Sounds familiar? And then pray. Pray that everything will work. And after that we have about 1 month of fixing bugs.
So we can change it CD by treating deployment as a part of our daily routine and deploy small changes to production.
But why the rush?
Because we want our product to succeed. We want to be the first one. To deliver great features first.
But that also means to iterate and fail fast. For example, you have a theory that might influence our conversion, you don’t want to wait for 6 month till next release.
We make the change and run A/B test and within 24 hours to see the results already. That’s the pace that we want to work.
Continuous delivery is an engineering practice that advocates for pushing code into production often by small changes.
There are many different talks about CD from different companies that shares their experience.
And most of those talks are about tooling. However, it only one piece of a puzzle. There’s a huge variety of tools, including modern solutions like “codeship” that offers CD as a service.
When you are a startup with team of 5-6 developers, it might be good enough. But when you start growing there’s more to the story.
How do you continue shipping when R&D department is 150 people instead of 5?
So what is answer? People are. People together form engineering practices.
The heart of CD process is values that we set and the culture that we form.
This is another form of engineering - human engineering. It enables us to deliver.
And those are two things that I wanna focus on today.
We start with values.
It’s hard to work together we you don’t trust each other.
We trust our people. We trust our developers. It's very hard to describe this feeling.
Trust doesn't go alone. It’s a two-way street.
With trust developers are receiving responsibility and certain obligations.
We are responsible for the code we write. No doubt here.
But we are also responsible for testing. We, developers, thinking how to cover the whole app with all sorts of tests including unit tests, component tests, e2e tests, contract tests and others.
The role of QA becomes in checking the work of the applications overall. If QA finds some bugs, it means that we, as developers, didn’t do good job. So the first thing we do is to add the tests that reveals that bug (for regressions in future) and then write a fix.
There are no special people for it. Developers write code, they deploy it into production. And if something goes south, we roll back, or fix and push another version.
The roles of devops, like roles of QA are shifting. They provide infrastructure and tools so that developers can deploy code by one click of a button.
After we deploy code to production, we are done, right? No. It’s our liability to monitor that services are working properly. Many devs have separate monitors in their rooms that display application graphs.
Besides the code that developers write, we are in charge of fixing it as well. It might sound obvious, but it’s an important detail. We don’t have other developers/juniors that we fixing bugs, and others write features. If you have a bug in production, the first who will wake up by the support call is the team leader, and then the corresponding developer.
If we have only several developers with such qualities that we’ve discussed before, it doesn’t help much. We need every developer to have them and care for them. This is the engineering culture.
We want to make the deployment the part of our daily routine. And we start early. Every new employee pushes code into production within first two weeks so they won’t fear. They can feel that trust and responsibility. Have a feeling of ownership.
However, things go wrong. Always. Almost every day. But blaming somebody and pointing fingers doesn’t help much. We deal with problems, fix them and move forward.
At the end, we write postmortem to the whole R&D with detailed explanations of what happened, how we solved it, and most importantly, what conclusions have been drawn to prevent the same issue in future.
Moreover, everyone can and should report to urgent (it’s our mail) when we notice that something is broken. It can be either support receiving a ticket from a customer. But it can be also that BI analyst saw a drop in a certain graph, or developer stumbled upon a bug in production.
When we are growing, sharing knowledge becomes very crucial. There’s no way we can track every project and coordinate technologies. So, we have guilds for that.
Every developer belongs to the guild. Together, as a guild, we decide which technologies and tools we are using. Developing common components, libraries.
We invest 20% of dev time in the guild sharing knowledge and in improving our infrastructure and engineering practices.
After having set values and culture, we can devise some engineering practices that we are using in CD.
When do you start a new project, from where do you start? Writing server side? Writing client side?
Since deployment is the part of our routine, we deploy “hello world” project into production first and then start writing features.
Writing features starts with failure - failing e2e test. Writing TDD is not a “nice to have” practice at delivering at a high pace, it’s a must. We need to make sure that our code works w/o relying on QA checks. Server guild don’t have a single person of QA. Client side uses the help of QA for sanity and styling bugs.
We even consider TDD as a test driven design. Imagine what happens when hundreds of developers write code. It easily becomes complex (sometimes unnecessarily). Writing code in TDD drives your design squeezing out unnecessary complexity by writing the minimal required amount of code.
The good suit of tests allows future code refactoring w/ any fear.
Most of the features that we write are protected under feature toggle.
You’ve probably all heard about feature toggle before. In case you didn’t. Feature toggles is a technique to push unfinished code to production secured by toggle or flag. As long as feature toggle is off, the code won’t be executed.
How does it look like in code? Basically, it’s an if statement.
The main confusion is that pushing code is not the same as releasing the feature. When your feature is ready, you open the flag in the backoffice, and the feature is available to users.
Feature toggle is a form of A/B test. Stateless experiment. A/B test is an experiment where groups are defined according to a certain criteria.
This practice allows us to experiment and iterate. We release features gradually. For example, a common scenario to open a feature: Wix Users, 20% New Users Canada, 50% new users Canada, US, 100% new users, 100% registered users. We first open new features to new users.
A/B tests at scale become a pretty hairy thing that involve many situations to consider. For example
how to provide consistency to anonymous users
how to manage almost 500 running experiment simultaneously
how to pause experiments
(Demonstrate guinepig production)
When we run A/B tests, we must know how they perform.
On the way here, I’ve read the article where the 26 y.o. CEO of Mixpanel pitched Andersen and Horowitz about his startup.
“Most of the world will make decisions by either guessing or using their gut. They will be either lucky or fail”.
We don’t wanna be like that. Thus, BI has an important part of the a project. It provides the understanding how users react on our changes, how they influence on our KPIs.
Each application sends the set of predefined events on clicks, navigations, etc. We analyze them and monitor.
BI is mostly a tool for a product manager and less for a developer. However, applications also need monitoring. New Relic does a fantastic job in real-time. We know transaction times, error rates, loading times, session traces and much more.
We have monitors w/ graphs in many rooms that are irreplaceable during the deployment process.
(Demonstrate New Relic graphs of APM and Browser).