Minimizing customer impact is a key feature in successfully rolling out frequent code updates. Learn how to leverage the AWS cloud so you can minimize bug impacts, test your services in isolation with canary data, and easily roll back changes. Learn to love deployments, not fear them, with a blue/green architecture model. This talk walks you through the reasons it works for us and how we set up our AWS infrastructure, including package repositories, Elastic Load Balancing load balancers, Auto Scaling groups, internal tools, and more to help orchestrate the process. Learn to view thousands of servers as resources at your command to help improve your engineering environment, take bigger risks, and not spend weekends firefighting bad deployments.
8. Data ingestion
Service A
Service A
UI
Service A
Service A
API
Sensors
Termination server
Termination server
Termination server
Termination server
Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
Sensors
Sensors
External service Elastic Load Balancing load balancer
Content Router
Content Router
Service A
Service A
Processor 1
Service A
Service A
Processor 2
9. •Fortune 500, Think Tanks, Non-Profits
•100K+ events per second
–Expected to hit 500K EPS by end of 2015
•Each enterprise customer can generate 2-4 TBs of data per day
•Microservice architecture
•Polyglot environment
High scale, big data
13. Solving for the problems
•OMG, all servers need to be patched??
•I’m afraid to restart that service; it’s been running for 2 years
•Large rolling restarts
•Deployment fear
–Friday night deploys
•B/G for event processing?
14. Our primary objectives for deployments
•Minimize customer impact
–Customers should have no indication that anything has changed
•Maximize engineer’s weekends
–Avoid burnout
•Reduce dependencies of rollouts
–Everything goes out together, 50+ services, 1000+ VMS
15. Leveraging AWS
•Programmable data centers
•Nodes are ephemeral
•It should be easier to re-create an environment than to fix it
—Think like the cloud
16. What is blue-green?
Router
Web
server
App
server
Application v1
Shared
database
Web
server
App
server
Application v2
x
x
17. What is blue-green?
•Full cluster BG
–Everything goes out together
–Indiana Jones: “idol switch”
•App-based BG
–Each app or team controls their ownblue-green deployments
18. Data plane
The data plane
can’t blue-green all the things
Blue cluster
Green cluster
Kafka
DynamoDB
Redis
Amazon RDS pgsql
Amazon Redshift
Amazon Glacier
Amazon S3
19. When do we deploy?
•Teams deploy end of sprint releases together
•Hot-fix/Upgrades are performed via rolling restart deployments frequently
•Early on deployments took an entire day
–Lack of automation
•Deploys today generally take 45 minutes
–Everyone has run a deployment
20. Sustaining engineer
•Every team member including QA has run deployments
•Builds confidence, understanding, and redundancy
•Ensures documentation is up to date and all things are automated that can be.
Sustaining engineer badge of honor
shirt after their tour of duty
21. Deployment day
•Apt repo synchronized and locked down
•Data plane migrations applied
•“Green” cluster is launched (1000s of machines)
•IT tests run
•Canary customers
•Logging and error checks
•Active-active
•“Blue” marked as inactive, decommissioned
22. Keys to success
Pro tip: It’s not just flipping load balancers
23. Keys to success
Automate all the things
•jr devs should be able to run your deploy system
24. Keys to success
Instrumentation & Metrics
https://github.com/codahale/metrics
https://github.com/rcrowley/go-metrics
25. Keys to success
Use a provisioning system
•Chef
•Puppet
•Salt
•baked AMIs
26. Keys to success
Live integration / regression test suites
Test
System
Send deterministic input values
Verify processed state
34. Elevator pitch on Kafka
•Distributed commit log
•Similar to a message queue
•Allows for replaying messages from earlier in the stream in case of failure
35. Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Sensors
Termination server
Termination server
Termination server
Termination server
Content Router
Content Router
Sensors
•Blue is running;normal operation
•Content Routers are writing to the “active” topics in Kafka
•Blue processors read from the “active” topics
Sensors
Active topic
Active topic
External service ELB load balancer
It all starts with a running cluster
37. Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
External service ELB load balanceer
Sensors
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Content Router
Content Router
Sensors
Sensors
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Active topic
Launching new cluster
Active topic
Active topic
Inactive Topic
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Content Router
Content Router
•Green cluster is launched
•Termination servers are kept out of the ELB load balancer by failing health checks
•Content Routers write to the “active” topics
•Processors in green read from the “inactive” topics
39. Getting the size right
•Sizing of our autoscale groups is determined programmatically
–Admin page allows for setting mix / max
–Script determines appropriate desired-capacity based on running cluster
•Launching is then as simple as updating the autoscale groups to the new sizes
defcurrent_counts(region='us-east-1'):
proc = Popen(
"as-describe-auto-scaling-groups “
“--region {} “
“--max-records=600".format(region),
shell=False, stdout=PIPE, stderr=PIPE)
out, err = proc.communicate()
iferr:
raiseException(err)
counts = {}
forline inout.splitlines():
if"AUTO-SCALING-GROUP"not inline:
continue
parts = line.split()
group = parts[1]
current = parts[-2]
counts[group] = int(current)
returncounts
42. User data and Chef get things rolling
•Inside out Chef bootstrapping
–Didn’t feel comfortable running `wget … | bash`
•Custom version of Chef installer
–Version of Chef
–Where to find the Chef servers
–Which role to run
–Which environment (dev, integ, blue, green)
43. Testing the new stuff
External service ELB load balancer
Sensors
Termination server
Termination server
Termination server
Termination server
Sensors
Active topic
Active topic
Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
Termination server
Termination server
Termination server
Termination server
Integration tests
Active topic
Inactive Topic
Content Router
Content Router
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Content Router
Content Router
•Test customer(s) are *canaried
•Integration test suite is run by connecting to a termination server directly
•Tests pass; then we canary real customers
45. •Canary information is stored in zookeeper
•Fortunately we dogfood our own tech
•This affords us the ability to use ourselves as canaries for new code
•The inactive processing cluster is set to read from the .inactivetopics
•The standard Kafka topics with .inactiveappended
•The ingestion layer has a watcher on that znode and routes any canaried customer to a the .inactive topics
•Ex. regular traffic goes to foo.bar, canary traffic goes to foo.bar.inactive
•When we are ready to test real traffic we mark several customers as canaries and start the monitoring process to determine any issues
Canary customers
46. Canary customers
Sensors
External service ELB load balancer
Event ingestor
Kafka
Green Processors
Inactive Topic
Regular Traffic
Active topic
Blue Processors
Active topic
Inactive Topic
Canary Traffic
Customer 123
Customer 456
50. IT tests run
•Integration tests are run
–~3000 tests in total
–Test customer must be “canaried”
•If any tests fail, we triage and determine if it is still possible to move forward
•Testing is only done when we are passing 100%—no exceptions!
53. Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
Trust, but verify!
Sensors
Termination server
Termination server
Termination server
Termination server
Sensors
Active Topic
Active Topic
Inactive Topic
Sensors
External service ELB load balancer
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Content Router
Content Router
Inactive Topic
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
•Monitor green services
•Verify health of the cluster by inspecting graphicaldata and log outputs
•Rerun tests with load
55. Logging and errorchecking
•Every server forwards its relevant logs to Splunk
•Several dashboards have been set up with common things to watch for
•Raw logs are streamed in near real-time and we watch specifically for log-level ERROR
•This is one of our most important steps, as it gives us the most insight into the health of the system as a whole
57. Moving customers over
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Sensors
Sensors
Sensors
External service ELB load blaancer
Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
Active topic
Active topic
Content Router
Content Router
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Content Router
Content Router
Active topic
Active topic
•Flip all customers back away from canary
•Activate green cluster
•Event processors and consuming services in blue and green now write to and consume the “active” topics
•We are in a state of active-activefor a few minutes
58. Each node in the data processing layer has a watcher on a particular znode which tells the environment whether it is active (use standard Kafka topics) or inactive (append .inactiveto the topics)
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Active Topic
Kafka
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Active -active
Inactive Topic
Ingestion
59. Inactive Topic
Active topic
When we are ready to make the switch, we start by making the new cluster active and enter into an active-active state where both processing clusters are doing work.
Kafka
Green, switch
to active!
Active Topic
This is where is it paramount that code is forward compatible since two different code bases will be doing work simultaneously
Active -active
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Ingestion
60. However, blue and green are fully partitioned and there is no intercommunication between the clusters. This allows for things like changes in serialization for inter- service communication.
Active Topic
Kafka
Active Topic
Active -active
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Ingestion
61. Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
Flipping the switch
Termination server
Termination server
Termination server
Termination server
Content Router
Content Router
Sensors
Sensors
Sensors
External service ELB load balancer
Termination server
Termination server
Termination server
Termination server
Content Router
Content Router
Active topic
Active topic
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Inactive topic
Active topic
•We deactivate Blue, which forces Termination Servers in Blue to fail health checks and all Blue sensors disconnect
•Blue processors switch to read from the “inactive” topic
•Once all consumers of the “inactive” topic have caught up to thehead of the stream, Blue can be decommissioned
62. Out with the old…
Termination server
Termination server
Termination server
Termination server
Content Router
Content Router
Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
Active topic
Active topic
Sensors
Sensors
Sensors
External service ELB load balancer
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
•Green is now the active cluster
•If we need to roll back code, we have a snapshot of the repository in Amazon S3
•We haven’t had to roll back code… yet
65. Half-baked AMIs
We use a process to create “half-baked” AMIs, which speed up deployments
•JVM (for our Scala code base)
•Common tools and configurations
•Latest updates to make sure patches are up to date
•Build plan is run twice daily
Green Server
Green Server
Green Server
Green Server
Green Server
Green server
Green Server
Green Server
Green Server
Green Server
Green Server
Blue server
Half-baked-AMI
Auto Scaling group
1
AMI
Auto Scale Group
Amazon S3
67. How code graduates -Development
Commit on main
Development apt repo
Auto deploy changed
roles
Development cluster
68. How code graduates -Production
Create release-X.X.X or
hotfix-X.X.X branches
Integration apt repo
Production apt repo
Same exact
Binary
Integration cluster
Integration apt repo
Sync specified
Packages for integ
New production cluster
75. Data plane migrations
•Migrations applied to the database are forward only
•We have past experiences with two way migrations, but the cost outweigh the benefits.
•Code must be forward compatible in case rollbacks are necessary
•Database schemas are only modified via migrations even in development and integration environments
•We use an in-house migration service(based on flyway) to parallelize the process
76. Final Thoughts
•blue-green deployments can be done in many ways
•Our requirement of never losing customer data made this the best solution for us
•The automation and tooling around our deployment system were built over many months and was a lot of work(built by 2 people –Hi Dennis!)
•But it is completely worth it, knowing we have a very reliable, fault-tolerant system