SlideShare ist ein Scribd-Unternehmen logo
1 von 78
Downloaden Sie, um offline zu lesen
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in partwithout the express consent of Amazon.com, Inc. 
November 13, 2014 | Las Vegas 
APP307 -Leveraging the Cloud with a Blue-Green Deployment Architecture 
Jim Plush, Sr. Director of Engineering, CrowdStrike -@jimplush 
Sean Berry, Principal Software Engineer, CrowdStrike -@schleprachaun
About us
•Founded in September 2011 
•~150 employees 
•Detection/prevention 
–Advanced cyber threats 
–Real-time detection 
–Real-time analytics 
Cybersecurity startup
Published experts
Event Stream Processing 
Sensor 
Targeted Malicious 
Malware 
The “CLOUD” 
{"date":"11/14/2014 08:03", "path": “C:WINDOWSProgramsWord.exe", "id": 49, "parentId": 48} 
{"date":"11/14/2014 08:03", "path": “C:WINDOWSSystem32cmd.exe", "id": 50, "parentId": 49} 
{"date":"11/14/2014 08:03", "path": “C:WINDOWSProgramsWord.exe", "id": 51, "parentId": 50} 
DNS Lookup 
{"date":"11/14/2014 08:03", “dns": “badapple.cc”, "id": 52, "parentId": 51} 
TCP Connect 
{"date":"11/14/2014 08:03", “tcp_connect”: “10.10.10.10”, "id": 53, "parentId": 51} 
FTP Download 
{"date":"11/14/2014 08:03", "download": “10.10.10.10/badstuff.exe”, “id": 54, "parentId": 51} 
Document Exfiltration 
{"date":"11/14/2014 08:03", "scp": “C:DocumentsTradeSecrets.doc”, “id": 55, "parentId": 54}
Tactical UI
Data ingestion 
Service A 
Service A 
UI 
Service A 
Service A 
API 
Sensors 
Termination server 
Termination server 
Termination server 
Termination server 
Kafka 
DynamoDB 
Redis 
Amazon RDS 
Amazon Redshift 
Amazon Glacier 
Amazon S3 
Data plane 
Sensors 
Sensors 
External service Elastic Load Balancing load balancer 
Content Router 
Content Router 
Service A 
Service A 
Processor 1 
Service A 
Service A 
Processor 2
•Fortune 500, Think Tanks, Non-Profits 
•100K+ events per second 
–Expected to hit 500K EPS by end of 2015 
•Each enterprise customer can generate 2-4 TBs of data per day 
•Microservice architecture 
•Polyglot environment 
High scale, big data
Our tech stack is complicated
…but possible because of AWS
Motivation
Solving for the problems 
•OMG, all servers need to be patched?? 
•I’m afraid to restart that service; it’s been running for 2 years 
•Large rolling restarts 
•Deployment fear 
–Friday night deploys 
•B/G for event processing?
Our primary objectives for deployments 
•Minimize customer impact 
–Customers should have no indication that anything has changed 
•Maximize engineer’s weekends 
–Avoid burnout 
•Reduce dependencies of rollouts 
–Everything goes out together, 50+ services, 1000+ VMS
Leveraging AWS 
•Programmable data centers 
•Nodes are ephemeral 
•It should be easier to re-create an environment than to fix it 
—Think like the cloud
What is blue-green? 
Router 
Web 
server 
App 
server 
Application v1 
Shared 
database 
Web 
server 
App 
server 
Application v2 
x 
x
What is blue-green? 
•Full cluster BG 
–Everything goes out together 
–Indiana Jones: “idol switch” 
•App-based BG 
–Each app or team controls their ownblue-green deployments
Data plane 
The data plane 
can’t blue-green all the things 
Blue cluster 
Green cluster 
Kafka 
DynamoDB 
Redis 
Amazon RDS pgsql 
Amazon Redshift 
Amazon Glacier 
Amazon S3
When do we deploy? 
•Teams deploy end of sprint releases together 
•Hot-fix/Upgrades are performed via rolling restart deployments frequently 
•Early on deployments took an entire day 
–Lack of automation 
•Deploys today generally take 45 minutes 
–Everyone has run a deployment
Sustaining engineer 
•Every team member including QA has run deployments 
•Builds confidence, understanding, and redundancy 
•Ensures documentation is up to date and all things are automated that can be. 
Sustaining engineer badge of honor 
shirt after their tour of duty
Deployment day 
•Apt repo synchronized and locked down 
•Data plane migrations applied 
•“Green” cluster is launched (1000s of machines) 
•IT tests run 
•Canary customers 
•Logging and error checks 
•Active-active 
•“Blue” marked as inactive, decommissioned
Keys to success 
Pro tip: It’s not just flipping load balancers
Keys to success 
Automate all the things 
•jr devs should be able to run your deploy system
Keys to success 
Instrumentation & Metrics 
https://github.com/codahale/metrics 
https://github.com/rcrowley/go-metrics
Keys to success 
Use a provisioning system 
•Chef 
•Puppet 
•Salt 
•baked AMIs
Keys to success 
Live integration / regression test suites 
Test 
System 
Send deterministic input values 
Verify processed state
Keys to success 
Canary Customers 
V1 App 
V2 App
Keys to success 
Feature Flags
Keys to success 
Unified app requirements
Keys to success 
Deployment History
–every team member 
“Thank God we have blue-green”
Implementation
How we blue-green
Elevator pitch on Kafka 
•Distributed commit log 
•Similar to a message queue 
•Allows for replaying messages from earlier in the stream in case of failure
Kafka 
DynamoDB 
Redis 
Amazon RDS 
Amazon Redshift 
Amazon Glacier 
Amazon S3 
Data plane 
Service A 
Service A 
Processor 1 
Service A 
Service A 
Processor 2 
Service A 
Service A 
Processor 3 
Service A 
Service A 
Processor 4 
Sensors 
Termination server 
Termination server 
Termination server 
Termination server 
Content Router 
Content Router 
Sensors 
•Blue is running;normal operation 
•Content Routers are writing to the “active” topics in Kafka 
•Blue processors read from the “active” topics 
Sensors 
Active topic 
Active topic 
External service ELB load balancer 
It all starts with a running cluster
Main management page for blue-green
Kafka 
DynamoDB 
Redis 
Amazon RDS 
Amazon Redshift 
Amazon Glacier 
Amazon S3 
Data plane 
External service ELB load balanceer 
Sensors 
Termination server 
Termination server 
Termination server 
Termination server 
Termination server 
Termination server 
Termination server 
Termination server 
Content Router 
Content Router 
Sensors 
Sensors 
Service A 
Service A 
Processor 1 
Service A 
Service A 
Processor 2 
Service A 
Service A 
Processor 3 
Service A 
Service A 
Processor 4 
Active topic 
Launching new cluster 
Active topic 
Active topic 
Inactive Topic 
Service A 
Service A 
Processor 1 
Service A 
Service A 
Processor 2 
Service A 
Service A 
Processor 3 
Service A 
Service A 
Processor 4 
Content Router 
Content Router 
•Green cluster is launched 
•Termination servers are kept out of the ELB load balancer by failing health checks 
•Content Routers write to the “active” topics 
•Processors in green read from the “inactive” topics
Sizing the new cluster
Getting the size right 
•Sizing of our autoscale groups is determined programmatically 
–Admin page allows for setting mix / max 
–Script determines appropriate desired-capacity based on running cluster 
•Launching is then as simple as updating the autoscale groups to the new sizes 
defcurrent_counts(region='us-east-1'): 
proc = Popen( 
"as-describe-auto-scaling-groups “ 
“--region {} “ 
“--max-records=600".format(region), 
shell=False, stdout=PIPE, stderr=PIPE) 
out, err = proc.communicate() 
iferr: 
raiseException(err) 
counts = {} 
forline inout.splitlines(): 
if"AUTO-SCALING-GROUP"not inline: 
continue 
parts = line.split() 
group = parts[1] 
current = parts[-2] 
counts[group] = int(current) 
returncounts
Tuning size before we launch
Bootstrapping
User data and Chef get things rolling 
•Inside out Chef bootstrapping 
–Didn’t feel comfortable running `wget … | bash` 
•Custom version of Chef installer 
–Version of Chef 
–Where to find the Chef servers 
–Which role to run 
–Which environment (dev, integ, blue, green)
Testing the new stuff 
External service ELB load balancer 
Sensors 
Termination server 
Termination server 
Termination server 
Termination server 
Sensors 
Active topic 
Active topic 
Kafka 
DynamoDB 
Redis 
Amazon RDS 
Amazon Redshift 
Amazon Glacier 
Amazon S3 
Data plane 
Termination server 
Termination server 
Termination server 
Termination server 
Integration tests 
Active topic 
Inactive Topic 
Content Router 
Content Router 
Service A 
Service A 
Processor 1 
Service A 
Service A 
Processor 2 
Service A 
Service A 
Processor 3 
Service A 
Service A 
Processor 4 
Service A 
Service A 
Processor 1 
Service A 
Service A 
Processor 2 
Service A 
Service A 
Processor 3 
Service A 
Service A 
Processor 4 
Content Router 
Content Router 
•Test customer(s) are *canaried 
•Integration test suite is run by connecting to a termination server directly 
•Tests pass; then we canary real customers
Canary customers
•Canary information is stored in zookeeper 
•Fortunately we dogfood our own tech 
•This affords us the ability to use ourselves as canaries for new code 
•The inactive processing cluster is set to read from the .inactivetopics 
•The standard Kafka topics with .inactiveappended 
•The ingestion layer has a watcher on that znode and routes any canaried customer to a the .inactive topics 
•Ex. regular traffic goes to foo.bar, canary traffic goes to foo.bar.inactive 
•When we are ready to test real traffic we mark several customers as canaries and start the monitoring process to determine any issues 
Canary customers
Canary customers 
Sensors 
External service ELB load balancer 
Event ingestor 
Kafka 
Green Processors 
Inactive Topic 
Regular Traffic 
Active topic 
Blue Processors 
Active topic 
Inactive Topic 
Canary Traffic 
Customer 123 
Customer 456
Let’s canary some customers
That was easy
Testing
IT tests run 
•Integration tests are run 
–~3000 tests in total 
–Test customer must be “canaried” 
•If any tests fail, we triage and determine if it is still possible to move forward 
•Testing is only done when we are passing 100%—no exceptions!
Sean is mad -we have work to do
Sean is happy -so we all arehappy
Kafka 
DynamoDB 
Redis 
Amazon RDS 
Amazon Redshift 
Amazon Glacier 
Amazon S3 
Data plane 
Trust, but verify! 
Sensors 
Termination server 
Termination server 
Termination server 
Termination server 
Sensors 
Active Topic 
Active Topic 
Inactive Topic 
Sensors 
External service ELB load balancer 
Service A 
Service A 
Processor 1 
Service A 
Service A 
Processor 2 
Service A 
Service A 
Processor 3 
Service A 
Service A 
Processor 4 
Content Router 
Content Router 
Inactive Topic 
Service A 
Service A 
Processor 1 
Service A 
Service A 
Processor 2 
Service A 
Service A 
Processor 3 
Service A 
Service A 
Processor 4 
•Monitor green services 
•Verify health of the cluster by inspecting graphicaldata and log outputs 
•Rerun tests with load
Monitoring
Logging and errorchecking 
•Every server forwards its relevant logs to Splunk 
•Several dashboards have been set up with common things to watch for 
•Raw logs are streamed in near real-time and we watch specifically for log-level ERROR 
•This is one of our most important steps, as it gives us the most insight into the health of the system as a whole
Logging / ErrorChecking
Moving customers over 
Termination server 
Termination server 
Termination server 
Termination server 
Termination server 
Termination server 
Termination server 
Termination server 
Sensors 
Sensors 
Sensors 
External service ELB load blaancer 
Kafka 
DynamoDB 
Redis 
Amazon RDS 
Amazon Redshift 
Amazon Glacier 
Amazon S3 
Data plane 
Active topic 
Active topic 
Content Router 
Content Router 
Service A 
Service A 
Processor 1 
Service A 
Service A 
Processor 2 
Service A 
Service A 
Processor 3 
Service A 
Service A 
Processor 4 
Service A 
Service A 
Processor 1 
Service A 
Service A 
Processor 2 
Service A 
Service A 
Processor 3 
Service A 
Service A 
Processor 4 
Content Router 
Content Router 
Active topic 
Active topic 
•Flip all customers back away from canary 
•Activate green cluster 
•Event processors and consuming services in blue and green now write to and consume the “active” topics 
•We are in a state of active-activefor a few minutes
Each node in the data processing layer has a watcher on a particular znode which tells the environment whether it is active (use standard Kafka topics) or inactive (append .inactiveto the topics) 
Service A 
Service A 
Processor 1 
Service A 
Service A 
Processor 2 
Service A 
Service A 
Processor 3 
Service A 
Service A 
Processor 4 
Active Topic 
Kafka 
Service A 
Service A 
Processor 1 
Service A 
Service A 
Processor 2 
Service A 
Service A 
Processor 3 
Service A 
Service A 
Processor 4 
Active -active 
Inactive Topic 
Ingestion
Inactive Topic 
Active topic 
When we are ready to make the switch, we start by making the new cluster active and enter into an active-active state where both processing clusters are doing work. 
Kafka 
Green, switch 
to active! 
Active Topic 
This is where is it paramount that code is forward compatible since two different code bases will be doing work simultaneously 
Active -active 
Service A 
Service A 
Processor 1 
Service A 
Service A 
Processor 2 
Service A 
Service A 
Processor 3 
Service A 
Service A 
Processor 4 
Service A 
Service A 
Processor 1 
Service A 
Service A 
Processor 2 
Service A 
Service A 
Processor 3 
Service A 
Service A 
Processor 4 
Ingestion
However, blue and green are fully partitioned and there is no intercommunication between the clusters. This allows for things like changes in serialization for inter- service communication. 
Active Topic 
Kafka 
Active Topic 
Active -active 
Service A 
Service A 
Processor 1 
Service A 
Service A 
Processor 2 
Service A 
Service A 
Processor 3 
Service A 
Service A 
Processor 4 
Service A 
Service A 
Processor 1 
Service A 
Service A 
Processor 2 
Service A 
Service A 
Processor 3 
Service A 
Service A 
Processor 4 
Ingestion
Kafka 
DynamoDB 
Redis 
Amazon RDS 
Amazon Redshift 
Amazon Glacier 
Amazon S3 
Data plane 
Flipping the switch 
Termination server 
Termination server 
Termination server 
Termination server 
Content Router 
Content Router 
Sensors 
Sensors 
Sensors 
External service ELB load balancer 
Termination server 
Termination server 
Termination server 
Termination server 
Content Router 
Content Router 
Active topic 
Active topic 
Service A 
Service A 
Processor 1 
Service A 
Service A 
Processor 2 
Service A 
Service A 
Processor 3 
Service A 
Service A 
Processor 4 
Service A 
Service A 
Processor 1 
Service A 
Service A 
Processor 2 
Service A 
Service A 
Processor 3 
Service A 
Service A 
Processor 4 
Inactive topic 
Active topic 
•We deactivate Blue, which forces Termination Servers in Blue to fail health checks and all Blue sensors disconnect 
•Blue processors switch to read from the “inactive” topic 
•Once all consumers of the “inactive” topic have caught up to thehead of the stream, Blue can be decommissioned
Out with the old… 
Termination server 
Termination server 
Termination server 
Termination server 
Content Router 
Content Router 
Kafka 
DynamoDB 
Redis 
Amazon RDS 
Amazon Redshift 
Amazon Glacier 
Amazon S3 
Data plane 
Active topic 
Active topic 
Sensors 
Sensors 
Sensors 
External service ELB load balancer 
Service A 
Service A 
Processor 1 
Service A 
Service A 
Processor 2 
Service A 
Service A 
Processor 3 
Service A 
Service A 
Processor 4 
•Green is now the active cluster 
•If we need to roll back code, we have a snapshot of the repository in Amazon S3 
•We haven’t had to roll back code… yet
Easing the pain
Bootstapping faster
Half-baked AMIs 
We use a process to create “half-baked” AMIs, which speed up deployments 
•JVM (for our Scala code base) 
•Common tools and configurations 
•Latest updates to make sure patches are up to date 
•Build plan is run twice daily 
Green Server 
Green Server 
Green Server 
Green Server 
Green Server 
Green server 
Green Server 
Green Server 
Green Server 
Green Server 
Green Server 
Blue server 
Half-baked-AMI 
Auto Scaling group 
1 
AMI 
Auto Scale Group 
Amazon S3
Getting code ready
How code graduates -Development 
Commit on main 
Development apt repo 
Auto deploy changed 
roles 
Development cluster
How code graduates -Production 
Create release-X.X.X or 
hotfix-X.X.X branches 
Integration apt repo 
Production apt repo 
Same exact 
Binary 
Integration cluster 
Integration apt repo 
Sync specified 
Packages for integ 
New production cluster
Choosing what goes out
Viewing debian details
Integration is synced
Integration is synced
Production is synced from Integ
Updating the data plane
Data plane migrations 
•Migrations applied to the database are forward only 
•We have past experiences with two way migrations, but the cost outweigh the benefits. 
•Code must be forward compatible in case rollbacks are necessary 
•Database schemas are only modified via migrations even in development and integration environments 
•We use an in-house migration service(based on flyway) to parallelize the process
Final Thoughts 
•blue-green deployments can be done in many ways 
•Our requirement of never losing customer data made this the best solution for us 
•The automation and tooling around our deployment system were built over many months and was a lot of work(built by 2 people –Hi Dennis!) 
•But it is completely worth it, knowing we have a very reliable, fault-tolerant system
Thankyou
http://bit.ly/awsevals 
Jim:@jimplush 
Sean:@schleprachaun

Weitere ähnliche Inhalte

Was ist angesagt?

Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Lucas Jellema
 

Was ist angesagt? (20)

A Deep Dive into Kafka Controller
A Deep Dive into Kafka ControllerA Deep Dive into Kafka Controller
A Deep Dive into Kafka Controller
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Openshift argo cd_v1_2
Openshift argo cd_v1_2Openshift argo cd_v1_2
Openshift argo cd_v1_2
 
GitOps - Operation By Pull Request
GitOps - Operation By Pull RequestGitOps - Operation By Pull Request
GitOps - Operation By Pull Request
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
GitOps w/argocd
GitOps w/argocdGitOps w/argocd
GitOps w/argocd
 
Terraform
TerraformTerraform
Terraform
 
Docker Kubernetes Istio
Docker Kubernetes IstioDocker Kubernetes Istio
Docker Kubernetes Istio
 
Kafka 101 and Developer Best Practices
Kafka 101 and Developer Best PracticesKafka 101 and Developer Best Practices
Kafka 101 and Developer Best Practices
 
Gitlab, GitOps & ArgoCD
Gitlab, GitOps & ArgoCDGitlab, GitOps & ArgoCD
Gitlab, GitOps & ArgoCD
 
Kafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platformKafka Tutorial - basics of the Kafka streaming platform
Kafka Tutorial - basics of the Kafka streaming platform
 
GitOps with ArgoCD
GitOps with ArgoCDGitOps with ArgoCD
GitOps with ArgoCD
 
Understanding Apache Kafka® Latency at Scale
Understanding Apache Kafka® Latency at ScaleUnderstanding Apache Kafka® Latency at Scale
Understanding Apache Kafka® Latency at Scale
 
Jenkins tutorial
Jenkins tutorialJenkins tutorial
Jenkins tutorial
 
Securing Kafka
Securing Kafka Securing Kafka
Securing Kafka
 
GitOps and ArgoCD
GitOps and ArgoCDGitOps and ArgoCD
GitOps and ArgoCD
 
A visual introduction to Apache Kafka
A visual introduction to Apache KafkaA visual introduction to Apache Kafka
A visual introduction to Apache Kafka
 
MySQL Monitoring using Prometheus & Grafana
MySQL Monitoring using Prometheus & GrafanaMySQL Monitoring using Prometheus & Grafana
MySQL Monitoring using Prometheus & Grafana
 
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
Microservices, Apache Kafka, Node, Dapr and more - Part Two (Fontys Hogeschoo...
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
 

Ähnlich wie (APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS re:Invent 2014

Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Amazon Web Services
 

Ähnlich wie (APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS re:Invent 2014 (20)

5 Steps on the Way to Continuous Delivery
5 Steps on the Way to Continuous Delivery5 Steps on the Way to Continuous Delivery
5 Steps on the Way to Continuous Delivery
 
AWS Webcast - AWS OpsWorks Continuous Integration Demo
AWS Webcast - AWS OpsWorks Continuous Integration Demo  AWS Webcast - AWS OpsWorks Continuous Integration Demo
AWS Webcast - AWS OpsWorks Continuous Integration Demo
 
Getting to Walk with DevOps
Getting to Walk with DevOpsGetting to Walk with DevOps
Getting to Walk with DevOps
 
CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National Police
CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National PoliceCodeMotion Amsterdam 2018 - Microservices in action at the Dutch National Police
CodeMotion Amsterdam 2018 - Microservices in action at the Dutch National Police
 
Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...
Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...
Microservices in action at the Dutch National Police - Bert Jan Schrijver - C...
 
iSense Java Summit 2017 - Microservices in action at the Dutch National Police
iSense Java Summit 2017 - Microservices in action at the Dutch National PoliceiSense Java Summit 2017 - Microservices in action at the Dutch National Police
iSense Java Summit 2017 - Microservices in action at the Dutch National Police
 
Microservices in action at the Dutch National Police
Microservices in action at the Dutch National PoliceMicroservices in action at the Dutch National Police
Microservices in action at the Dutch National Police
 
OpenValue meetup October 2017 - Microservices in action at the Dutch National...
OpenValue meetup October 2017 - Microservices in action at the Dutch National...OpenValue meetup October 2017 - Microservices in action at the Dutch National...
OpenValue meetup October 2017 - Microservices in action at the Dutch National...
 
JavaZone 2017 - Microservices in action at the Dutch National Police
JavaZone 2017 - Microservices in action at the Dutch National PoliceJavaZone 2017 - Microservices in action at the Dutch National Police
JavaZone 2017 - Microservices in action at the Dutch National Police
 
Get There meetup March 2018 - Microservices in action at the Dutch National P...
Get There meetup March 2018 - Microservices in action at the Dutch National P...Get There meetup March 2018 - Microservices in action at the Dutch National P...
Get There meetup March 2018 - Microservices in action at the Dutch National P...
 
Dublin JUG February 2018 - Microservices in action at the Dutch National Police
Dublin JUG February 2018 - Microservices in action at the Dutch National PoliceDublin JUG February 2018 - Microservices in action at the Dutch National Police
Dublin JUG February 2018 - Microservices in action at the Dutch National Police
 
Agile infrastructure
Agile infrastructureAgile infrastructure
Agile infrastructure
 
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
DevOps, Continuous Integration and Deployment on AWS: Putting Money Back into...
 
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
Devops continuousintegration and deployment onaws puttingmoneybackintoyourmis...
 
Stay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolithStay productive_while_slicing_up_the_monolith
Stay productive_while_slicing_up_the_monolith
 
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
Using AWS to Build a Scalable Big Data Management & Processing Service (BDT40...
 
Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015
 
Monitoring and tuning your chef server - chef conf talk
Monitoring and tuning your chef server - chef conf talk Monitoring and tuning your chef server - chef conf talk
Monitoring and tuning your chef server - chef conf talk
 
10 Tips for Your Journey to the Public Cloud
10 Tips for Your Journey to the Public Cloud10 Tips for Your Journey to the Public Cloud
10 Tips for Your Journey to the Public Cloud
 
DevOps and AWS
DevOps and AWSDevOps and AWS
DevOps and AWS
 

Mehr von Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Kürzlich hochgeladen (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS re:Invent 2014

  • 1. © 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in partwithout the express consent of Amazon.com, Inc. November 13, 2014 | Las Vegas APP307 -Leveraging the Cloud with a Blue-Green Deployment Architecture Jim Plush, Sr. Director of Engineering, CrowdStrike -@jimplush Sean Berry, Principal Software Engineer, CrowdStrike -@schleprachaun
  • 3. •Founded in September 2011 •~150 employees •Detection/prevention –Advanced cyber threats –Real-time detection –Real-time analytics Cybersecurity startup
  • 5. Event Stream Processing Sensor Targeted Malicious Malware The “CLOUD” {"date":"11/14/2014 08:03", "path": “C:WINDOWSProgramsWord.exe", "id": 49, "parentId": 48} {"date":"11/14/2014 08:03", "path": “C:WINDOWSSystem32cmd.exe", "id": 50, "parentId": 49} {"date":"11/14/2014 08:03", "path": “C:WINDOWSProgramsWord.exe", "id": 51, "parentId": 50} DNS Lookup {"date":"11/14/2014 08:03", “dns": “badapple.cc”, "id": 52, "parentId": 51} TCP Connect {"date":"11/14/2014 08:03", “tcp_connect”: “10.10.10.10”, "id": 53, "parentId": 51} FTP Download {"date":"11/14/2014 08:03", "download": “10.10.10.10/badstuff.exe”, “id": 54, "parentId": 51} Document Exfiltration {"date":"11/14/2014 08:03", "scp": “C:DocumentsTradeSecrets.doc”, “id": 55, "parentId": 54}
  • 6.
  • 8. Data ingestion Service A Service A UI Service A Service A API Sensors Termination server Termination server Termination server Termination server Kafka DynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3 Data plane Sensors Sensors External service Elastic Load Balancing load balancer Content Router Content Router Service A Service A Processor 1 Service A Service A Processor 2
  • 9. •Fortune 500, Think Tanks, Non-Profits •100K+ events per second –Expected to hit 500K EPS by end of 2015 •Each enterprise customer can generate 2-4 TBs of data per day •Microservice architecture •Polyglot environment High scale, big data
  • 10. Our tech stack is complicated
  • 13. Solving for the problems •OMG, all servers need to be patched?? •I’m afraid to restart that service; it’s been running for 2 years •Large rolling restarts •Deployment fear –Friday night deploys •B/G for event processing?
  • 14. Our primary objectives for deployments •Minimize customer impact –Customers should have no indication that anything has changed •Maximize engineer’s weekends –Avoid burnout •Reduce dependencies of rollouts –Everything goes out together, 50+ services, 1000+ VMS
  • 15. Leveraging AWS •Programmable data centers •Nodes are ephemeral •It should be easier to re-create an environment than to fix it —Think like the cloud
  • 16. What is blue-green? Router Web server App server Application v1 Shared database Web server App server Application v2 x x
  • 17. What is blue-green? •Full cluster BG –Everything goes out together –Indiana Jones: “idol switch” •App-based BG –Each app or team controls their ownblue-green deployments
  • 18. Data plane The data plane can’t blue-green all the things Blue cluster Green cluster Kafka DynamoDB Redis Amazon RDS pgsql Amazon Redshift Amazon Glacier Amazon S3
  • 19. When do we deploy? •Teams deploy end of sprint releases together •Hot-fix/Upgrades are performed via rolling restart deployments frequently •Early on deployments took an entire day –Lack of automation •Deploys today generally take 45 minutes –Everyone has run a deployment
  • 20. Sustaining engineer •Every team member including QA has run deployments •Builds confidence, understanding, and redundancy •Ensures documentation is up to date and all things are automated that can be. Sustaining engineer badge of honor shirt after their tour of duty
  • 21. Deployment day •Apt repo synchronized and locked down •Data plane migrations applied •“Green” cluster is launched (1000s of machines) •IT tests run •Canary customers •Logging and error checks •Active-active •“Blue” marked as inactive, decommissioned
  • 22. Keys to success Pro tip: It’s not just flipping load balancers
  • 23. Keys to success Automate all the things •jr devs should be able to run your deploy system
  • 24. Keys to success Instrumentation & Metrics https://github.com/codahale/metrics https://github.com/rcrowley/go-metrics
  • 25. Keys to success Use a provisioning system •Chef •Puppet •Salt •baked AMIs
  • 26. Keys to success Live integration / regression test suites Test System Send deterministic input values Verify processed state
  • 27. Keys to success Canary Customers V1 App V2 App
  • 28. Keys to success Feature Flags
  • 29. Keys to success Unified app requirements
  • 30. Keys to success Deployment History
  • 31. –every team member “Thank God we have blue-green”
  • 34. Elevator pitch on Kafka •Distributed commit log •Similar to a message queue •Allows for replaying messages from earlier in the stream in case of failure
  • 35. Kafka DynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3 Data plane Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Sensors Termination server Termination server Termination server Termination server Content Router Content Router Sensors •Blue is running;normal operation •Content Routers are writing to the “active” topics in Kafka •Blue processors read from the “active” topics Sensors Active topic Active topic External service ELB load balancer It all starts with a running cluster
  • 36. Main management page for blue-green
  • 37. Kafka DynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3 Data plane External service ELB load balanceer Sensors Termination server Termination server Termination server Termination server Termination server Termination server Termination server Termination server Content Router Content Router Sensors Sensors Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Active topic Launching new cluster Active topic Active topic Inactive Topic Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Content Router Content Router •Green cluster is launched •Termination servers are kept out of the ELB load balancer by failing health checks •Content Routers write to the “active” topics •Processors in green read from the “inactive” topics
  • 38. Sizing the new cluster
  • 39. Getting the size right •Sizing of our autoscale groups is determined programmatically –Admin page allows for setting mix / max –Script determines appropriate desired-capacity based on running cluster •Launching is then as simple as updating the autoscale groups to the new sizes defcurrent_counts(region='us-east-1'): proc = Popen( "as-describe-auto-scaling-groups “ “--region {} “ “--max-records=600".format(region), shell=False, stdout=PIPE, stderr=PIPE) out, err = proc.communicate() iferr: raiseException(err) counts = {} forline inout.splitlines(): if"AUTO-SCALING-GROUP"not inline: continue parts = line.split() group = parts[1] current = parts[-2] counts[group] = int(current) returncounts
  • 40. Tuning size before we launch
  • 42. User data and Chef get things rolling •Inside out Chef bootstrapping –Didn’t feel comfortable running `wget … | bash` •Custom version of Chef installer –Version of Chef –Where to find the Chef servers –Which role to run –Which environment (dev, integ, blue, green)
  • 43. Testing the new stuff External service ELB load balancer Sensors Termination server Termination server Termination server Termination server Sensors Active topic Active topic Kafka DynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3 Data plane Termination server Termination server Termination server Termination server Integration tests Active topic Inactive Topic Content Router Content Router Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Content Router Content Router •Test customer(s) are *canaried •Integration test suite is run by connecting to a termination server directly •Tests pass; then we canary real customers
  • 45. •Canary information is stored in zookeeper •Fortunately we dogfood our own tech •This affords us the ability to use ourselves as canaries for new code •The inactive processing cluster is set to read from the .inactivetopics •The standard Kafka topics with .inactiveappended •The ingestion layer has a watcher on that znode and routes any canaried customer to a the .inactive topics •Ex. regular traffic goes to foo.bar, canary traffic goes to foo.bar.inactive •When we are ready to test real traffic we mark several customers as canaries and start the monitoring process to determine any issues Canary customers
  • 46. Canary customers Sensors External service ELB load balancer Event ingestor Kafka Green Processors Inactive Topic Regular Traffic Active topic Blue Processors Active topic Inactive Topic Canary Traffic Customer 123 Customer 456
  • 47. Let’s canary some customers
  • 50. IT tests run •Integration tests are run –~3000 tests in total –Test customer must be “canaried” •If any tests fail, we triage and determine if it is still possible to move forward •Testing is only done when we are passing 100%—no exceptions!
  • 51. Sean is mad -we have work to do
  • 52. Sean is happy -so we all arehappy
  • 53. Kafka DynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3 Data plane Trust, but verify! Sensors Termination server Termination server Termination server Termination server Sensors Active Topic Active Topic Inactive Topic Sensors External service ELB load balancer Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Content Router Content Router Inactive Topic Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 •Monitor green services •Verify health of the cluster by inspecting graphicaldata and log outputs •Rerun tests with load
  • 55. Logging and errorchecking •Every server forwards its relevant logs to Splunk •Several dashboards have been set up with common things to watch for •Raw logs are streamed in near real-time and we watch specifically for log-level ERROR •This is one of our most important steps, as it gives us the most insight into the health of the system as a whole
  • 57. Moving customers over Termination server Termination server Termination server Termination server Termination server Termination server Termination server Termination server Sensors Sensors Sensors External service ELB load blaancer Kafka DynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3 Data plane Active topic Active topic Content Router Content Router Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Content Router Content Router Active topic Active topic •Flip all customers back away from canary •Activate green cluster •Event processors and consuming services in blue and green now write to and consume the “active” topics •We are in a state of active-activefor a few minutes
  • 58. Each node in the data processing layer has a watcher on a particular znode which tells the environment whether it is active (use standard Kafka topics) or inactive (append .inactiveto the topics) Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Active Topic Kafka Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Active -active Inactive Topic Ingestion
  • 59. Inactive Topic Active topic When we are ready to make the switch, we start by making the new cluster active and enter into an active-active state where both processing clusters are doing work. Kafka Green, switch to active! Active Topic This is where is it paramount that code is forward compatible since two different code bases will be doing work simultaneously Active -active Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Ingestion
  • 60. However, blue and green are fully partitioned and there is no intercommunication between the clusters. This allows for things like changes in serialization for inter- service communication. Active Topic Kafka Active Topic Active -active Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Ingestion
  • 61. Kafka DynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3 Data plane Flipping the switch Termination server Termination server Termination server Termination server Content Router Content Router Sensors Sensors Sensors External service ELB load balancer Termination server Termination server Termination server Termination server Content Router Content Router Active topic Active topic Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 Inactive topic Active topic •We deactivate Blue, which forces Termination Servers in Blue to fail health checks and all Blue sensors disconnect •Blue processors switch to read from the “inactive” topic •Once all consumers of the “inactive” topic have caught up to thehead of the stream, Blue can be decommissioned
  • 62. Out with the old… Termination server Termination server Termination server Termination server Content Router Content Router Kafka DynamoDB Redis Amazon RDS Amazon Redshift Amazon Glacier Amazon S3 Data plane Active topic Active topic Sensors Sensors Sensors External service ELB load balancer Service A Service A Processor 1 Service A Service A Processor 2 Service A Service A Processor 3 Service A Service A Processor 4 •Green is now the active cluster •If we need to roll back code, we have a snapshot of the repository in Amazon S3 •We haven’t had to roll back code… yet
  • 65. Half-baked AMIs We use a process to create “half-baked” AMIs, which speed up deployments •JVM (for our Scala code base) •Common tools and configurations •Latest updates to make sure patches are up to date •Build plan is run twice daily Green Server Green Server Green Server Green Server Green Server Green server Green Server Green Server Green Server Green Server Green Server Blue server Half-baked-AMI Auto Scaling group 1 AMI Auto Scale Group Amazon S3
  • 67. How code graduates -Development Commit on main Development apt repo Auto deploy changed roles Development cluster
  • 68. How code graduates -Production Create release-X.X.X or hotfix-X.X.X branches Integration apt repo Production apt repo Same exact Binary Integration cluster Integration apt repo Sync specified Packages for integ New production cluster
  • 73. Production is synced from Integ
  • 75. Data plane migrations •Migrations applied to the database are forward only •We have past experiences with two way migrations, but the cost outweigh the benefits. •Code must be forward compatible in case rollbacks are necessary •Database schemas are only modified via migrations even in development and integration environments •We use an in-house migration service(based on flyway) to parallelize the process
  • 76. Final Thoughts •blue-green deployments can be done in many ways •Our requirement of never losing customer data made this the best solution for us •The automation and tooling around our deployment system were built over many months and was a lot of work(built by 2 people –Hi Dennis!) •But it is completely worth it, knowing we have a very reliable, fault-tolerant system