(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS re:Invent 2014

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in partwithout the express consent of Amazon.com, Inc.
November 13, 2014 | Las Vegas
APP307 -Leveraging the Cloud with a Blue-Green Deployment Architecture
Jim Plush, Sr. Director of Engineering, CrowdStrike -@jimplush
Sean Berry, Principal Software Engineer, CrowdStrike -@schleprachaun

•Founded in September 2011
•~150 employees
•Detection/prevention
–Advanced cyber threats
–Real-time detection
–Real-time analytics
Cybersecurity startup

Event Stream Processing
Sensor
Targeted Malicious
Malware
The “CLOUD”
{"date":"11/14/2014 08:03", "path": “C:WINDOWSProgramsWord.exe", "id": 49, "parentId": 48}
{"date":"11/14/2014 08:03", "path": “C:WINDOWSSystem32cmd.exe", "id": 50, "parentId": 49}
{"date":"11/14/2014 08:03", "path": “C:WINDOWSProgramsWord.exe", "id": 51, "parentId": 50}
DNS Lookup
{"date":"11/14/2014 08:03", “dns": “badapple.cc”, "id": 52, "parentId": 51}
TCP Connect
{"date":"11/14/2014 08:03", “tcp_connect”: “10.10.10.10”, "id": 53, "parentId": 51}
FTP Download
{"date":"11/14/2014 08:03", "download": “10.10.10.10/badstuff.exe”, “id": 54, "parentId": 51}
Document Exfiltration
{"date":"11/14/2014 08:03", "scp": “C:DocumentsTradeSecrets.doc”, “id": 55, "parentId": 54}

Data ingestion
Service A
Service A
UI
Service A
Service A
API
Sensors
Termination server
Termination server
Termination server
Termination server
Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
Sensors
Sensors
External service Elastic Load Balancing load balancer
Content Router
Content Router
Service A
Service A
Processor 1
Service A
Service A
Processor 2

•Fortune 500, Think Tanks, Non-Profits
•100K+ events per second
–Expected to hit 500K EPS by end of 2015
•Each enterprise customer can generate 2-4 TBs of data per day
•Microservice architecture
•Polyglot environment
High scale, big data

…but possible because of AWS

Solving for the problems
•OMG, all servers need to be patched??
•I’m afraid to restart that service; it’s been running for 2 years
•Large rolling restarts
•Deployment fear
–Friday night deploys
•B/G for event processing?

Our primary objectives for deployments
•Minimize customer impact
–Customers should have no indication that anything has changed
•Maximize engineer’s weekends
–Avoid burnout
•Reduce dependencies of rollouts
–Everything goes out together, 50+ services, 1000+ VMS

Leveraging AWS
•Programmable data centers
•Nodes are ephemeral
•It should be easier to re-create an environment than to fix it
—Think like the cloud

What is blue-green?
Router
Web
server
App
server
Application v1
Shared
database
Web
server
App
server
Application v2
x
x

What is blue-green?
•Full cluster BG
–Everything goes out together
–Indiana Jones: “idol switch”
•App-based BG
–Each app or team controls their ownblue-green deployments

Data plane
The data plane
can’t blue-green all the things
Blue cluster
Green cluster
Kafka
DynamoDB
Redis
Amazon RDS pgsql
Amazon Redshift
Amazon Glacier
Amazon S3

When do we deploy?
•Teams deploy end of sprint releases together
•Hot-fix/Upgrades are performed via rolling restart deployments frequently
•Early on deployments took an entire day
–Lack of automation
•Deploys today generally take 45 minutes
–Everyone has run a deployment

Sustaining engineer
•Every team member including QA has run deployments
•Builds confidence, understanding, and redundancy
•Ensures documentation is up to date and all things are automated that can be.
Sustaining engineer badge of honor
shirt after their tour of duty

Deployment day
•Apt repo synchronized and locked down
•Data plane migrations applied
•“Green” cluster is launched (1000s of machines)
•IT tests run
•Canary customers
•Logging and error checks
•Active-active
•“Blue” marked as inactive, decommissioned

Keys to success
Pro tip: It’s not just flipping load balancers

Keys to success
Automate all the things
•jr devs should be able to run your deploy system

Keys to success
Instrumentation & Metrics
https://github.com/codahale/metrics
https://github.com/rcrowley/go-metrics

Keys to success
Use a provisioning system
•Chef
•Puppet
•Salt
•baked AMIs

Keys to success
Live integration / regression test suites
Test
System
Send deterministic input values
Verify processed state

Keys to success
Canary Customers
V1 App
V2 App

Keys to success
Feature Flags

Keys to success
Unified app requirements

Keys to success
Deployment History

–every team member
“Thank God we have blue-green”

Elevator pitch on Kafka
•Distributed commit log
•Similar to a message queue
•Allows for replaying messages from earlier in the stream in case of failure

Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Sensors
Termination server
Termination server
Termination server
Termination server
Content Router
Content Router
Sensors
•Blue is running;normal operation
•Content Routers are writing to the “active” topics in Kafka
•Blue processors read from the “active” topics
Sensors
Active topic
Active topic
External service ELB load balancer
It all starts with a running cluster

Main management page for blue-green

Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
External service ELB load balanceer
Sensors
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Content Router
Content Router
Sensors
Sensors
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Active topic
Launching new cluster
Active topic
Active topic
Inactive Topic
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Content Router
Content Router
•Green cluster is launched
•Termination servers are kept out of the ELB load balancer by failing health checks
•Content Routers write to the “active” topics
•Processors in green read from the “inactive” topics

Getting the size right
•Sizing of our autoscale groups is determined programmatically
–Admin page allows for setting mix / max
–Script determines appropriate desired-capacity based on running cluster
•Launching is then as simple as updating the autoscale groups to the new sizes
defcurrent_counts(region='us-east-1'):
proc = Popen(
"as-describe-auto-scaling-groups “
“--region {} “
“--max-records=600".format(region),
shell=False, stdout=PIPE, stderr=PIPE)
out, err = proc.communicate()
iferr:
raiseException(err)
counts = {}
forline inout.splitlines():
if"AUTO-SCALING-GROUP"not inline:
continue
parts = line.split()
group = parts[1]
current = parts[-2]
counts[group] = int(current)
returncounts

User data and Chef get things rolling
•Inside out Chef bootstrapping
–Didn’t feel comfortable running `wget … | bash`
•Custom version of Chef installer
–Version of Chef
–Where to find the Chef servers
–Which role to run
–Which environment (dev, integ, blue, green)

Testing the new stuff
Sensors
Termination server
Termination server
Termination server
Termination server
Sensors
Active topic
Active topic
Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
Termination server
Termination server
Termination server
Termination server
Integration tests
Active topic
Inactive Topic
Content Router
Content Router
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Content Router
Content Router
•Test customer(s) are *canaried
•Integration test suite is run by connecting to a termination server directly
•Tests pass; then we canary real customers

•Canary information is stored in zookeeper
•Fortunately we dogfood our own tech
•This affords us the ability to use ourselves as canaries for new code
•The inactive processing cluster is set to read from the .inactivetopics
•The standard Kafka topics with .inactiveappended
•The ingestion layer has a watcher on that znode and routes any canaried customer to a the .inactive topics
•Ex. regular traffic goes to foo.bar, canary traffic goes to foo.bar.inactive
•When we are ready to test real traffic we mark several customers as canaries and start the monitoring process to determine any issues
Canary customers

Canary customers
Sensors
Event ingestor
Kafka
Green Processors
Inactive Topic
Regular Traffic
Active topic
Blue Processors
Active topic
Inactive Topic
Canary Traffic
Customer 123
Customer 456

IT tests run
•Integration tests are run
–~3000 tests in total
–Test customer must be “canaried”
•If any tests fail, we triage and determine if it is still possible to move forward
•Testing is only done when we are passing 100%—no exceptions!

Sean is mad -we have work to do

Sean is happy -so we all arehappy

Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
Trust, but verify!
Sensors
Termination server
Termination server
Termination server
Termination server
Sensors
Active Topic
Active Topic
Inactive Topic
Sensors
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Content Router
Content Router
Inactive Topic
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
•Monitor green services
•Verify health of the cluster by inspecting graphicaldata and log outputs
•Rerun tests with load

Logging and errorchecking
•Every server forwards its relevant logs to Splunk
•Several dashboards have been set up with common things to watch for
•Raw logs are streamed in near real-time and we watch specifically for log-level ERROR
•This is one of our most important steps, as it gives us the most insight into the health of the system as a whole

Moving customers over
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Termination server
Sensors
Sensors
Sensors
External service ELB load blaancer
Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
Active topic
Active topic
Content Router
Content Router
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Content Router
Content Router
Active topic
Active topic
•Flip all customers back away from canary
•Activate green cluster
•Event processors and consuming services in blue and green now write to and consume the “active” topics
•We are in a state of active-activefor a few minutes

Each node in the data processing layer has a watcher on a particular znode which tells the environment whether it is active (use standard Kafka topics) or inactive (append .inactiveto the topics)
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Active Topic
Kafka
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Active -active
Inactive Topic
Ingestion

Inactive Topic
Active topic
When we are ready to make the switch, we start by making the new cluster active and enter into an active-active state where both processing clusters are doing work.
Kafka
Green, switch
to active!
Active Topic
This is where is it paramount that code is forward compatible since two different code bases will be doing work simultaneously
Active -active
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Ingestion

However, blue and green are fully partitioned and there is no intercommunication between the clusters. This allows for things like changes in serialization for inter- service communication.
Active Topic
Kafka
Active Topic
Active -active
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Ingestion

Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
Flipping the switch
Termination server
Termination server
Termination server
Termination server
Content Router
Content Router
Sensors
Sensors
Sensors
Termination server
Termination server
Termination server
Termination server
Content Router
Content Router
Active topic
Active topic
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
Inactive topic
Active topic
•We deactivate Blue, which forces Termination Servers in Blue to fail health checks and all Blue sensors disconnect
•Blue processors switch to read from the “inactive” topic
•Once all consumers of the “inactive” topic have caught up to thehead of the stream, Blue can be decommissioned

Out with the old…
Termination server
Termination server
Termination server
Termination server
Content Router
Content Router
Kafka
DynamoDB
Redis
Amazon RDS
Amazon Redshift
Amazon Glacier
Amazon S3
Data plane
Active topic
Active topic
Sensors
Sensors
Sensors
Service A
Service A
Processor 1
Service A
Service A
Processor 2
Service A
Service A
Processor 3
Service A
Service A
Processor 4
•Green is now the active cluster
•If we need to roll back code, we have a snapshot of the repository in Amazon S3
•We haven’t had to roll back code… yet

Half-baked AMIs
We use a process to create “half-baked” AMIs, which speed up deployments
•JVM (for our Scala code base)
•Common tools and configurations
•Latest updates to make sure patches are up to date
•Build plan is run twice daily
Green Server
Green Server
Green Server
Green Server
Green Server
Green server
Green Server
Green Server
Green Server
Green Server
Green Server
Blue server
Half-baked-AMI
Auto Scaling group
1
AMI
Auto Scale Group
Amazon S3

How code graduates -Development
Commit on main
Development apt repo
Auto deploy changed
roles
Development cluster

How code graduates -Production
Create release-X.X.X or
hotfix-X.X.X branches
Integration apt repo
Production apt repo
Same exact
Binary
Integration cluster
Integration apt repo
Sync specified
Packages for integ
New production cluster

Production is synced from Integ

Data plane migrations
•Migrations applied to the database are forward only
•We have past experiences with two way migrations, but the cost outweigh the benefits.
•Code must be forward compatible in case rollbacks are necessary
•Database schemas are only modified via migrations even in development and integration environments
•We use an in-house migration service(based on flyway) to parallelize the process

Final Thoughts
•blue-green deployments can be done in many ways
•Our requirement of never losing customer data made this the best solution for us
•The automation and tooling around our deployment system were built over many months and was a lot of work(built by 2 people –Hi Dennis!)
•But it is completely worth it, knowing we have a very reliable, fault-tolerant system

http://bit.ly/awsevals
Jim:@jimplush
Sean:@schleprachaun

(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS re:Invent 2014

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie (APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS re:Invent 2014

Ähnlich wie (APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS re:Invent 2014 (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS re:Invent 2014