Blue/Green deployments have been an important, if rarely implemented, technique in the Continuous Delivery playbook for years. Their aim is simple: provision, deploy, test — and optionally roll-back — your application before it's served to the public. Betterment's deployment architecture takes a similar, but more straightforward approach, accomplishing the important goals sought out by Blue/Green practitioners. Dubbed 'Cyan' (a mixture of Blue/Green), Betterment uses Ansible to provision new instances, push the latest artifacts to them, and ensure that they're healthy before marking them ready for production. All this ensures fast, stable, zero-downtime rollout with minimal human interaction. We'll discuss Betterment's philosophical approach to shipping new code and then dive into the nitty-gritty Ansible that powers the whole thing.
17. Wait. Two Databases?
“There's still the issue of dealing with missed transactions
while the green environment was live, but depending on
your design you may be able to...
● feed transactions to both environments in such a way
as to keep the blue environment as a backup when the
green is live. Or you may be able to...
● put the application in read-only mode before cut-over,
run it for a while in read-only mode, and then switch it to
read-write mode.”
http://martinfowler.com/bliki/BlueGreenDeployment.html
22. Jenkins’ Job
1. Build
2. Test
3. Package
4. Publish
5. Run Migrations
6. Invoke Ansible
7. Cull Zombies
23. Ansible’s Job
1. Check for S3 deliverables
2. Spin up new EC2 Instance(s)
3. Apply role(s) to instance(s)
4. Find instance(s) in ELB
5. Add new instance(s) to ELB & tag
o status: in-use
6. Remove & tag instances
o status: zombie
55. ● Predictable
● Repeatable
● Minimal Human Interaction
● Zero User Interruption
● Contained Failure
Dream Delivery Achieved
56. The Future
● Long Running Instances + Docker
o Huge speed improvement
● Post Monolith, Abandon Jenkins?
o Travis CI for Build/Test
o Tower for Deployment Orchestration
● Ansible Galaxy?
58. careers@betterment.com
All code snippets & diagrams contained in this presentation are property of Betterment, but please learn from them.
All photographs / GIFs used in this presentation are someone else’s.
Street Fighter, Back To The Future, Indiana Jones, Futurama, and Arrested Development are someone else’s property
too.
Editor's Notes
Background on who we are:
Betterment is online investing service, helping people to better manage and grow their wealth through smarter technology
disrupt the investing financial industry
Largest and fastest growing automated investing service
More than 90,000 customers and quickly growing
Betterment is online investing service, helping people to better manage and grow their wealth through smarter technology
Invest in a diversified portfolio
Automate everything for you - from rebalancing and dividend reinvestment to automatic deposits
Tax efficient
very, very confusing at times and
have the potential to waste a lot of your time before you figure out the right approach
the dream of devops is to create an environment where the path of least resistance also yields the most efficient, sane result.
betterment is totally onboard with that mission from the ground up.
it’s gonna take all night
it’s friday night. i have no date, a 2-liter bottle of shasta, and my all-rush mixtape. let’s rock.
Deploy Once a Month
At 2AM
Unpredictable pre-prod
Rented Iron
By definition, long-running
LoadBalancer with “status.txt” file
Manual Package Installation
“Python2.4 must be default”
AWS
VPC
AZ
Subnets
EC2
ELB
Multi-AZ
RDS
predictable deployments
DNS Update
Unpredictable TTLs
Never for Consumer Traffic
Okay for Internal Traffic
Elastic Load Balancer
Re-use “pre-warmed” long-running ELBs
Multiple AZs
Health Checks
Connection Draining
Sticky Sessions
Goes without saying you need HTTPS
long-running machine columns still are a pain to provision.
give me machines on tap.
Ansible means i can stand up stacks of
Stateless Sessions / Servers
Stateful Authentication Cookies
HTTPS Everywhere
Column Health Awareness
Health Check Routes
Two Full-stack “Columns”
Rent 2x N-machine columns
Elastic Cloud + Ansible
2x Web / App / Database Nodes
New EC2 Cluster hits production DB
Migration Constraints
Old Code always works on New Schema
Win: Simplicity
Lose: “Instant” Standby Stack
Don’t optimize for rollbacks
Fast Rollforward
Speedy Fix&Ship
Fast Failure
curl healthcheck route before publicizing
fail tolerance percent: 0%
Emergency Rollback
Ship previous git-hash.
ansible parallelization
super ugly, but awesome.
ensure that hash matches master
scalability!
Monolith ⇒ SOA
2 ⇒ ~10 applications
One repo to rule them all
Version Ansible w/ Apps
.tgz,.war deliverables
./exec
./roles/{deploy,provision}
./playbooks/shared
TODO:
Y’all ready for this?
CYAN!
Sometimes, things do go wrong.
database still at v1, old code continues to run….
If the deployment fails halfway through, zombie killer will take care of the “ready” or “pending” instances that are not in ELBs.
If the deployment succeeds, instances tagged “zombie” will also be reaped.