Like what you've read? We're frequently hiring for a variety of engineering roles at Etsy. If you're interested, drop me a line or send me your resume: mike@etsy.com.
http://www.etsy.com/careers
1. Advanced Topics in
Continuous Deployment
Mike Brittain
Engineering Director, Etsy
@mikebrittain mikebrittain.com/talks
2. - Config flags - this one goes to eleven.
Today’s TOPICs
credit: photobookgirl (flickr)
3. - Config flags - this one goes to eleven.
- Automated deploys - never settle for anything less
Today’s TOPICs
credit: photobookgirl (flickr)
4. - Config flags - this one goes to eleven.
- Automated deploys - never settle for anything less
- Release management - who needs it?!?
Today’s TOPICs
credit: photobookgirl (flickr)
5. - Config flags - this one goes to eleven.
- Automated deploys - never settle for anything less
- Release management - who needs it?!?
- Deploying schema changes - ‘cause everybody asks
Today’s TOPICs
credit: photobookgirl (flickr)
27. Small incremental changes to the application
New classes, methods, controllers
Graphics, stylesheets, templates
Copy/content changes
App deploys
Turning flags on, off, or % ramp up
Config deploys
30. Latent bugs and security holes
Traffic management, load shedding
Adding and removing infrastructure
!
Tweaking config flags or releasing patches.
“Operating” the site
41. Interpreted language, text files.
Opcode cache (Opcache or APC)
~100 servers (web, gearman, api)
Rsync (push, not pull)
Avoid restarts
PHP & Apache
42. Lots of remote orchestration (ssh and dsh)
Push code from a git clone to production network.
Splay to a few boxes, each splays to more.
Stage files in a temp location on prod boxes.
Local rsync (using dsh) into live docroot.
Keeping things fast
43. 100+ files opened per request.
Flushing opcode cache (or graceful restart).
Mostly harmless.
What can go wrong with this?
45. Two document roots (“yin” and “yang”)
Symbolic link to the right one
Opcache has to use path name, not inode
Atomic deploys
http://codeascraft.com/2013/07/01/atomic-deploys-at-etsy/
46. Binaries, not text files
Requires restarts
Requires search index and cache warming
Rsync (push, not pull)
Solr and JVM
47. Take boxes out of rotation, deploy, bring back up
Beware capacity management
Multiple versions running for extended period
Rollbacks are a pain (esp. when in mixed-state)
Rolling restarts
49. One live cluster, one dark cluster
Deploy to dark cluster (indexes, pre-warm, restarts)
Define search clusters in app config
Switch cluster traffic via config deploy
“Flip” and “Flop”
50. Start with a shell script.
Yours will be a unique snowflake.
89. Entire app deploy took 15 minutes.
!
4 people running the deployment
8 committers
Config deploy and Chef change deployed in parallel.
90. Optimal queue size
Normalized communication
Improved visibility
Historical record is ideal for post-mortems
Organic evolution
91. Hold up the queue (.hold)
Work the issue with the people available in #push
Additional help always available in #sysops
Buddy-system for off-hours deploys
Ops-on-call, dev-on-call
When something goes wrong?
97. Our web application is largely monolithic.
Etsy.com, Support & Back-office tools,
Developer API, Gearman (async work)
98. Etsy.com, Support & Back-office tools,
Developer API, Gearman (async work)
PHP, Apache, Memcache
Our web application is largely monolithic.
99. External “services” are not deployed with
the main application.
e.g. Databases, Search, Photo storage, Payments
100. e.g. Databases, Search, Photo storage, Payments
MYSQL
(schema changes)
SOLR, JVM
(rolling restarts)
PROXY CACHE,
FILERS, AMAZON S3
(specialized infra.)
PCI
(controlled access)
External “services” are not deployed with
the main application.
101. For every config flag, there are two states
we can support — present and future.
102. ... or past and present.
For every config flag, there are two states
we can support — present and future.
103. $cfg[‘new_search’] = array('enabled' => 'off');
!
// Meanwhile...
!
if ($cfg[‘new_search’] == ‘on’) {
# New and fancy search
$results = do_solr();
} else {
# old and boring search
$results = do_grep();
}
107. !
1. Write to both versions
2. Backfill historical data
3. Read from new version
4. Cut-off writes to old version
108. 0. Add new version to schema
1. Write to both versions
2. Backfill historical data
3. Read from new version
4. Cut-off writes to old version
109. 0. Add new version to schema
Schema change to add prefs columns to “users” table.
!
“write_prefs_to_user_prefs_table” => “on”
“write_prefs_to_users_table” => “off”
“read_prefs_from_users_table” => “off”
110. 1. Write to both versions
Write code for writing prefs to the “users” table.
!
“write_prefs_to_user_prefs_table” => “on”
“write_prefs_to_users_table” => “on”
“read_prefs_from_users_table” => “off”
111. 2. Backfill historical data
Offline process to sync existing data from “user_prefs”
to new columns in “users”
112. 3. Read from new version
Data validation tests. Ensure consistency both internally
and in production.
!
“write_prefs_to_user_prefs_table” => “on”
“write_prefs_to_users_table” => “on”
“read_prefs_from_users_table” => “staff”
!
113. 3. Read from new version
Data validation tests. Ensure consistency both internally
and in production.
!
“write_prefs_to_user_prefs_table” => “on”
“write_prefs_to_users_table” => “on”
“read_prefs_from_users_table” => “1%”
!
114. 3. Read from new version
Data validation tests. Ensure consistency both internally
and in production.
!
“write_prefs_to_user_prefs_table” => “on”
“write_prefs_to_users_table” => “on”
“read_prefs_from_users_table” => “5%”
!
115. 3. Read from new version
Data validation tests. Ensure consistency both internally
and in production.
!
“write_prefs_to_user_prefs_table” => “on”
“write_prefs_to_users_table” => “on”
“read_prefs_from_users_table” => “11%”
!
“This one goes to eleven.”
116. 3. Read from new version
Data validation tests. Ensure consistency both internally
and in production.
!
“write_prefs_to_user_prefs_table” => “on”
“write_prefs_to_users_table” => “on”
“read_prefs_from_users_table” => “on” // same as 100%
!
!
!
!
117. 4. Cut-off writes to old version
After running on the new table for a significant amount
of time, we can cut off writes to the old table.
!
“write_prefs_to_user_prefs_table” => “off”
“write_prefs_to_users_table” => “on”
“read_prefs_from_users_table” => “on”
!
118. “Branch by Astraction”
Controller Controller
Users Model
“users” (old) “user_prefs” “users”
old schema new schema
(Abstraction)
http://paulhammant.com/blog/branch_by_abstraction.html
http://continuousdelivery.com/2011/05/make-large-scale-changes-incrementally-with-branch-by-abstraction/
119. Avoid temptation of putting logic into DB
Async worker queue (Gearman)
Get good at alerting on data inconsistencies
Easier to scale out app servers that DBs
Shards limit complexity
About our database design…
120. No longer valid for the business
No longer stable, valid, or trusted code
Impacting performance or readability
We can afford to spend time
Clean up old config flags?
122. Start small. (We did.)
Automated tests and production monitoring.
Have a story around maintaining quality.
“We can always go back to the old way.”
Demonstrate value to leadership.