Presented at ALM Summit 3 in Redmond, WA. January 2013.
Like what you've read? We're frequently hiring for a variety of engineering roles at Etsy. If you're interested, drop me a line or send me your resume: mike@etsy.com.
http://www.etsy.com/careers
8. DECEMBER 2012
1.5 Billion page views
$117 Million of goods sold
6 Million items sold
Items by anjaysdesigns, betwixxt, OneStarLeatherGoods, mediumcontrol, TheDesignPallet http://www.etsy.com/blog/news/2013/etsy-statistics-december-2012-weather-report/
11. Continuous delivery is a pattern language in growing use
in software development to improve the process of
software delivery.
Techniques such as automated testing, continuous integration,
and continuous deployment allow software to be developed to a high
standard and easily packaged and deployed to test environments,
resulting in the ability to rapidly, reliably and repeatedly push out
enhancements and bug fixes to customers at low risk and with
minimal manual overhead.
~wikipedia
credit: Stewart, redgen (flickr)
14. Then Now
2009 2010-today
Just before we
started using CD
15. Then Now
6-14 hours 15 mins
“Deployment Army” 1 person
Special event and Part of everyday
highly orchestrated workflow
16. Then Now
Blocked for Blocked for
6-14 hours. 15 minutes.
6+ hours to 15 minutes to
redeploy redeploy
17. Then Now
Release branch, Mainline,
database schemas, minimal linking
data transforms, and building,
packaging, rsync,
rolling restarts, site up
cache purging,
scheduled downtime
33. What’s in a deploy?
Small incremental changes to the application
New classes, methods, controllers
Graphics, stylesheets, templates
Copy/content changes
Turning flags on, off, or % ramp up
34. Low MTTR (response times)
Latent bugs and security holes
Traffic management, load shedding
Adding and removing infrastructure
Tweaking config flags or releasing patches.
43. Our web application is largely monolithic.
Etsy.com, Support & Back-office tools,
Developer API, Gearman (async work)
44. Our web application is largely monolithic.
Etsy.com, Support & Back-office tools,
Developer API, Gearman (async work)
PHP, Apache, Memcache
45. External “services” are not deployed with
the main application.
e.g. Databases, Search, Photo storage, Payments
46. External “services” are not deployed with
the main application.
e.g. Databases, Search, Photo storage, Payments
MYSQL PCI
PROXY CACHE,
(schema changes) (controlled access)
FILERS, AMAZON S3
SOLR, JVM
(specialized infra.)
(rolling restarts)
47. For every config flag, there are two states
we can support — present and future.
48. For every config flag, there are two states
we can support — present and future.
... or past and present.
51. C
RULE OF THUMB:
Prefer ADDs over ALTERs (non-breaking expansion)
52. 1. Write to both versions
2. Backfill historical data
3. Read from new version
4. Cut-off writes to old version
53. 0. Add new version to schema
1. Write to both versions
2. Backfill historical data
3. Read from new version
4. Cut-off writes to old version
54. 0. Add new version to schema
Schema change to add prefs columns to “users” table.
“write_prefs_to_user_prefs_table” => “on”
“write_prefs_to_users_table” => “off”
“read_prefs_from_users_table” => “off”
55. 1. Write to both versions
Write code for writing prefs to the “users” table.
“write_prefs_to_user_prefs_table” => “on”
“write_prefs_to_users_table” => “on”
“read_prefs_from_users_table” => “off”
56. 2. Backfill historical data
Offline process to sync existing data from “user_prefs”
to new columns in “users”
57. 3. Read from new version
Data validation tests. Ensure consistency both internally
and in production.
“write_prefs_to_user_prefs_table” => “on”
“write_prefs_to_users_table” => “on”
“read_prefs_from_users_table” => “staff”
58. 3. Read from new version
Data validation tests. Ensure consistency both internally
and in production.
“write_prefs_to_user_prefs_table” => “on”
“write_prefs_to_users_table” => “on”
“read_prefs_from_users_table” => “1%”
59. 3. Read from new version
Data validation tests. Ensure consistency both internally
and in production.
“write_prefs_to_user_prefs_table” => “on”
“write_prefs_to_users_table” => “on”
“read_prefs_from_users_table” => “5%”
60. 3. Read from new version
Data validation tests. Ensure consistency both internally
and in production.
“write_prefs_to_user_prefs_table” => “on”
“write_prefs_to_users_table” => “on”
“read_prefs_from_users_table” => “on”
(“on” == “100%”)
61. 4. Cut-off writes to old version
After running on the new table for a significant amount
of time, we can cut off writes to the old table.
“write_prefs_to_user_prefs_table” => “off”
“write_prefs_to_users_table” => “on”
“read_prefs_from_users_table” => “on”
62. “Branch by Astraction”
Controller Controller
Users Model (Abstraction)
“users” (old) “user_prefs” “users”
old schema new schema
http://paulhammant.com/blog/branch_by_abstraction.html
http://continuousdelivery.com/2011/05/make-large-scale-changes-incrementally-with-branch-by-abstraction/
63. “The Migration 4-Step”
1. Write to both versions
2. Backfill historical data
3. Read from new version
4. Cut-off writes to old version
65. We might remove config flags for the old version when...
It is no longer valid for the business.
It is no longer stable, maintained, or trusted.
It has poor performance characteristics.
The code is a mess, or difficult to read.
We can afford to spend time on it.
78. “Where a new system concept or new technology is used, one has to build a
system to throw away, for even the best planning is not so omniscient as to
get it right the first time. Hence plan to throw one away; you will, anyhow.”
~ Fred Brooks, The Mythical Man-Month
94. More database servers in prod.
Bigger database hardware in prod.
More web servers.
Various replication schemes.
Different versions of server and OS software.
Schema changes applied at different times.
Physical hardware in prod.
More data in prod.
Legacy data (7 years of odd user states).
More traffic in prod.
Wait, I mean MUCH more traffic in prod.
Fewer elves.
Faster disks (SSDs) in prod.
95. Using a MySQL database in dev for an application that will be running
on Oracle in production: Priceless
106. SERVER METRICS
Apache requests/sec, Busy processes,
CPU utilization, Script exec time (med. & 95th)
APPLICATION METRICS
Logins, Registrations, Checkouts,
Listings created, Forum posts
Time and event correlated.
113. Tighten your feedback cycles
Integrate with production and validate early in cycle.
Use tools that allow you to detect issues early.
Optimize for quick response times.
Applied to both feature development and operability.
114. Thank you
... and questions?
These slides will be available later today at http://mikebrittain.com/talks
Mike Brittain
ENGINEERING DIRECTOR @mikebrittain
mike@etsy.com