The document discusses Etsy's experience integrating multiple content delivery network (CDN) providers. Etsy began using a single CDN in 2008 but then investigated using multiple CDNs in 2012 to improve resilience, flexibility, and costs. They developed an evaluation criteria and testing process to initially configure and test the CDNs with non-critical traffic before routing production traffic. Etsy then implemented methods for balancing traffic across CDNs using DNS and monitoring the performance of the CDNs and origin infrastructure.
7. Background
▪ First started using a single CDN in 2008
▪ Exponential Growth
▪ Start of 2012 began investigation into running
multiple CDNs
@lozzd • @ickymettle
8. Why use a CDN?
▪ Goal: Consistently fast user experience globally
▪ Improve last mile performance by caching content
close to the user
▪ Offload content delivery from origin infrastructure
to the CDN provider
@lozzd • @ickymettle
9. Why use more than one CDN?
▪ Resilience
-
Eliminate single point of failure
▪ Flexibility
-
Balance traffic based on business requirements
▪ Cost
-
Manage provider costs
@lozzd • @ickymettle
11. The Plan
1. Establish evaluation criteria
2. Initial configuration and testing
3. Test with production traffic
4. Operationalising
@lozzd • @ickymettle
14. Performance
▪ Baseline Response Times
-
Should be within ±5% of our existing CDN provider’s
response times
▪ Hit Ratios and Origin Offload
-
Provider should achieve equivalent or better origin offload
performance and hit ratios
@lozzd • @ickymettle
15. Configuration
▪ Complexity
-
how complex is the providers configuration system
▪ Self service
-
can you make changes directly or do they require
professional services or other intervention
▪ Latency for changes
-
how quickly do changes take to propagate
@lozzd • @ickymettle
20. Clean the house
▪ Managing caching TTLs from origin
-
CDNs honour the origin cache-control headers!
<LocationMatch ".(gif|jpg|jpeg|png|css|js)$">
Header set Cache-Control "max-age=94670800"
</LocationMatch>
@lozzd • @ickymettle
21. Clean the house
▪ Manage gzip compression from origin
-
Honoured by CDNs
-
Compression from origin to CDN
## mod_deflate compression - see OPS-1537 ##
AddOutputFilterByType DEFLATE text/html text/plain
text/css application/x-javascript [..]
@lozzd • @ickymettle
22. Clean the house
If you can do it at origin,
do it at origin
@lozzd • @ickymettle
23. Mean Time To Curl
http://www.flickr.com/photos/wwarby/3297205226
24. curl -i -H 'Host: img0.etsystatic.com'
global-ssl.fastly.net/someimage.jpg
HTTP/1.1 200 OK
Server: Apache
Last-Modified: Sat, 09 Nov 2013 23:43:38 GMT
Cache-Control: max-age=94670800
[...]
X-Served-By: cache-lo82-LHR
X-Cache: MISS
X-Cache-Hits: 0
25. curl -i -H 'Host: img0.etsystatic.com'
global-ssl.fastly.net/someimage.jpg
HTTP/1.1 200 OK
Server: Apache
Last-Modified: Sat, 09 Nov 2013 23:43:38 GMT
Cache-Control: max-age=94670800
[...]
X-Served-By: cache-lo82-LHR
X-Cache: HIT
X-Cache-Hits: 1
26. Mean Time To Curl = Done
https://www.etsy.com/listing/99871278
27. Mean Time To Curl
▪ No need to touch existing infrastructure
▪ Smoke test of functionality
▪ 10 minutes from configuration to curl
▪ New providers should be plug and play
@lozzd • @ickymettle
29. Testing with Production Traffic
▪ Images only at first
▪ Good test of caching performance
▪ Easy to test by swapping hostnames
▪ Made even easier with our A/B testing framework
@lozzd • @ickymettle
30. A/B Test Framework
▪ Fine grained control
▪ Enable test for specific users or groups
▪ Percentage of users
▪ All controlled via configuration in code
▪ Rapid and complete rollback
@lozzd • @ickymettle
36. Metrics and Monitoring
▪ Get more detail by pulling metrics in house
▪ Write script to pull data from API
▪ Create dashboards with data
@lozzd • @ickymettle
37. Metrics and Monitoring
▪ Get more detail by pulling metrics in house
▪ Write script to pull data from API
▪ Create dashboards with data
@lozzd • @ickymettle
40. Testing Plan
1. for c in $cdns; do rampup $c; done;
2. Deliberately slow and steady
3. Watch traffic increase
4. Watch origin offload increase
5. Watch performance
@lozzd • @ickymettle
41. Downsides of this approach
▪ AB testing can’t be used for main site
▪ Exposing your test CNAMEs
▪ Especially if hotlinking is a concern
@lozzd • @ickymettle
42. Downsides of this approach
▪ Exposing your test CNAMEs
▪ Especially if hotlinking is a concern
@lozzd • @ickymettle
43. How do you know it’s broke?
▪ Check the graphs!
▪ Check with your community
▪ Keep support in the loop
@lozzd • @ickymettle
51. Balancing Traffic Using DNS
▪ Traffic Manager
▪ Extends DNS to dynamically return records based
on rules
▪ Weighted round robin
@lozzd • @ickymettle
54. Balancing Traffic Using DNS
▪ Rule updates typically made via web UI
▪ Can be slow and error prone
▪ Changes need to be applied to all three domains
▪ API available to make changes programmatically
@lozzd • @ickymettle
66. DNS balancing downsides
▪ Low TTLs for fast convergence
▪ Mo QPS == Mo Money
▪ More DNS lookups for users
▪ Not 100% instant or
deterministic
@lozzd • @ickymettle
72. Failure Beacons
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
@lozzd • @ickymettle
73. Failure Beacons
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
self.reg = re.compile('^S+(s:)? (?P<remote_addr>[0-9.]+),?
[0-9.,- ]+ [[^]]+] "GET /status/images/beacon.gif?
(beacon_)?source=(?P<source>S+) HTTP/1.d" d+ [d-]+ "(?
P<referrer>[^"]+)" "(?P<user_agent>[^"]+)" .*$')
@lozzd • @ickymettle
74. Failure Beacons
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
4. Logster posts event counts to Graphite
@lozzd • @ickymettle
75. Failure Beacons
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
4. Logster posts event counts to Graphite
@lozzd • @ickymettle
76. Failure Beacons
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
4. Logster posts event counts to Graphite
5. Alert on Graphite graph in Nagios
@lozzd • @ickymettle
77. Failure Beacons
1. 1x1 tracking pixel embedded in page
2. Request creates an access log line
3. Scrape them out minutely using logster
4. Logster posts event counts to Graphite
5. Alert on Graphite graph in Nagios
@lozzd • @ickymettle
85. Backend Monitoring
▪ Vendor APIs to bring data in house
▪ Data in-house benefits include
-
Integration with our anomaly detection systems
-
Consistent and unified view of all CDN metrics
-
We control data retention period
@lozzd • @ickymettle
86. Awareness
▪ Over 100 engineers
▪ Deploying 60 times a day
▪ Correlating external and internal services
@lozzd • @ickymettle
89. Frontend Monitoring
▪ Performance is important to us
▪ Monitoring overall site performance
▪ Monitoring performance by CDN provider
▪ Real User Monitoring on key pages to track page
performance
@lozzd • @ickymettle
90. Frontend Monitoring
▪ Performance is important to us
▪ Monitoring overall site performance
▪ Monitoring performance by CDN provider
▪ SOASTA mPulse on key pages to track real user
page performance
@lozzd • @ickymettle
92. Debugging: What broke?
▪ MTTD/MTTR can be extremely low with this
system
▪ But not always
@lozzd • @ickymettle
93. Debugging: What broke?
▪ MTTD/MTTR can be extremely low with this
system
▪ But not always
@lozzd • @ickymettle
94. Debugging: What broke?
▪ Non technical member base
▪ Confusing and time consuming
▪ Amazing support team
▪ Log as much information as possible
@lozzd • @ickymettle
96. Great success
▪ 12 months in the benefits have far outweighed the
few downsides
▪ We’re continuing to evolve the system
▪ We’ll be sure to share our experience with the
community along the way
@lozzd • @ickymettle