This document discusses canary analysis, which is a deployment process where a new change is gradually rolled out to production with checkpoints to examine the new systems versus the old systems and make go/no-go decisions. It proposes using canary analysis to test software releases by routing a small percentage of traffic to new servers and comparing metrics like error rates and requests per second between the new and old servers before fully deploying the new release. The document provides advice on automating canary analysis, focusing on relative metrics, ignoring outliers, balancing fidelity with customer impact, and letting application owners choose when differences are acceptable.
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
Canary Analyze All the Things
1. Canary Analyze All the
Things
Roy Rapoport
@royrapoport
June 12, 2014
Significant contributions by Chris Sanden, @chris_sanden
1
2. Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
2
4. A Word About Me …
•About 20 years in technology
3
5. A Word About Me …
•About 20 years in technology
•Systems engineering, networking, software development, QA,
release management
3
6. A Word About Me …
•About 20 years in technology
•Systems engineering, networking, software development, QA,
release management
•Time at Netflix: 1809 days
3
7. A Word About Me …
•About 20 years in technology
•Systems engineering, networking, software development, QA,
release management
•Time at Netflix: 1809 days 4y:11m:14d
3
8. A Word About Me …
•About 20 years in technology
•Systems engineering, networking, software development, QA,
release management
•Time at Netflix: 1809 days
•At Netflix:
4y:11m:14d
3
9. A Word About Me …
•About 20 years in technology
•Systems engineering, networking, software development, QA,
release management
•Time at Netflix: 1809 days
•At Netflix:
•Systems Engineering, Service Delivery in IT/Ops
4y:11m:14d
3
10. A Word About Me …
•About 20 years in technology
•Systems engineering, networking, software development, QA,
release management
•Time at Netflix: 1809 days
•At Netflix:
•Systems Engineering, Service Delivery in IT/Ops
•Troubleshooter and Builder of Python Things[tm] in Product
Engineering
4y:11m:14d
3
11. A Word About Me …
•About 20 years in technology
•Systems engineering, networking, software development, QA,
release management
•Time at Netflix: 1809 days
•At Netflix:
•Systems Engineering, Service Delivery in IT/Ops
•Troubleshooter and Builder of Python Things[tm] in Product
Engineering
•Current role: Insight Engineering in Product Engineering
4y:11m:14d
3
12. A Word About Me …
•About 20 years in technology
•Systems engineering, networking, software development, QA,
release management
•Time at Netflix: 1809 days
•At Netflix:
•Systems Engineering, Service Delivery in IT/Ops
•Troubleshooter and Builder of Python Things[tm] in Product
Engineering
•Current role: Insight Engineering in Product Engineering
•Real-Time Operational Insight
4y:11m:14d
3
20. A Word About Netflix…
Freedom and Responsibility Culture
5
21. A Word About Netflix…
•Optimize speed of innovation
Constrain availability
Cost will be what cost will be
Freedom and Responsibility Culture
5
22. A Word About Netflix…
•Optimize speed of innovation
Constrain availability
Cost will be what cost will be
•Hire smart (experienced)
people
Get out of their way
Freedom and Responsibility Culture
5
23. A Word About Netflix…
•Optimize speed of innovation
Constrain availability
Cost will be what cost will be
•Hire smart (experienced)
people
Get out of their way
•Anti-process bias
Freedom and Responsibility Culture
5
25. A Word About Netflix…
Technology and Operations
6
26. A Word About Netflix…
•Service Oriented Architecture
Technology and Operations
6
27. A Word About Netflix…
•Service Oriented Architecture
•Decentralized Operations. You
Technology and Operations
6
28. A Word About Netflix…
•Service Oriented Architecture
•Decentralized Operations. You
•Build
Technology and Operations
6
29. A Word About Netflix…
•Service Oriented Architecture
•Decentralized Operations. You
•Build
•Test
Technology and Operations
6
30. A Word About Netflix…
•Service Oriented Architecture
•Decentralized Operations. You
•Build
•Test
•Deploy
Technology and Operations
6
31. A Word About Netflix…
•Service Oriented Architecture
•Decentralized Operations. You
•Build
•Test
•Deploy
•Set up alerting and monitoring
Technology and Operations
6
32. A Word About Netflix…
•Service Oriented Architecture
•Decentralized Operations. You
•Build
•Test
•Deploy
•Set up alerting and monitoring
•Wake up at 2AM
Technology and Operations
6
33. Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
7
44. So You’ve Just Done a Release
> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/fox
11
45. So You’ve Just Done a Release
> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/fox
{“response”: “wa-pa-pa-pa-pa-pa-pow”}
11
46. So You’ve Just Done a Release
> curl http://WhatDoesTheFooSay.prod.netflix.net/api/v1/fox
{“response”: “wa-pa-pa-pa-pa-pa-pow”}
The correct answer to “what does the fox say?” is left an exercise for the reader
11
49. You Need Better Testing!
“I’m going to push to production, though
I’m pretty sure it’s going to kill the system”
13
- Said no one, ever*
* Hopefully
50. Rate of Change
1 10 100 1000
0
1
2
3
4
5
6
Availability(nines)
Detour
Rate of Change vs Availability
14
51. Rate of Change
1 10 100 1000
0
1
2
3
4
5
6
Availability(nines)
Detour
Rate of Change vs Availability
14
52. Rate of Change
1 10 100 1000
0
1
2
3
4
5
6
Availability(nines)
Detour
Rate of Change vs Availability
14
53. Rate of Change
1 10 100 1000
0
1
2
3
4
5
6
Availability(nines)
Detour
Rate of Change vs Availability
14
54. Rate of Change
1 10 100 1000
0
1
2
3
4
5
6
Availability(nines)
Detour
Rate of Change vs Availability
14
55. Rate of Change
1 10 100 1000
0
1
2
3
4
5
6
Availability(nines)
Detour
Rate of Change vs Availability
Operations
Engineering
14
56. You Need Better Testing!Deployments!
Canary Analysis!
!
• A deployment process where
• a new change (in behavior, code, or both)
• is rolled out into production gradually,
• with checkpoints along the way to examine the new (canary) systems
• (optionally versus the old (baseline) systems)
• and make go/no-go decisions.
15
69. Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
20
104. A Quick Recap
• Observe
• Segregate metrics
• Partial deploy
• Compare to Baseline
31
105. A Quick Recap
• Observe
• Segregate metrics
• Partial deploy
• Compare to Baseline
• Absolutes are never right
31
106. A Quick Recap
• Observe
• Segregate metrics
• Partial deploy
• Compare to Baseline
• Absolutes are never right
• Automate decision
31
107. A Quick Recap
• Observe
• Segregate metrics
• Partial deploy
• Compare to Baseline
• Absolutes are never right
• Automate decision
• Automate execution
31
108. Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
32
109. To Save You Some Time …
Not all
metrics are
created
equal
33
110. To Save You Some Time …
Not all
metrics are
created
equal
Focus on
System and
Application
Metrics
33
111. To Save You Some Time …
Not all
metrics are
created
equal
Focus on
System and
Application
Metrics
Weight by
category
(system,
latency, etc)
33
112. To Save You Some Time …
Outliers are
out, lying
34
113. To Save You Some Time …
Outliers are
out, lying
Use a group
of servers
34
114. To Save You Some Time …
Outliers are
out, lying
Use a group
of servers
Balance
fidelity with
customer
impact
34
115. To Save You Some Time …
Exercise
without
warmup
can result
in injury
35
116. To Save You Some Time …
Exercise
without
warmup
can result
in injury
Repeat
canary
analysis
frequently
35
117. To Save You Some Time …
Exercise
without
warmup
can result
in injury
Repeat
canary
analysis
frequently
Both traffic
and startup
time are
factors
35
118. To Save You Some Time …
vive la
différence!
36
119. To Save You Some Time …
vive la
différence!
Hot-OK,
Cold-OK
36
120. To Save You Some Time …
vive la
différence!
Hot-OK,
Cold-OK
Let
Application
Owners
Choose
36
121. To Save You Some Time …
Signal is better
than no1$#[NO
CARRIER]
37
122. To Save You Some Time …
Signal is better
than no1$#[NO
CARRIER]
Ignore weak
signals
37
123. Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
38
132. Oh, the Places We’ll Go!
• Introductions
• Proposed Use Case and Definition
• Continuous Improvement / MVP Model
• Issues, Solutions
• Cloud Considerations
• The Road at Netflix
41
168. For Our Next Trick …
• Configuration GUI
• Deployment System Integration
55
169. For Our Next Trick …
• Configuration GUI
• Deployment System Integration
• ACA All The Things
55
170. For Our Next Trick …
• Configuration GUI
• Deployment System Integration
• ACA All The Things
• OpenConnect firmware updates
55
171. For Our Next Trick …
• Configuration GUI
• Deployment System Integration
• ACA All The Things
• OpenConnect firmware updates
• Client software changes
55
172. For Our Next Trick …
• Configuration GUI
• Deployment System Integration
• ACA All The Things
• OpenConnect firmware updates
• Client software changes
• Configuration changes in production
55