Jason Yee - Chaos! - Codemotion Rome 2019

•

1 gefällt mir•318 views

As applications become more distributed and complex, so do our failure modes. In this presentation, I’ll share why you shouldn’t just embrace failure, but why you should induce it to intentionally cause and learn from failure. Together with the audience, I'll run a Chaos experiment to show how they can start their own Chaos engineering and make their systems more resilient.

Technologie

CHAOS!
Breaking your systems to make them unbreakable

–Barack Obama
“You can’t let your failures define you.
You have to let your failures teach you.”

– Corey Bertram
“Hope is not a strategy.”

J A S O N Y E E
Te c h n i c a l E v a n g e l i s t
Tr a v e l H a c k e r
W h i s k e y H u n t e r
P o k e m o n Tr a i n e r
Tw : @ g i t b i s e c t
E m : j y e e @ d a t a d o g h q . c o m

D ATA D O G
S a a S - b a s e d o b s e r v a b i l i t y
p l a t f o r m :
• M e t r i c s
• Tr a c e s ( A P M )
• L o g s
• S y n t h e t i c s
Tw : @ d a t a d o g h q
We ’ re h i r i n g :
j o b s . d a t a d o g h q . c o m

–Netflix Technology Blog, 2011
http://bit.ly/netflix-chaos
“By running Chaos Monkey in the middle of a
business day, in a carefully monitored environment
with engineers standing by to address any problems,
we can still learn the lessons about the weaknesses
of our system, and build automatic recovery
mechanisms to deal with them.
So next time an instance fails at 3 am on a Sunday,
we won’t even notice.”

“By running Chaos Monkey in the middle of a
business day, in a carefully monitored environment
with engineers standing by to address any problems,
we can still learn the lessons about the weaknesses
of our system, and build automatic recovery
mechanisms to deal with them.
So next time an instance fails at 3 am on a Sunday,
we won’t even notice.”
–Netflix Technology Blog, 2011
http://bit.ly/netflix-chaos

90 Minutes
30 minutes planning.
50 minutes playing.

90 Minutes
30 minutes planning.
50 minutes playing.
10 minutes reporting.

The Team
1 subject matter expert
1 SRE
Sometimes, 1 junior
engineer/developer

Before
• Schedule it.
• Pick tests. Start easy.

Before
• Schedule it.
• Pick tests. Start easy.
• Write down what you expect to happen.

Before
• Schedule it.
• Pick tests. Start easy.
• Write down what you expect to happen.
• Write down your plan if things go wrong.

Before
• Schedule it.
• Pick tests. Start easy.
• Write down what you expect to happen.
• Write down your plan if things go wrong.
• Share your document!

During
• Staging -> Production (off-peak) -> Production (primetime)

During
• Staging -> Production (off-peak) -> Production (primetime)
• Announce start in group chat.

During
• Staging -> Production (off-peak) -> Production (primetime)
• Announce start in group chat.
• Maintain discussion in group chat.

During
• Staging -> Production (off-peak) -> Production (primetime)
• Announce start in group chat.
• Maintain discussion in group chat.
• Monitor for outages.

During
• Staging -> Production (off-peak) -> Production (primetime)
• Announce start in group chat.
• Maintain discussion in group chat.
• Monitor for outages.
• Run your test and take notes!

After
• Create cards/tickets to track issues that need work.

After
• Create cards/tickets to track issues that need work.
• Write a summary & key lessons.

After
• Create cards/tickets to track issues that need work.
• Write a summary & key lessons.
• Email to all of engineering.

After
• Create cards/tickets to track issues that need work.
• Write a summary & key lessons.
• Email to all of engineering.
• Celebrate!

Game Levels
0. Terminate the service. Block access to 1 dependency.

Game Levels
0. Terminate the service. Block access to 1 dependency.
1. Block access to all dependencies.

Game Levels
0. Terminate the service. Block access to 1 dependency.
1. Block access to all dependencies.
2. Terminate the host.

Game Levels
0. Terminate the service. Block access to 1 dependency.
1. Block access to all dependencies.
2. Terminate the host.
3. Degrade the environment.
Monitor & alert on Latency, Errors, Traffic, & Saturation

Additional Resources
• systemctl, kill, iptables

Additional Resources
• systemctl, kill, iptables
• Comcast - https://github.com/tylertreat/comcast

Additional Resources
• systemctl, kill, iptables
• Comcast - https://github.com/tylertreat/comcast
• Vegeta - https://github.com/tsenart/vegeta

Additional Resources
• systemctl, kill, iptables
• Comcast - https://github.com/tylertreat/comcast
• Vegeta - https://github.com/tsenart/vegeta
• Kubernetes & Istio -
https://istio.io/docs/tasks/traffic-management/fault-injection/

apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: wordpress-virtualservice
spec:
hosts:
- "*"
gateways:
- wordpress-gateway
http:
- route:
- destination:
host: wordpress
fault:
abort:
percentage:
value: 50.0
httpStatus: 500
delay:
percentage:
value: 50.0
fixedDelay: 30s

Additional Resources
• Gremlin - https://www.gremlin.com/
• ChaosToolkit - https://chaostoolkit.org/

Questions?
jason.yee@datadoghq.com
@gitbisect

Slides: bit.ly/cmr-chaos 
Documents: bit.ly/cmr-chaos-docs
jason.yee@datadoghq.com
@gitbisect

Empfohlen

Patrick Laverty - #InfoGov17 - I Hack People For Money: A Day In The Life Of ...ARMA International

Red Team Diary: Meu Recon falhou e agora?Arthur Paixão

EduSparkz Thunder Thursday Debugging CodeSatish AG

The Future of Testing WebinarAlan Richardson

Failing @ Scaling: Don't Panic, and Carry a Towel - Agile2016Em Campbell-Pretty

Dev / non dev communicationaaronjorbin

Mobile CRO - The psychology behind selling to thumbsOnline Dialogue

Test Motherfucker...TestMario García

Empfohlen

Patrick Laverty - #InfoGov17 - I Hack People For Money: A Day In The Life Of ...ARMA International

Red Team Diary: Meu Recon falhou e agora?Arthur Paixão

EduSparkz Thunder Thursday Debugging CodeSatish AG

The Future of Testing WebinarAlan Richardson

Failing @ Scaling: Don't Panic, and Carry a Towel - Agile2016Em Campbell-Pretty

Dev / non dev communicationaaronjorbin

Mobile CRO - The psychology behind selling to thumbsOnline Dialogue

Test Motherfucker...TestMario García

The hunt of the unicorn, to capture productivityBrainhub

Larry Maccherone: "Probabilistic Decision Making"RedHatAgileDay

Five Cliches of Online Game Developmentiandundore

How I failed to build a runbook automation systemTimothyBonci

Hushcon 2016 Keynote: Test for EchoDeja vu Security

Continuous Automated Testing - Cast conference workshop august 2014Noah Sussman

Software testing (2) trainingin-mumbaivibrantuser

Scrum Plus Extreme Programming (XP) for Hyper ProductivityRon Quartel

Secrets and Mysteries of Automated Execution Keynote slidesAlan Richardson

Angus Fletcher - Error Handling in Concurrent SystemsMaritime DevCon

Let's Sharpen Your Agile Ax, It's Story Splitting TimeExcella

Sharpen your Agile Axe by Brian SjorberWashington DC Scrum User Group

Getting Schooled DerbyCon 3.0TonikJDK

Coaching teams in creative problem solvingFlowa Oy

Software testing (2) trainingin-mumbai...vibrantuser

Chaos Engineering Talk at DevOps Days Austinmatthewbrahms

DevOps Days Vancouver 2014 SlidesAlex Cruise

How To Run a 5 Whys (With Humans, Not Robots)Dan Milstein

Spaghetti gateJon Bachelor

Put Some SRE in Your Shipped SoftwareTheo Schlossnagle

Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...Codemotion

Pompili - From hero to_zero: The FatalNoise neverending storyCodemotion

Weitere ähnliche Inhalte

Ähnlich wie Jason Yee - Chaos! - Codemotion Rome 2019

The hunt of the unicorn, to capture productivityBrainhub

Larry Maccherone: "Probabilistic Decision Making"RedHatAgileDay

Five Cliches of Online Game Developmentiandundore

How I failed to build a runbook automation systemTimothyBonci

Hushcon 2016 Keynote: Test for EchoDeja vu Security

Continuous Automated Testing - Cast conference workshop august 2014Noah Sussman

Software testing (2) trainingin-mumbaivibrantuser

Scrum Plus Extreme Programming (XP) for Hyper ProductivityRon Quartel

Secrets and Mysteries of Automated Execution Keynote slidesAlan Richardson

Angus Fletcher - Error Handling in Concurrent SystemsMaritime DevCon

Let's Sharpen Your Agile Ax, It's Story Splitting TimeExcella

Sharpen your Agile Axe by Brian SjorberWashington DC Scrum User Group

Getting Schooled DerbyCon 3.0TonikJDK

Coaching teams in creative problem solvingFlowa Oy

Software testing (2) trainingin-mumbai...vibrantuser

Chaos Engineering Talk at DevOps Days Austinmatthewbrahms

DevOps Days Vancouver 2014 SlidesAlex Cruise

How To Run a 5 Whys (With Humans, Not Robots)Dan Milstein

Spaghetti gateJon Bachelor

Put Some SRE in Your Shipped SoftwareTheo Schlossnagle

Ähnlich wie Jason Yee - Chaos! - Codemotion Rome 2019 (20)

The hunt of the unicorn, to capture productivity

Larry Maccherone: "Probabilistic Decision Making"

Five Cliches of Online Game Development

How I failed to build a runbook automation system

Hushcon 2016 Keynote: Test for Echo

Continuous Automated Testing - Cast conference workshop august 2014

Software testing (2) trainingin-mumbai

Scrum Plus Extreme Programming (XP) for Hyper Productivity

Secrets and Mysteries of Automated Execution Keynote slides

Angus Fletcher - Error Handling in Concurrent Systems

Let's Sharpen Your Agile Ax, It's Story Splitting Time

Sharpen your Agile Axe by Brian Sjorber

Getting Schooled DerbyCon 3.0

Coaching teams in creative problem solving

Software testing (2) trainingin-mumbai...

Chaos Engineering Talk at DevOps Days Austin

DevOps Days Vancouver 2014 Slides

How To Run a 5 Whys (With Humans, Not Robots)

Spaghetti gate

Put Some SRE in Your Shipped Software

Mehr von Codemotion

Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...Codemotion

Pompili - From hero to_zero: The FatalNoise neverending storyCodemotion

Pastore - Commodore 65 - La storiaCodemotion

Pennisi - Essere Richard AltwasserCodemotion

Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...Codemotion

Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019Codemotion

Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019Codemotion

Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 - Codemotion

Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...Codemotion

Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...Codemotion

Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...Codemotion

Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...Codemotion

Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019Codemotion

Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019Codemotion

Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019Codemotion

James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...Codemotion

Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...Codemotion

Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019Codemotion

Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019Codemotion

Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019Codemotion

Mehr von Codemotion (20)

Fuzz-testing: A hacker's approach to making your code more secure | Pascal Ze...

Pompili - From hero to_zero: The FatalNoise neverending story

Pastore - Commodore 65 - La storia

Pennisi - Essere Richard Altwasser

Michel Schudel - Let's build a blockchain... in 40 minutes! - Codemotion Amst...

Richard Süselbeck - Building your own ride share app - Codemotion Amsterdam 2019

Eward Driehuis - What we learned from 20.000 attacks - Codemotion Amsterdam 2019

Francesco Baldassarri - Deliver Data at Scale - Codemotion Amsterdam 2019 -

Martin Förtsch, Thomas Endres - Stereoscopic Style Transfer AI - Codemotion A...

Melanie Rieback, Klaus Kursawe - Blockchain Security: Melting the "Silver Bul...

Angelo van der Sijpt - How well do you know your network stack? - Codemotion ...

Lars Wolff - Performance Testing for DevOps in the Cloud - Codemotion Amsterd...

Sascha Wolter - Conversational AI Demystified - Codemotion Amsterdam 2019

Michele Tonutti - Scaling is caring - Codemotion Amsterdam 2019

Pat Hermens - From 100 to 1,000+ deployments a day - Codemotion Amsterdam 2019

James Birnie - Using Many Worlds of Compute Power with Quantum - Codemotion A...

Don Goodman-Wilson - Chinese food, motor scooters, and open source developmen...

Pieter Omvlee - The story behind Sketch - Codemotion Amsterdam 2019

Dave Farley - Taking Back “Software Engineering” - Codemotion Amsterdam 2019

Joshua Hoffman - Should the CTO be Coding? - Codemotion Amsterdam 2019

Kürzlich hochgeladen

DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

CNIC Information System with Pakdata Cf In Pakistandanishmna97

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede

ICT role in 21st century education and its challengesrafiqahmad00786416

Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Manulife - Insurer Transformation Award 2024The Digital Insurer

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

MS Copilot expands with MS Graph connectorsNanddeep Nachan

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Why Teams call analytics are critical to your entire businesspanagenda

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Kürzlich hochgeladen (20)

DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...

Strategies for Landing an Oracle DBA Job as a Fresher

CNIC Information System with Pakdata Cf In Pakistan

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...

presentation ICT roal in 21st century education

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Spring Boot vs Quarkus the ultimate battle - DevoxxUK

ICT role in 21st century education and its challenges

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Manulife - Insurer Transformation Award 2024

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...

MS Copilot expands with MS Graph connectors

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Why Teams call analytics are critical to your entire business

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

Jason Yee - Chaos! - Codemotion Rome 2019

1. CHAOS! Breaking your systems to make them unbreakable

2. –Barack Obama “You can’t let your failures define you. You have to let your failures teach you.”

3. – Corey Bertram “Hope is not a strategy.”

4. http://bit.ly/netflix-5-things

5. Chaos Monkey

6. J A S O N Y E E Te c h n i c a l E v a n g e l i s t Tr a v e l H a c k e r W h i s k e y H u n t e r P o k e m o n Tr a i n e r Tw : @ g i t b i s e c t E m : j y e e @ d a t a d o g h q . c o m

7. D ATA D O G S a a S - b a s e d o b s e r v a b i l i t y p l a t f o r m : • M e t r i c s • Tr a c e s ( A P M ) • L o g s • S y n t h e t i c s Tw : @ d a t a d o g h q We ’ re h i r i n g : j o b s . d a t a d o g h q . c o m

8. –Netflix Technology Blog, 2011 http://bit.ly/netflix-chaos “By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won’t even notice.”

9. “By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won’t even notice.” –Netflix Technology Blog, 2011 http://bit.ly/netflix-chaos

10. “By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won’t even notice.” –Netflix Technology Blog, 2011 http://bit.ly/netflix-chaos

11. “By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won’t even notice.” –Netflix Technology Blog, 2011 http://bit.ly/netflix-chaos

12. “By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them. So next time an instance fails at 3 am on a Sunday, we won’t even notice.” –Netflix Technology Blog, 2011 http://bit.ly/netflix-chaos

13. Don’t be a jerk!

14. Game Days

15. 90 Minutes

16. 90 Minutes 30 minutes planning.

17. 90 Minutes 30 minutes planning. 50 minutes playing.

18. 90 Minutes 30 minutes planning. 50 minutes playing. 10 minutes reporting.

19. The Team

20. The Team 1 subject matter expert

21. The Team 1 subject matter expert 1 SRE

22. The Team 1 subject matter expert 1 SRE Sometimes, 1 junior engineer/developer

23. Before • Schedule it.

24. Before • Schedule it. • Pick tests. Start easy.

25. Before • Schedule it. • Pick tests. Start easy. • Write down what you expect to happen.

26. Before • Schedule it. • Pick tests. Start easy. • Write down what you expect to happen. • Write down your plan if things go wrong.

27. Before • Schedule it. • Pick tests. Start easy. • Write down what you expect to happen. • Write down your plan if things go wrong. • Share your document!

28. During • Staging -> Production (off-peak) -> Production (primetime)

29. During • Staging -> Production (off-peak) -> Production (primetime) • Announce start in group chat.

30. During • Staging -> Production (off-peak) -> Production (primetime) • Announce start in group chat. • Maintain discussion in group chat.

31. During • Staging -> Production (off-peak) -> Production (primetime) • Announce start in group chat. • Maintain discussion in group chat. • Monitor for outages.

32. During • Staging -> Production (off-peak) -> Production (primetime) • Announce start in group chat. • Maintain discussion in group chat. • Monitor for outages. • Run your test and take notes!

33. After • Create cards/tickets to track issues that need work.

34. After • Create cards/tickets to track issues that need work. • Write a summary & key lessons.

35. After • Create cards/tickets to track issues that need work. • Write a summary & key lessons. • Email to all of engineering.

36. After • Create cards/tickets to track issues that need work. • Write a summary & key lessons. • Email to all of engineering. • Celebrate!

37. 🍾🎉🐶

38. Game Levels 0. Terminate the service. Block access to 1 dependency.

39. Game Levels 0. Terminate the service. Block access to 1 dependency. 1. Block access to all dependencies.

40. Game Levels 0. Terminate the service. Block access to 1 dependency. 1. Block access to all dependencies. 2. Terminate the host.

41. Game Levels 0. Terminate the service. Block access to 1 dependency. 1. Block access to all dependencies. 2. Terminate the host. 3. Degrade the environment. Monitor & alert on Latency, Errors, Traffic, & Saturation

42. Game Levels 0. Terminate the service. Block access to 1 dependency. 1. Block access to all dependencies. 2. Terminate the host. 3. Degrade the environment. 4. Spike traffic.

43. Game Levels 0. Terminate the service. Block access to 1 dependency. 1. Block access to all dependencies. 2. Terminate the host. 3. Degrade the environment. 4. Spike traffic. 5. Terminate the region/cloud.

44. Experiment time!

45. Review Share! Plan! Have fun!

46. Additional Resources • systemctl, kill, iptables

47. Additional Resources • systemctl, kill, iptables • Comcast - https://github.com/tylertreat/comcast

48. Additional Resources • systemctl, kill, iptables • Comcast - https://github.com/tylertreat/comcast • Vegeta - https://github.com/tsenart/vegeta

49. Additional Resources • systemctl, kill, iptables • Comcast - https://github.com/tylertreat/comcast • Vegeta - https://github.com/tsenart/vegeta • Kubernetes & Istio - https://istio.io/docs/tasks/traffic-management/fault-injection/

50.

51.

52.

53. apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: wordpress-virtualservice spec: hosts: - "*" gateways: - wordpress-gateway http: - route: - destination: host: wordpress fault: abort: percentage: value: 50.0 httpStatus: 500 delay: percentage: value: 50.0 fixedDelay: 30s

54. Additional Resources • Gremlin - https://www.gremlin.com/ • ChaosToolkit - https://chaostoolkit.org/

55. Questions? jason.yee@datadoghq.com @gitbisect

56. Slides: bit.ly/cmr-chaos  Documents: bit.ly/cmr-chaos-docs jason.yee@datadoghq.com @gitbisect