SlideShare a Scribd company logo
1 of 50
Download to read offline
Section Break
Own Your Own Impact: Incident Response at Airbnb
Cameron Tuckerman-Lee, Site Reliability, Airbnb
Own Your Own Impact:
Incident Response at
Airbnb
CAMERON TUCKERMAN-LEE / 2016 NOVEMBER 17 / FUTURESTACK
@TUCKERMA
N
Who am I?
DevOps &
Sysops
Convincing 50
people go on-call
for all of Airbnb
voluntarily.
People First

On-Call
The future of what
it means to take
the pager at
Airbnb.
Tooling
The tools and
services we use
that make this all
possible.
Incidents

Walk-Thru
Real past incidents
at Airbnb.
Airbnb Open &
New Relic
Feeling so
comfortable about
our biggest launch
ever that I’m here
instead of in a war
room.
Owning Your Own Impact
Our next 45 Minutes Together
Devops & Sysops
Development Operations
DevOpsDevelopment Operations
DevOps
Usually the Best of Both Worlds
• Velocity: Teams are able to move quickly as they own both the development and
release of their project.
• Ownership: Smaller teams owning a product area or feature. In depth knowledge
about its architecture and how to operate it.
• Automation: Incentivized to automate as much of the operations as possible.
DevOps
Sometimes the Worst of Both Worlds
• Upstream Outages: Cloud provider outages
• Not So Isolated: Features often have to come together in the view layer, where a
problem with one can cause a problem with another.
• Incident Management: Managing the lifecycle of an incident, gauging customer
impact, communicating to stake holders, prioritizing remediations are hard work and
specialized knowledge.
• Coordination: The same incident is happening to multiple teams and they are all
investigating from square one.
Sysops
History
SysopsVolunteersSRE
Airbnb hires the first SRE tasked
with ensuring the site doesn’t go
down. He is on-call 24/7
responding to all incidents.
Airbnb starts to get a bit bigger.
Incidents are more frequent and
the stakes are higher. Engineers
volunteer to help out the SRE
one day at the weekly meeting.
The volunteer group grows, is
formalized, and officially takes
over owning incident response.
Hats were made.
Sysops
History
SysopsVolunteersSRE
Airbnb hires the first SRE tasked
with ensuring the site doesn’t go
down. He is on-call 24/7
responding to all incidents.
Airbnb starts to get a bit bigger.
Incidents are more frequent and
the stakes are higher. Engineers
volunteer to help out the SRE
one day at the weekly meeting.
The volunteer group grows, is
formalized, and officially takes
over owning incident response.
Hats are made.
Sysops
History
SysopsVolunteersSRE
Airbnb hires the first SRE tasked
with ensuring the site doesn’t go
down. He is on-call 24/7
responding to all incidents.
Airbnb starts to get a bit bigger.
Incidents are more frequent and
the stakes are higher. Engineers
volunteer to help out the SRE
one day at the weekly meeting.
The volunteer group grows, is
formalized, and officially takes
over owning incident response.
Hats are made. Sysops is born.
Sysops
History
SysopsVolunteersSRE
Airbnb hires the first SRE tasked
with ensuring the site doesn’t go
down. He is on-call 24/7
responding to all incidents.
Airbnb starts to get a bit bigger.
Incidents are more frequent and
the stakes are higher. Engineers
volunteer to help out the SRE
one day at the weekly meeting.
The volunteer group grows, is
formalized, and officially takes
over owning incident response.
Hats are made.
Sysops at Airbnb
Today
• Volunteer Group: Frontend and backend. Product and infrastructure. Individual
contributor and manager.
• Specialized Training: Biennial training covering infrastructure, incident response,
communication, and pizza.
• Ownership: Escalation point of last resort. Primary on-call has the authority to make
important decisions in the moment.
25 50 33%
Engineers Currently

On Sysops
Engineers in

Primary Rotation
Engineers Attended
Sysops Training
Sysops at Airbnb
Today
How does this work?
• Reliability Matters: The uptime of our site isn’t just important for conversions, it’s
important for the entire end-to-end experience.
• Making Mistakes: Blameless postmortem. Blameless remediation.
• Learning Opportunity: Training, collaboration, truly “fullstack".
• Tooling: Tooling is built with Sysops in mind.
• SRE: SRE builds tools that make on-calls more productive.
• Culture: Sysops grew out of our core values; cereal entrepreneurship.
People First On-Call
People First On-Call
The Future of On-Call at Airbnb
• Pager-Life Balance: Ensure that more involved, tenured engineers aren’t always the
ones waking up at 3 AM to put out fires.
• Learning/Growth Focused: Continuing education and learning opportunities for on-
call engineers.
• Evaluation Metrics: Engineers should know where they can improve and should be
recognized for excellent work.
• Intelligent Scheduling: In DevOps when every team has at least two on-call
rotations, how can we schedule around lives outside of work (and responsibilities
inside of work)?
Tooling
STATSD Protocol
Measure Everything
https://codeascraft.com/2011/02/15/measure-anything-measure-everything/
• Metrics Live in Code: When adding new features in the codebase, engineers can instrument them at the
same time in line with the feature.
• System/Business Metrics: Allows single dashboards where you can correlate low and high level metrics
together, e.g. packets-in and number of nights booked.
• Simple: Engineers don’t need to ask for permission, submit PRs to the SRE team, or use other tools.
Adding metrics is as easy as thinking of a metric name.
Alerts
Interferon
https://github.com/airbnb/interferon
• Configuration as Code: Exposes best practices and allows re-syncing metrics in the case of
emergencies.
• Autogeneration of Alerts: System-CPU alerts get created for every host, and alerts are automatically
routed to the owner based on “Host Sources”.
• Extensible: Only used for our STATSD metrics, but are working on adding New Relic. Open Source so
you can add your own!
Example Alert
High CPU
name “#{@hostinfo[:role]}: load spiking”
message <<<EOM
You can put the body of your alert here if the
destination supports messages.
Context. Context. Context!
EOM
applies !@hostinfo[:role].end_with?(‘-test’)
notify.groups [
‘sre’,
@hostinfo[:owner]
]
metric.query <<<EOQ
min(last_5m):avg:
system.load.norm.1{role:#{@hostinfo[:role]}}
> 0.95
EOQ
Postmortems
Incidents Reporter
• Blameless Postmortems:
Goal is to learn how to prevent/
mitigate it in the future.
• Tracking Impact: Allows
resource allocation to take into
account the needs of bettering
reliability.
• Follow Up: Integration with
JIRA for bug-fix/improvement
tracking.
New Relic at Airbnb
Browser
Not every engineer is a frontend engineer…
• Our Product is in the Browser: Backend changes by backend engineers can have
profound (i.e. terrible) implications for front-end code.
• Alerts: Before Browser, errors/metrics would have to propagate to the backend
before engineers would be alerted. Now we can be alerted about regressions.
• Browsers Your Engineers Don’t Us: Engineers at Airbnb only use Chrome — that
isn’t true about our users.
Browser Error
Reporting
APM
Every application, every deploy
• Out of the Box: Chef auto-{installs, configures} the New Relic agent for all services
without the engineer having to know about it.
• Consistent Monitoring Across Platforms: At Airbnb, we have java, node, and ruby
services — APM gives all engineers a familiar monitoring tool without having to be
familiar with the underlying framework.
• Deploying: Engineers know that they can monitor the health of a code deployment
for any service (and downstream services) in APM. We also embed relevant APM
information in our deploy tool, Deployboard.
Pale Yellow
Uh oh…
Synthetics
Last Line of Defense
Can sleep comfortably at night
knowing that, if other safeguards
fail, any user-facing downtime
will get caught.
Real World Walk-Throughs
Frontend Response Time Degradation
Questions?
• Which database is having the problem? Use the “Databases” tab to dig into which query is adding the
most to the response time.
• What is the end user impact? The query will show which transactions/controllers it is part of.
• How can I recreate the regression? New Relic automatically samples slow queries — allowing you to dig
into them. Includes information such as which index is being used for the query.
Remember that one time that toasters and webcams
tried to take down the internet?
First Attack Second Attack
Questions?
• Where are the errors coming from? Both database queries timing and out from clients being unable to
resolve DNS names.
• Which database is having the problem? All of them.
• Which external providers are having the problem? Some of them.
Airbnb Open
Airbnb Open
Using New Relic for a Live Event Launch
• Development: All applications developed at Airbnb come with New Relic “out of the
box” without additional configuration. Developers are using APM from day one.
• Deployment: As you make changes to dark code, you can ensure that your deploy
doesn’t impact live code paths.
• Load Testing: Understand how increased load affects your service and its external
service dependencies.
• Launch: The day of the big keynote presentation, you are able to monitor for
performance and regressions and triangulate their cause and understand their impact
on the end user.
• Allspaw, John. Blameless PostMortems and a Just Culture. Code as Craft, 2012.

< https://codeascraft.com/2012/05/22/blameless-postmortems/ >
• Beyer, Betsy, Chris Jones, and Jennifer Petoff. Site Reliability Engineering: How Google Runs
Production Systems. O'Reilly Media, 2016.
• Fowler, Susan. Who’s On Call? 2016.

< http://www.susanjfowler.com/blog/2016/9/6/whos-on-call >
• Malpass, Ian. Measure Anything, Measure Everything. Code as Craft, 2011.

< https://codeascraft.com/2011/02/15/measure-anything-measure-everything/ >
• Underwood, Todd. PostOps: A Non-Surgical Tale of Software, Fragility, and Reliability. Lisa,
2013.

< https://www.usenix.org/conference/lisa13/technical-sessions/plenary/underwood >
Inspiration and Citations
Thanks!

More Related Content

What's hot

Henrik Kniberg - Essence of Agile
Henrik Kniberg - Essence of AgileHenrik Kniberg - Essence of Agile
Henrik Kniberg - Essence of Agile
AgileSparks
 
iSQI Certification Days DASA – DevOps & ISTQB Frank Frambach
iSQI Certification Days DASA – DevOps & ISTQB Frank FrambachiSQI Certification Days DASA – DevOps & ISTQB Frank Frambach
iSQI Certification Days DASA – DevOps & ISTQB Frank Frambach
Ievgenii Katsan
 

What's hot (20)

Pragmatic Microservices
Pragmatic MicroservicesPragmatic Microservices
Pragmatic Microservices
 
The History of DevOps (and what you need to do about it)
The History of DevOps (and what you need to do about it)The History of DevOps (and what you need to do about it)
The History of DevOps (and what you need to do about it)
 
How HipChat Ships and Recovers Fast with DevOps Practices
How HipChat Ships and Recovers Fast with DevOps PracticesHow HipChat Ships and Recovers Fast with DevOps Practices
How HipChat Ships and Recovers Fast with DevOps Practices
 
Top Lessons Learned While Researching and Writing The DevOps Handbook
Top Lessons Learned While Researching and Writing The DevOps HandbookTop Lessons Learned While Researching and Writing The DevOps Handbook
Top Lessons Learned While Researching and Writing The DevOps Handbook
 
DevOps Kaizen: Find and Fix What is Really Behind Your Problems
DevOps Kaizen: Find and Fix What is Really Behind Your ProblemsDevOps Kaizen: Find and Fix What is Really Behind Your Problems
DevOps Kaizen: Find and Fix What is Really Behind Your Problems
 
DOES SFO 2016 - Greg Padak - Default to Open
DOES SFO 2016 - Greg Padak - Default to OpenDOES SFO 2016 - Greg Padak - Default to Open
DOES SFO 2016 - Greg Padak - Default to Open
 
Adobe Presents Internal Service Delivery Platform at Velocity 13 Santa Clara
Adobe Presents Internal Service Delivery Platform at Velocity 13 Santa ClaraAdobe Presents Internal Service Delivery Platform at Velocity 13 Santa Clara
Adobe Presents Internal Service Delivery Platform at Velocity 13 Santa Clara
 
DOES16 London - Better Faster Cheaper .. How?
DOES16 London - Better Faster Cheaper .. How? DOES16 London - Better Faster Cheaper .. How?
DOES16 London - Better Faster Cheaper .. How?
 
DDD In Agile
DDD In Agile   DDD In Agile
DDD In Agile
 
Agile Everywhere! - Henrik Kniberg
Agile Everywhere! - Henrik KnibergAgile Everywhere! - Henrik Kniberg
Agile Everywhere! - Henrik Kniberg
 
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...
Without Self-Service Operations, the Cloud is Just Expensive Hosting 2.0 - (a...
 
Henrik Kniberg - Essence of Agile
Henrik Kniberg - Essence of AgileHenrik Kniberg - Essence of Agile
Henrik Kniberg - Essence of Agile
 
iSQI Certification Days DASA – DevOps & ISTQB Frank Frambach
iSQI Certification Days DASA – DevOps & ISTQB Frank FrambachiSQI Certification Days DASA – DevOps & ISTQB Frank Frambach
iSQI Certification Days DASA – DevOps & ISTQB Frank Frambach
 
All daydevops 2016 - Turning Human Capital into High Performance Organizati...
All daydevops   2016 - Turning Human Capital into High Performance Organizati...All daydevops   2016 - Turning Human Capital into High Performance Organizati...
All daydevops 2016 - Turning Human Capital into High Performance Organizati...
 
All Change! How the new economics of Cloud will make you think differently ab...
All Change! How the new economics of Cloud will make you think differently ab...All Change! How the new economics of Cloud will make you think differently ab...
All Change! How the new economics of Cloud will make you think differently ab...
 
The Importance of Culture: Building and Sustaining Effective Engineering Org...
The Importance of Culture:  Building and Sustaining Effective Engineering Org...The Importance of Culture:  Building and Sustaining Effective Engineering Org...
The Importance of Culture: Building and Sustaining Effective Engineering Org...
 
Turning Human Capital into High Performance Organizational Capital
Turning Human Capital into High Performance Organizational CapitalTurning Human Capital into High Performance Organizational Capital
Turning Human Capital into High Performance Organizational Capital
 
CampDevOps keynote - DevOps: Using 'Lean' to eliminate Bottlenecks
CampDevOps keynote - DevOps: Using 'Lean' to eliminate BottlenecksCampDevOps keynote - DevOps: Using 'Lean' to eliminate Bottlenecks
CampDevOps keynote - DevOps: Using 'Lean' to eliminate Bottlenecks
 
Beyond Agile and DevOps: From Concepts to Products in Weeks, Not Months
Beyond Agile and DevOps: From Concepts to Products in Weeks, Not MonthsBeyond Agile and DevOps: From Concepts to Products in Weeks, Not Months
Beyond Agile and DevOps: From Concepts to Products in Weeks, Not Months
 
DOES SFO 2016 - Topo Pal - DevOps at Capital One
DOES SFO 2016 - Topo Pal - DevOps at Capital OneDOES SFO 2016 - Topo Pal - DevOps at Capital One
DOES SFO 2016 - Topo Pal - DevOps at Capital One
 

Viewers also liked

Airbnb Marketing Plan _ Sample
Airbnb Marketing Plan _ SampleAirbnb Marketing Plan _ Sample
Airbnb Marketing Plan _ Sample
LeeAnn Hardwell
 

Viewers also liked (20)

Airbnb and the Hotel Industry
Airbnb and the Hotel Industry Airbnb and the Hotel Industry
Airbnb and the Hotel Industry
 
Airbnb business model analysis
Airbnb business model analysisAirbnb business model analysis
Airbnb business model analysis
 
Airbnb Marketing Plan _ Sample
Airbnb Marketing Plan _ SampleAirbnb Marketing Plan _ Sample
Airbnb Marketing Plan _ Sample
 
AirBnB Pitch Deck
AirBnB Pitch Deck AirBnB Pitch Deck
AirBnB Pitch Deck
 
Human Factors and PostMortems
Human Factors and PostMortemsHuman Factors and PostMortems
Human Factors and PostMortems
 
March Marketers: Research Trends Presentation
March Marketers: Research Trends PresentationMarch Marketers: Research Trends Presentation
March Marketers: Research Trends Presentation
 
It's Not Your Fault - Blameless Post-mortems
It's Not Your Fault - Blameless Post-mortemsIt's Not Your Fault - Blameless Post-mortems
It's Not Your Fault - Blameless Post-mortems
 
Customer experience programs in b2b - Analysegruppen whitepaper 2014
Customer experience programs in b2b - Analysegruppen whitepaper 2014Customer experience programs in b2b - Analysegruppen whitepaper 2014
Customer experience programs in b2b - Analysegruppen whitepaper 2014
 
Startup AddVenture 2013 - Marvin Liao - Marketing Lessons from Silicon Valley...
Startup AddVenture 2013 - Marvin Liao - Marketing Lessons from Silicon Valley...Startup AddVenture 2013 - Marvin Liao - Marketing Lessons from Silicon Valley...
Startup AddVenture 2013 - Marvin Liao - Marketing Lessons from Silicon Valley...
 
Growth Hacking Marketing for Mobile Video App Angl - Startup
Growth Hacking Marketing for Mobile Video App Angl - StartupGrowth Hacking Marketing for Mobile Video App Angl - Startup
Growth Hacking Marketing for Mobile Video App Angl - Startup
 
Growth Hacking: How to grow a bootstrapped startup - by StartupCVs.com
Growth Hacking: How to grow a bootstrapped startup - by StartupCVs.comGrowth Hacking: How to grow a bootstrapped startup - by StartupCVs.com
Growth Hacking: How to grow a bootstrapped startup - by StartupCVs.com
 
Growth Hacking @ Sup de Pub International Track — part 1
Growth Hacking @ Sup de Pub International Track — part 1Growth Hacking @ Sup de Pub International Track — part 1
Growth Hacking @ Sup de Pub International Track — part 1
 
Ryerson Startup School 2015 - Growth Hacking 2015
Ryerson Startup School 2015 - Growth Hacking 2015Ryerson Startup School 2015 - Growth Hacking 2015
Ryerson Startup School 2015 - Growth Hacking 2015
 
Growth Hacking: Ways & tricks to get traction for your startup
Growth Hacking: Ways & tricks to get traction for your startupGrowth Hacking: Ways & tricks to get traction for your startup
Growth Hacking: Ways & tricks to get traction for your startup
 
What’s Driving Brand Loyalty In The B2B Automotive Market?
What’s Driving Brand Loyalty In The B2B Automotive Market?What’s Driving Brand Loyalty In The B2B Automotive Market?
What’s Driving Brand Loyalty In The B2B Automotive Market?
 
Art pitch startupweekend grenoble
Art pitch startupweekend grenobleArt pitch startupweekend grenoble
Art pitch startupweekend grenoble
 
Building B2B Client Loyalty and Increasing Wallet Share
Building B2B Client Loyalty and Increasing Wallet ShareBuilding B2B Client Loyalty and Increasing Wallet Share
Building B2B Client Loyalty and Increasing Wallet Share
 
Introduction to Startup Marketing * Launch Startup *Growth Hacking
Introduction to Startup Marketing * Launch Startup *Growth Hacking Introduction to Startup Marketing * Launch Startup *Growth Hacking
Introduction to Startup Marketing * Launch Startup *Growth Hacking
 
Airbnb presentation
Airbnb presentationAirbnb presentation
Airbnb presentation
 
Airbnb
AirbnbAirbnb
Airbnb
 

Similar to Own Your Own Impact: Incident Response at Airbnb [FutureStack16]

DevOps Days Toronto: From 6 Months Waterfall to 1 hour Code Deploys
DevOps Days Toronto: From 6 Months Waterfall to 1 hour Code DeploysDevOps Days Toronto: From 6 Months Waterfall to 1 hour Code Deploys
DevOps Days Toronto: From 6 Months Waterfall to 1 hour Code Deploys
Andreas Grabner
 

Similar to Own Your Own Impact: Incident Response at Airbnb [FutureStack16] (20)

Dev Ops without the Ops
Dev Ops without the OpsDev Ops without the Ops
Dev Ops without the Ops
 
AppSec Pipelines and Event based Security
AppSec Pipelines and Event based SecurityAppSec Pipelines and Event based Security
AppSec Pipelines and Event based Security
 
Subverting the monolith!
Subverting the monolith!Subverting the monolith!
Subverting the monolith!
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
 
DevOps Days Toronto: From 6 Months Waterfall to 1 hour Code Deploys
DevOps Days Toronto: From 6 Months Waterfall to 1 hour Code DeploysDevOps Days Toronto: From 6 Months Waterfall to 1 hour Code Deploys
DevOps Days Toronto: From 6 Months Waterfall to 1 hour Code Deploys
 
Chaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient SystemsChaos Engineering: Why the World Needs More Resilient Systems
Chaos Engineering: Why the World Needs More Resilient Systems
 
Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE Incident Management in the Age of DevOps and SRE
Incident Management in the Age of DevOps and SRE
 
The Last Mile Continued: Incident Management
The Last Mile Continued: Incident Management The Last Mile Continued: Incident Management
The Last Mile Continued: Incident Management
 
Confoo-Montreal-2016: Controlling Your Environments using Infrastructure as Code
Confoo-Montreal-2016: Controlling Your Environments using Infrastructure as CodeConfoo-Montreal-2016: Controlling Your Environments using Infrastructure as Code
Confoo-Montreal-2016: Controlling Your Environments using Infrastructure as Code
 
WSO2Con EU 2015: Keynote - Cloud Native Apps… from a user point of view
WSO2Con EU 2015: Keynote - Cloud Native Apps… from a user point of viewWSO2Con EU 2015: Keynote - Cloud Native Apps… from a user point of view
WSO2Con EU 2015: Keynote - Cloud Native Apps… from a user point of view
 
Continuous Deployment
Continuous DeploymentContinuous Deployment
Continuous Deployment
 
Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati...
Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati...Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati...
Flink Forward Berlin 2018: Wei-Che (Tony) Wei - "Lessons learned from Migrati...
 
Bringing Open-Source Practices to Your Day Job
Bringing Open-Source Practices to Your Day JobBringing Open-Source Practices to Your Day Job
Bringing Open-Source Practices to Your Day Job
 
2019 Top Lessons Learned Since the Phoenix Project Was Released
2019 Top Lessons Learned Since the Phoenix Project Was Released2019 Top Lessons Learned Since the Phoenix Project Was Released
2019 Top Lessons Learned Since the Phoenix Project Was Released
 
Building an Open Source AppSec Pipeline
Building an Open Source AppSec PipelineBuilding an Open Source AppSec Pipeline
Building an Open Source AppSec Pipeline
 
The Anatomy of Continuous Deployment at Scale - 100 deploys a week at Envato ...
The Anatomy of Continuous Deployment at Scale - 100 deploys a week at Envato ...The Anatomy of Continuous Deployment at Scale - 100 deploys a week at Envato ...
The Anatomy of Continuous Deployment at Scale - 100 deploys a week at Envato ...
 
How do we drive tech changes
How do we drive tech changesHow do we drive tech changes
How do we drive tech changes
 
Continuous Integration Best Practices for Software Development Teams - AWS On...
Continuous Integration Best Practices for Software Development Teams - AWS On...Continuous Integration Best Practices for Software Development Teams - AWS On...
Continuous Integration Best Practices for Software Development Teams - AWS On...
 
Automic Support Tips and Tricks
Automic Support Tips and TricksAutomic Support Tips and Tricks
Automic Support Tips and Tricks
 
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel ...
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel ...DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel ...
DOES16 San Francisco - David Blank-Edelman - Lessons Learned from a Parallel ...
 

More from New Relic

More from New Relic (20)

7 Tips & Tricks to Having Happy Customers at Scale
7 Tips & Tricks to Having Happy Customers at Scale7 Tips & Tricks to Having Happy Customers at Scale
7 Tips & Tricks to Having Happy Customers at Scale
 
7 Tips & Tricks to Having Happy Customers at Scale
7 Tips & Tricks to Having Happy Customers at Scale7 Tips & Tricks to Having Happy Customers at Scale
7 Tips & Tricks to Having Happy Customers at Scale
 
New Relic University at Future Stack Tokyo 2019
New Relic University at Future Stack Tokyo 2019New Relic University at Future Stack Tokyo 2019
New Relic University at Future Stack Tokyo 2019
 
FutureStack Tokyo 19 -[事例講演]株式会社リクルートライフスタイル:年間9300万件以上のサロン予約を支えるホットペッパービューティ...
FutureStack Tokyo 19 -[事例講演]株式会社リクルートライフスタイル:年間9300万件以上のサロン予約を支えるホットペッパービューティ...FutureStack Tokyo 19 -[事例講演]株式会社リクルートライフスタイル:年間9300万件以上のサロン予約を支えるホットペッパービューティ...
FutureStack Tokyo 19 -[事例講演]株式会社リクルートライフスタイル:年間9300万件以上のサロン予約を支えるホットペッパービューティ...
 
FutureStack Tokyo 19 -[New Relic テクニカル講演]モニタリングと可視化がデジタルトランスフォーメーションを救う! - サ...
FutureStack  Tokyo 19 -[New Relic テクニカル講演]モニタリングと可視化がデジタルトランスフォーメーションを救う! - サ...FutureStack  Tokyo 19 -[New Relic テクニカル講演]モニタリングと可視化がデジタルトランスフォーメーションを救う! - サ...
FutureStack Tokyo 19 -[New Relic テクニカル講演]モニタリングと可視化がデジタルトランスフォーメーションを救う! - サ...
 
FutureStack Tokyo 19 -[特別講演]システム開発によろこびと驚きの連鎖を
FutureStack Tokyo 19 -[特別講演]システム開発によろこびと驚きの連鎖をFutureStack Tokyo 19 -[特別講演]システム開発によろこびと驚きの連鎖を
FutureStack Tokyo 19 -[特別講演]システム開発によろこびと驚きの連鎖を
 
FutureStack Tokyo 19 -[パートナー講演]アマゾン ウェブ サービス ジャパン株式会社: New Relicを活用したAWSへのアプリ...
FutureStack Tokyo 19 -[パートナー講演]アマゾン ウェブ サービス ジャパン株式会社: New Relicを活用したAWSへのアプリ...FutureStack Tokyo 19 -[パートナー講演]アマゾン ウェブ サービス ジャパン株式会社: New Relicを活用したAWSへのアプリ...
FutureStack Tokyo 19 -[パートナー講演]アマゾン ウェブ サービス ジャパン株式会社: New Relicを活用したAWSへのアプリ...
 
FutureStack Tokyo 19_インサイトとデータを組織の力にする_株式会社ドワンゴ 池田 明啓 氏
FutureStack Tokyo 19_インサイトとデータを組織の力にする_株式会社ドワンゴ 池田 明啓 氏FutureStack Tokyo 19_インサイトとデータを組織の力にする_株式会社ドワンゴ 池田 明啓 氏
FutureStack Tokyo 19_インサイトとデータを組織の力にする_株式会社ドワンゴ 池田 明啓 氏
 
Three Monitoring Mistakes and How to Avoid Them
Three Monitoring Mistakes and How to Avoid ThemThree Monitoring Mistakes and How to Avoid Them
Three Monitoring Mistakes and How to Avoid Them
 
Intro to Multidimensional Kubernetes Monitoring
Intro to Multidimensional Kubernetes MonitoringIntro to Multidimensional Kubernetes Monitoring
Intro to Multidimensional Kubernetes Monitoring
 
FS18 Chicago Keynote
FS18 Chicago Keynote FS18 Chicago Keynote
FS18 Chicago Keynote
 
SRE-iously
SRE-iouslySRE-iously
SRE-iously
 
10 Things You Can Do With New Relic - Number 9 Will Shock You
10 Things You Can Do With New Relic - Number 9 Will Shock You10 Things You Can Do With New Relic - Number 9 Will Shock You
10 Things You Can Do With New Relic - Number 9 Will Shock You
 
Ground Rules for Code Reviews
Ground Rules for Code ReviewsGround Rules for Code Reviews
Ground Rules for Code Reviews
 
Understanding Microservice Latency for DevOps Teams: An Introduction to New R...
Understanding Microservice Latency for DevOps Teams: An Introduction to New R...Understanding Microservice Latency for DevOps Teams: An Introduction to New R...
Understanding Microservice Latency for DevOps Teams: An Introduction to New R...
 
Monitor all your Kubernetes and EKS stack with New Relic
Monitor all your Kubernetes and EKS stack with New Relic	Monitor all your Kubernetes and EKS stack with New Relic
Monitor all your Kubernetes and EKS stack with New Relic
 
Host for the Most: Cloud Cost Optimization
Host for the Most: Cloud Cost OptimizationHost for the Most: Cloud Cost Optimization
Host for the Most: Cloud Cost Optimization
 
New Relic Infrastructure in the Real World: AWS
New Relic Infrastructure in the Real World: AWSNew Relic Infrastructure in the Real World: AWS
New Relic Infrastructure in the Real World: AWS
 
Best Practices for Measuring your Code Pipeline
Best Practices for Measuring your Code PipelineBest Practices for Measuring your Code Pipeline
Best Practices for Measuring your Code Pipeline
 
Top Three Mistakes People Make with Monitoring
Top Three Mistakes People Make with MonitoringTop Three Mistakes People Make with Monitoring
Top Three Mistakes People Make with Monitoring
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Own Your Own Impact: Incident Response at Airbnb [FutureStack16]

  • 1. Section Break Own Your Own Impact: Incident Response at Airbnb Cameron Tuckerman-Lee, Site Reliability, Airbnb
  • 2. Own Your Own Impact: Incident Response at Airbnb CAMERON TUCKERMAN-LEE / 2016 NOVEMBER 17 / FUTURESTACK @TUCKERMA N
  • 4. DevOps & Sysops Convincing 50 people go on-call for all of Airbnb voluntarily. People First
 On-Call The future of what it means to take the pager at Airbnb. Tooling The tools and services we use that make this all possible. Incidents
 Walk-Thru Real past incidents at Airbnb. Airbnb Open & New Relic Feeling so comfortable about our biggest launch ever that I’m here instead of in a war room. Owning Your Own Impact Our next 45 Minutes Together
  • 8. DevOps Usually the Best of Both Worlds • Velocity: Teams are able to move quickly as they own both the development and release of their project. • Ownership: Smaller teams owning a product area or feature. In depth knowledge about its architecture and how to operate it. • Automation: Incentivized to automate as much of the operations as possible.
  • 9. DevOps Sometimes the Worst of Both Worlds • Upstream Outages: Cloud provider outages • Not So Isolated: Features often have to come together in the view layer, where a problem with one can cause a problem with another. • Incident Management: Managing the lifecycle of an incident, gauging customer impact, communicating to stake holders, prioritizing remediations are hard work and specialized knowledge. • Coordination: The same incident is happening to multiple teams and they are all investigating from square one.
  • 10. Sysops History SysopsVolunteersSRE Airbnb hires the first SRE tasked with ensuring the site doesn’t go down. He is on-call 24/7 responding to all incidents. Airbnb starts to get a bit bigger. Incidents are more frequent and the stakes are higher. Engineers volunteer to help out the SRE one day at the weekly meeting. The volunteer group grows, is formalized, and officially takes over owning incident response. Hats were made.
  • 11. Sysops History SysopsVolunteersSRE Airbnb hires the first SRE tasked with ensuring the site doesn’t go down. He is on-call 24/7 responding to all incidents. Airbnb starts to get a bit bigger. Incidents are more frequent and the stakes are higher. Engineers volunteer to help out the SRE one day at the weekly meeting. The volunteer group grows, is formalized, and officially takes over owning incident response. Hats are made.
  • 12. Sysops History SysopsVolunteersSRE Airbnb hires the first SRE tasked with ensuring the site doesn’t go down. He is on-call 24/7 responding to all incidents. Airbnb starts to get a bit bigger. Incidents are more frequent and the stakes are higher. Engineers volunteer to help out the SRE one day at the weekly meeting. The volunteer group grows, is formalized, and officially takes over owning incident response. Hats are made. Sysops is born.
  • 13. Sysops History SysopsVolunteersSRE Airbnb hires the first SRE tasked with ensuring the site doesn’t go down. He is on-call 24/7 responding to all incidents. Airbnb starts to get a bit bigger. Incidents are more frequent and the stakes are higher. Engineers volunteer to help out the SRE one day at the weekly meeting. The volunteer group grows, is formalized, and officially takes over owning incident response. Hats are made.
  • 14. Sysops at Airbnb Today • Volunteer Group: Frontend and backend. Product and infrastructure. Individual contributor and manager. • Specialized Training: Biennial training covering infrastructure, incident response, communication, and pizza. • Ownership: Escalation point of last resort. Primary on-call has the authority to make important decisions in the moment.
  • 15. 25 50 33% Engineers Currently
 On Sysops Engineers in
 Primary Rotation Engineers Attended Sysops Training Sysops at Airbnb Today
  • 16. How does this work? • Reliability Matters: The uptime of our site isn’t just important for conversions, it’s important for the entire end-to-end experience. • Making Mistakes: Blameless postmortem. Blameless remediation. • Learning Opportunity: Training, collaboration, truly “fullstack". • Tooling: Tooling is built with Sysops in mind. • SRE: SRE builds tools that make on-calls more productive. • Culture: Sysops grew out of our core values; cereal entrepreneurship.
  • 17.
  • 18.
  • 19.
  • 21. People First On-Call The Future of On-Call at Airbnb • Pager-Life Balance: Ensure that more involved, tenured engineers aren’t always the ones waking up at 3 AM to put out fires. • Learning/Growth Focused: Continuing education and learning opportunities for on- call engineers. • Evaluation Metrics: Engineers should know where they can improve and should be recognized for excellent work. • Intelligent Scheduling: In DevOps when every team has at least two on-call rotations, how can we schedule around lives outside of work (and responsibilities inside of work)?
  • 23. STATSD Protocol Measure Everything https://codeascraft.com/2011/02/15/measure-anything-measure-everything/ • Metrics Live in Code: When adding new features in the codebase, engineers can instrument them at the same time in line with the feature. • System/Business Metrics: Allows single dashboards where you can correlate low and high level metrics together, e.g. packets-in and number of nights booked. • Simple: Engineers don’t need to ask for permission, submit PRs to the SRE team, or use other tools. Adding metrics is as easy as thinking of a metric name.
  • 24. Alerts Interferon https://github.com/airbnb/interferon • Configuration as Code: Exposes best practices and allows re-syncing metrics in the case of emergencies. • Autogeneration of Alerts: System-CPU alerts get created for every host, and alerts are automatically routed to the owner based on “Host Sources”. • Extensible: Only used for our STATSD metrics, but are working on adding New Relic. Open Source so you can add your own!
  • 25. Example Alert High CPU name “#{@hostinfo[:role]}: load spiking” message <<<EOM You can put the body of your alert here if the destination supports messages. Context. Context. Context! EOM applies !@hostinfo[:role].end_with?(‘-test’) notify.groups [ ‘sre’, @hostinfo[:owner] ] metric.query <<<EOQ min(last_5m):avg: system.load.norm.1{role:#{@hostinfo[:role]}} > 0.95 EOQ
  • 26. Postmortems Incidents Reporter • Blameless Postmortems: Goal is to learn how to prevent/ mitigate it in the future. • Tracking Impact: Allows resource allocation to take into account the needs of bettering reliability. • Follow Up: Integration with JIRA for bug-fix/improvement tracking.
  • 27. New Relic at Airbnb
  • 28. Browser Not every engineer is a frontend engineer… • Our Product is in the Browser: Backend changes by backend engineers can have profound (i.e. terrible) implications for front-end code. • Alerts: Before Browser, errors/metrics would have to propagate to the backend before engineers would be alerted. Now we can be alerted about regressions. • Browsers Your Engineers Don’t Us: Engineers at Airbnb only use Chrome — that isn’t true about our users.
  • 30. APM Every application, every deploy • Out of the Box: Chef auto-{installs, configures} the New Relic agent for all services without the engineer having to know about it. • Consistent Monitoring Across Platforms: At Airbnb, we have java, node, and ruby services — APM gives all engineers a familiar monitoring tool without having to be familiar with the underlying framework. • Deploying: Engineers know that they can monitor the health of a code deployment for any service (and downstream services) in APM. We also embed relevant APM information in our deploy tool, Deployboard.
  • 31.
  • 33. Synthetics Last Line of Defense Can sleep comfortably at night knowing that, if other safeguards fail, any user-facing downtime will get caught.
  • 34.
  • 36. Frontend Response Time Degradation
  • 37.
  • 38.
  • 39.
  • 40.
  • 41. Questions? • Which database is having the problem? Use the “Databases” tab to dig into which query is adding the most to the response time. • What is the end user impact? The query will show which transactions/controllers it is part of. • How can I recreate the regression? New Relic automatically samples slow queries — allowing you to dig into them. Includes information such as which index is being used for the query.
  • 42. Remember that one time that toasters and webcams tried to take down the internet?
  • 43.
  • 45.
  • 46. Questions? • Where are the errors coming from? Both database queries timing and out from clients being unable to resolve DNS names. • Which database is having the problem? All of them. • Which external providers are having the problem? Some of them.
  • 48. Airbnb Open Using New Relic for a Live Event Launch • Development: All applications developed at Airbnb come with New Relic “out of the box” without additional configuration. Developers are using APM from day one. • Deployment: As you make changes to dark code, you can ensure that your deploy doesn’t impact live code paths. • Load Testing: Understand how increased load affects your service and its external service dependencies. • Launch: The day of the big keynote presentation, you are able to monitor for performance and regressions and triangulate their cause and understand their impact on the end user.
  • 49. • Allspaw, John. Blameless PostMortems and a Just Culture. Code as Craft, 2012.
 < https://codeascraft.com/2012/05/22/blameless-postmortems/ > • Beyer, Betsy, Chris Jones, and Jennifer Petoff. Site Reliability Engineering: How Google Runs Production Systems. O'Reilly Media, 2016. • Fowler, Susan. Who’s On Call? 2016.
 < http://www.susanjfowler.com/blog/2016/9/6/whos-on-call > • Malpass, Ian. Measure Anything, Measure Everything. Code as Craft, 2011.
 < https://codeascraft.com/2011/02/15/measure-anything-measure-everything/ > • Underwood, Todd. PostOps: A Non-Surgical Tale of Software, Fragility, and Reliability. Lisa, 2013.
 < https://www.usenix.org/conference/lisa13/technical-sessions/plenary/underwood > Inspiration and Citations