SlideShare ist ein Scribd-Unternehmen logo
1 von 83
Avoiding Cascading
Failures
SRECON2016 - Craig Fender - Ravindra Punati
Craig Fender
Craig is presently a Senior Technical
Duty Officer at eBay and is responsible
for commanding all types of large scale
site incidents. In addition to an
undergraduate degree Craig holds
numerous professional certifications
related to the computer industry
(RHCE, SSCA, ITIL and et cetera).
Craig has held several roles at multiple
start-up and fortune 500 companies
such as Senior Systems Engineer,
Project Manager, Presenter and Major
Incident Commander.
Ravindra Punati
Ravindra Punati is leader of the Site
Reliability Engineering team. In other
roles at eBay Ravi has been
responsible for the infrastructure
automation initiatives and cloud
operations. Ravi brings extensive
expertise in the fields of database
engineering, application development
and software as a service product
lines. In addition to holding multiple
degrees in computer science Ravi has
held several roles as an engineer,
architect, manager and executive in
various silicon valley start-ups.
Avoiding Cascading
Failures
EBAY AT A GLANCE
.
OUR
BUSINESS
EBAY AT A GLANCE
$8.6B
Revenue in 2015
$82B
GMV in 2015
800M
Live Listings *
162M
Global Active Buyers *
190
Markets eBay apps
are available in*
57%
Percentage of eBay Inc.
revenue that is
international *
*Q4 2015 data
$33B
2015 MCV
across eBay’s
portfolio of apps
Percentage of GMV
closed on Mobile *
43%
eBay Application
downloads *
304M
EBAY
LEADS IN
MOBILE
9.2M
a tablet is bought via
mobile in the U.S. *
28 sec.
Every
a ladies handbag is
bought via mobile in
the U.S. *
10 sec.
Every
new listings are
added via mobile
every week *
*Q4 2015 data
EBAY OPERATIONS
.
Background on eBay Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 8
Background on eBay Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 9
Background on eBay Operations
• In any mission critical real-time environment where one is
highly incentivized toward maximum uptime one must have:
– redundancy ,
– resiliency and
– robustness.
•These needs give rise to complexity.
•Complexity can lead to fragility.
–Multilayered dependencies can cause cascading failure.
•To manage that complexity certain behaviors, culture,
principles and technology emerge as the most successful.
PREVENT CASCADING FAILURES 10
Background on eBay Operations
•eBay itself is like a
bridge between
buyers . . .
PREVENT CASCADING FAILURES 11
Background on eBay Operations
•. . .and sellers
PREVENT CASCADING FAILURES 12
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 13
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 14
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 15
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 16
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 17
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 18
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 19
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 20
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 21
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 22
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 23
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 24
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 25
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 26
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 27
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 28
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 29
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 30
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 31
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 32
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 33
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 34
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 35
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 36
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 37
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 38
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 39
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 40
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 41
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 42
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 43
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 44
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 45
Background on what SEC and Operations
• Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation
PREVENT CASCADING FAILURES 46
SO FULL OF FAIL
Creating resiliency through intelligently injected failure.
PREVENT CASCADING FAILURES 48
So Full of Fail – Failure is NOT an option.
It’s a requirement !
PREVENT CASCADING FAILURES 49
• The inability of a system or system component to perform a required
function within specified limits. A failure may be produced when an error is encountered.
PREVENT CASCADING FAILURES 50
So Full of Fail – Failure is NOT an option.
It’s a requirement !
So Full of Fail – Atomic failures – Overview
PREVENT CASCADING FAILURES 51
•Software
–Database failure
–Service failures
•Hardware
–LB failures
–Compute failures
–Network failures
So Full of Fail – Cascading failures
PREVENT CASCADING FAILURES 52
•A cascading failure is a failure in a system
of interconnected parts in which the failure of a part can
trigger the failure of successive parts or the whole system.
So Full of Fail – Cascading failures
PREVENT CASCADING FAILURES 53
So Full of Fail – Cascading failures
PREVENT CASCADING FAILURES 54
So Full of Fail – Cascading failures – Software
PREVENT CASCADING FAILURES 55
•Service to Service
–Service A talks to Service
B using using load
balancer virtual ip
•Service B to Database
–Service B talks to data
store using data access
layer
So Full of Fail – Cascading failures – Software
Preventing Cascading Failures 56
Preventing Cascading Failure 57
So Full of Fail – Cascading failures – Software
So Full of Fail – Cascading failures– Software
Preventing Cascading Failure 58
So Full of Fail – Cascading failures – Software -
Database
PREVENT CASCADING FAILURES 59
•Connectivity
failure
•Query timeouts
•SQL errors
So Full of Fail – Cascading failures – Software
PREVENT CASCADING FAILURES 60
•Data access layer identifies any of the above
issues and marks the connection down.
•When the destination is recovered the
path is marked active.
•Time to detect the path state is ~15 ms
•Request for connections to a struggling
DB can lead to to stacking in upstream
•The fail fast waiter queue logic helps the
loaded database
PREVENT CASCADING FAILURES 61
So Full of Fail – Cascading failures – Software
PREVENT CASCADING FAILURES 62
So Full of Fail – Cascading failures – Software
PREVENT CASCADING FAILURES 63
So Full of Fail – Cascading failures – Hardware
PREVENT CASCADING FAILURES 64
•LB failures
•Compute failures
•Network failures
So Full of Fail – Cascading failures – Hardware - Network
PREVENT CASCADING FAILURES
• Traffic organized
in layers.
• Redundant
interconnected
paths.
So Full of Fail – Cascading failures – Hardware
• n = Failure Domain (e.g. Network)
• Sn = Number of Service Instances in Failure Domain n
• Fn = Number of instances of Failure Domain n
PREVENT CASCADING FAILURES
PREVENTION
THROUGH PARADIGM
Prevent and remediate atomic failure in order to prevent
cascading failure scenarios.
PREVENTION THROUGH PARADIGM
•Site code is deployed on a common set of platforms which enforce the
aforementioned tools.
– Engineering and development architecture effort is “front-loaded” to ensure that
all production services can be restarted in exactly the same way no matter what
the underlying command.
– Uniformity of response
– Remediation at the node level can be automated.
•State based and event based monitoring of key operating metrics.
– The 9 or so OS-level metrics we all monitor
•Capacity monitoring for “DR’ compliance
– Two or more co-locations
– One feature per “pool”
•Provision on demand
•Regional code roll
PREVENTING CASCADING FAILURES 68
PARADIGM BEFORE PROCEDURES.
•For maximum rapid remediation your pool
or feature should comply with operational
paradigms.
•If the care and feeding of your particular
beautiful ‘unicorn’ requires special attention
or procedures:
–You’re asking someone to go against
their established paradigm and training.
–At a high rate of speed
–As the rarely seen or remembered
exception
PARADIGM BEFORE PROCEDURES 69
DEALING WITH DISASTER
DIRECTION AND DECISION
Preventing the cascade failure through redirection and
deciding which features to keep.
Preventing Cascading Failures 71
73
PREVENT CASCADING FAILURES
TOOLS NOT DOCUMENTS 74
TOOLS NOT DOCUMENTS 75
PREVENT CASCADING FAILURES 76
DEALING WITH DISASTER - DIRECTION
DEALING WITH DISASTER - DIRECTION
•Graphic about two lane bridge with one lane broken
PREVENTION THROUGH PARADIGM 77
TOOLS NOT DOCUMENTS 78
DEALING WITH DISASTER - DIRECTION
TOOLS NOT DOCUMENTS 79
DEALING WITH DISASTER - DIRECTION
DEALING WITH DISASTER - DIRECTION
•To prevent total failure of the bridge we have to get the kittens
to drive over only one lane
•Connection limits would be required
•Blocking bots
•Scale down to minimal serving modes
•Wiring off features like advertising.
PREVENTION THROUGH PARADIGM 80
TOOLS NOT DOCUMENTS 81
CONCLUSIONS
So Full of Fail– Total Failure Avoidance
•Bubble up anomalies for inspection
•Fail Fast
•Mark down the failed paths
•Preserve the user experiences
•Use system approaches to solve
PREVENT CASCADING FAILURES 83

Weitere ähnliche Inhalte

Andere mochten auch

20110406 comac plurio
20110406 comac plurio20110406 comac plurio
20110406 comac plurioFrank Thinnes
 
20130227 präsentation culture lu
20130227 präsentation culture lu20130227 präsentation culture lu
20130227 präsentation culture luFrank Thinnes
 
Will the Internet replace television?
Will the Internet replace television?Will the Internet replace television?
Will the Internet replace television?sanskover
 
20110526 eurodistrict saarmoselle
20110526 eurodistrict saarmoselle20110526 eurodistrict saarmoselle
20110526 eurodistrict saarmoselleFrank Thinnes
 
eCreative Project presentation
eCreative Project presentationeCreative Project presentation
eCreative Project presentationFrank Thinnes
 
Dlf Corporate Goverance Report
Dlf Corporate Goverance ReportDlf Corporate Goverance Report
Dlf Corporate Goverance ReportPrijit Debnath
 

Andere mochten auch (11)

20110406 comac plurio
20110406 comac plurio20110406 comac plurio
20110406 comac plurio
 
20110511 museumlu
20110511 museumlu20110511 museumlu
20110511 museumlu
 
20130227 präsentation culture lu
20130227 präsentation culture lu20130227 präsentation culture lu
20130227 präsentation culture lu
 
Will the Internet replace television?
Will the Internet replace television?Will the Internet replace television?
Will the Internet replace television?
 
Plurio Culturemondo
Plurio CulturemondoPlurio Culturemondo
Plurio Culturemondo
 
Umts services
Umts servicesUmts services
Umts services
 
20110526 eurodistrict saarmoselle
20110526 eurodistrict saarmoselle20110526 eurodistrict saarmoselle
20110526 eurodistrict saarmoselle
 
eCreative Project presentation
eCreative Project presentationeCreative Project presentation
eCreative Project presentation
 
Management Devlopment
Management DevlopmentManagement Devlopment
Management Devlopment
 
M&A Integration
M&A IntegrationM&A Integration
M&A Integration
 
Dlf Corporate Goverance Report
Dlf Corporate Goverance ReportDlf Corporate Goverance Report
Dlf Corporate Goverance Report
 

Ähnlich wie SRECON16PreventCascadeFailures

Congratulations, You’re a DBA... Now What?
Congratulations, You’re a DBA... Now What?Congratulations, You’re a DBA... Now What?
Congratulations, You’re a DBA... Now What?Embarcadero Technologies
 
Entrepreneurship 101 - Product Development
Entrepreneurship 101 - Product DevelopmentEntrepreneurship 101 - Product Development
Entrepreneurship 101 - Product DevelopmentMaRS Discovery District
 
Ent101 productdevelopment-101111094924-phpapp02
Ent101 productdevelopment-101111094924-phpapp02Ent101 productdevelopment-101111094924-phpapp02
Ent101 productdevelopment-101111094924-phpapp02lornwally
 
The ilities of software engineering.pptx
The ilities of software engineering.pptxThe ilities of software engineering.pptx
The ilities of software engineering.pptxMonica Beckwith
 
CAPINC What's New in SOLIDWORKS 2015
CAPINC What's New in SOLIDWORKS 2015 CAPINC What's New in SOLIDWORKS 2015
CAPINC What's New in SOLIDWORKS 2015 CAPINC
 
Metrics-driven Continuous Delivery
Metrics-driven Continuous DeliveryMetrics-driven Continuous Delivery
Metrics-driven Continuous DeliveryAndrew Phillips
 
CyberArk Engineering: Guilds - our journey to mastery
CyberArk Engineering: Guilds - our journey to masteryCyberArk Engineering: Guilds - our journey to mastery
CyberArk Engineering: Guilds - our journey to masteryDaniel Schwartzer
 
How Agile changed Software Development
How Agile changed Software DevelopmentHow Agile changed Software Development
How Agile changed Software DevelopmentSteve Maraspin
 
Agile Software Development at UPT DEGI | Nov, 2015
Agile Software Development at UPT DEGI | Nov, 2015Agile Software Development at UPT DEGI | Nov, 2015
Agile Software Development at UPT DEGI | Nov, 2015Eduardo Ribeiro
 
HIS 2015: Neil White - Advances in Practical Techniques for Critical Developm...
HIS 2015: Neil White - Advances in Practical Techniques for Critical Developm...HIS 2015: Neil White - Advances in Practical Techniques for Critical Developm...
HIS 2015: Neil White - Advances in Practical Techniques for Critical Developm...AdaCore
 
All about Agile, an Overview - Texavi Tech Bootcamp on How to be agile- Texav...
All about Agile, an Overview - Texavi Tech Bootcamp on How to be agile- Texav...All about Agile, an Overview - Texavi Tech Bootcamp on How to be agile- Texav...
All about Agile, an Overview - Texavi Tech Bootcamp on How to be agile- Texav...Texavi Innovative Solutions
 
Technical Excellence Doesn't Just Happen--Igniting a Craftsmanship Culture
Technical Excellence Doesn't Just Happen--Igniting a Craftsmanship CultureTechnical Excellence Doesn't Just Happen--Igniting a Craftsmanship Culture
Technical Excellence Doesn't Just Happen--Igniting a Craftsmanship CultureAllison Pollard
 
Software Development Life Cycle
Software Development Life CycleSoftware Development Life Cycle
Software Development Life Cyclenayanbanik
 
Никита Галкин "Technical backlog: инструкция к применению"
Никита Галкин "Technical backlog: инструкция к применению"Никита Галкин "Technical backlog: инструкция к применению"
Никита Галкин "Technical backlog: инструкция к применению"Fwdays
 
SUGCON: The Agile Nirvana of DevSecOps and Containerization
SUGCON: The Agile Nirvana of DevSecOps and ContainerizationSUGCON: The Agile Nirvana of DevSecOps and Containerization
SUGCON: The Agile Nirvana of DevSecOps and ContainerizationVasiliy Fomichev
 
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...Spark Summit
 

Ähnlich wie SRECON16PreventCascadeFailures (20)

Congratulations, You’re a DBA... Now What?
Congratulations, You’re a DBA... Now What?Congratulations, You’re a DBA... Now What?
Congratulations, You’re a DBA... Now What?
 
Entrepreneurship 101 - Product Development
Entrepreneurship 101 - Product DevelopmentEntrepreneurship 101 - Product Development
Entrepreneurship 101 - Product Development
 
Ent101 productdevelopment-101111094924-phpapp02
Ent101 productdevelopment-101111094924-phpapp02Ent101 productdevelopment-101111094924-phpapp02
Ent101 productdevelopment-101111094924-phpapp02
 
The ilities of software engineering.pptx
The ilities of software engineering.pptxThe ilities of software engineering.pptx
The ilities of software engineering.pptx
 
CAPINC What's New in SOLIDWORKS 2015
CAPINC What's New in SOLIDWORKS 2015 CAPINC What's New in SOLIDWORKS 2015
CAPINC What's New in SOLIDWORKS 2015
 
00 introduction
00 introduction00 introduction
00 introduction
 
Metrics-driven Continuous Delivery
Metrics-driven Continuous DeliveryMetrics-driven Continuous Delivery
Metrics-driven Continuous Delivery
 
Forget about Agile
Forget about AgileForget about Agile
Forget about Agile
 
CyberArk Engineering: Guilds - our journey to mastery
CyberArk Engineering: Guilds - our journey to masteryCyberArk Engineering: Guilds - our journey to mastery
CyberArk Engineering: Guilds - our journey to mastery
 
How Agile changed Software Development
How Agile changed Software DevelopmentHow Agile changed Software Development
How Agile changed Software Development
 
Agile Software Development at UPT DEGI | Nov, 2015
Agile Software Development at UPT DEGI | Nov, 2015Agile Software Development at UPT DEGI | Nov, 2015
Agile Software Development at UPT DEGI | Nov, 2015
 
Domain Driven Design
Domain Driven DesignDomain Driven Design
Domain Driven Design
 
HIS 2015: Neil White - Advances in Practical Techniques for Critical Developm...
HIS 2015: Neil White - Advances in Practical Techniques for Critical Developm...HIS 2015: Neil White - Advances in Practical Techniques for Critical Developm...
HIS 2015: Neil White - Advances in Practical Techniques for Critical Developm...
 
All about Agile, an Overview - Texavi Tech Bootcamp on How to be agile- Texav...
All about Agile, an Overview - Texavi Tech Bootcamp on How to be agile- Texav...All about Agile, an Overview - Texavi Tech Bootcamp on How to be agile- Texav...
All about Agile, an Overview - Texavi Tech Bootcamp on How to be agile- Texav...
 
Technical Excellence Doesn't Just Happen--Igniting a Craftsmanship Culture
Technical Excellence Doesn't Just Happen--Igniting a Craftsmanship CultureTechnical Excellence Doesn't Just Happen--Igniting a Craftsmanship Culture
Technical Excellence Doesn't Just Happen--Igniting a Craftsmanship Culture
 
Software Development Life Cycle
Software Development Life CycleSoftware Development Life Cycle
Software Development Life Cycle
 
Никита Галкин "Technical backlog: инструкция к применению"
Никита Галкин "Technical backlog: инструкция к применению"Никита Галкин "Technical backlog: инструкция к применению"
Никита Галкин "Technical backlog: инструкция к применению"
 
Viewpoints and perspectives
Viewpoints and perspectivesViewpoints and perspectives
Viewpoints and perspectives
 
SUGCON: The Agile Nirvana of DevSecOps and Containerization
SUGCON: The Agile Nirvana of DevSecOps and ContainerizationSUGCON: The Agile Nirvana of DevSecOps and Containerization
SUGCON: The Agile Nirvana of DevSecOps and Containerization
 
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
Optimizing Spark Deployments for Containers: Isolation, Safety, and Performan...
 

Kürzlich hochgeladen

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard37
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 

Kürzlich hochgeladen (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 

SRECON16PreventCascadeFailures

  • 1. Avoiding Cascading Failures SRECON2016 - Craig Fender - Ravindra Punati
  • 2. Craig Fender Craig is presently a Senior Technical Duty Officer at eBay and is responsible for commanding all types of large scale site incidents. In addition to an undergraduate degree Craig holds numerous professional certifications related to the computer industry (RHCE, SSCA, ITIL and et cetera). Craig has held several roles at multiple start-up and fortune 500 companies such as Senior Systems Engineer, Project Manager, Presenter and Major Incident Commander. Ravindra Punati Ravindra Punati is leader of the Site Reliability Engineering team. In other roles at eBay Ravi has been responsible for the infrastructure automation initiatives and cloud operations. Ravi brings extensive expertise in the fields of database engineering, application development and software as a service product lines. In addition to holding multiple degrees in computer science Ravi has held several roles as an engineer, architect, manager and executive in various silicon valley start-ups. Avoiding Cascading Failures
  • 3. EBAY AT A GLANCE .
  • 5. EBAY AT A GLANCE $8.6B Revenue in 2015 $82B GMV in 2015 800M Live Listings * 162M Global Active Buyers * 190 Markets eBay apps are available in* 57% Percentage of eBay Inc. revenue that is international * *Q4 2015 data
  • 6. $33B 2015 MCV across eBay’s portfolio of apps Percentage of GMV closed on Mobile * 43% eBay Application downloads * 304M EBAY LEADS IN MOBILE 9.2M a tablet is bought via mobile in the U.S. * 28 sec. Every a ladies handbag is bought via mobile in the U.S. * 10 sec. Every new listings are added via mobile every week * *Q4 2015 data
  • 8. Background on eBay Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 8
  • 9. Background on eBay Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 9
  • 10. Background on eBay Operations • In any mission critical real-time environment where one is highly incentivized toward maximum uptime one must have: – redundancy , – resiliency and – robustness. •These needs give rise to complexity. •Complexity can lead to fragility. –Multilayered dependencies can cause cascading failure. •To manage that complexity certain behaviors, culture, principles and technology emerge as the most successful. PREVENT CASCADING FAILURES 10
  • 11. Background on eBay Operations •eBay itself is like a bridge between buyers . . . PREVENT CASCADING FAILURES 11
  • 12. Background on eBay Operations •. . .and sellers PREVENT CASCADING FAILURES 12
  • 13. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 13
  • 14. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 14
  • 15. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 15
  • 16. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 16
  • 17. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 17
  • 18. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 18
  • 19. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 19
  • 20. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 20
  • 21. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 21
  • 22. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 22
  • 23. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 23
  • 24. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 24
  • 25. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 25
  • 26. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 26
  • 27. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 27
  • 28. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 28
  • 29. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 29
  • 30. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 30
  • 31. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 31
  • 32. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 32
  • 33. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 33
  • 34. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 34
  • 35. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 35
  • 36. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 36
  • 37. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 37
  • 38. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 38
  • 39. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 39
  • 40. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 40
  • 41. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 41
  • 42. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 42
  • 43. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 43
  • 44. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 44
  • 45. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 45
  • 46. Background on what SEC and Operations • Pictures and slides about what an sec is and who we are – see Excellence in engineering presentation PREVENT CASCADING FAILURES 46
  • 47. SO FULL OF FAIL Creating resiliency through intelligently injected failure.
  • 49. So Full of Fail – Failure is NOT an option. It’s a requirement ! PREVENT CASCADING FAILURES 49 • The inability of a system or system component to perform a required function within specified limits. A failure may be produced when an error is encountered.
  • 50. PREVENT CASCADING FAILURES 50 So Full of Fail – Failure is NOT an option. It’s a requirement !
  • 51. So Full of Fail – Atomic failures – Overview PREVENT CASCADING FAILURES 51 •Software –Database failure –Service failures •Hardware –LB failures –Compute failures –Network failures
  • 52. So Full of Fail – Cascading failures PREVENT CASCADING FAILURES 52 •A cascading failure is a failure in a system of interconnected parts in which the failure of a part can trigger the failure of successive parts or the whole system.
  • 53. So Full of Fail – Cascading failures PREVENT CASCADING FAILURES 53
  • 54. So Full of Fail – Cascading failures PREVENT CASCADING FAILURES 54
  • 55. So Full of Fail – Cascading failures – Software PREVENT CASCADING FAILURES 55 •Service to Service –Service A talks to Service B using using load balancer virtual ip •Service B to Database –Service B talks to data store using data access layer
  • 56. So Full of Fail – Cascading failures – Software Preventing Cascading Failures 56
  • 57. Preventing Cascading Failure 57 So Full of Fail – Cascading failures – Software
  • 58. So Full of Fail – Cascading failures– Software Preventing Cascading Failure 58
  • 59. So Full of Fail – Cascading failures – Software - Database PREVENT CASCADING FAILURES 59 •Connectivity failure •Query timeouts •SQL errors
  • 60. So Full of Fail – Cascading failures – Software PREVENT CASCADING FAILURES 60 •Data access layer identifies any of the above issues and marks the connection down. •When the destination is recovered the path is marked active. •Time to detect the path state is ~15 ms •Request for connections to a struggling DB can lead to to stacking in upstream •The fail fast waiter queue logic helps the loaded database
  • 61. PREVENT CASCADING FAILURES 61 So Full of Fail – Cascading failures – Software
  • 62. PREVENT CASCADING FAILURES 62 So Full of Fail – Cascading failures – Software
  • 64. So Full of Fail – Cascading failures – Hardware PREVENT CASCADING FAILURES 64 •LB failures •Compute failures •Network failures
  • 65. So Full of Fail – Cascading failures – Hardware - Network PREVENT CASCADING FAILURES • Traffic organized in layers. • Redundant interconnected paths.
  • 66. So Full of Fail – Cascading failures – Hardware • n = Failure Domain (e.g. Network) • Sn = Number of Service Instances in Failure Domain n • Fn = Number of instances of Failure Domain n PREVENT CASCADING FAILURES
  • 67. PREVENTION THROUGH PARADIGM Prevent and remediate atomic failure in order to prevent cascading failure scenarios.
  • 68. PREVENTION THROUGH PARADIGM •Site code is deployed on a common set of platforms which enforce the aforementioned tools. – Engineering and development architecture effort is “front-loaded” to ensure that all production services can be restarted in exactly the same way no matter what the underlying command. – Uniformity of response – Remediation at the node level can be automated. •State based and event based monitoring of key operating metrics. – The 9 or so OS-level metrics we all monitor •Capacity monitoring for “DR’ compliance – Two or more co-locations – One feature per “pool” •Provision on demand •Regional code roll PREVENTING CASCADING FAILURES 68
  • 69. PARADIGM BEFORE PROCEDURES. •For maximum rapid remediation your pool or feature should comply with operational paradigms. •If the care and feeding of your particular beautiful ‘unicorn’ requires special attention or procedures: –You’re asking someone to go against their established paradigm and training. –At a high rate of speed –As the rarely seen or remembered exception PARADIGM BEFORE PROCEDURES 69
  • 70. DEALING WITH DISASTER DIRECTION AND DECISION Preventing the cascade failure through redirection and deciding which features to keep.
  • 72.
  • 76. PREVENT CASCADING FAILURES 76 DEALING WITH DISASTER - DIRECTION
  • 77. DEALING WITH DISASTER - DIRECTION •Graphic about two lane bridge with one lane broken PREVENTION THROUGH PARADIGM 77
  • 78. TOOLS NOT DOCUMENTS 78 DEALING WITH DISASTER - DIRECTION
  • 79. TOOLS NOT DOCUMENTS 79 DEALING WITH DISASTER - DIRECTION
  • 80. DEALING WITH DISASTER - DIRECTION •To prevent total failure of the bridge we have to get the kittens to drive over only one lane •Connection limits would be required •Blocking bots •Scale down to minimal serving modes •Wiring off features like advertising. PREVENTION THROUGH PARADIGM 80
  • 83. So Full of Fail– Total Failure Avoidance •Bubble up anomalies for inspection •Fail Fast •Mark down the failed paths •Preserve the user experiences •Use system approaches to solve PREVENT CASCADING FAILURES 83