SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Downloaden Sie, um offline zu lesen
The two little bugs that almost
brought down Booking.com
Jean-François Gagné (System Engineer)
jeanfrancois DOT gagne AT booking.com
April 25, 2017 – Percona Live Santa Clara 2017
2
For a, b and c relatively small:
Consequence(a + b + c) is much bigger than
Conseq.(a) + Conseq.(b) + Conseq.(c)
MySQL/MariaDB replication at Booking.com
● Typical Booking.com MySQL/MariaDB replication deployment:
+---+
| M |
+---+
|
+------+-- ... --+---------------+-------- ...
| | | |
+---+ +---+ +---+ +---+
| S1| | S2| | Sn| | M1|
+---+ +---+ +---+ +---+
|
+-- ... --+
| |
+---+ +---+
| T1| | Tm|
+---+ +---+
3
Impacted setup (simplified)
+---+
| M |
+---+
|
+---------- .... ----------+--------------+
| | |
+---+ +---+ +---+
| M1| | Mi| | Mj|
+---+ +---+ +---+
| | |
+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+
| | | | | | | | | |
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
4
Upgrade from 5.5 to new major version
+---+
| M |
+---+
|
+---------- .... ----------+--------------+
| | |
+---+ +---+ +---+
| M1| | Mi| | Mj|
+---+ +---+ +---+
| | |
+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+
| | | | | | | | | |
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
5
Bad transaction on the master
+---+
| M | <<-- “bad transaction”
+---+
|
+---------- .... ----------+--------------+
| | |
+---+ +---+ +---+
| M1| | Mi| | Mj|
+---+ +---+ +---+
| | |
+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+
| | | | | | | | | |
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
6
Oups ! M2 runs OOM and is killed
+---+
| M | <<-- “bad transaction”
+---+
|
+---------- .... ----------+--------------+
| | |
+---+ +-/+ +---+
| M1| | Mi| | Mj|
+---+ +/-+ +---+
| | |
+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+
| | | | | | | | | |
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
7
Oups2 ! all “blue” run OOM and are killed
+---+
| M | <<-- “bad transaction”
+---+
|
+---------- .... ----------+--------------+
| | |
+---+ +-/+ +---+
| M1| | Mi| | Mj|
+---+ +/-+ +---+
| | |
+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+
| | | | | | | | | |
+-/+ +---+ +-/+ +---+ +---+ +---+ +---+ +-/+ +---+ +-/+
| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|
+/-+ +---+ +/-+ +---+ +---+ +---+ +---+ +/-+ +---+ +/-+
8
What is the “bad” transaction ?
● DELETE FROM TABLE WHERE …lot of rows…;
● Transaction of ~2 GB in the binary logs (RBR)
● Obviously a bug in the application
(but it should not have triggered an OOM)
9
What needs to be done next ?
● Reminder: 5.5 is not replication crash safe
● Next version is crash safe, but can’t…
● Crashed slaves either OOM again or are corrupted
● We need to re-clone all crashed slaves !
10
What saved us ?
● Engaged team of skilled DBAs: all joined to help
● Data not too sensitive on replication delay
● Data not too sensitive on “skipping transactions”
● pt-slave-restart
● IDEMPOTENT mode
● A torrenting cloning tool
11
What could have helped us ?
● A “working” torrenting cloning tool…
● Not used often enough, so we did not know it was broken
(fixed in less than 2 hours)
● An AUTO-FIX/AUTO-REPAIR mode (RBR)
● Instead of skipping transaction (and make data diverge)
should repair (fix) slave drift (and make data converge)
https://bugs.mysql.com/bug.php?id=54250
http://blog.wl0.org/2016/05/the-differences-between-idempotent-
and-my-suggested-auto-repair-mode/
12
● We are hiring !
● MySQL Engineer / DBA
● System Administrator
● System Engineer
● Site Reliability Engineer
● Developer / Designer
● Technical Team Lead
● Product Owner
● Data Scientist
● And many more…
● https://workingatbooking.com/
Want to know more…
Thanks
Jean-François Gagné
jeanfrancois DOT gagne AT booking.com

Weitere ähnliche Inhalte

Ähnlich wie The two little bugs that almost brought down Booking.com

Cloudy with a Chance of Fireballs: Provisioning and Certificate Management in...
Cloudy with a Chance of Fireballs: Provisioning and Certificate Management in...Cloudy with a Chance of Fireballs: Provisioning and Certificate Management in...
Cloudy with a Chance of Fireballs: Provisioning and Certificate Management in...
Puppet
 
Fosscon 2012 firewall workshop
Fosscon 2012 firewall workshopFosscon 2012 firewall workshop
Fosscon 2012 firewall workshop
jvehent
 
Homework Assignment 3 Chapter 3 St. Clair & Visick, Putting you.docx
Homework Assignment 3 Chapter 3 St. Clair & Visick, Putting you.docxHomework Assignment 3 Chapter 3 St. Clair & Visick, Putting you.docx
Homework Assignment 3 Chapter 3 St. Clair & Visick, Putting you.docx
fideladallimore
 

Ähnlich wie The two little bugs that almost brought down Booking.com (20)

M|18 User Defined Function
M|18 User Defined FunctionM|18 User Defined Function
M|18 User Defined Function
 
Cloudy with a Chance of Fireballs: Provisioning and Certificate Management in...
Cloudy with a Chance of Fireballs: Provisioning and Certificate Management in...Cloudy with a Chance of Fireballs: Provisioning and Certificate Management in...
Cloudy with a Chance of Fireballs: Provisioning and Certificate Management in...
 
Fosscon 2012 firewall workshop
Fosscon 2012 firewall workshopFosscon 2012 firewall workshop
Fosscon 2012 firewall workshop
 
Assembly programming on the nand2tetris architecture
Assembly programming on the nand2tetris architectureAssembly programming on the nand2tetris architecture
Assembly programming on the nand2tetris architecture
 
Spring MVC - Wiring the different layers
Spring MVC -  Wiring the different layersSpring MVC -  Wiring the different layers
Spring MVC - Wiring the different layers
 
Microservices Tutorial Session at JavaOne 2016
Microservices Tutorial Session at JavaOne 2016Microservices Tutorial Session at JavaOne 2016
Microservices Tutorial Session at JavaOne 2016
 
Functional Database Strategies at Scala Bay
Functional Database Strategies at Scala BayFunctional Database Strategies at Scala Bay
Functional Database Strategies at Scala Bay
 
Explain
ExplainExplain
Explain
 
Discovering and querying temporal data
Discovering and querying temporal dataDiscovering and querying temporal data
Discovering and querying temporal data
 
A tale of queues — from ActiveMQ over Hazelcast to Disque - Philipp Krenn
A tale of queues — from ActiveMQ over Hazelcast to Disque - Philipp KrennA tale of queues — from ActiveMQ over Hazelcast to Disque - Philipp Krenn
A tale of queues — from ActiveMQ over Hazelcast to Disque - Philipp Krenn
 
16 MySQL Optimization #burningkeyboards
16 MySQL Optimization #burningkeyboards16 MySQL Optimization #burningkeyboards
16 MySQL Optimization #burningkeyboards
 
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
codecentric AG: Using Cassandra and Clojure for Data Crunching backendscodecentric AG: Using Cassandra and Clojure for Data Crunching backends
codecentric AG: Using Cassandra and Clojure for Data Crunching backends
 
1Hippeus - zerocopy messaging по законам Спарты, Леонид Юрьев (ПЕТЕР-СЕРВИС)
1Hippeus -  zerocopy messaging по законам Спарты, Леонид Юрьев (ПЕТЕР-СЕРВИС)1Hippeus -  zerocopy messaging по законам Спарты, Леонид Юрьев (ПЕТЕР-СЕРВИС)
1Hippeus - zerocopy messaging по законам Спарты, Леонид Юрьев (ПЕТЕР-СЕРВИС)
 
Workshop 20140522 BigQuery Implementation
Workshop 20140522   BigQuery ImplementationWorkshop 20140522   BigQuery Implementation
Workshop 20140522 BigQuery Implementation
 
CoreOS in anger : firing up wordpress across a 3 machine CoreOS cluster
CoreOS in anger : firing up wordpress across a 3 machine CoreOS cluster CoreOS in anger : firing up wordpress across a 3 machine CoreOS cluster
CoreOS in anger : firing up wordpress across a 3 machine CoreOS cluster
 
IP Addresses
IP AddressesIP Addresses
IP Addresses
 
Homework Assignment 3 Chapter 3 St. Clair & Visick, Putting you.docx
Homework Assignment 3 Chapter 3 St. Clair & Visick, Putting you.docxHomework Assignment 3 Chapter 3 St. Clair & Visick, Putting you.docx
Homework Assignment 3 Chapter 3 St. Clair & Visick, Putting you.docx
 
ANALYZE for Statements - MariaDB's hidden gem
ANALYZE for Statements - MariaDB's hidden gemANALYZE for Statements - MariaDB's hidden gem
ANALYZE for Statements - MariaDB's hidden gem
 
8 congestion-ipv6
8 congestion-ipv68 congestion-ipv6
8 congestion-ipv6
 
Advanced Query Optimizer Tuning and Analysis
Advanced Query Optimizer Tuning and AnalysisAdvanced Query Optimizer Tuning and Analysis
Advanced Query Optimizer Tuning and Analysis
 

Mehr von Jean-François Gagné

Mehr von Jean-François Gagné (17)

MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)
MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)
MySQL Parallel Replication: All the 5.7 and 8.0 Details (LOGICAL_CLOCK)
 
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and OrchestratorAlmost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
Almost Perfect Service Discovery and Failover with ProxySQL and Orchestrator
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash Safety
 
The consequences of sync_binlog != 1
The consequences of sync_binlog != 1The consequences of sync_binlog != 1
The consequences of sync_binlog != 1
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated Environment
 
MySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated EnvironmentMySQL Scalability and Reliability for Replicated Environment
MySQL Scalability and Reliability for Replicated Environment
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash Safety
 
Demystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash SafetyDemystifying MySQL Replication Crash Safety
Demystifying MySQL Replication Crash Safety
 
The Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication TutorialThe Full MySQL and MariaDB Parallel Replication Tutorial
The Full MySQL and MariaDB Parallel Replication Tutorial
 
MySQL Parallel Replication by Booking.com
MySQL Parallel Replication by Booking.comMySQL Parallel Replication by Booking.com
MySQL Parallel Replication by Booking.com
 
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
MySQL Parallel Replication (LOGICAL_CLOCK): all the 5.7 (and some of the 8.0)...
 
MySQL/MariaDB Parallel Replication: inventory, use-case and limitations
MySQL/MariaDB Parallel Replication: inventory, use-case and limitationsMySQL/MariaDB Parallel Replication: inventory, use-case and limitations
MySQL/MariaDB Parallel Replication: inventory, use-case and limitations
 
How Booking.com avoids and deals with replication lag
How Booking.com avoids and deals with replication lagHow Booking.com avoids and deals with replication lag
How Booking.com avoids and deals with replication lag
 
MySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsMySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitations
 
MySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitationsMySQL Parallel Replication: inventory, use-case and limitations
MySQL Parallel Replication: inventory, use-case and limitations
 
MySQL Parallel Replication: inventory, use-cases and limitations
MySQL Parallel Replication: inventory, use-cases and limitationsMySQL Parallel Replication: inventory, use-cases and limitations
MySQL Parallel Replication: inventory, use-cases and limitations
 
Riding the Binlog: an in Deep Dissection of the Replication Stream
Riding the Binlog: an in Deep Dissection of the Replication StreamRiding the Binlog: an in Deep Dissection of the Replication Stream
Riding the Binlog: an in Deep Dissection of the Replication Stream
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

The two little bugs that almost brought down Booking.com

  • 1. The two little bugs that almost brought down Booking.com Jean-François Gagné (System Engineer) jeanfrancois DOT gagne AT booking.com April 25, 2017 – Percona Live Santa Clara 2017
  • 2. 2 For a, b and c relatively small: Consequence(a + b + c) is much bigger than Conseq.(a) + Conseq.(b) + Conseq.(c)
  • 3. MySQL/MariaDB replication at Booking.com ● Typical Booking.com MySQL/MariaDB replication deployment: +---+ | M | +---+ | +------+-- ... --+---------------+-------- ... | | | | +---+ +---+ +---+ +---+ | S1| | S2| | Sn| | M1| +---+ +---+ +---+ +---+ | +-- ... --+ | | +---+ +---+ | T1| | Tm| +---+ +---+ 3
  • 4. Impacted setup (simplified) +---+ | M | +---+ | +---------- .... ----------+--------------+ | | | +---+ +---+ +---+ | M1| | Mi| | Mj| +---+ +---+ +---+ | | | +-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+ | | | | | | | | | | +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo| +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 4
  • 5. Upgrade from 5.5 to new major version +---+ | M | +---+ | +---------- .... ----------+--------------+ | | | +---+ +---+ +---+ | M1| | Mi| | Mj| +---+ +---+ +---+ | | | +-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+ | | | | | | | | | | +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo| +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 5
  • 6. Bad transaction on the master +---+ | M | <<-- “bad transaction” +---+ | +---------- .... ----------+--------------+ | | | +---+ +---+ +---+ | M1| | Mi| | Mj| +---+ +---+ +---+ | | | +-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+ | | | | | | | | | | +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo| +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 6
  • 7. Oups ! M2 runs OOM and is killed +---+ | M | <<-- “bad transaction” +---+ | +---------- .... ----------+--------------+ | | | +---+ +-/+ +---+ | M1| | Mi| | Mj| +---+ +/-+ +---+ | | | +-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+ | | | | | | | | | | +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ | S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo| +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ 7
  • 8. Oups2 ! all “blue” run OOM and are killed +---+ | M | <<-- “bad transaction” +---+ | +---------- .... ----------+--------------+ | | | +---+ +-/+ +---+ | M1| | Mi| | Mj| +---+ +/-+ +---+ | | | +-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+ | | | | | | | | | | +-/+ +---+ +-/+ +---+ +---+ +---+ +---+ +-/+ +---+ +-/+ | S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo| +/-+ +---+ +/-+ +---+ +---+ +---+ +---+ +/-+ +---+ +/-+ 8
  • 9. What is the “bad” transaction ? ● DELETE FROM TABLE WHERE …lot of rows…; ● Transaction of ~2 GB in the binary logs (RBR) ● Obviously a bug in the application (but it should not have triggered an OOM) 9
  • 10. What needs to be done next ? ● Reminder: 5.5 is not replication crash safe ● Next version is crash safe, but can’t… ● Crashed slaves either OOM again or are corrupted ● We need to re-clone all crashed slaves ! 10
  • 11. What saved us ? ● Engaged team of skilled DBAs: all joined to help ● Data not too sensitive on replication delay ● Data not too sensitive on “skipping transactions” ● pt-slave-restart ● IDEMPOTENT mode ● A torrenting cloning tool 11
  • 12. What could have helped us ? ● A “working” torrenting cloning tool… ● Not used often enough, so we did not know it was broken (fixed in less than 2 hours) ● An AUTO-FIX/AUTO-REPAIR mode (RBR) ● Instead of skipping transaction (and make data diverge) should repair (fix) slave drift (and make data converge) https://bugs.mysql.com/bug.php?id=54250 http://blog.wl0.org/2016/05/the-differences-between-idempotent- and-my-suggested-auto-repair-mode/ 12
  • 13. ● We are hiring ! ● MySQL Engineer / DBA ● System Administrator ● System Engineer ● Site Reliability Engineer ● Developer / Designer ● Technical Team Lead ● Product Owner ● Data Scientist ● And many more… ● https://workingatbooking.com/ Want to know more…