SlideShare ist ein Scribd-Unternehmen logo
1 von 26
How Raft consensus algorithm will make
replication even better in MongoDB 3.2
Speaker: Henrik Ingo, @h_ingo
Author: Spencer T Brody, @stbrody
Agenda
• Introduction to consensus
• Leader-based replicated state machine
• Elections and data replication in MongoDB
• Improvements coming in MongoDB 3.2
Why use Replication?
• Data redundancy
• High availability
• Geographical distribution
• Latency
What is Consensus?
• Getting multiple processes/servers to agree
on state machine state
• Must handle a wide range of failure modes
• Disk failure
• Network partitions
• Machine freezes
• Clock skews
• Bugs?
Basic consensus
State
Machine
X 3
Y 2
Z 7
State
Machine
X 3
Y 2
Z 7
State
Machine
X 3
Y 2
Z 7
Leader Based Consensus
State
Machine
X 3
Y 2
Z 7
Replicated
Log
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
State
Machine
X 3
Y 2
Z 7
Replicated
Log
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
State
Machine
X 3
Y 2
Z 7
Replicated
Log
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
Agenda
• Introduction to consensus
• Leader-based replicated state machine
• Elections and data replication in MongoDB
• Improvements coming in MongoDB 3.2
Elections
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
Elections
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
Elections
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
Elections
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
Elections
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
Key requirements for elections
• There must be exactly 1 leader at any time
• Majority election: there can only be 1 majority
• Note: 50% is not a majority!
• Old leader must know to step down, even if
disconnected
Agenda
• Introduction to consensus
• Leader-based replicated state machine
• Elections and data replication in MongoDB
• Improvements coming in MongoDB 3.2
Agenda
• Introduction to consensus
• Leader-based replicated state machine
• Elections and data replication in MongoDB
• Improvements coming in MongoDB 3.2
• Goals and inspiration from Raft Consensus
Algorithm
• Preventing double voting
• Monitoring node status
• Calling for elections
Goals for MongoDB 3.2
• Decrease failover time
• Speed up detection and resolution of false
primary situations
Finding Inspiration in Raft
• “In Search of an Understandable Consensus
Algorithm” by Diego Ongaro:
https://ramcloud.stanford.edu/raft.pdf
• Designed to address the shortcomings of Paxos
• Easier to understand
• Easier to implement in real applications
• Provably correct
• Remarkably similar to what we’re doing already
Raft Concepts
• Term (election) IDs
• Heartbeats using existing data replication
channel
• Asymmetric election timeouts
Preventing Double Voting
• Can’t vote for 2 nodes in the same election
• Pre-3.2: 30 second vote timeout
• Post-3.2: Term IDs
• Term:
• Monotonically increasing ID
• Incremented on every election *attempt*
• Lets voters distinguish elections so they can vote
twice quickly in different elections
• Logical clock
MongoDB 3.0 OpLog entry
> db.oplog.rs.find().sort( { $natural : -1 }
).limit(1).pretty()
{
"ts" : Timestamp(1444466011, 1),
"h" : NumberLong("-6240522391332325619"),
"v" : 2,
"op" : "i",
"ns" : "test.nulltest",
"o" : {
"_id" : 3,
"a" : 3
}
}
millisecond + counter
Random hash = gtid
MongoDB 3.2 OpLog entry
> db.oplog.rs.find().sort( { $natural : -1 }
).limit(1).pretty()
{
"ts" : Timestamp(1445632081, 861),
"t" : NumberLong(42),
"h" : NumberLong("5466055178864103715"),
"v" : 2,
"op" : "u",
"ns" : "ycsb.usertable",
"o2" : {
"_id" : "user645414720104877157"
},
"o" : {
"$set" : {
"field4" : BinData(0,"KT5sN1wrPl...Ny9wIg==")
}
}
millisecond + Raft counter
Random hash = gtid
Raft Term
Heartbeats
• MongoDB 3.0: Heartbeats
• Sent every two seconds from every node to every
other node
• Volume increases quadratically as nodes are
added to the replica set
• Odd behaviors with asymmetric network partitions,
corner cases
• MongoDB 3.2: Extra metadata sent
via existing data replication channel
• Utilizes chained replication
• Faster heartbeats = faster elections
and detection of false primaries
Daisy chaining
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
X 3
Y 2
Z 7
X ⬅️ 1
Y ⬅️ 2
X ⬅️ 3
Election timeouts
• Tradeoff between failover time and spurious
failovers
• Node calls for an election when it hasn’t heard
from the primary within the election timeout
• Starting in 3.2:
• Single election timeout
• Election timeout is configurable
• Election timeout is varied randomly
for each node
• Varying timeouts help reduce tied votes
• Fewer tied votes = faster failover
rs.config().
settings.
electionTimeoutMillis
Conclusion
• MongoDB 3.2 will have
• Faster failovers
• Faster error detection
• More control to prevent spurious failovers
• Robust also in corner cases
• This means your systems are
• More resilient to failure
• More uptime
• Easier to maintain
Questions?

Weitere ähnliche Inhalte

Ähnlich wie How Raft consensus algorithm will make replication even better in MongoDB 3.2 / Henrik Ingo (MongoDB)

Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma TangOptimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Databricks
 
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Lucidworks
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure
Eman magdy
 

Ähnlich wie How Raft consensus algorithm will make replication even better in MongoDB 3.2 / Henrik Ingo (MongoDB) (20)

Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
Optimal Strategies for Large-Scale Batch ETL Jobs
Optimal Strategies for Large-Scale Batch ETL JobsOptimal Strategies for Large-Scale Batch ETL Jobs
Optimal Strategies for Large-Scale Batch ETL Jobs
 
How to Make Norikra Perfect
How to Make Norikra PerfectHow to Make Norikra Perfect
How to Make Norikra Perfect
 
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma TangOptimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
Optimal Strategies for Large Scale Batch ETL Jobs with Emma Tang
 
Reaching reliable agreement in an unreliable world
Reaching reliable agreement in an unreliable worldReaching reliable agreement in an unreliable world
Reaching reliable agreement in an unreliable world
 
Unveiling etcd: Architecture and Source Code Deep Dive
Unveiling etcd: Architecture and Source Code Deep DiveUnveiling etcd: Architecture and Source Code Deep Dive
Unveiling etcd: Architecture and Source Code Deep Dive
 
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
Seqr - Protein Sequence Search: Presented by Lianyi Han, Medical Science & Co...
 
Fighting fraud: finding duplicates at scale (Highload+ 2019)
Fighting fraud: finding duplicates at scale (Highload+ 2019)Fighting fraud: finding duplicates at scale (Highload+ 2019)
Fighting fraud: finding duplicates at scale (Highload+ 2019)
 
Cassandra introduction apache con 2014 budapest
Cassandra introduction apache con 2014 budapestCassandra introduction apache con 2014 budapest
Cassandra introduction apache con 2014 budapest
 
Tutorial: The Role of Event-Time Analysis Order in Data Streaming
Tutorial: The Role of Event-Time Analysis Order in Data StreamingTutorial: The Role of Event-Time Analysis Order in Data Streaming
Tutorial: The Role of Event-Time Analysis Order in Data Streaming
 
Basics in algorithms and data structure
Basics in algorithms and data structure Basics in algorithms and data structure
Basics in algorithms and data structure
 
What is a decentralised application? - Devoxx Morocco 2018
What is a decentralised application? - Devoxx Morocco 2018What is a decentralised application? - Devoxx Morocco 2018
What is a decentralised application? - Devoxx Morocco 2018
 
Blockchain, cryptography and tokens — NYC Bar presentation
Blockchain, cryptography and tokens — NYC Bar presentationBlockchain, cryptography and tokens — NYC Bar presentation
Blockchain, cryptography and tokens — NYC Bar presentation
 
What is a decentralised application ? - Les Jeudis du Libre
What is a decentralised application ? - Les Jeudis du LibreWhat is a decentralised application ? - Les Jeudis du Libre
What is a decentralised application ? - Les Jeudis du Libre
 
Measurements in Cryptocurrency Networks
Measurements in Cryptocurrency NetworksMeasurements in Cryptocurrency Networks
Measurements in Cryptocurrency Networks
 
Tom Capper Mozcon 2021 - Core Web Vitals - The Fast & The Spurious
Tom Capper Mozcon 2021 - Core Web Vitals - The Fast & The SpuriousTom Capper Mozcon 2021 - Core Web Vitals - The Fast & The Spurious
Tom Capper Mozcon 2021 - Core Web Vitals - The Fast & The Spurious
 
Lunch session: Quantum Computing
Lunch session: Quantum ComputingLunch session: Quantum Computing
Lunch session: Quantum Computing
 
Xbfs HPDC'2019
Xbfs HPDC'2019Xbfs HPDC'2019
Xbfs HPDC'2019
 
Ralf Herbrich - Introduction to Graphical models in Industry
Ralf Herbrich - Introduction to Graphical models in IndustryRalf Herbrich - Introduction to Graphical models in Industry
Ralf Herbrich - Introduction to Graphical models in Industry
 
Blockchain for Gambling
Blockchain for Gambling Blockchain for Gambling
Blockchain for Gambling
 

Mehr von Ontico

Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Ontico
 

Mehr von Ontico (20)

One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
 
Масштабируя DNS / Артем Гавриченков (Qrator Labs)
Масштабируя DNS / Артем Гавриченков (Qrator Labs)Масштабируя DNS / Артем Гавриченков (Qrator Labs)
Масштабируя DNS / Артем Гавриченков (Qrator Labs)
 
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
 
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
 
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
 
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
 
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
 
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
 
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
 
MySQL Replication — Advanced Features / Петр Зайцев (Percona)
MySQL Replication — Advanced Features / Петр Зайцев (Percona)MySQL Replication — Advanced Features / Петр Зайцев (Percona)
MySQL Replication — Advanced Features / Петр Зайцев (Percona)
 
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
 
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
 
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
 
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
 
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
 
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
 
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
 
100500 способов кэширования в Oracle Database или как достичь максимальной ск...
100500 способов кэширования в Oracle Database или как достичь максимальной ск...100500 способов кэширования в Oracle Database или как достичь максимальной ск...
100500 способов кэширования в Oracle Database или как достичь максимальной ск...
 
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
 
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
 

Kürzlich hochgeladen

Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Kandungan 087776558899
 

Kürzlich hochgeladen (20)

Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak HamilCara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
Cara Menggugurkan Sperma Yang Masuk Rahim Biyar Tidak Hamil
 

How Raft consensus algorithm will make replication even better in MongoDB 3.2 / Henrik Ingo (MongoDB)

  • 1. How Raft consensus algorithm will make replication even better in MongoDB 3.2 Speaker: Henrik Ingo, @h_ingo Author: Spencer T Brody, @stbrody
  • 2. Agenda • Introduction to consensus • Leader-based replicated state machine • Elections and data replication in MongoDB • Improvements coming in MongoDB 3.2
  • 3. Why use Replication? • Data redundancy • High availability • Geographical distribution • Latency
  • 4. What is Consensus? • Getting multiple processes/servers to agree on state machine state • Must handle a wide range of failure modes • Disk failure • Network partitions • Machine freezes • Clock skews • Bugs?
  • 5. Basic consensus State Machine X 3 Y 2 Z 7 State Machine X 3 Y 2 Z 7 State Machine X 3 Y 2 Z 7
  • 6. Leader Based Consensus State Machine X 3 Y 2 Z 7 Replicated Log X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 State Machine X 3 Y 2 Z 7 Replicated Log X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 State Machine X 3 Y 2 Z 7 Replicated Log X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  • 7. Agenda • Introduction to consensus • Leader-based replicated state machine • Elections and data replication in MongoDB • Improvements coming in MongoDB 3.2
  • 8. Elections X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  • 9. Elections X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  • 10. Elections X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  • 11. Elections X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  • 12. Elections X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  • 13. Key requirements for elections • There must be exactly 1 leader at any time • Majority election: there can only be 1 majority • Note: 50% is not a majority! • Old leader must know to step down, even if disconnected
  • 14. Agenda • Introduction to consensus • Leader-based replicated state machine • Elections and data replication in MongoDB • Improvements coming in MongoDB 3.2
  • 15. Agenda • Introduction to consensus • Leader-based replicated state machine • Elections and data replication in MongoDB • Improvements coming in MongoDB 3.2 • Goals and inspiration from Raft Consensus Algorithm • Preventing double voting • Monitoring node status • Calling for elections
  • 16. Goals for MongoDB 3.2 • Decrease failover time • Speed up detection and resolution of false primary situations
  • 17. Finding Inspiration in Raft • “In Search of an Understandable Consensus Algorithm” by Diego Ongaro: https://ramcloud.stanford.edu/raft.pdf • Designed to address the shortcomings of Paxos • Easier to understand • Easier to implement in real applications • Provably correct • Remarkably similar to what we’re doing already
  • 18. Raft Concepts • Term (election) IDs • Heartbeats using existing data replication channel • Asymmetric election timeouts
  • 19. Preventing Double Voting • Can’t vote for 2 nodes in the same election • Pre-3.2: 30 second vote timeout • Post-3.2: Term IDs • Term: • Monotonically increasing ID • Incremented on every election *attempt* • Lets voters distinguish elections so they can vote twice quickly in different elections • Logical clock
  • 20. MongoDB 3.0 OpLog entry > db.oplog.rs.find().sort( { $natural : -1 } ).limit(1).pretty() { "ts" : Timestamp(1444466011, 1), "h" : NumberLong("-6240522391332325619"), "v" : 2, "op" : "i", "ns" : "test.nulltest", "o" : { "_id" : 3, "a" : 3 } } millisecond + counter Random hash = gtid
  • 21. MongoDB 3.2 OpLog entry > db.oplog.rs.find().sort( { $natural : -1 } ).limit(1).pretty() { "ts" : Timestamp(1445632081, 861), "t" : NumberLong(42), "h" : NumberLong("5466055178864103715"), "v" : 2, "op" : "u", "ns" : "ycsb.usertable", "o2" : { "_id" : "user645414720104877157" }, "o" : { "$set" : { "field4" : BinData(0,"KT5sN1wrPl...Ny9wIg==") } } millisecond + Raft counter Random hash = gtid Raft Term
  • 22. Heartbeats • MongoDB 3.0: Heartbeats • Sent every two seconds from every node to every other node • Volume increases quadratically as nodes are added to the replica set • Odd behaviors with asymmetric network partitions, corner cases • MongoDB 3.2: Extra metadata sent via existing data replication channel • Utilizes chained replication • Faster heartbeats = faster elections and detection of false primaries
  • 23. Daisy chaining X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3 X 3 Y 2 Z 7 X ⬅️ 1 Y ⬅️ 2 X ⬅️ 3
  • 24. Election timeouts • Tradeoff between failover time and spurious failovers • Node calls for an election when it hasn’t heard from the primary within the election timeout • Starting in 3.2: • Single election timeout • Election timeout is configurable • Election timeout is varied randomly for each node • Varying timeouts help reduce tied votes • Fewer tied votes = faster failover rs.config(). settings. electionTimeoutMillis
  • 25. Conclusion • MongoDB 3.2 will have • Faster failovers • Faster error detection • More control to prevent spurious failovers • Robust also in corner cases • This means your systems are • More resilient to failure • More uptime • Easier to maintain

Hinweis der Redaktion

  1. Every write requires participation from all nodes – like direct democracy A lot of network traffic, poor performance This is like pure Paxos
  2. Leader = Primary This is what most real-world deployments do In mongodb… Oplog = logical transformation of data
  3. One node nominates itself and contacts others Pre-election check – will you vote for me? Election check – do you vote for me? Election notifications – I am now primary
  4. Nodes might vote no: primary exists, repl lag
  5. Question break
  6. Define: Failover time. Failover time > election time False primaries Rollbacks
  7. Raft paper presented at Usenix 2014 by Stamford University PHD students Diego Ongaro and John Ousterhout Provably correct Paxos - 1989 Paxos hard to use – “Paxos made simple”, “Paxos made practical”
  8. Continue sending heartbeats Initial sync source, spanning tree maintenance Even less frequent now
  9. Define “election timeout” - *not* time limit of election duration One possibility: Base timeout <1second: ~500ms, plus a few milliseconds