SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Raft in Scylla
Konstantin Osipov
ScyllaDB
Konstantin Osipov
Team Lead, ScyllaDB
Kostja’s one of the developers behind Scylla lightweight transactions. His current focus is
Raft log replication and its applications to schema, topology changes and tablets.
2
Recap Scylla Summit 2019
▪ LWT: the first strongly
consistent feature
▪ Available in 4.0
▪ Pay per use
3
2020-12-25: Tested with Jepsen!
UPDATE employees
SET join_date = '2018-05-19' WHERE
firstname = 'John' AND
lastname = 'Doe'
IF join_date != null;
[applied]
False
▪ 3 network round trips per write
▪ Must read the old value before write
LWT use of Paxos
4
R1
R
2
R
3
Decision made
Can I
propose
a value?
Check
condition
Accept
new
value
Learn
decision
What is Raft anyway?
▪ Raft provides strong consistency efficiently
▪ Only the leader can accept writes
5
Leader
Append
entries
Apply
Follower
Follower
Decision made
… 1 network round trip per write on average
Raft log replication
▪ Each node has a copy of Raft log
6
Scylla plans to use Raft for:
▪ Topology changes
7
Scylla plans to use Raft for:
▪ Topology changes
▪ Schema changes
8
Scylla plans to use Raft for:
▪ Schema changes
▪ Topology changes
▪ Tablets
9
0-99
100-199
200-299
Table
Node A
Tablet 1
0-99
Node B
Tablet 2
100-199
Node C
Tablet 3
200-299
Topology changes on Raft
10
Topology changes in Scylla
▪ Safe when one change is done at a time
▪ Rely on 30+ second timeouts for consistency
▪ Allowed on a significantly degraded cluster (split brain)
11
30s
👋
💡
💡
Topology changes using Raft
▪ Durable and linearizable
▪ Permit adding multiple nodes
▪ Permit background data rebalancing
▪ Require a majority of replicas alive to succeed
12
Schema changes using Raft
13
Schema changes in Scylla
▪ Each node owns a copy of the schema
▪ Schema change is first made locally
▪ Then eventually pushed through the cluster
▪ Last-timestamp-wins rule is used for reconciliation
14
Node A:
> CREATE TABLE e (a int);
OK (hash: a81e, ts: 1609420790)
Node B:
> CREATE TABLE e (a int, b int);
OK (hash: 2fa3, ts: 1609420792)
INSERT/UPDATE when schemas differ
▪ Each data request carries a schema version
▪ Missing versions can be pulled from peers
15
Node A (a81e):
> INSERT INTO e (a) VALUES (1);
hash: 2fa3, row: (1, null)
Node B (2fa3):
> INSERT INTO e (a) VALUES (1);
hash: 2fa3, row: (1, null)
Schema changes using Raft
▪ Each node continues to store a copy of the schema
▪ A change is first persisted in a global Raft log
▪ On success, it’s applied on replicas
▪ Schema changes are now linearizable and consistent
▪ Nodes catch up with schema history during boot
The
Speaker’s
camera
displays
here
16
Tablets
17
Token based partitioning
▪ Partition key is hashed to an integer (token)
▪ Nodes own ranges of tokens
▪ Provides even distribution of data and traffic
▪ Hotspots if partitions have many clustering rows
18
ck:
pk - partition key, ck - clustering key
pk: a b c .. gf .. t.. u
1 2 3 .. 21 3 11 1
token
footprint:
Tablet partitioning
▪ Tablet is a new kind of partition
▪ It stores a primary key range, not a single partition key
▪ Tablet ranges are subject to dynamic load balancing
▪ Size of each tablet is configurable (e.g. 64MB)
19
Raft for Tablets
▪ Manageable number of Raft groups (~100,000)
▪ No client-side timestamps
▪ Provides isolation for ALL queries
▪ Writes do not require a read
▪ No need to repair
▪ Strong consistency of materialized views
Strong consistency by default
20
Raft in Scylla: summary
▪ Raft extended to efficiently support many groups
▪ Raft and Tablet partitioning = fast strong consistency
▪ Linearizable, more powerful schema and topology changes
▪ High Availability and partition tolerance of Cassandra are
mostly unaffected
21
Thank You
@kostja_osipov
kostja@scylladb.com
Konstantin Osipov
22
Download Scylla Open Source:
scylladb.com/download
Talk to an expert:
scylladb.com/consultation
Take a test drive:
scylladb.com/test-drive
The
Speaker’s
camera
displays
here
Experience Scylla for Yourself
23

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

redis 소개자료 - 네오클로바
redis 소개자료 - 네오클로바redis 소개자료 - 네오클로바
redis 소개자료 - 네오클로바
 
Linux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance ShowdownLinux Kernel vs DPDK: HTTP Performance Showdown
Linux Kernel vs DPDK: HTTP Performance Showdown
 
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
Deletes Without Tombstones or TTLs (Eric Stevens, ProtectWise) | Cassandra Su...
 
Neutron packet logging framework
Neutron packet logging frameworkNeutron packet logging framework
Neutron packet logging framework
 
Kafka at Peak Performance
Kafka at Peak PerformanceKafka at Peak Performance
Kafka at Peak Performance
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1Scylla Summit 2022: Scylla 5.0 New Features, Part 1
Scylla Summit 2022: Scylla 5.0 New Features, Part 1
 
InnoDB Locking Explained with Stick Figures
InnoDB Locking Explained with Stick FiguresInnoDB Locking Explained with Stick Figures
InnoDB Locking Explained with Stick Figures
 
SR-IOV+KVM on Debian/Stable
SR-IOV+KVM on Debian/StableSR-IOV+KVM on Debian/Stable
SR-IOV+KVM on Debian/Stable
 
Nick Fisk - low latency Ceph
Nick Fisk - low latency CephNick Fisk - low latency Ceph
Nick Fisk - low latency Ceph
 
How Impala Works
How Impala WorksHow Impala Works
How Impala Works
 
Stream Processing made simple with Kafka
Stream Processing made simple with KafkaStream Processing made simple with Kafka
Stream Processing made simple with Kafka
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
 
BPF - in-kernel virtual machine
BPF - in-kernel virtual machineBPF - in-kernel virtual machine
BPF - in-kernel virtual machine
 
MongoDB at Baidu
MongoDB at BaiduMongoDB at Baidu
MongoDB at Baidu
 
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong ZhuBuild a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
Build a High Available NFS Cluster Based on CephFS - Shangzhong Zhu
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
Outrageous Performance: RageDB's Experience with the Seastar Framework
Outrageous Performance: RageDB's Experience with the Seastar FrameworkOutrageous Performance: RageDB's Experience with the Seastar Framework
Outrageous Performance: RageDB's Experience with the Seastar Framework
 
Under The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database ArchitectureUnder The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database Architecture
 
Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instances
 
Ceph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOceanCeph Tech Talk: Ceph at DigitalOcean
Ceph Tech Talk: Ceph at DigitalOcean
 

Ähnlich wie Eventually, Scylla Chooses Consistency

Cisco data center support
Cisco data center supportCisco data center support
Cisco data center support
Krunal Shah
 
Tech Talk: ONOS- A Distributed SDN Network Operating System
Tech Talk: ONOS- A Distributed SDN Network Operating SystemTech Talk: ONOS- A Distributed SDN Network Operating System
Tech Talk: ONOS- A Distributed SDN Network Operating System
nvirters
 
Cisco systems hacking layer 2 ethernet switches
Cisco systems   hacking layer 2 ethernet switchesCisco systems   hacking layer 2 ethernet switches
Cisco systems hacking layer 2 ethernet switches
KJ Savaliya
 
Cisco labs practical4
Cisco labs practical4Cisco labs practical4
Cisco labs practical4
Tai Lam
 
©LWTAOB© 2013 Cisco andLab – O.docx
©LWTAOB© 2013 Cisco andLab – O.docx©LWTAOB© 2013 Cisco andLab – O.docx
©LWTAOB© 2013 Cisco andLab – O.docx
LynellBull52
 

Ähnlich wie Eventually, Scylla Chooses Consistency (20)

Graph processing
Graph processingGraph processing
Graph processing
 
CCNA (R & S) Module 04 - Scaling Networks - Chapter 4
CCNA (R & S) Module 04 - Scaling Networks - Chapter 4CCNA (R & S) Module 04 - Scaling Networks - Chapter 4
CCNA (R & S) Module 04 - Scaling Networks - Chapter 4
 
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
State Management in Apache Flink : Consistent Stateful Distributed Stream Pro...
 
Cisco data center support
Cisco data center supportCisco data center support
Cisco data center support
 
Tech Talk: ONOS- A Distributed SDN Network Operating System
Tech Talk: ONOS- A Distributed SDN Network Operating SystemTech Talk: ONOS- A Distributed SDN Network Operating System
Tech Talk: ONOS- A Distributed SDN Network Operating System
 
Cisco systems hacking layer 2 ethernet switches
Cisco systems   hacking layer 2 ethernet switchesCisco systems   hacking layer 2 ethernet switches
Cisco systems hacking layer 2 ethernet switches
 
Hacking Layer 2 - Enthernet Switcher Hacking Countermeasures.
Hacking Layer 2 - Enthernet Switcher Hacking Countermeasures.Hacking Layer 2 - Enthernet Switcher Hacking Countermeasures.
Hacking Layer 2 - Enthernet Switcher Hacking Countermeasures.
 
Cisco labs practical4
Cisco labs practical4Cisco labs practical4
Cisco labs practical4
 
Evolving Data Center switching with TRILL
Evolving Data Center switching with TRILLEvolving Data Center switching with TRILL
Evolving Data Center switching with TRILL
 
Ether channel fundamentals
Ether channel fundamentalsEther channel fundamentals
Ether channel fundamentals
 
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and BeyondScylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
Scylla Summit 2022: The Future of Consensus in ScyllaDB 5.0 and Beyond
 
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala TaiwanScala & Spark(1.6) in Performance Aspect for Scala Taiwan
Scala & Spark(1.6) in Performance Aspect for Scala Taiwan
 
HPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark verticaHPBigData2015 PSTL kafka spark vertica
HPBigData2015 PSTL kafka spark vertica
 
©LWTAOB© 2013 Cisco andLab – O.docx
©LWTAOB© 2013 Cisco andLab – O.docx©LWTAOB© 2013 Cisco andLab – O.docx
©LWTAOB© 2013 Cisco andLab – O.docx
 
Raft After ScyllaDB 5.2: Safe Topology Changes
Raft After ScyllaDB 5.2: Safe Topology ChangesRaft After ScyllaDB 5.2: Safe Topology Changes
Raft After ScyllaDB 5.2: Safe Topology Changes
 
Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)Network Programming: Data Plane Development Kit (DPDK)
Network Programming: Data Plane Development Kit (DPDK)
 
Building Scalable, Real Time Applications for Financial Services with DataStax
Building Scalable, Real Time Applications for Financial Services with DataStaxBuilding Scalable, Real Time Applications for Financial Services with DataStax
Building Scalable, Real Time Applications for Financial Services with DataStax
 
DataStax Enterprise – Foundations for Finance – 20160419
DataStax Enterprise – Foundations for Finance – 20160419DataStax Enterprise – Foundations for Finance – 20160419
DataStax Enterprise – Foundations for Finance – 20160419
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
Code GPU with CUDA - SIMT
Code GPU with CUDA - SIMTCode GPU with CUDA - SIMT
Code GPU with CUDA - SIMT
 

Mehr von ScyllaDB

Mehr von ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 

Eventually, Scylla Chooses Consistency

  • 1. Raft in Scylla Konstantin Osipov ScyllaDB
  • 2. Konstantin Osipov Team Lead, ScyllaDB Kostja’s one of the developers behind Scylla lightweight transactions. His current focus is Raft log replication and its applications to schema, topology changes and tablets. 2
  • 3. Recap Scylla Summit 2019 ▪ LWT: the first strongly consistent feature ▪ Available in 4.0 ▪ Pay per use 3 2020-12-25: Tested with Jepsen! UPDATE employees SET join_date = '2018-05-19' WHERE firstname = 'John' AND lastname = 'Doe' IF join_date != null; [applied] False
  • 4. ▪ 3 network round trips per write ▪ Must read the old value before write LWT use of Paxos 4 R1 R 2 R 3 Decision made Can I propose a value? Check condition Accept new value Learn decision
  • 5. What is Raft anyway? ▪ Raft provides strong consistency efficiently ▪ Only the leader can accept writes 5 Leader Append entries Apply Follower Follower Decision made … 1 network round trip per write on average
  • 6. Raft log replication ▪ Each node has a copy of Raft log 6
  • 7. Scylla plans to use Raft for: ▪ Topology changes 7
  • 8. Scylla plans to use Raft for: ▪ Topology changes ▪ Schema changes 8
  • 9. Scylla plans to use Raft for: ▪ Schema changes ▪ Topology changes ▪ Tablets 9 0-99 100-199 200-299 Table Node A Tablet 1 0-99 Node B Tablet 2 100-199 Node C Tablet 3 200-299
  • 11. Topology changes in Scylla ▪ Safe when one change is done at a time ▪ Rely on 30+ second timeouts for consistency ▪ Allowed on a significantly degraded cluster (split brain) 11 30s 👋 💡 💡
  • 12. Topology changes using Raft ▪ Durable and linearizable ▪ Permit adding multiple nodes ▪ Permit background data rebalancing ▪ Require a majority of replicas alive to succeed 12
  • 14. Schema changes in Scylla ▪ Each node owns a copy of the schema ▪ Schema change is first made locally ▪ Then eventually pushed through the cluster ▪ Last-timestamp-wins rule is used for reconciliation 14 Node A: > CREATE TABLE e (a int); OK (hash: a81e, ts: 1609420790) Node B: > CREATE TABLE e (a int, b int); OK (hash: 2fa3, ts: 1609420792)
  • 15. INSERT/UPDATE when schemas differ ▪ Each data request carries a schema version ▪ Missing versions can be pulled from peers 15 Node A (a81e): > INSERT INTO e (a) VALUES (1); hash: 2fa3, row: (1, null) Node B (2fa3): > INSERT INTO e (a) VALUES (1); hash: 2fa3, row: (1, null)
  • 16. Schema changes using Raft ▪ Each node continues to store a copy of the schema ▪ A change is first persisted in a global Raft log ▪ On success, it’s applied on replicas ▪ Schema changes are now linearizable and consistent ▪ Nodes catch up with schema history during boot The Speaker’s camera displays here 16
  • 18. Token based partitioning ▪ Partition key is hashed to an integer (token) ▪ Nodes own ranges of tokens ▪ Provides even distribution of data and traffic ▪ Hotspots if partitions have many clustering rows 18 ck: pk - partition key, ck - clustering key pk: a b c .. gf .. t.. u 1 2 3 .. 21 3 11 1 token footprint:
  • 19. Tablet partitioning ▪ Tablet is a new kind of partition ▪ It stores a primary key range, not a single partition key ▪ Tablet ranges are subject to dynamic load balancing ▪ Size of each tablet is configurable (e.g. 64MB) 19
  • 20. Raft for Tablets ▪ Manageable number of Raft groups (~100,000) ▪ No client-side timestamps ▪ Provides isolation for ALL queries ▪ Writes do not require a read ▪ No need to repair ▪ Strong consistency of materialized views Strong consistency by default 20
  • 21. Raft in Scylla: summary ▪ Raft extended to efficiently support many groups ▪ Raft and Tablet partitioning = fast strong consistency ▪ Linearizable, more powerful schema and topology changes ▪ High Availability and partition tolerance of Cassandra are mostly unaffected 21
  • 23. Download Scylla Open Source: scylladb.com/download Talk to an expert: scylladb.com/consultation Take a test drive: scylladb.com/test-drive The Speaker’s camera displays here Experience Scylla for Yourself 23

Hinweis der Redaktion

  1. Hi, This talk is about Raft in Scylla - our effort to improve a lot of existing Cassandra functionality and add new strongly consistent features.
  2. I’m Konstantin Osipov, I live in Moscow and work on open source databases. In Scylla I’ve been involved with implementation of lightweight transactions.
  3. Before discussing Raft, let’s recap the items we delivered recently. Back at Scylla Summit 2019 we announced support for Cassandra lightweight transactions. Lightweight transactions allow all clients agree on a state of a database before making a change to it. Prior to that, Scylla lacked any strongly consistent features. We made a considerable effort testing LWT, and just recently completed an industry standard Jepsen testing for it.
  4. In Scylla, LWT are based on Paxos consensus algorithm. Paxos is a leaderless protocol, in which each participant stores little state, which was an advantage considering that to be compatible with Cassandra Scylla needed to allow each partition be independently available. Paxos runs 3 rounds of network messages to commit each transaction. This is 1 round trip less than Cassandra, but still is more than necessary in the optimal case. An important property of LWT is that it works over existing tables and alongside eventually consistent operations. If LWT are not used, the overhead on the rest of the operations is zero. This is the gain of a fairly high cost of the implementation. We mentioned at the 2019 Summit that Scylla is committed to providing an optimized implementation of strongly consistent reads and writes based on Raft. In this talk I will discuss our progress with Raft and what else we’re going to improve using it.
  5. So what is Raft anyway? It is a leader based log replication protocol. A very crude explanation of what Raft does, is it elects a leader once, and then the leader is responsible for making all the decisions about the state of the database. This helps avoid extra communication between replicas during individual reads and writes. Each node maintains a state of who the current leader is, and forwards requests to the leader. Scylla clients are unaffected - except now the leader does some more work than replicas, so the load distribution may be less even. This means Scylla will need to Raft instances side by side.
  6. Raft is built around the notion of a replicated log. When the leader receives a request, it first stores an entry for it in its log. Then it pushes the entry to replica’s copies of the log. Once the majority of replicas store the entry, the leader applies the entry and instruct the replicas to do the same. On event of leader failure, a replica with the most up to date log becomes the leader.
  7. Raft defines not only how group makes a decision, but also the protocol of deciding on new members of the group, and removing group members. This lays a solid foundation for Scylla topology changes: they translate naturally to Raft configuration changes, assuming there is a Raft group for all of the nodes, and no longer need a proprietary protocol.
  8. Schema changes translate to simply storing a command in a the global Raft log and then applying the change on each node which has a copy of the log.
  9. Because of the additional state (the current leader) stored at each peer, it’s not as straightforward to apply Raft to Scylla data manipulation statements. Maintaining a separate leader for each partition would be just too much overhead, considering individual partition updates may be rare. This is why Scylla, alongside Raft, works on a new partitioner, which would reduce the total number of partitions, while still keeping the number high to guarantee even distribution of data and work, and would allow balance the data between partitions more flexibly. For each such partition, called Tablet, Scylla will run an own instance of Raft algorithm. In the rest of the talk I will discuss these 3 applications of Raft in more detail.
  10. Let’s begin with the subject of topology changes and discuss how Raft could be used to improve it.
  11. Presently, topology changes in Scylla are eventually consistent. Let’s use node addition as an example. A node wishing to join the cluster advertises itself to the rest of the members through Gossip. For those of you not familiar with the way Gossip works, it’s a great protocol for distributing some infrequently changing information at low cost. It’s very commonly used for failure detection - when healthy clusters enjoy low network overhead induced by a failure detector, and state of a faulty node distributes across the cluster reasonably quickly - a few to several seconds would be a typical interval. Knowing Gossip is not too fast waits for (by default) 30s to let the news spread. Nodes begin forwarding relevant updates to the new node once they are aware of ot. With updates coming in, the node can start data rebalancing. Node removal or decommission works similarly, except the node spreading the rumour (aka the change coordinator) is not necessarily the same node the rumour is about (just what we are used to in real life). This poses some challenges: The actions performed by the change coordinator are unilateral, and assume the operator avoids starting a conflicting change concurrently. The joining node will proceed after a 30s interval even if one of the nodes in the cluster is down and did not get the news about the new member. Such nodes, once are back online, will continue serving queries using old topology until Gossip messages reach them. A repair will then be necessary to restore the configured data replication factor. If a joining node dies mid-way, its added data ranges will remain in the cluster topology and the operator will need to clean them up manually before proceeding with the next change. Since the procedure relies on a fairly slow vehicle to spread the information, it’s hard to split into multiple steps. When we at Scylla discuss how to add multiple nodes concurrently, we consider breaking a single topology change action into smaller, persistent and resumable steps, such as first adding an empty node, then assigning it some data ranges, then actually moving these ranges. Having to wait 30s for each step to settle in through Gossip is not very practical.
  12. Raft handles these challenges by including topology changes (called configuration changes there) into protocol core. This part of Raft protocol is also widely adopted and went under extensive scrutiny, so should be naturally preferred to Scylla’s proprietary solution inherited from Cassandra. The way Raft treats topology changes is similar to the way it handles standard strongly consistent reads and writes. A topology change is done by appending two records to the distributed Raft log. The first record is introducing the new topology to the cluster. After the first record is appended to master log, and until the log with this record is shipped to the majority of nodes, the cluster takes into account the new topology (e.g. a new node) in all writes, but doesn’t abandon the old topology yet - it’s also used for all reads and writes. Once the majority of replicas got the information about the new topology, the leader adds the second record to the log. This informs replicas that now it’s safe to discard the old topology and fully switch to the new one. This two-step procedure ensures that no two parts of the cluster operate in two different configurations - worst case, some nodes may still be using joint topology and old one, or joint topology and new one, both of which is safe, but never only old and only new topology. With Raft, Scylla topology changes could be split into multiple steps: First, add the new node to the global Raft group configuration, using the procedure just described Then, commit a record to token_metadata with the new nodes’ token. This will be linearizable with all topologies The, stream ranges to the added node, and update state of each range as it is streamed. Since all the steps are linearized through Raft log it is now possible to permit concurrent topology changes, as long as they don’t conflict. The only conceivable downside is that if the majority of the cluster nodes are down, it may be not possible to perform topology changes at all. Scylla will need to provide an emergency brake instrument to recover clusters so significantly degraded. One possible solution would be directly editing topology information on the remaining nodes, to let them continue in the state that remains.
  13. Schema changes are operations such as creating and dropping keyspaces, tables, user defined types or functions. If they are using Raft, they also benefit from linearizability.
  14. Currently, Schema changes in Scylla are eventually consistent. Each Scylla node has an own copy of the schema. Requests to change schema are validated against a local copy and then are applied locally. A new data item may be added to the immediately following, before any other cluster node knows about it. There is no coordination between changes at different nodes, and any node is free to propose a change. The change is eventually propagated through the rest of the cluster. The last-timestamp-wins is used to resolve conflicts if two changes against the same object happened concurrently.
  15. Data manipulation is aware of possible schema inconsistency. A specific request carries a schema version with it. Scylla is able execute requests with divergent history, so will fetch a particular schema version just to execute a request. This guarantees the schema changes are fully available in presence of network failures. It has some downsides as well. It is possible to submit changes that conflict: e.g. define a table based on UDT, and drop that UDT New features, such as triggers, stored functions, UDFs, aggravate the consistency problem
  16. After switching schema changes to Raft any node would still be able to propose a change. However, the change now will be forwarded to Raft leader, where it will be validated against the latest version of the schema. Then, the leader will persist it in a global Raft log, replicated to all nodes of the cluster. Once the majority of replicas confirm persisting its copy of the log, the change will be applied on all replicas. With this approach, all schema changes will form a linear history and divergent/conflicting changes will be impossible. It should open the way to complex but safe dependencies between schema objects, i.e. triggers, constraints or functional indexes. A replica which was down while the cluster has been performing schema changes will catch up with them on boot, but streaming the entire history of changes from the leader. There is also a downside. It will no longer possible to perform a schema change if the majority for the cluster is unreachable or down. It is still possible that a node gets a request for a schema it did not see yet, and will need to fetch schema for it. For older schemas we will maintain a version history. For newer schemas, we will need to make sure that the history can be fetched from any node, not just the leader. https://docs.google.com/presentation/d/1ZazssA802_bUHcJKy7yPUbiVby8acFxbebf-VbmXRDk/edit#slide=id.ga3bc8bcbea_0_131
  17. Finally, the ultimate feature enabled by Raft are fast & efficient yet strongly consistent tables. Tablets is a term for a database unit of data distribution and load balancing first introduced in Google BigTable paper from 2006. Let’s see how they work.
  18. Today, Scylla’s partitioning strategy is not pluggable. Compare with replication strategy: you can change how many replicas a keyspace has, and where these replicas are located. You can also use QUORUM/LOCAL_QUORUM and SERIAL/LOCAL_SERIAL to work efficiently in cross-dc setup. Scylla partitioner is not like it: all you can choose is what makes a partition key. The key is always hashed to a token, a token mapped to a replica set/shard. Thanks to hashing and use of vnodes (tokens), the data is evenly distributed across the cluster. Most write and read scenarios produce even load on all nodes of the cluster. Hotspots, while possible, are unlikely. Unfortunately, one size still can not fit all. Using the same partitioner for all tables can be rather a hindrance if there are a lot of small tables, which are frequently scanned. Frequent range scans also require an extra step of merging streams produced by multiple nodes. Certain partitions tend to get hot no matter how good is the choice of the partition key. https://docs.google.com/document/d/1flYRliD-VXNlrdPR2IT_rswXRW_55CySlXnEcw7qqtY/edit#heading=h.ly4c9p67vgne https://docs.google.com/presentation/d/1Pm1hIGza4RcSEzlV_bRSYv9AmUyAGRv6cuNmVuEmt9g/edit#slide=id.g51b14e1223_0_432
  19. So in Scylla, we would like to make partitioning strategy a user choice, like the replication factor is today. If a user chooses tablet partitioning, Scylla will store small tables using few tablets. Large partitions (tablets) will be automatically split, and small tablets coalesced if necessary. Other databases that support range-based partitioners include MongoDB, Couchbase, Cockroach… https://docs.google.com/document/d/1flYRliD-VXNlrdPR2IT_rswXRW_55CySlXnEcw7qqtY/edit#heading=h.ly4c9p67vgne https://docs.google.com/presentation/d/1Pm1hIGza4RcSEzlV_bRSYv9AmUyAGRv6cuNmVuEmt9g/edit#slide=id.g51b14e1223_0_432
  20. Tables, partitioned using tablets, will work efficiently with Raft. When Raft is used, the change is stored in the log before it’s applied to the table, so no repair in Cassandra sense is needed - we may still want to “repair” (i.e. sync up) the logs between replicas, but the base tables will stay consistent at all times. This addressed the problem of consistency of derived data, which has been open in Cassandra for along time (many of you who track Cassandra development are familiar with materialized view consistency issues) . https://docs.google.com/document/d/1flYRliD-VXNlrdPR2IT_rswXRW_55CySlXnEcw7qqtY/edit#heading=h.ly4c9p67vgne https://docs.google.com/presentation/d/1Pm1hIGza4RcSEzlV_bRSYv9AmUyAGRv6cuNmVuEmt9g/edit#slide=id.g51b14e1223_0_432
  21. Original Raft does not know about partitions, tokens, shards. It is an abstract algorithm describing replication of an abstract state machine. In Scylla, we have more than one state machine (schema information, topology information, and then each tablet and its replica set is an independent Raft instance), so we want to run many copies of Raft algorithm simultaneously. This poses new challenges: how do we spawn new copies consistently? How much state the algorithm will take? Can we share the overhead of the algorithm, such as the cost of distributed failure detection, between Raft instances? Where to store Raft replication log? Could we avoid the overhead of double logging: raft log and commit log? Could we make these decision configurable, depending on the balance of performance and ease of use? We have already addressed many of these issues in Scylla Raft - a reusable library, which supports joint consensus configuration changes, pluggable state machine, logging and failure detection. We’re working on rebuilding Scylla schema on top of it. The first user-visible impact of the effort is expected in the upcoming year. Stay tuned.