SlideShare ist ein Scribd-Unternehmen logo
1 von 65
1© 2018 All rights reserved.
Distributed PostgreSQL
with YugaByte DB
Karthik Ranganathan
PostgresConf Silicon Valley
Oct 16, 2018
2© 2018 All rights reserved.
CHECKOUT THIS REPO:
github.com/YugaByte/yb-sql-workshop
3© 2018 All rights reserved.
About Us
Kannan Muthukkaruppan, CEO
Nutanix ♦ Facebook ♦ Oracle
IIT-Madras, University of California-Berkeley
Karthik Ranganathan, CTO
Nutanix ♦ Facebook ♦ Microsoft
IIT-Madras, University of Texas-Austin
Mikhail Bautin, Software Architect
ClearStory Data ♦ Facebook ♦ D.E.Shaw
Nizhny Novgorod State University, Stony Brook
 Founded Feb 2016
 Apache HBase committers and early engineers on Apache Cassandra
 Built Facebook’s NoSQL platform powered by Apache HBase
 Scaled the platform to serve many mission-critical use cases
• Facebook Messages (Messenger)
• Operational Data Store (Time series Data)
 Reassembled the same Facebook team at YugaByte along with
engineers from Oracle, Google, Nutanix and LinkedIn
Founders
4© 2018 All rights reserved.
WORKSHOP AGENDA
• What is YugaByte DB? Why Another DB?
• Exercise 1: BI Tools on YugaByte PostgreSQL
• Exercise 2: Distributed PostgreSQL Architecture
• Exercise 3: Sharding and Scale Out in Action
• Exercise 4: Fault Tolerance in Action
5© 2018 All rights reserved.
WHAT IS
YUGABYTE DB?
6© 2018 All rights reserved.
A transactional, planet-scale database
for building high-performance cloud services.
7© 2018 All rights reserved.
NoSQL + SQL Cloud Native
8© 2018 All rights reserved.
WHY ANOTHER DB?
9© 2018 All rights reserved.
Typical Stack Today
Fragile infra with several moving parts
Datacenter 1
SQL Master SQL Slave
Application Tier (Stateless Microservices)
Datacenter 2
SQL for OLTP data
Manual sharding
Cost: dev team
Manual replication
Manual failover
Cost: ops team
NoSQL for other data
App aware of data silo
Cost: dev team
Cache for low latency
App does caching
Cost: dev team
Data inconsistency/loss
Fragile infra
Hours of debugging
Cost: dev + ops team
10© 2018 All rights reserved.
Does AWS change this?
Datacenter 1
SQL Master SQL Slave
Datacenter 2
Elasticache
Aurora
DynamoDB
Still Complex
it’s the same architecture
Application Tier (Stateless Microservices)
11© 2018 All rights reserved.
Not Portable
Not Portable
Open Source
Not Portable
Open Source
Open Source
High Performance, Transactional, Planet-Scale High Performance, Transactional, Planet-Scale
High Performance, Transactional, Planet-Scale High Performance, Transactional, Planet-Scale
System-of-Record DBs for Global Apps
12© 2018 All rights reserved.
TRANSACTIONAL PLANET-SCALEHIGH PERFORMANCE
Single Shard & Distributed ACID Txns
Document-Based, Strongly
Consistent Storage
Low Latency, Tunable Reads
High Throughput
OPEN SOURCE
Apache 2.0
Popular APIs Extended
Apache Cassandra, Redis and PostgreSQL (BETA)
Auto Sharding & Rebalancing
Global Data Distribution
Design Principles
CLOUD NATIVE
Built For The Container Era
Self-Healing, Fault-Tolerant
13© 2018 All rights reserved.
EXERCISE #1
BUSINESS INTELLIGENCE
14© 2018 All rights reserved.
EXERCISE #2
DISTRIBUTED POSTGRES:
ARCHITECTURE
15© 2018 All rights reserved.
ARCHITECTURE
Overview
16© 2018 All rights reserved.
YugaByte DB Process Overview
• Universe = cluster of nodes
• Two sets of processes: YB-Master & YB-TServer
• Example universe
4 nodes
rf=3
17© 2018 All rights reserved.
Sharding data
• User table split into tablets
18© 2018 All rights reserved.
One tablet for every key
19© 2018 All rights reserved.
Tablets and replication
• Tablet = set of tablet-peers in a RAFT group
• Num tablet-peers in tablet = replication factor (RF)
Tolerate 1 failure : RF=3
Tolerate 2 failures: RF=5
20© 2018 All rights reserved.
YB-TServer
• Process that does IO
• Hosts tablet for tables
• Hosts transaction manager
• Auto memory sizing
Block cache
Memstores
21© 2018 All rights reserved.
YB-Master
• Not in critical path
• System metadata store
Keyspaces, tables, tablets
Users/roles, permissions
• Admin operations
Create/alter/drop of tables
Backups
Load balancing (leader and data balancing)
Enforces data placement policy
22© 2018 All rights reserved.
HANDLING DDL STATEMENTS
23© 2018 All rights reserved.
DDL Statements in PostgreSQL
DDL Postman
(Authentication, authorization)
Rewriter
Planner/Optimizer
Executor
DISK
Create Table Data File
Update System Tables
24© 2018 All rights reserved.
DDL Statements in YugaByte DB PostgreSQL
DDL Postman
(Authentication, authorization)
Rewriter
Planner/Optimizer
Executor
Create sharded, replicated table as data source
Store Table Metadata in YB-Master (in works)
YugaByte
master3
…
YugaByte
master2
YugaByte
master1
25© 2018 All rights reserved.
YugaByte Query Layer (YQL)
• Stateless, runs in each YB-TServer process
GA Goal:
Distributed
Stateless
PostgreSQL Layer
Current Beta uses
a single Stateless
PostgreSQL Layer
26© 2018 All rights reserved.
HANDLING DML QUERIES
27© 2018 All rights reserved.
DDL Queries in PostgreSQL
QUERY Postman
(Authentication, authorization)
Rewriter
Planner/Optimizer
Executor
WAL Writer BG Writer…
DISK
FDW
Local Table Code Path
EXTERNAL
DATABASE
28© 2018 All rights reserved.
DML Queries in YugaByte DB PostgreSQL
DML Postman
(Authentication, authorization)
Rewriter
Planner/Optimizer
Executor
FDW
YugaByte DB Code Path
YB Gateway
EXTERNAL
DATABASE
YugaByte
node3
YugaByte
node4
…
YugaByte
node2
YugaByte
node1
Using FDW as a
Table Storage API
29© 2018 All rights reserved.
ARCHITECTURE
Data Persistence
30© 2018 All rights reserved.
Data Persistence in DocDB
• DocDB is YugaByte DB’s LSM storage engine
• Persistent key to document store
• Extends and enhances RocksDB
• Designed to support high data-densities per node
31© 2018 All rights reserved.
DocDB: Key-to-Document Store
• Document key
CQL/SQL/Redis primary key
• Document value
a CQL or SQL row
Redis data structure
• Fine-grained reads and writes
32© 2018 All rights reserved.
DocDB Data Format
Example Insert
Encoding
33© 2018 All rights reserved.
Some of the RocksDB enhancements
• WAL and MVCC enhancements
o Removed RocksDB WAL, re-uses Raft log
o MVCC at a higher layer
o Coordinate RocksDB memstore flushing and Raft log garbage collection
• File format changes
o Sharded (multi-level) indexes and Bloom filters
• Splitting data blocks & metadata into separate files for tiering support
• Separate queues for large and small compactions
34© 2018 All rights reserved.
More Enhancements to RocksDB
• Data model aware Bloom filters
• Per-SSTable key range metadata to optimize range queries
• Server-global block caches & memstore limits
• Scan-resistant block cache (single-touch and multi-touch)
35© 2018 All rights reserved.
ARCHITECTURE
Data Replication
36© 2018 All rights reserved.
Raft Replication for Consistency
37© 2018 All rights reserved.
How Raft Replication Works
38© 2018 All rights reserved.
How Raft Replication Works
39© 2018 All rights reserved.
How Raft Replication Works
40© 2018 All rights reserved.
How Raft Replication Works
41© 2018 All rights reserved.
Raft Related Enhancements
• Leader Leases
• Multiple Raft groups (1 per tablet)
• Leader Balancing
• Group Commits
• Observer Nodes / Read Replicas
42© 2018 All rights reserved.
ARCHITECTURE
Transactions
43© 2018 All rights reserved.
Single Shard Transactions
Raft Consensus Protocol
. . .
INSERT INTO table (k, v) VALUES (‘k1’, ‘v1’) Lock Manager
(in memory, on leader only)
Acquire a lock on x
DocDB / RocksDB
Read current value of x
Submit a Raft operation for replication:
Insert (k1, v1) at hybrid_time 100
Raft log
Tablet
follower
Tablet
follower
Replicate to
majority of
tablet peers
Apply to RocksDB and
release lock
k1,v1
@ht=100
1
2
5
3
4
44© 2018 All rights reserved.
MVCC for Lockless Reads
• Achieved through HybridTime (HT)
Monotonically increasing timestamp
• Allows reads at a particular HT without locking
• Multiple versions may exist temporarily
Reclaim older values during compactions
45© 2018 All rights reserved.
Single Shard Transactions
• Each tablet maintains a “safe time” for reads
o Highest timestamp such that the view as of that timestamp is fixed
o In the common case it is just before the hybrid time of the next
uncommitted record in the tablet
46© 2018 All rights reserved.
Distributed Transactions
• Fully decentralized architecture
• Every tablet server can act as a Transaction Manager
• A distributed Transaction Status table
Tracks state of active transactions
• Transactions can have 3 states:
pending, committed, aborted
47© 2018 All rights reserved.
Distributed Transactions – Write Path
48© 2018 All rights reserved.
Distributed Transactions – Write Path Step 1: Client request
49© 2018 All rights reserved.
Distributed Transactions – Write Path Step 2: Create status record
50© 2018 All rights reserved.
Distributed Transactions – Write Path Step 2: Create status record
51© 2018 All rights reserved.
Distributed Transactions – Write Path Step 3: Write provisional records
52© 2018 All rights reserved.
Distributed Transactions – Write Path Step 4: Atomic commit
53© 2018 All rights reserved.
Distributed Transactions – Write Path Step 5: Respond to client
54© 2018 All rights reserved.
Distributed Transactions – Write Path Step 6: Apply provisional records
55© 2018 All rights reserved.
Isolation Levels
• Currently Snapshot Isolation is supported
o Write-write conflicts detected when writing provisional records
• Serializable isolation (roadmap)
o Reads in RW txns also need provisional records
• Read-only transactions are always lock-free
56© 2018 All rights reserved.
Clock Skew and Read Restarts
• Need to ensure the read timestamp is high enough
o Committed records the client might have seen must be visible
• Optimistically use current Hybrid Time, re-read if necessary
o Reads are restarted if a record with a higher timestamp that the client
could have seen is encountered
o Read restart happens at most once per tablet
o Relying on bounded clock skew (NTP, AWS Time Sync)
• Only affects multi-row reads of frequently updated records
57© 2018 All rights reserved.
Distributed Transactions – Read Path
58© 2018 All rights reserved.
Distributed Transactions – Read Path Step 1: Client request; pick ht_read
59© 2018 All rights reserved.
Distributed Transactions – Read Path Step 2: Read from tablet servers
60© 2018 All rights reserved.
Distributed Transactions – Read Path Step 3: Resolve txn status
61© 2018 All rights reserved.
Distributed Transactions – Read Path Step 4: Respond to YQL Engine
62© 2018 All rights reserved.
Distributed Transactions – Read Path Step 5: Respond to client
63© 2018 All rights reserved.
Distributed Transactions – Conflicts & Retries
• Every transaction is assigned a random priority
• In a conflict, the higher-priority transaction wins
o The restarted transaction gets a new random priority
o Probability of success quickly increases with retries
• Restarting a transaction is the same as starting a new one
• A read-write transaction can be subject to read-restart
64© 2018 All rights reserved.
EXERCISE #3 and #4
SHARDING AND SCALE OUT
FAULT TOLERANCE
65© 2018 All rights reserved.
Questions?
Try it at
docs.yugabyte.com/latest/quick-start

Weitere ähnliche Inhalte

Was ist angesagt?

Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureScyllaDB
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)Noritaka Sekiyama
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
PostgreSQL HA
PostgreSQL   HAPostgreSQL   HA
PostgreSQL HAharoonm
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
Apache doris (incubating) introduction
Apache doris (incubating) introductionApache doris (incubating) introduction
Apache doris (incubating) introductionleanderlee2
 
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Altinity Ltd
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Distributed SQL Databases Deconstructed
Distributed SQL Databases DeconstructedDistributed SQL Databases Deconstructed
Distributed SQL Databases DeconstructedYugabyte
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
Understanding PostgreSQL LW Locks
Understanding PostgreSQL LW LocksUnderstanding PostgreSQL LW Locks
Understanding PostgreSQL LW LocksJignesh Shah
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBYugabyteDB
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemDatabricks
 

Was ist angesagt? (20)

The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 
Under the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database ArchitectureUnder the Hood of a Shard-per-Core Database Architecture
Under the Hood of a Shard-per-Core Database Architecture
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
Hadoop/Spark で Amazon S3 を徹底的に使いこなすワザ (Hadoop / Spark Conference Japan 2019)
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
PostgreSQL HA
PostgreSQL   HAPostgreSQL   HA
PostgreSQL HA
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Apache doris (incubating) introduction
Apache doris (incubating) introductionApache doris (incubating) introduction
Apache doris (incubating) introduction
 
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
Introduction to the Mysteries of ClickHouse Replication, By Robert Hodges and...
 
Apache Kafka Best Practices
Apache Kafka Best PracticesApache Kafka Best Practices
Apache Kafka Best Practices
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Distributed SQL Databases Deconstructed
Distributed SQL Databases DeconstructedDistributed SQL Databases Deconstructed
Distributed SQL Databases Deconstructed
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Understanding PostgreSQL LW Locks
Understanding PostgreSQL LW LocksUnderstanding PostgreSQL LW Locks
Understanding PostgreSQL LW Locks
 
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DBDistributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
Distributed Databases Deconstructed: CockroachDB, TiDB and YugaByte DB
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 

Ähnlich wie How YugaByte DB Implements Distributed PostgreSQL

YugaByte + PKS CloudFoundry Meetup 10/15/2018
YugaByte + PKS CloudFoundry Meetup 10/15/2018YugaByte + PKS CloudFoundry Meetup 10/15/2018
YugaByte + PKS CloudFoundry Meetup 10/15/2018AlanCaldera
 
Scale Transactional Apps Across Multiple Regions with Low Latency
Scale Transactional Apps Across Multiple Regions with Low LatencyScale Transactional Apps Across Multiple Regions with Low Latency
Scale Transactional Apps Across Multiple Regions with Low LatencyYugabyte
 
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by YugabyteA Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by YugabyteCarlos Andrés García
 
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by YugabyteA Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by YugabyteVMware Tanzu
 
Running Stateful Apps on Kubernetes
Running Stateful Apps on KubernetesRunning Stateful Apps on Kubernetes
Running Stateful Apps on KubernetesYugabyte
 
times ten in-memory database for extreme performance
times ten in-memory database for extreme performancetimes ten in-memory database for extreme performance
times ten in-memory database for extreme performanceOracle Korea
 
Oracle GoldenGate Performance Tuning
Oracle GoldenGate Performance TuningOracle GoldenGate Performance Tuning
Oracle GoldenGate Performance TuningBobby Curtis
 
Times ten 18.1_overview_meetup
Times ten 18.1_overview_meetupTimes ten 18.1_overview_meetup
Times ten 18.1_overview_meetupByung Ho Lee
 
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...Flink Forward
 
Tuning Flink For Robustness And Performance
Tuning Flink For Robustness And PerformanceTuning Flink For Robustness And Performance
Tuning Flink For Robustness And PerformanceStefan Richter
 
EDB Postgres Platform 11 Webinar
EDB Postgres Platform 11 WebinarEDB Postgres Platform 11 Webinar
EDB Postgres Platform 11 WebinarEDB
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSKimmo Kantojärvi
 
New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13EDB
 
PostgreSQL as a Strategic Tool
PostgreSQL as a Strategic ToolPostgreSQL as a Strategic Tool
PostgreSQL as a Strategic ToolEDB
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processingconfluent
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...HostedbyConfluent
 
Leveraging Scala and Akka to build NSDb
Leveraging Scala and Akka to build NSDbLeveraging Scala and Akka to build NSDb
Leveraging Scala and Akka to build NSDbradicalbit
 
YugaByte DB on Kubernetes - An Introduction
YugaByte DB on Kubernetes - An IntroductionYugaByte DB on Kubernetes - An Introduction
YugaByte DB on Kubernetes - An IntroductionYugabyte
 

Ähnlich wie How YugaByte DB Implements Distributed PostgreSQL (20)

YugaByte + PKS CloudFoundry Meetup 10/15/2018
YugaByte + PKS CloudFoundry Meetup 10/15/2018YugaByte + PKS CloudFoundry Meetup 10/15/2018
YugaByte + PKS CloudFoundry Meetup 10/15/2018
 
Scale Transactional Apps Across Multiple Regions with Low Latency
Scale Transactional Apps Across Multiple Regions with Low LatencyScale Transactional Apps Across Multiple Regions with Low Latency
Scale Transactional Apps Across Multiple Regions with Low Latency
 
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by YugabyteA Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
 
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by YugabyteA Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
A Planet-Scale Database for Low Latency Transactional Apps by Yugabyte
 
Running Stateful Apps on Kubernetes
Running Stateful Apps on KubernetesRunning Stateful Apps on Kubernetes
Running Stateful Apps on Kubernetes
 
times ten in-memory database for extreme performance
times ten in-memory database for extreme performancetimes ten in-memory database for extreme performance
times ten in-memory database for extreme performance
 
Oracle GoldenGate Performance Tuning
Oracle GoldenGate Performance TuningOracle GoldenGate Performance Tuning
Oracle GoldenGate Performance Tuning
 
Timesten Architecture
Timesten ArchitectureTimesten Architecture
Timesten Architecture
 
Times ten 18.1_overview_meetup
Times ten 18.1_overview_meetupTimes ten 18.1_overview_meetup
Times ten 18.1_overview_meetup
 
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
Flink Forward Berlin 2018: Stefan Richter - "Tuning Flink for Robustness and ...
 
Tuning Flink For Robustness And Performance
Tuning Flink For Robustness And PerformanceTuning Flink For Robustness And Performance
Tuning Flink For Robustness And Performance
 
EDB Postgres Platform 11 Webinar
EDB Postgres Platform 11 WebinarEDB Postgres Platform 11 Webinar
EDB Postgres Platform 11 Webinar
 
Make your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWSMake your data fly - Building data platform in AWS
Make your data fly - Building data platform in AWS
 
Novinky v Oracle Database 18c
Novinky v Oracle Database 18cNovinky v Oracle Database 18c
Novinky v Oracle Database 18c
 
New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13New enhancements for security and usability in EDB 13
New enhancements for security and usability in EDB 13
 
PostgreSQL as a Strategic Tool
PostgreSQL as a Strategic ToolPostgreSQL as a Strategic Tool
PostgreSQL as a Strategic Tool
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
 
Leveraging Scala and Akka to build NSDb
Leveraging Scala and Akka to build NSDbLeveraging Scala and Akka to build NSDb
Leveraging Scala and Akka to build NSDb
 
YugaByte DB on Kubernetes - An Introduction
YugaByte DB on Kubernetes - An IntroductionYugaByte DB on Kubernetes - An Introduction
YugaByte DB on Kubernetes - An Introduction
 

Kürzlich hochgeladen

WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnAmarnathKambale
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsArshad QA
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfonteinmasabamasaba
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...Health
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in sowetomasabamasaba
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...SelfMade bd
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionOnePlan Solutions
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfonteinmasabamasaba
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech studentsHimanshiGarg82
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...masabamasaba
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2
 

Kürzlich hochgeladen (20)

WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Software Quality Assurance Interview Questions
Software Quality Assurance Interview QuestionsSoftware Quality Assurance Interview Questions
Software Quality Assurance Interview Questions
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) SolutionIntroducing Microsoft’s new Enterprise Work Management (EWM) Solution
Introducing Microsoft’s new Enterprise Work Management (EWM) Solution
 
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students8257 interfacing 2 in microprocessor for btech students
8257 interfacing 2 in microprocessor for btech students
 
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park %in ivory park+277-882-255-28 abortion pills for sale in ivory park
%in ivory park+277-882-255-28 abortion pills for sale in ivory park
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 

How YugaByte DB Implements Distributed PostgreSQL

  • 1. 1© 2018 All rights reserved. Distributed PostgreSQL with YugaByte DB Karthik Ranganathan PostgresConf Silicon Valley Oct 16, 2018
  • 2. 2© 2018 All rights reserved. CHECKOUT THIS REPO: github.com/YugaByte/yb-sql-workshop
  • 3. 3© 2018 All rights reserved. About Us Kannan Muthukkaruppan, CEO Nutanix ♦ Facebook ♦ Oracle IIT-Madras, University of California-Berkeley Karthik Ranganathan, CTO Nutanix ♦ Facebook ♦ Microsoft IIT-Madras, University of Texas-Austin Mikhail Bautin, Software Architect ClearStory Data ♦ Facebook ♦ D.E.Shaw Nizhny Novgorod State University, Stony Brook  Founded Feb 2016  Apache HBase committers and early engineers on Apache Cassandra  Built Facebook’s NoSQL platform powered by Apache HBase  Scaled the platform to serve many mission-critical use cases • Facebook Messages (Messenger) • Operational Data Store (Time series Data)  Reassembled the same Facebook team at YugaByte along with engineers from Oracle, Google, Nutanix and LinkedIn Founders
  • 4. 4© 2018 All rights reserved. WORKSHOP AGENDA • What is YugaByte DB? Why Another DB? • Exercise 1: BI Tools on YugaByte PostgreSQL • Exercise 2: Distributed PostgreSQL Architecture • Exercise 3: Sharding and Scale Out in Action • Exercise 4: Fault Tolerance in Action
  • 5. 5© 2018 All rights reserved. WHAT IS YUGABYTE DB?
  • 6. 6© 2018 All rights reserved. A transactional, planet-scale database for building high-performance cloud services.
  • 7. 7© 2018 All rights reserved. NoSQL + SQL Cloud Native
  • 8. 8© 2018 All rights reserved. WHY ANOTHER DB?
  • 9. 9© 2018 All rights reserved. Typical Stack Today Fragile infra with several moving parts Datacenter 1 SQL Master SQL Slave Application Tier (Stateless Microservices) Datacenter 2 SQL for OLTP data Manual sharding Cost: dev team Manual replication Manual failover Cost: ops team NoSQL for other data App aware of data silo Cost: dev team Cache for low latency App does caching Cost: dev team Data inconsistency/loss Fragile infra Hours of debugging Cost: dev + ops team
  • 10. 10© 2018 All rights reserved. Does AWS change this? Datacenter 1 SQL Master SQL Slave Datacenter 2 Elasticache Aurora DynamoDB Still Complex it’s the same architecture Application Tier (Stateless Microservices)
  • 11. 11© 2018 All rights reserved. Not Portable Not Portable Open Source Not Portable Open Source Open Source High Performance, Transactional, Planet-Scale High Performance, Transactional, Planet-Scale High Performance, Transactional, Planet-Scale High Performance, Transactional, Planet-Scale System-of-Record DBs for Global Apps
  • 12. 12© 2018 All rights reserved. TRANSACTIONAL PLANET-SCALEHIGH PERFORMANCE Single Shard & Distributed ACID Txns Document-Based, Strongly Consistent Storage Low Latency, Tunable Reads High Throughput OPEN SOURCE Apache 2.0 Popular APIs Extended Apache Cassandra, Redis and PostgreSQL (BETA) Auto Sharding & Rebalancing Global Data Distribution Design Principles CLOUD NATIVE Built For The Container Era Self-Healing, Fault-Tolerant
  • 13. 13© 2018 All rights reserved. EXERCISE #1 BUSINESS INTELLIGENCE
  • 14. 14© 2018 All rights reserved. EXERCISE #2 DISTRIBUTED POSTGRES: ARCHITECTURE
  • 15. 15© 2018 All rights reserved. ARCHITECTURE Overview
  • 16. 16© 2018 All rights reserved. YugaByte DB Process Overview • Universe = cluster of nodes • Two sets of processes: YB-Master & YB-TServer • Example universe 4 nodes rf=3
  • 17. 17© 2018 All rights reserved. Sharding data • User table split into tablets
  • 18. 18© 2018 All rights reserved. One tablet for every key
  • 19. 19© 2018 All rights reserved. Tablets and replication • Tablet = set of tablet-peers in a RAFT group • Num tablet-peers in tablet = replication factor (RF) Tolerate 1 failure : RF=3 Tolerate 2 failures: RF=5
  • 20. 20© 2018 All rights reserved. YB-TServer • Process that does IO • Hosts tablet for tables • Hosts transaction manager • Auto memory sizing Block cache Memstores
  • 21. 21© 2018 All rights reserved. YB-Master • Not in critical path • System metadata store Keyspaces, tables, tablets Users/roles, permissions • Admin operations Create/alter/drop of tables Backups Load balancing (leader and data balancing) Enforces data placement policy
  • 22. 22© 2018 All rights reserved. HANDLING DDL STATEMENTS
  • 23. 23© 2018 All rights reserved. DDL Statements in PostgreSQL DDL Postman (Authentication, authorization) Rewriter Planner/Optimizer Executor DISK Create Table Data File Update System Tables
  • 24. 24© 2018 All rights reserved. DDL Statements in YugaByte DB PostgreSQL DDL Postman (Authentication, authorization) Rewriter Planner/Optimizer Executor Create sharded, replicated table as data source Store Table Metadata in YB-Master (in works) YugaByte master3 … YugaByte master2 YugaByte master1
  • 25. 25© 2018 All rights reserved. YugaByte Query Layer (YQL) • Stateless, runs in each YB-TServer process GA Goal: Distributed Stateless PostgreSQL Layer Current Beta uses a single Stateless PostgreSQL Layer
  • 26. 26© 2018 All rights reserved. HANDLING DML QUERIES
  • 27. 27© 2018 All rights reserved. DDL Queries in PostgreSQL QUERY Postman (Authentication, authorization) Rewriter Planner/Optimizer Executor WAL Writer BG Writer… DISK FDW Local Table Code Path EXTERNAL DATABASE
  • 28. 28© 2018 All rights reserved. DML Queries in YugaByte DB PostgreSQL DML Postman (Authentication, authorization) Rewriter Planner/Optimizer Executor FDW YugaByte DB Code Path YB Gateway EXTERNAL DATABASE YugaByte node3 YugaByte node4 … YugaByte node2 YugaByte node1 Using FDW as a Table Storage API
  • 29. 29© 2018 All rights reserved. ARCHITECTURE Data Persistence
  • 30. 30© 2018 All rights reserved. Data Persistence in DocDB • DocDB is YugaByte DB’s LSM storage engine • Persistent key to document store • Extends and enhances RocksDB • Designed to support high data-densities per node
  • 31. 31© 2018 All rights reserved. DocDB: Key-to-Document Store • Document key CQL/SQL/Redis primary key • Document value a CQL or SQL row Redis data structure • Fine-grained reads and writes
  • 32. 32© 2018 All rights reserved. DocDB Data Format Example Insert Encoding
  • 33. 33© 2018 All rights reserved. Some of the RocksDB enhancements • WAL and MVCC enhancements o Removed RocksDB WAL, re-uses Raft log o MVCC at a higher layer o Coordinate RocksDB memstore flushing and Raft log garbage collection • File format changes o Sharded (multi-level) indexes and Bloom filters • Splitting data blocks & metadata into separate files for tiering support • Separate queues for large and small compactions
  • 34. 34© 2018 All rights reserved. More Enhancements to RocksDB • Data model aware Bloom filters • Per-SSTable key range metadata to optimize range queries • Server-global block caches & memstore limits • Scan-resistant block cache (single-touch and multi-touch)
  • 35. 35© 2018 All rights reserved. ARCHITECTURE Data Replication
  • 36. 36© 2018 All rights reserved. Raft Replication for Consistency
  • 37. 37© 2018 All rights reserved. How Raft Replication Works
  • 38. 38© 2018 All rights reserved. How Raft Replication Works
  • 39. 39© 2018 All rights reserved. How Raft Replication Works
  • 40. 40© 2018 All rights reserved. How Raft Replication Works
  • 41. 41© 2018 All rights reserved. Raft Related Enhancements • Leader Leases • Multiple Raft groups (1 per tablet) • Leader Balancing • Group Commits • Observer Nodes / Read Replicas
  • 42. 42© 2018 All rights reserved. ARCHITECTURE Transactions
  • 43. 43© 2018 All rights reserved. Single Shard Transactions Raft Consensus Protocol . . . INSERT INTO table (k, v) VALUES (‘k1’, ‘v1’) Lock Manager (in memory, on leader only) Acquire a lock on x DocDB / RocksDB Read current value of x Submit a Raft operation for replication: Insert (k1, v1) at hybrid_time 100 Raft log Tablet follower Tablet follower Replicate to majority of tablet peers Apply to RocksDB and release lock k1,v1 @ht=100 1 2 5 3 4
  • 44. 44© 2018 All rights reserved. MVCC for Lockless Reads • Achieved through HybridTime (HT) Monotonically increasing timestamp • Allows reads at a particular HT without locking • Multiple versions may exist temporarily Reclaim older values during compactions
  • 45. 45© 2018 All rights reserved. Single Shard Transactions • Each tablet maintains a “safe time” for reads o Highest timestamp such that the view as of that timestamp is fixed o In the common case it is just before the hybrid time of the next uncommitted record in the tablet
  • 46. 46© 2018 All rights reserved. Distributed Transactions • Fully decentralized architecture • Every tablet server can act as a Transaction Manager • A distributed Transaction Status table Tracks state of active transactions • Transactions can have 3 states: pending, committed, aborted
  • 47. 47© 2018 All rights reserved. Distributed Transactions – Write Path
  • 48. 48© 2018 All rights reserved. Distributed Transactions – Write Path Step 1: Client request
  • 49. 49© 2018 All rights reserved. Distributed Transactions – Write Path Step 2: Create status record
  • 50. 50© 2018 All rights reserved. Distributed Transactions – Write Path Step 2: Create status record
  • 51. 51© 2018 All rights reserved. Distributed Transactions – Write Path Step 3: Write provisional records
  • 52. 52© 2018 All rights reserved. Distributed Transactions – Write Path Step 4: Atomic commit
  • 53. 53© 2018 All rights reserved. Distributed Transactions – Write Path Step 5: Respond to client
  • 54. 54© 2018 All rights reserved. Distributed Transactions – Write Path Step 6: Apply provisional records
  • 55. 55© 2018 All rights reserved. Isolation Levels • Currently Snapshot Isolation is supported o Write-write conflicts detected when writing provisional records • Serializable isolation (roadmap) o Reads in RW txns also need provisional records • Read-only transactions are always lock-free
  • 56. 56© 2018 All rights reserved. Clock Skew and Read Restarts • Need to ensure the read timestamp is high enough o Committed records the client might have seen must be visible • Optimistically use current Hybrid Time, re-read if necessary o Reads are restarted if a record with a higher timestamp that the client could have seen is encountered o Read restart happens at most once per tablet o Relying on bounded clock skew (NTP, AWS Time Sync) • Only affects multi-row reads of frequently updated records
  • 57. 57© 2018 All rights reserved. Distributed Transactions – Read Path
  • 58. 58© 2018 All rights reserved. Distributed Transactions – Read Path Step 1: Client request; pick ht_read
  • 59. 59© 2018 All rights reserved. Distributed Transactions – Read Path Step 2: Read from tablet servers
  • 60. 60© 2018 All rights reserved. Distributed Transactions – Read Path Step 3: Resolve txn status
  • 61. 61© 2018 All rights reserved. Distributed Transactions – Read Path Step 4: Respond to YQL Engine
  • 62. 62© 2018 All rights reserved. Distributed Transactions – Read Path Step 5: Respond to client
  • 63. 63© 2018 All rights reserved. Distributed Transactions – Conflicts & Retries • Every transaction is assigned a random priority • In a conflict, the higher-priority transaction wins o The restarted transaction gets a new random priority o Probability of success quickly increases with retries • Restarting a transaction is the same as starting a new one • A read-write transaction can be subject to read-restart
  • 64. 64© 2018 All rights reserved. EXERCISE #3 and #4 SHARDING AND SCALE OUT FAULT TOLERANCE
  • 65. 65© 2018 All rights reserved. Questions? Try it at docs.yugabyte.com/latest/quick-start