SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Downloaden Sie, um offline zu lesen
HerdDB
A Distributed
JVM Embeddable Database
built upon Apache BookKeeper
Enrico Olivelli
Senior Software Engineer @MagNews.com @EmailSuccess.com
Pulsar Summit 2020
2
Agenda
• Why HerdDB ?
• Embedded vs Replicated
• HerdDB: Data Model
• HerdDB: How it works
• Using Apache BookKeeper as WAL
• Q&A Mme
3
Why HerdDB?
HerdDB is a Distributed Database System, designed from the ground up to run inside the
same process of client applications.
The project started inside EmailSuccess.com, a Mail Transfer Agent (like sendmail).
EmailSuccess is a Java application and it initially used MySQL to store message state.
In 2016 we started the development of EmailSuccess Cluster and we needed a better
solution for storing message state.
Requisites:
• Embedded: The database should run inside the same Java process (like SQLLite)
• Replicated: The database should scale out and offer high-availability automatically
Solutions were already present on the market, but nothing could meet both of the
requirements .
4
Some HerdDB users
Currently HerdDB is used in production:
- As Primary SQL DB for EmailSuccess.com standalone
- As Replicated DB for EmailSuccess.com cluster
- As Metadata Service on BlobIt – Binary Large Object Storage built over
Apache BookKeeper (https://github.com/diennea/blobit)
- As Configuration and Certificate store for CarapaceProxy HTTP server
(https://github.com/diennea/carapaceproxy)
- As SQL Database for Apache Pulsar Manager
(https://github.com/apache/pulsar-manager)
- In HerdDB Collections Framework flavour at https://magnews.com
- As MySQL replacement in a few other non Open Source internal products at
https://diennea.com
5
Another Embedded Database for the JVM
Embedding the database with the application brings these advantages:
- No additional deployment and management tools for users.
- The database is hidden to the user, you have to manage «only» the application.
- The application is tested with only one version of the Database (the same as in
production).
- Easier to test for developers (run the DB in-memory for unit tests).
Challenges:
- Memory management: the DB cannot use all of the heap of the client application.
- Have sensible defaults.
- Enable automatic management routines by default.
- Handle upgrade procedures automatically, without manual intervention.
- Pay much attention to wire protocol changes and to disk formats.
6
Replication for an Embedded Database
Replication: high-availability and scalability.
Benefits:
- Work even in case of temporary or permanent loss of machines.
- Keep data closer to the clients.
- No shared disks or network file systems.
- Scale out by adding machines to the application.
Challenges:
- Handle gracefully temporary failures; long GC pauses, application restart.
- Understand storage topology: WAL and cold Data have separate replication
mechanisms.
- Need some external source of truth: Apache ZooKeeper.
7
HerdDB - Data model: TableSpaces
The database is partitioned into TableSpaces, a TableSpace is a group of Tables.
You can run transactions and joins that span tables of the same TableSpace.
A TableSpace must reside entirely on every machine assigned to it.
TableSpace assets:
- Tables
- Indexes
- Open Transactions
- Write-ahead log
- Checkpoint data
TableSpace
Table 1
Secondary
Indexes
PK
Index
Table 2
Secondary
Indexes
PK
Index
Write-ahead-log
Transac'ons
8
HerdDB - Data Model: Tables
A Table is a simple key-value dictionary (byte[] -> byte[]).
Clients use JDBC API and SQL language.
On top of the key-value structure we have a SQL layer:
- Key = Primary Key
- Value = All of the other columns
We are using Apache Calcite for SQL work: Parser and Planner.
Core Data Structures (TableManager):
- Data Page: a bunch of records, indexes by PK (hashmap)
- Primary Key Index: a map that binds a PK value to the id of a data page
- Dirty Page Buffer: a Data Page that is open for writes
- Secondary indexes: they map a value to a set of PKs
HerdDB Collections Framework uses directly the Key Value structure.
TableManager
Secondary
index
VALUE1 -> KEY1
VALUE2 > KEY1
VALUE3 -> KEY2
Primary Key
Index
KEY1 -> PAGE 1
KEY2 -> PAGE 2
KEY3 -> PAGE 1
Page 1 (immutable)
KEY1 -> RECORD1v1
KEY2 -> RECORD2v1
KEY3 -> RECORD3v1
Page 2 (writable)
KEY2 -> RECORD2v2
9
HerdDB - How it works: the Write Path
Write Path flow for INSERT/UPDATE operations:
1) Validate the operation.
2) Write to the WAL.
3) If the record is a new record or it was on another page, write the new version
to the Dirty Page Buffer and update the PK index.
3b) If the record was already on the Dirty Page Buffer, simply replace it.
You can have multiple versions of the same record, old versions are discarded
during maintenance operations, usually during checkpoints.
When you are working inside a Transaction all of the writes are applied to an
internal buffer local to the Transaction and they are not visible to other
Transactions.
In this case writes to the Dirty Page Buffer are defferred to the «commit»
operation.
TableManager
Secondary
index
VALUE1 -> KEY1
VALUE2 > KEY1
VALUE3 -> KEY2
Primary Key
Index
KEY1 -> PAGE 1
KEY2 -> PAGE 1
KEY3 -> PAGE 1
Page 1 (immutable)
KEY1 -> RECORD1v1
KEY2 -> RECORD2v1
KEY3 -> RECORD3v1
Page 2 (writable)
10
HerdDB - How it works: the Write Path
Write Path flow for INSERT/UPDATE operations:
1) Validate the operation.
2) Write to the WAL.
3) If the record is a new record or it was on another page, write the new version
to the Dirty Page Buffer and update the PK index.
3b) If the record was already on the Dirty Page Buffer, simply replace it.
You can have multiple versions of the same record, old versions are discarded
during maintenance operations, usually during checkpoints.
When you are working inside a Transaction all of the writes are applied to an
internal buffer local to the Transaction and they are not visible to other
Transactions.
In this case writes to the Dirty Page Buffer are defferred to the «commit»
operation.
TableManager
Secondary
index
VALUE1 -> KEY1
VALUE2 > KEY1
VALUE3 -> KEY2
Primary Key
Index
KEY1 -> PAGE 1
KEY2 -> PAGE 2
KEY3 -> PAGE 1
Page 1 (immutable)
KEY1 -> RECORD1v1
KEY2 -> RECORD2v1
KEY3 -> RECORD3v1
Page 2 (writable)
KEY2 -> RECORD2v2
TableManager
Secondary
index
VALUE1 -> KEY1
VALUE2 > KEY1
VALUE3 -> KEY2
Primary Key
Index
KEY1 -> PAGE 1
KEY2 -> PAGE 1
KEY3 -> PAGE 1
Page 1 (immutable)
KEY1 -> RECORD1v1
KEY2 -> RECORD2v1
KEY3 -> RECORD3v1
Page 2 (writable)
11
HerdDB - How it works: the Write Path
TableManager
Primary Key
Index
KEY1 -> PAGE 1
KEY2 -> PAGE 1
KEY3 -> PAGE 1
Page 1 (writable)
KEY1 -> RECORD1v1
KEY2 -> RECORD2v1
KEY3 -> RECORD3v1
If the latest version of the record is already on a writable page, then we can simply replace the content of the record.
No need to update the Primary Key index.
12
HerdDB - How it works: the Write Path
TableManager
Primary Key
Index
KEY1 -> PAGE 1
KEY2 -> PAGE 1
KEY3 -> PAGE 1
Page 1 (writable)
KEY1 -> RECORD1v1
KEY2 -> RECORD2v2
KEY3 -> RECORD3v1
TableManager
Primary Key
Index
KEY1 -> PAGE 1
KEY2 -> PAGE 1
KEY3 -> PAGE 1
Page 1 (writable)
KEY1 -> RECORD1v1
KEY2 -> RECORD2v1
KEY3 -> RECORD3v1
If the latest version of the record is already on a writable page, then we can simply replace the content of the record.
No need to update the Primary Key index.
13
HerdDB - How it works: Replication
An HerdDB Cluster is made of:
- HerdDB nodes
- BookKeeper servers (Bookies)
- ZooKeeper servers
It is common to run the Bookie inside
the same process of the HerdDB
node
Is it also common to run the HerdDB
node inside the same process of the
client application.
HerdDB
Server
HerdDB
Server
HerdDB
Server
Bookie
Bookie
Bookie
ZK
App App App
14
HerdDB - How it works: Replication
An HerdDB Cluster is made of:
- HerdDB nodes
- BookKeeper servers (Bookies)
- ZooKeeper servers
It is common to run the Bookie inside
the same process of the HerdDB
node
Is it also common to run the HerdDB
node inside the same process of the
client application.
HerdDB
Server
HerdDB
Server
HerdDB
Server
Bookie
Bookie
Bookie
ZK
App App App
HerdDB +
Bookie
ZK
App App App
HerdDB +
Bookie
HerdDB +
Bookie
15
HerdDB - How it works: Replication
An HerdDB Cluster is made of:
- HerdDB nodes
- BookKeeper servers (Bookies)
- ZooKeeper servers
It is common to run the Bookie inside
the same process of the HerdDB
node
Is it also common to run the HerdDB
node inside the same process of the
client applica@on.
HerdDB
Server
HerdDB
Server
HerdDB
Server
Bookie
Bookie
Bookie
ZK
App App App
HerdDB +
Bookie
ZK
App App App
HerdDB +
Bookie
HerdDB +
Bookie
ZK
App+
HerdDB +
Bookie
App+
HerdDB +
Bookie
App+
HerdDB +
Bookie
16
HerdDB - How it works: Replication
A TableSpace is a Replicated State Machine:
- The state of the machine is the set of tables and their contents.
- We have a list of replicas: one node is the leader node, the
others are the followers.
- Clients only talk to the leader node.
- The leader node writes state changes to the BookKeeper.
- Followers read the tail of the log and apply every opera'on to
the local copy of the tablespace, in the same order.
- Nodes never talk to each other for write/read opera'ons: only
for ini'al bootstrap of a new replica.
Apache BookKeeper guarantees the overall consistency:
- Fencing mechanism.
- Last Add Confirmed Protocol.
Apache ZooKeeper is the glue and the source of truth:
- Service Discovery.
- Metadata management.
- Coordina'on.
Node1
leader
Node2
follower
Node3
follower
Bookie3
Bookie2
Bookie1
Client
Read/Write
Tail the log
ZooKeeper
Read changes
and apply locally
Write Data
Changes
17
HerdDB - How it works: Replication
Two levels of Replication:
- Write ahead log: Apache BookKeeper.
- TableSpace data structures: HerdDB runtime.
BookKeeper client on the leader node:
- Writes directly to each Bookie in the choosen ensemble.
- Spreads data among bookies.
- Deals with temporary and permanent failure of bookies.
HerdDB writes to the local disk in these cases:
- Checkpoint: the server consolidates the local copy of the data at a
given point in time (log position).
- Low memory: swap out current dirty pages to temporary files.
WAL truncation happens with a time-based policy.
Followers will only eventually store durably locally a copy of the whole
tablespace.
HerdDB
Server
HerdDB
Server
HerdDB
Server
Bookie
Bookie
Bookie
ZK
App App App
18
HerdDB - How it works: Follower promotion and Fencing
N1
leader
N2
follower
Bookie3
Bookie2
Bookie1
Client
Read/Write
ZooKeeper
Read changes
and apply locally
Write Data
Changes
Apache BookKeeper guarantees the consistency of TableSpace data, by
means of the fencing mechanism
What happens when a follower node N2 is promoted to be leader
instead of N1 ?
- N1 is current leader
- The client is sending all of the requests to N1
- But N1 has been partioned from the cluster, some network paths
are not available any more:
- But the client is still connected to it
- N1 is still able to talk to Bookies
- N1 is not receiving notifications from ZK
- Someone updates TableSpace metadata on ZK: the leader is now
officially N2
- Please note that N1 does not receive any notification from ZK
about the change of role
X
19
HerdDB - How it works: Follower promotion and Fencing
- N2 starts recovery and opens all of the ledgers with the ‘recovery’
flag
- Bookies fence out N1
- N1 receives the «fenced» error response at the first write
opera'on and stops
- Any client s'll connected to N1 receive a «not leader» error from
N1 and discover the new leader using ZK
- Now new writes go to N2
- When N1 is back to normal opera'on state it recovers from the log
and starts being a follower.
Please note that opera'ons on the hot paths, reads and writes, do not
access the metadata service, but the overall consistency is s'll
guaranteed.
N1
Old
leader
N2
New
leader
Bookie3
Bookie2
Bookie1
Client
Send Write
Open log with fencing
Receive
Failure
response
«Fenced»
Receive
Failure
response
«No more
leader»
Send new
Write to
Node2
20
Wrap up
HerdDB is a good op@on for you:
- If you want a Open Source SQL database with builVn automaVc replicaVon
- If you want to embed the SQL database into your Java applicaVon
Embedding a Replicated Database benefits:
- Run and manage only the applicaVon, hide distributed Database complexity
to system administrators
- Run and test the applicaVon with the same Database you use in producVon
HerdDB replicaVon is based on Apache BookKeeper and Apache ZooKeeper:
- BookKeeper implements replicaVon for the WAL and guarantees the
consistency of your data
- ZooKeeper is the glue and the base for coordinaVon, metadata management
and service discovery
21
Q&A
https://herddb.com
https://bookkeeper.apache.org
Twitter: @eolivelli
LinkedIn: https://www.linkedin.com/in/enrico-olivelli-984b7874/
We can chat on https://bookkeeper.apache.org/community/slack/
Thank you.

Weitere ähnliche Inhalte

Was ist angesagt?

Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
Joe Stein
 
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
StreamNative
 

Was ist angesagt? (20)

How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...How Orange Financial combat financial frauds over 50M transactions a day usin...
How Orange Financial combat financial frauds over 50M transactions a day usin...
 
Transaction preview of Apache Pulsar
Transaction preview of Apache PulsarTransaction preview of Apache Pulsar
Transaction preview of Apache Pulsar
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
Apache Pulsar at Yahoo! Japan
Apache Pulsar at Yahoo! JapanApache Pulsar at Yahoo! Japan
Apache Pulsar at Yahoo! Japan
 
Architecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructureArchitecture of a Kafka camus infrastructure
Architecture of a Kafka camus infrastructure
 
Apache Kafka
Apache KafkaApache Kafka
Apache Kafka
 
Integrating Apache Pulsar with Big Data Ecosystem
Integrating Apache Pulsar with Big Data EcosystemIntegrating Apache Pulsar with Big Data Ecosystem
Integrating Apache Pulsar with Big Data Ecosystem
 
When apache pulsar meets apache flink
When apache pulsar meets apache flinkWhen apache pulsar meets apache flink
When apache pulsar meets apache flink
 
Building a FaaS with pulsar
Building a FaaS with pulsarBuilding a FaaS with pulsar
Building a FaaS with pulsar
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
How pulsar stores data at Pulsar-na-summit-2021.pptx (1)
 
lessons from managing a pulsar cluster
 lessons from managing a pulsar cluster lessons from managing a pulsar cluster
lessons from managing a pulsar cluster
 
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
Unify Storage Backend for Batch and Streaming Computation with Apache Pulsar_...
 
Deploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analyticsDeploying Apache Flume to enable low-latency analytics
Deploying Apache Flume to enable low-latency analytics
 
Apache Con 2021 Structured Data Streaming
Apache Con 2021 Structured Data StreamingApache Con 2021 Structured Data Streaming
Apache Con 2021 Structured Data Streaming
 
Getting Pulsar Spinning_Addison Higham
Getting Pulsar Spinning_Addison HighamGetting Pulsar Spinning_Addison Higham
Getting Pulsar Spinning_Addison Higham
 
Interactive querying of streams using Apache Pulsar_Jerry peng
Interactive querying of streams using Apache Pulsar_Jerry pengInteractive querying of streams using Apache Pulsar_Jerry peng
Interactive querying of streams using Apache Pulsar_Jerry peng
 
A Unified Platform for Real-time Storage and Processing
A Unified Platform for Real-time Storage and ProcessingA Unified Platform for Real-time Storage and Processing
A Unified Platform for Real-time Storage and Processing
 
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&PierreKafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
Kafka on Pulsar:bringing native Kafka protocol support to Pulsar_Sijie&Pierre
 
Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1
 

Ähnlich wie Introducing HerdDB - a distributed JVM embeddable database built upon Apache BookKeeper_Enrico Olivelli

Ähnlich wie Introducing HerdDB - a distributed JVM embeddable database built upon Apache BookKeeper_Enrico Olivelli (20)

MariaDB: Connect Storage Engine
MariaDB: Connect Storage EngineMariaDB: Connect Storage Engine
MariaDB: Connect Storage Engine
 
A Step to programming with Apache Spark
A Step to programming with Apache SparkA Step to programming with Apache Spark
A Step to programming with Apache Spark
 
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
Running Production CDC Ingestion Pipelines With Balaji Varadarajan and Pritam...
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Db2 Important questions to read
Db2 Important questions to readDb2 Important questions to read
Db2 Important questions to read
 
Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...
Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...
Getting Under the Hood of Kafka Streams: Optimizing Storage Engines to Tune U...
 
Hadoop introduction
Hadoop introductionHadoop introduction
Hadoop introduction
 
Rds data lake @ Robinhood
Rds data lake @ Robinhood Rds data lake @ Robinhood
Rds data lake @ Robinhood
 
hadoop-spark.ppt
hadoop-spark.ppthadoop-spark.ppt
hadoop-spark.ppt
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
Unit 5
Unit  5Unit  5
Unit 5
 
Training
TrainingTraining
Training
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
GCP Data Engineer cheatsheet
GCP Data Engineer cheatsheetGCP Data Engineer cheatsheet
GCP Data Engineer cheatsheet
 
Big data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guideBig data hadooop analytic and data warehouse comparison guide
Big data hadooop analytic and data warehouse comparison guide
 
Gcp data engineer
Gcp data engineerGcp data engineer
Gcp data engineer
 
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast DataKudu: New Hadoop Storage for Fast Analytics on Fast Data
Kudu: New Hadoop Storage for Fast Analytics on Fast Data
 
HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.HBase, crazy dances on the elephant back.
HBase, crazy dances on the elephant back.
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 

Mehr von StreamNative

Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
StreamNative
 

Mehr von StreamNative (20)

Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
 
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
 
Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...
 
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
 
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
 
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
 
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
 
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
 
Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022
 
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
 
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
 
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
 
Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
 
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
 

Kürzlich hochgeladen

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
vexqp
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
vexqp
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
q6pzkpark
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
ptikerjasaptiker
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 

Kürzlich hochgeladen (20)

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
怎样办理圣路易斯大学毕业证(SLU毕业证书)成绩单学校原版复制
 
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
怎样办理伦敦大学城市学院毕业证(CITY毕业证书)成绩单学校原版复制
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
一比一原版(曼大毕业证书)曼尼托巴大学毕业证成绩单留信学历认证一手价格
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling ManjurJual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
Jual Cytotec Asli Obat Aborsi No. 1 Paling Manjur
 
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATIONCapstone in Interprofessional Informatic  // IMPACT OF COVID 19 ON EDUCATION
Capstone in Interprofessional Informatic // IMPACT OF COVID 19 ON EDUCATION
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 

Introducing HerdDB - a distributed JVM embeddable database built upon Apache BookKeeper_Enrico Olivelli

  • 1. HerdDB A Distributed JVM Embeddable Database built upon Apache BookKeeper Enrico Olivelli Senior Software Engineer @MagNews.com @EmailSuccess.com Pulsar Summit 2020
  • 2. 2 Agenda • Why HerdDB ? • Embedded vs Replicated • HerdDB: Data Model • HerdDB: How it works • Using Apache BookKeeper as WAL • Q&A Mme
  • 3. 3 Why HerdDB? HerdDB is a Distributed Database System, designed from the ground up to run inside the same process of client applications. The project started inside EmailSuccess.com, a Mail Transfer Agent (like sendmail). EmailSuccess is a Java application and it initially used MySQL to store message state. In 2016 we started the development of EmailSuccess Cluster and we needed a better solution for storing message state. Requisites: • Embedded: The database should run inside the same Java process (like SQLLite) • Replicated: The database should scale out and offer high-availability automatically Solutions were already present on the market, but nothing could meet both of the requirements .
  • 4. 4 Some HerdDB users Currently HerdDB is used in production: - As Primary SQL DB for EmailSuccess.com standalone - As Replicated DB for EmailSuccess.com cluster - As Metadata Service on BlobIt – Binary Large Object Storage built over Apache BookKeeper (https://github.com/diennea/blobit) - As Configuration and Certificate store for CarapaceProxy HTTP server (https://github.com/diennea/carapaceproxy) - As SQL Database for Apache Pulsar Manager (https://github.com/apache/pulsar-manager) - In HerdDB Collections Framework flavour at https://magnews.com - As MySQL replacement in a few other non Open Source internal products at https://diennea.com
  • 5. 5 Another Embedded Database for the JVM Embedding the database with the application brings these advantages: - No additional deployment and management tools for users. - The database is hidden to the user, you have to manage «only» the application. - The application is tested with only one version of the Database (the same as in production). - Easier to test for developers (run the DB in-memory for unit tests). Challenges: - Memory management: the DB cannot use all of the heap of the client application. - Have sensible defaults. - Enable automatic management routines by default. - Handle upgrade procedures automatically, without manual intervention. - Pay much attention to wire protocol changes and to disk formats.
  • 6. 6 Replication for an Embedded Database Replication: high-availability and scalability. Benefits: - Work even in case of temporary or permanent loss of machines. - Keep data closer to the clients. - No shared disks or network file systems. - Scale out by adding machines to the application. Challenges: - Handle gracefully temporary failures; long GC pauses, application restart. - Understand storage topology: WAL and cold Data have separate replication mechanisms. - Need some external source of truth: Apache ZooKeeper.
  • 7. 7 HerdDB - Data model: TableSpaces The database is partitioned into TableSpaces, a TableSpace is a group of Tables. You can run transactions and joins that span tables of the same TableSpace. A TableSpace must reside entirely on every machine assigned to it. TableSpace assets: - Tables - Indexes - Open Transactions - Write-ahead log - Checkpoint data TableSpace Table 1 Secondary Indexes PK Index Table 2 Secondary Indexes PK Index Write-ahead-log Transac'ons
  • 8. 8 HerdDB - Data Model: Tables A Table is a simple key-value dictionary (byte[] -> byte[]). Clients use JDBC API and SQL language. On top of the key-value structure we have a SQL layer: - Key = Primary Key - Value = All of the other columns We are using Apache Calcite for SQL work: Parser and Planner. Core Data Structures (TableManager): - Data Page: a bunch of records, indexes by PK (hashmap) - Primary Key Index: a map that binds a PK value to the id of a data page - Dirty Page Buffer: a Data Page that is open for writes - Secondary indexes: they map a value to a set of PKs HerdDB Collections Framework uses directly the Key Value structure. TableManager Secondary index VALUE1 -> KEY1 VALUE2 > KEY1 VALUE3 -> KEY2 Primary Key Index KEY1 -> PAGE 1 KEY2 -> PAGE 2 KEY3 -> PAGE 1 Page 1 (immutable) KEY1 -> RECORD1v1 KEY2 -> RECORD2v1 KEY3 -> RECORD3v1 Page 2 (writable) KEY2 -> RECORD2v2
  • 9. 9 HerdDB - How it works: the Write Path Write Path flow for INSERT/UPDATE operations: 1) Validate the operation. 2) Write to the WAL. 3) If the record is a new record or it was on another page, write the new version to the Dirty Page Buffer and update the PK index. 3b) If the record was already on the Dirty Page Buffer, simply replace it. You can have multiple versions of the same record, old versions are discarded during maintenance operations, usually during checkpoints. When you are working inside a Transaction all of the writes are applied to an internal buffer local to the Transaction and they are not visible to other Transactions. In this case writes to the Dirty Page Buffer are defferred to the «commit» operation. TableManager Secondary index VALUE1 -> KEY1 VALUE2 > KEY1 VALUE3 -> KEY2 Primary Key Index KEY1 -> PAGE 1 KEY2 -> PAGE 1 KEY3 -> PAGE 1 Page 1 (immutable) KEY1 -> RECORD1v1 KEY2 -> RECORD2v1 KEY3 -> RECORD3v1 Page 2 (writable)
  • 10. 10 HerdDB - How it works: the Write Path Write Path flow for INSERT/UPDATE operations: 1) Validate the operation. 2) Write to the WAL. 3) If the record is a new record or it was on another page, write the new version to the Dirty Page Buffer and update the PK index. 3b) If the record was already on the Dirty Page Buffer, simply replace it. You can have multiple versions of the same record, old versions are discarded during maintenance operations, usually during checkpoints. When you are working inside a Transaction all of the writes are applied to an internal buffer local to the Transaction and they are not visible to other Transactions. In this case writes to the Dirty Page Buffer are defferred to the «commit» operation. TableManager Secondary index VALUE1 -> KEY1 VALUE2 > KEY1 VALUE3 -> KEY2 Primary Key Index KEY1 -> PAGE 1 KEY2 -> PAGE 2 KEY3 -> PAGE 1 Page 1 (immutable) KEY1 -> RECORD1v1 KEY2 -> RECORD2v1 KEY3 -> RECORD3v1 Page 2 (writable) KEY2 -> RECORD2v2 TableManager Secondary index VALUE1 -> KEY1 VALUE2 > KEY1 VALUE3 -> KEY2 Primary Key Index KEY1 -> PAGE 1 KEY2 -> PAGE 1 KEY3 -> PAGE 1 Page 1 (immutable) KEY1 -> RECORD1v1 KEY2 -> RECORD2v1 KEY3 -> RECORD3v1 Page 2 (writable)
  • 11. 11 HerdDB - How it works: the Write Path TableManager Primary Key Index KEY1 -> PAGE 1 KEY2 -> PAGE 1 KEY3 -> PAGE 1 Page 1 (writable) KEY1 -> RECORD1v1 KEY2 -> RECORD2v1 KEY3 -> RECORD3v1 If the latest version of the record is already on a writable page, then we can simply replace the content of the record. No need to update the Primary Key index.
  • 12. 12 HerdDB - How it works: the Write Path TableManager Primary Key Index KEY1 -> PAGE 1 KEY2 -> PAGE 1 KEY3 -> PAGE 1 Page 1 (writable) KEY1 -> RECORD1v1 KEY2 -> RECORD2v2 KEY3 -> RECORD3v1 TableManager Primary Key Index KEY1 -> PAGE 1 KEY2 -> PAGE 1 KEY3 -> PAGE 1 Page 1 (writable) KEY1 -> RECORD1v1 KEY2 -> RECORD2v1 KEY3 -> RECORD3v1 If the latest version of the record is already on a writable page, then we can simply replace the content of the record. No need to update the Primary Key index.
  • 13. 13 HerdDB - How it works: Replication An HerdDB Cluster is made of: - HerdDB nodes - BookKeeper servers (Bookies) - ZooKeeper servers It is common to run the Bookie inside the same process of the HerdDB node Is it also common to run the HerdDB node inside the same process of the client application. HerdDB Server HerdDB Server HerdDB Server Bookie Bookie Bookie ZK App App App
  • 14. 14 HerdDB - How it works: Replication An HerdDB Cluster is made of: - HerdDB nodes - BookKeeper servers (Bookies) - ZooKeeper servers It is common to run the Bookie inside the same process of the HerdDB node Is it also common to run the HerdDB node inside the same process of the client application. HerdDB Server HerdDB Server HerdDB Server Bookie Bookie Bookie ZK App App App HerdDB + Bookie ZK App App App HerdDB + Bookie HerdDB + Bookie
  • 15. 15 HerdDB - How it works: Replication An HerdDB Cluster is made of: - HerdDB nodes - BookKeeper servers (Bookies) - ZooKeeper servers It is common to run the Bookie inside the same process of the HerdDB node Is it also common to run the HerdDB node inside the same process of the client applica@on. HerdDB Server HerdDB Server HerdDB Server Bookie Bookie Bookie ZK App App App HerdDB + Bookie ZK App App App HerdDB + Bookie HerdDB + Bookie ZK App+ HerdDB + Bookie App+ HerdDB + Bookie App+ HerdDB + Bookie
  • 16. 16 HerdDB - How it works: Replication A TableSpace is a Replicated State Machine: - The state of the machine is the set of tables and their contents. - We have a list of replicas: one node is the leader node, the others are the followers. - Clients only talk to the leader node. - The leader node writes state changes to the BookKeeper. - Followers read the tail of the log and apply every opera'on to the local copy of the tablespace, in the same order. - Nodes never talk to each other for write/read opera'ons: only for ini'al bootstrap of a new replica. Apache BookKeeper guarantees the overall consistency: - Fencing mechanism. - Last Add Confirmed Protocol. Apache ZooKeeper is the glue and the source of truth: - Service Discovery. - Metadata management. - Coordina'on. Node1 leader Node2 follower Node3 follower Bookie3 Bookie2 Bookie1 Client Read/Write Tail the log ZooKeeper Read changes and apply locally Write Data Changes
  • 17. 17 HerdDB - How it works: Replication Two levels of Replication: - Write ahead log: Apache BookKeeper. - TableSpace data structures: HerdDB runtime. BookKeeper client on the leader node: - Writes directly to each Bookie in the choosen ensemble. - Spreads data among bookies. - Deals with temporary and permanent failure of bookies. HerdDB writes to the local disk in these cases: - Checkpoint: the server consolidates the local copy of the data at a given point in time (log position). - Low memory: swap out current dirty pages to temporary files. WAL truncation happens with a time-based policy. Followers will only eventually store durably locally a copy of the whole tablespace. HerdDB Server HerdDB Server HerdDB Server Bookie Bookie Bookie ZK App App App
  • 18. 18 HerdDB - How it works: Follower promotion and Fencing N1 leader N2 follower Bookie3 Bookie2 Bookie1 Client Read/Write ZooKeeper Read changes and apply locally Write Data Changes Apache BookKeeper guarantees the consistency of TableSpace data, by means of the fencing mechanism What happens when a follower node N2 is promoted to be leader instead of N1 ? - N1 is current leader - The client is sending all of the requests to N1 - But N1 has been partioned from the cluster, some network paths are not available any more: - But the client is still connected to it - N1 is still able to talk to Bookies - N1 is not receiving notifications from ZK - Someone updates TableSpace metadata on ZK: the leader is now officially N2 - Please note that N1 does not receive any notification from ZK about the change of role X
  • 19. 19 HerdDB - How it works: Follower promotion and Fencing - N2 starts recovery and opens all of the ledgers with the ‘recovery’ flag - Bookies fence out N1 - N1 receives the «fenced» error response at the first write opera'on and stops - Any client s'll connected to N1 receive a «not leader» error from N1 and discover the new leader using ZK - Now new writes go to N2 - When N1 is back to normal opera'on state it recovers from the log and starts being a follower. Please note that opera'ons on the hot paths, reads and writes, do not access the metadata service, but the overall consistency is s'll guaranteed. N1 Old leader N2 New leader Bookie3 Bookie2 Bookie1 Client Send Write Open log with fencing Receive Failure response «Fenced» Receive Failure response «No more leader» Send new Write to Node2
  • 20. 20 Wrap up HerdDB is a good op@on for you: - If you want a Open Source SQL database with builVn automaVc replicaVon - If you want to embed the SQL database into your Java applicaVon Embedding a Replicated Database benefits: - Run and manage only the applicaVon, hide distributed Database complexity to system administrators - Run and test the applicaVon with the same Database you use in producVon HerdDB replicaVon is based on Apache BookKeeper and Apache ZooKeeper: - BookKeeper implements replicaVon for the WAL and guarantees the consistency of your data - ZooKeeper is the glue and the base for coordinaVon, metadata management and service discovery