Introducing HerdDB - a distributed JVM embeddable database built upon Apache BookKeeper_Enrico Olivelli

HerdDB
A Distributed
JVM Embeddable Database
built upon Apache BookKeeper
Enrico Olivelli
Senior Software Engineer @MagNews.com @EmailSuccess.com
Pulsar Summit 2020

2
Agenda
• Why HerdDB ?
• Embedded vs Replicated
• HerdDB: Data Model
• HerdDB: How it works
• Using Apache BookKeeper as WAL
• Q&A Mme

3
Why HerdDB?
HerdDB is a Distributed Database System, designed from the ground up to run inside the
same process of client applications.
The project started inside EmailSuccess.com, a Mail Transfer Agent (like sendmail).
EmailSuccess is a Java application and it initially used MySQL to store message state.
In 2016 we started the development of EmailSuccess Cluster and we needed a better
solution for storing message state.
Requisites:
• Embedded: The database should run inside the same Java process (like SQLLite)
• Replicated: The database should scale out and offer high-availability automatically
Solutions were already present on the market, but nothing could meet both of the
requirements .

4
Some HerdDB users
Currently HerdDB is used in production:
- As Primary SQL DB for EmailSuccess.com standalone
- As Replicated DB for EmailSuccess.com cluster
- As Metadata Service on BlobIt – Binary Large Object Storage built over
Apache BookKeeper (https://github.com/diennea/blobit)
- As Configuration and Certificate store for CarapaceProxy HTTP server
(https://github.com/diennea/carapaceproxy)
- As SQL Database for Apache Pulsar Manager
(https://github.com/apache/pulsar-manager)
- In HerdDB Collections Framework flavour at https://magnews.com
- As MySQL replacement in a few other non Open Source internal products at
https://diennea.com

5
Another Embedded Database for the JVM
Embedding the database with the application brings these advantages:
- No additional deployment and management tools for users.
- The database is hidden to the user, you have to manage «only» the application.
- The application is tested with only one version of the Database (the same as in
production).
- Easier to test for developers (run the DB in-memory for unit tests).
Challenges:
- Memory management: the DB cannot use all of the heap of the client application.
- Have sensible defaults.
- Enable automatic management routines by default.
- Handle upgrade procedures automatically, without manual intervention.
- Pay much attention to wire protocol changes and to disk formats.

6
Replication for an Embedded Database
Replication: high-availability and scalability.
Benefits:
- Work even in case of temporary or permanent loss of machines.
- Keep data closer to the clients.
- No shared disks or network file systems.
- Scale out by adding machines to the application.
Challenges:
- Handle gracefully temporary failures; long GC pauses, application restart.
- Understand storage topology: WAL and cold Data have separate replication
mechanisms.
- Need some external source of truth: Apache ZooKeeper.

7
HerdDB - Data model: TableSpaces
The database is partitioned into TableSpaces, a TableSpace is a group of Tables.
You can run transactions and joins that span tables of the same TableSpace.
A TableSpace must reside entirely on every machine assigned to it.
TableSpace assets:
- Tables
- Indexes
- Open Transactions
- Write-ahead log
- Checkpoint data
TableSpace
Table 1
Secondary
Indexes
PK
Index
Table 2
Secondary
Indexes
PK
Index
Write-ahead-log
Transac'ons

8
HerdDB - Data Model: Tables
A Table is a simple key-value dictionary (byte[] -> byte[]).
Clients use JDBC API and SQL language.
On top of the key-value structure we have a SQL layer:
- Key = Primary Key
- Value = All of the other columns
We are using Apache Calcite for SQL work: Parser and Planner.
Core Data Structures (TableManager):
- Data Page: a bunch of records, indexes by PK (hashmap)
- Primary Key Index: a map that binds a PK value to the id of a data page
- Dirty Page Buffer: a Data Page that is open for writes
- Secondary indexes: they map a value to a set of PKs
HerdDB Collections Framework uses directly the Key Value structure.
TableManager
Secondary
index
VALUE1 -> KEY1
VALUE2 > KEY1
VALUE3 -> KEY2
Primary Key
Index
KEY1 -> PAGE 1
KEY2 -> PAGE 2
KEY3 -> PAGE 1
Page 1 (immutable)
KEY1 -> RECORD1v1
KEY2 -> RECORD2v1
KEY3 -> RECORD3v1
Page 2 (writable)
KEY2 -> RECORD2v2

9
HerdDB - How it works: the Write Path
Write Path flow for INSERT/UPDATE operations:
1) Validate the operation.
2) Write to the WAL.
3) If the record is a new record or it was on another page, write the new version
to the Dirty Page Buffer and update the PK index.
3b) If the record was already on the Dirty Page Buffer, simply replace it.
You can have multiple versions of the same record, old versions are discarded
during maintenance operations, usually during checkpoints.
When you are working inside a Transaction all of the writes are applied to an
internal buffer local to the Transaction and they are not visible to other
Transactions.
In this case writes to the Dirty Page Buffer are defferred to the «commit»
operation.
TableManager
Secondary
index
VALUE1 -> KEY1
VALUE2 > KEY1
VALUE3 -> KEY2
Primary Key
Index
KEY1 -> PAGE 1
KEY2 -> PAGE 1
KEY3 -> PAGE 1
Page 1 (immutable)
KEY1 -> RECORD1v1
KEY2 -> RECORD2v1
KEY3 -> RECORD3v1
Page 2 (writable)

10
Write Path flow for INSERT/UPDATE operations:
1) Validate the operation.
2) Write to the WAL.
3) If the record is a new record or it was on another page, write the new version
to the Dirty Page Buffer and update the PK index.
3b) If the record was already on the Dirty Page Buffer, simply replace it.
You can have multiple versions of the same record, old versions are discarded
during maintenance operations, usually during checkpoints.
When you are working inside a Transaction all of the writes are applied to an
internal buffer local to the Transaction and they are not visible to other
Transactions.
In this case writes to the Dirty Page Buffer are defferred to the «commit»
operation.
TableManager
Secondary
index
VALUE1 -> KEY1
VALUE2 > KEY1
VALUE3 -> KEY2
Primary Key
Index
KEY1 -> PAGE 1
KEY2 -> PAGE 2
KEY3 -> PAGE 1
Page 1 (immutable)
KEY1 -> RECORD1v1
KEY2 -> RECORD2v1
KEY3 -> RECORD3v1
Page 2 (writable)
KEY2 -> RECORD2v2
TableManager
Secondary
index
VALUE1 -> KEY1
VALUE2 > KEY1
VALUE3 -> KEY2
Primary Key
Index
KEY1 -> PAGE 1
KEY2 -> PAGE 1
KEY3 -> PAGE 1
Page 1 (immutable)
KEY1 -> RECORD1v1
KEY2 -> RECORD2v1
KEY3 -> RECORD3v1
Page 2 (writable)

11
TableManager
Primary Key
Index
KEY1 -> PAGE 1
KEY2 -> PAGE 1
KEY3 -> PAGE 1
Page 1 (writable)
KEY1 -> RECORD1v1
KEY2 -> RECORD2v1
KEY3 -> RECORD3v1
If the latest version of the record is already on a writable page, then we can simply replace the content of the record.
No need to update the Primary Key index.

12
TableManager
Primary Key
Index
KEY1 -> PAGE 1
KEY2 -> PAGE 1
KEY3 -> PAGE 1
Page 1 (writable)
KEY1 -> RECORD1v1
KEY2 -> RECORD2v2
KEY3 -> RECORD3v1
TableManager
Primary Key
Index
KEY1 -> PAGE 1
KEY2 -> PAGE 1
KEY3 -> PAGE 1
Page 1 (writable)
KEY1 -> RECORD1v1
KEY2 -> RECORD2v1
KEY3 -> RECORD3v1
If the latest version of the record is already on a writable page, then we can simply replace the content of the record.
No need to update the Primary Key index.

13
HerdDB - How it works: Replication
An HerdDB Cluster is made of:
- HerdDB nodes
- BookKeeper servers (Bookies)
- ZooKeeper servers
It is common to run the Bookie inside
the same process of the HerdDB
node
Is it also common to run the HerdDB
node inside the same process of the
client application.
HerdDB
Server
HerdDB
Server
HerdDB
Server
Bookie
Bookie
Bookie
ZK
App App App

14
- HerdDB nodes
- ZooKeeper servers
node
client application.
HerdDB
Server
HerdDB
Server
HerdDB
Server
Bookie
Bookie
Bookie
ZK
App App App
HerdDB +
Bookie
ZK
App App App
HerdDB +
Bookie
HerdDB +
Bookie

15
- HerdDB nodes
- ZooKeeper servers
node
client applica@on.
HerdDB
Server
HerdDB
Server
HerdDB
Server
Bookie
Bookie
Bookie
ZK
App App App
HerdDB +
Bookie
ZK
App App App
HerdDB +
Bookie
HerdDB +
Bookie
ZK
App+
HerdDB +
Bookie
App+
HerdDB +
Bookie
App+
HerdDB +
Bookie

16
A TableSpace is a Replicated State Machine:
- The state of the machine is the set of tables and their contents.
- We have a list of replicas: one node is the leader node, the
others are the followers.
- Clients only talk to the leader node.
- The leader node writes state changes to the BookKeeper.
- Followers read the tail of the log and apply every opera'on to
the local copy of the tablespace, in the same order.
- Nodes never talk to each other for write/read opera'ons: only
for ini'al bootstrap of a new replica.
Apache BookKeeper guarantees the overall consistency:
- Fencing mechanism.
- Last Add Conﬁrmed Protocol.
Apache ZooKeeper is the glue and the source of truth:
- Service Discovery.
- Metadata management.
- Coordina'on.
Node1
leader
Node2
follower
Node3
follower
Bookie3
Bookie2
Bookie1
Client
Read/Write
Tail the log
ZooKeeper
Read changes
and apply locally
Write Data
Changes

17
Two levels of Replication:
- Write ahead log: Apache BookKeeper.
- TableSpace data structures: HerdDB runtime.
BookKeeper client on the leader node:
- Writes directly to each Bookie in the choosen ensemble.
- Spreads data among bookies.
- Deals with temporary and permanent failure of bookies.
HerdDB writes to the local disk in these cases:
- Checkpoint: the server consolidates the local copy of the data at a
given point in time (log position).
- Low memory: swap out current dirty pages to temporary files.
WAL truncation happens with a time-based policy.
Followers will only eventually store durably locally a copy of the whole
tablespace.
HerdDB
Server
HerdDB
Server
HerdDB
Server
Bookie
Bookie
Bookie
ZK
App App App

18
HerdDB - How it works: Follower promotion and Fencing
N1
leader
N2
follower
Bookie3
Bookie2
Bookie1
Client
Read/Write
ZooKeeper
Read changes
and apply locally
Write Data
Changes
Apache BookKeeper guarantees the consistency of TableSpace data, by
means of the fencing mechanism
What happens when a follower node N2 is promoted to be leader
instead of N1 ?
- N1 is current leader
- The client is sending all of the requests to N1
- But N1 has been partioned from the cluster, some network paths
are not available any more:
- But the client is still connected to it
- N1 is still able to talk to Bookies
- N1 is not receiving notifications from ZK
- Someone updates TableSpace metadata on ZK: the leader is now
officially N2
- Please note that N1 does not receive any notification from ZK
about the change of role
X

19
HerdDB - How it works: Follower promotion and Fencing
- N2 starts recovery and opens all of the ledgers with the ‘recovery’
ﬂag
- Bookies fence out N1
- N1 receives the «fenced» error response at the ﬁrst write
opera'on and stops
- Any client s'll connected to N1 receive a «not leader» error from
N1 and discover the new leader using ZK
- Now new writes go to N2
- When N1 is back to normal opera'on state it recovers from the log
and starts being a follower.
Please note that opera'ons on the hot paths, reads and writes, do not
access the metadata service, but the overall consistency is s'll
guaranteed.
N1
Old
leader
N2
New
leader
Bookie3
Bookie2
Bookie1
Client
Send Write
Open log with fencing
Receive
Failure
response
«Fenced»
Receive
Failure
response
«No more
leader»
Send new
Write to
Node2

20
Wrap up
HerdDB is a good op@on for you:
- If you want a Open Source SQL database with builVn automaVc replicaVon
- If you want to embed the SQL database into your Java applicaVon
Embedding a Replicated Database beneﬁts:
- Run and manage only the applicaVon, hide distributed Database complexity
to system administrators
- Run and test the applicaVon with the same Database you use in producVon
HerdDB replicaVon is based on Apache BookKeeper and Apache ZooKeeper:
- BookKeeper implements replicaVon for the WAL and guarantees the
consistency of your data
- ZooKeeper is the glue and the base for coordinaVon, metadata management
and service discovery

https://herddb.com
https://bookkeeper.apache.org
Twitter: @eolivelli
LinkedIn: https://www.linkedin.com/in/enrico-olivelli-984b7874/
We can chat on https://bookkeeper.apache.org/community/slack/
Thank you.

Introducing HerdDB - a distributed JVM embeddable database built upon Apache BookKeeper_Enrico Olivelli

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Introducing HerdDB - a distributed JVM embeddable database built upon Apache BookKeeper_Enrico Olivelli

Ähnlich wie Introducing HerdDB - a distributed JVM embeddable database built upon Apache BookKeeper_Enrico Olivelli (20)

Mehr von StreamNative

Mehr von StreamNative (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Introducing HerdDB - a distributed JVM embeddable database built upon Apache BookKeeper_Enrico Olivelli