CrateDB and PostgreSQL are discussed. PostgreSQL uses a multi-process architecture with B-tree indexes and ACID transactions, while CrateDB uses a distributed, multi-threaded architecture with Lucene indexes and eventual consistency. PostgreSQL excels for transactions and SQL features on a single node but struggles with distribution. CrateDB handles distribution naturally with high ingest speeds but lacks some SQL features and ACID transactions. PostgreSQL is suited for transactional workloads and small datasets, while CrateDB works well for analytics, machine learning, and fulltext search on large datasets.
2. About
~2yrs at Crate.io
DevRel/Field Engineering/Support/
Integrations/…
Speaking
Conferences, meetups, ...
Working with customers
Consulting, pre- and post-sales
@claus__m
3. Agenda
Failures
What, how, and when?
PostgreSQL
Concept overview
CrateDB
Concept overview
Discussion
NewSQL or not? Benefits and drawbacks.
Use Cases
Wrap up
@claus__m
6. Database Failures
Consequences
Data loss
Lost updates, dirty reads, ...
Service interruptions
Services can’t work without their database
Slow performance
Users may lose interest
Pressure
DBAs in the spotlight
@claus__m
7. What Makes Databases
Fail?
Overloaded
Insufficient hardware (RAM, CPU, disk),
swapping, inefficient queries
Failure
Hardware may fail on many levels: e.g.
Network, disk, RAM
Platform
Configuration errors, updates, resource
sharing, bugs
People
Malicious intent, sloppiness, ...
@claus__m
8. Overloaded
Insufficient hardware (RAM, CPU, disk),
swapping, inefficient queries
Failure
Hardware may fail on many levels: e.g.
Network, disk, RAM
Platform
Configuration errors, updates, resource
sharing, bugs
People
Malicious intent, sloppiness, ...
@claus__m
What Makes Databases
Fail?
9. Overview
Concepts and other things
Index and data
How the database creates indices, stores and
retrieves data
Search and scans
How the data is found
Replication and high availability
Distribution and achieving zero downtime
@claus__m
Assessment
11. Overview
Multi-process System
fork() to clone processes from postmaster to
postgres instances with shared memory
Technology
C/C++ based natively compiled
Optimization
Cost-based optimizer
Transactional
ACID compliant
@claus__m
12. Index And Data
Tree-based
An in-memory B-Tree, defined in CREATE
TABLE or ALTER TABLE
In Memory & On Disk
8K data pages in shared buffer cache and on
disk
Item Pointers
Only major changes are reflected in the index
(e.g. INSERT/DELETES)
@claus__m
14. Searches And Scans
Sequential
Go over every block and execute a predicate
Index-based
Find something using an index on that column,
or a full index scan
Bitmap-based
Mark matches in boolean queries for results
@claus__m
15. Replication And
High Availability
Disk based
By sharing a disk or continuously cloning a disk
Log-shipping
Send the write-ahead-log to the standby server,
which can answer reads
Master/Master
Sends rows to the other master, can answer
reads and writes, locks rows/tables
Client-sharding
Shard the data on a client/proxy and route
accordingly
@claus__m
17. Overview
Multi-threaded System
Thread-pools to read/write Lucene segments
Technology
Java/JVM based
Optimization
Naive optimization on query levels
Eventually Consistent
Atomic operations per row, optimistic
concurrency only
Distributed By Default
Transparent partitioning and sharding @claus__m
18. Index And Data
Inverted index
Term dictionary where field values point to
rows (posting list)
Field cache
“Inverted inverted index”, column names point
to the possible values and their rows
On disk, cached in memory
Immutable segments on disk, binary search in
each segment, cached with mmap() into ram
pages
@claus__m
20. Index And Data
@claus__m
Shards
Compounds of multiple immutable segments,
merged occasionally
Rows are documents, columns are fields
Vector space model to weight and score
searches (_score field)
Multi-threaded index access
Shards are multiple segments, each is read
with a thread
21. Replication And
High Availability
Shared nothing architecture
Every node handles every task
Shard-based
Replicas are copies of shards that are
distributed in the cluster evenly
Consistency
Elected leader maintains and distributes a
consistent cluster state
CAP
Tuneable consistency with synchronous inserts
@claus__m
24. PostgreSQL: Weaknesses
Distribution
High availability or working with huge data sets
requires 3rd party software, partitioning
Ingest speed
ACID compliance slows down inserts
Operational Complexity/DevOps Readiness
Highly controllable features make it hard to
manage
Schema Flexibility
Schema evolution management required
@claus__m
25. CrateDB: Strengths
Distribution
Distributed by nature, with tunable consistency
Ingest speed
Solid insert speeds with bulk inserts
Operational Complexity/DevOps Readiness
High flexibility, containerization, sane defaults
Schema Flexibility
Schema evolution on the fly
Built-in Search
Fulltext capabilities
@claus__m
26. CrateDB: Weaknesses
Single-Node-Performance
Distribution overhead requires a certain cluster
size to be efficient
SQL Features
Many features are yet missing or hard to do in
a distributed system
Transactions
No ACID compliance, eventual
consistency/optimistic concurrency requires
client-side handling
@claus__m
28. Use Cases: PostgreSQL
ORMs
Broad integration in various object-relational
mappers in frameworks (hibernate, …)
Transaction-based workloads
Single, high-value transactions
Extensive SQL compliance
Required support for views, stored procedures,
…
Small data sets
Hundreds of MBs to several GB
@claus__m
29. Use Cases: CrateDB
DevOps
Flexible schemas, ad-hoc queries, easy
maintenance
Analytics, machine learning
Large scale inserts/queries, high concurrency,
SQL
Fulltext search
Built-in tools for text-mining/analysis, built on
the de-facto standard of search
@claus__m