1. BASLE BERN BRUGG DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. GENEVA
HAMBURG COPENHAGEN LAUSANNE MUNICH STUTTGART VIENNA ZURICH
Apache Cassandra
Under The Hood
Robert Bialek
2. Who Am I
Apache Cassandra Under The Hood2 15.09.2018
Senior Principal Consultant and Trainer at Trivadis GmbH in Munich.
– Master of Science in Computer Engineering.
– At Trivadis since 2004.
– Trivadis Partner since 2012.
Focus:
– Data and service high availability, disaster recovery.
– Architecture design, optimization, automation.
– Troubleshooting.
– Trainer: O-RAC, O-DG.
3. Agenda
Apache Cassandra Under The Hood3 15.09.2018
1. Introduction
2. Key Components
3. Data Replication
4. Scalability
5. Read/Write Operations
6. Data Consistency
7. Summary
5. What is Apache Cassandra?
Apache Cassandra Under The Hood5 15.09.2018
Distributed NoSQL (wide column) partitioned row store database, which runs within a
JVM.
Decentralized, highly fault tolerant database with no single point of failure.
Horizontal scalable system (computing resources/performance).
Initially developed at Facebook, released as an open source project in July 2008.
– Based on Amazon‘s Dynamo and Google‘s Big Table.
6. Apache Cassandra & CAP Theorem?
Apache Cassandra Under The Hood6 15.09.2018
According to CAP (Brewer’s) theorem “it is impossible for a distributed data store to
simultaneously provide more than two out of the following three guarantees”
– Consistency
– Availability
– Partition tolerance
Apache Cassandra is a AP system.
– Data result is eventually consistent (though, consistency is tunable).
– Does not adhere to all ACID properties.
? ?
7. Cassandra for Enterprise Applications
Apache Cassandra Under The Hood7 15.09.2018
Support 24x7x365.
Enterprise features, e.g.: DSE Advanced Security, DSE Analytics, DSE Search, DSE
Graph, DSE Advanced Replication, DSE Tiered Storage, DSE NodeSync, ...
Administration and monitoring with DSE OpsCenter (real-time monitoring, tuning,
provisioning, backup, security management).
According to DataStax, 2x or more throughput compared to Apache Cassandra.
Documentation, client drivers and DSE for development are free to use.
8. Who is Using Cassandra Database?
Apache Cassandra Under The Hood8 15.09.2018
Source http://cassandra.apache.org
– Apple: over 75,000 nodes storing over 10 PB of data.
– Netflix: 2,500 nodes, 420 TB, over 1 trillion requests per day.
– Chinese search engine Easou: 270 nodes, 300 TB, over 800 million requests per
day.
– eBay: over 100 nodes, 250 TB.
Source https://www.datastax.com/customers
– Microsoft, UBS, Sony, Sky, ING, NEC, Coursera, CISCO, Walmart, NVIDIA,
Samsung, …
10. Node – Basic Database Infrastructure
Apache Cassandra Under The Hood10 15.09.2018
Commodity hardware, ideally local storage (reduce
dependencies).
Hosts software and configuration files:
– cassandra.yaml, cassandra-rackdc.properties, …
Hosts data and accompanying structures:
Cassandra Node
(DSE: Transactional Node)
Index.db
Data.db
(SSTable) Statistics.db
CompressionInfo.db
Digest.crc32
Filter.db
TOC.txt
11. Keyspaces & Tables
Apache Cassandra Under The Hood11 15.09.2018
Table (Column Family)
– Stores data based on a primary key.
• Primary key: partitioning key plus optionally
clustering columns.
– Physically split into partitioned.
– Denormalization (data duplication) is necessary.
Keyspace
– Grouping of data, similar to a schema.
– Defines replication properties.
12. Partitioner – Data Distribution
Apache Cassandra Under The Hood12 15.09.2018
Determines which node receives data based on
partitioning key token.
Supplied partitioners (own can be created):
Data
Token
PARTITIONER
Murmur3Partitioner (default)
Random Partitioner
ByteOrderedPartitioner
‘Cassandra'
356242581507269238
13. Cassandra Ring – Singe Token Architecture
Apache Cassandra Under The Hood13 15.09.2018
Cassandra Ring
initial_token:1
initial_token:10
initial_token: 20
initial_token: 30
Example Partitioner
Token Range: 1 – 40
Token Range: 31 – 40,1
Token Range: 2 – 10
Token Range: 11 – 20
Token Range: 21 – 30
Data
Token
15. Snitches – Ring Topology
Apache Cassandra Under The Hood15 15.09.2018
Determines physical location (datacenter and a
rack) of a Cassandra node.
Dynamic snitching (enabled by default):
– Monitors the read performance and ring health.
SNITECHES
SimpleSnitch/DseSimpleSnitch (default)
GossipingPropertyFileSnitch
PropertyFileSnitch
Ec2Snitch/Ec2MultiRegionSnitch/GoogleCloudSnitch/
CloudstackSnitch
RackInferringSnitch
DC 1 DC 2
Rack 1 Rack 1
Rack 2 Rack 2
16. Gossip – Internode Communication
Apache Cassandra Under The Hood16 15.09.2018
Peer-to-peer communication protocol to exchange
ring state information.
Gossip process runs every second and exchanges
messages with up to three other nodes in the ring.
Eventually, all nodes learn (indirectly) about all
other nodes.
18. Cassandra Ring – Scale Out
Apache Cassandra Under The Hood18 15.09.2018
Increases computing power and
throughput of a Cassandra ring.
Online and transparent to the
applications.
Ring
Information
START
Joing Ring
Generate
Tokens
FINISH
Joing Ring
Cassandra Ring
SEED Node
Bootstrap
Data Streaming
Software &
Configuration Files
19. Cassandra Ring – Scale In
Apache Cassandra Under The Hood19 15.09.2018
Decreases computing power of a
Cassandra ring.
Online and transparent to the
applications.
Cassandra Ring
DECOMMISSION
Data Streaming
Remove
Tokens
DECOMMISSIONED
21. Replication – Data High Availability
Apache Cassandra Under The Hood21 15.09.2018
To ensure data and service high availability, Cassandra stores data on multiple
nodes in a cluster.
All replicas are equally important (no primary or
secondary data).
Replication strategy and replication factor (RF) is
defined on a keyspace (application) level.
– RF can be set differently in different data centers.
Two replication strategies are available:
– SimpleStrategy
– NetworkTopologyStrategy
DC 1 DC 2
Rack 1 Rack 1
Rack 2 Rack 2
22. Replication – SimpleStrategy (RF: 2)
Apache Cassandra Under The Hood22 15.09.2018
Data Center 1
Rack 1 Rack 1
Rack 1 Rack 1
23. Replication – NetworkTopologyStrategy (RF/DC: 2)
Apache Cassandra Under The Hood23 15.09.2018
Data Center 1 Data Center 2
Rack 1 Rack 1
Rack 2 Rack 2
25. Read Request Flow on a Cassandra Node
Apache Cassandra Under The Hood25 15.09.2018
Memtable Row Cache Bloom Filter
Partition Key
Cache
Compression
Offset Map Partition Summary
Partition Index
SSTables
MemoryDisk
26. Write Request Flow on a Cassandra Node
Apache Cassandra Under The Hood26 15.09.2018
Memtable
Index.db
Data.db
(SSTable)
MemoryDisk
Commit Log
Statistics.db
CompressionInfo.db
Digest.crc32
Filter.db
TOC.txt
Compaction
Process
27. Upserts on a Cassandra Node
Apache Cassandra Under The Hood27 15.09.2018
Memtable
TAG: CASSANDRA
SSTables
ID C1 C2 TSTAMP
1 2 TEST1 100
ID C1 C2 TSTAMP
2 3 TEST2 50
INSERT INTO t (TAG, ID,C1,C2)
VALUES (‘CASSANDRA‘,1,5,‘TEST3‘);
UPDATE t SET C2=PROD1 WHERE
TAG=‘CASSANDRA‘ AND ID=1;
DELTE FROM t
WHERE TAG=‘CASSANDRA‘ AND ID=2;
ID C1 C2 TSTAMP
1 5 TEST3 150
ID C2 TSTAMP
1 PROD1 200
ID Tombstone
(marked_deleted)
TSTAMP
2 250
Partition Key: TAG
Primary Key: TAG, ID
28. Compaction Process on a Cassandra Node
Apache Cassandra Under The Hood28 15.09.2018
ID C1 C2 TSTAMP
1 2 TEST1 100
ID C1 C2 TSTAMP
2 3 TEST2 50
ID C1 C2 TSTAMP
1 5 TEST3 150
ID C2 TSTAMP
1 PROD1 200
ID Tombstone
(marked_deleted)
TSTAMP
2 250
ID C1 C2 TSTAMP
3 4 TEST3 120
ID C1 C2 TSTAMP
1 5 PROD1 300
ID Tombstone
(marked_deleted)
TSTAMP
2 250
ID C1 C2 TSTAMP
3 4 TEST3 120
gc_grace_seconds
reached?
New SSTable
Compaction Strategies
SizeTieredCompactionStrategy (STCS)
LeveledCompactionStrategy (LCS)
TimeWindowCompactionStrategy (TWCS)
No
30. Data Consistency – Overview
Apache Cassandra Under The Hood30 15.09.2018
Cassandra offers tunable data consistency for read and write operations.
Two types of read requests:
– Direct read request.
– Digest read request.
Inconsistent data can be repaired automatically by:
– Background read repair request.
– NodeSync continuous background repair (only DSE 6).
Inconsistent data can be repaired manually by:
– Anty-Entropy Repair.
31. Tunable Consistency
Apache Cassandra Under The Hood31 15.09.2018
A tradeoff between data consistency and availability
WRITE Consistency Level READ Consistency Level
ALL ALL
EACH_QUORUM Not supported.
QUORUM QUORUM
LOCAL_QUORUM LOCAL_QUORUM
ONE, TWO, THREE ONE, TWO, THREE
LOCAL_ONE LOCAL_ONE
ANY Not supported.
Not supported. SERIAL
Not supported. LOCAL_SERIAL
32. Read Requests & Tunable Consistency (1)
Apache Cassandra Under The Hood32 15.09.2018
One DC, CONSISTENCY=QUORUM, RF=3
Coordinator
Direct Read
Digest Read
speculative_retry!
33. Read Requests & Tunable Consistency (2)
Apache Cassandra Under The Hood33 15.09.2018
One DC, CONSISTENCY=QUORUM, RF=3
Coordinator
Direct Read
Digest Read Background
Read Repair
read_repair_chance=0.10
34. Read Requests & Tunable Consistency (3)
Apache Cassandra Under The Hood34 15.09.2018
Two DC, CONSISTENCY=QUORUM, RF=3
Coordinator
DC=1 DC=2
Direct Read
Digest Read
Digest Read
Digest Read
35. Read Requests & Tunable Consistency (4)
Apache Cassandra Under The Hood35 15.09.2018
Two DC, CONSISTENCY=LOCAL_QUORUM, RF=3
Coordinator
DC=1 DC=2
Direct Read
Digest Read
36. Write Requests & Tunable Consistency (1)
Apache Cassandra Under The Hood36 15.09.2018
One DC, CONSISTENCY=ONE, RF=3
Coordinator
37. Write Requests & Tunable Consistency (2)
Apache Cassandra Under The Hood37 15.09.2018
One DC, CONSISTENCY=QUORUM, RF=3
Coordinator
DELETE
Possibile
ZOMBI
Hinted Handoff
38. Data Consistency – Anty-Entropy Repair
Apache Cassandra Under The Hood38 15.09.2018
Manual data repair:
– A Merkle tree is build for each replica
– Merkle trees are compered between all
replicas.
Repair can be performed:
– Sequential.
– Parallel.
– Datacenter parallel.
Source: DSE 6.0 Architecture Guide
40. Summary
Apache Cassandra Under The Hood40 15.09.2018
Cassandra is a very powerful distributed and decentralized NoSQL database with no
singe point of failure.
It guarantees service and data availability in case of a partitioned network, though
the data might be stale.
Designed for large data stores which require performant and scalable system.
Application data model need to be designed for Cassandra.
Many ways to interact with the database:
– CQLSH (Cassandra Query Language Shell).
– Drivers and tools provided by DataStax.
DataStax offers support for enterprise customers and a good documentation.
41. 15.09.2018 Apache Cassandra Under The Hood41
Robert Bialek
Senior Principal Consultant
Tel. +49 89 99 27 59 38
robert.bialek@trivadis.com