5. What’s wrong with RDBMS?
• Pros
• Relational data modeling is well
understood
• SQL is easy to use and
ubiquitous
• ACID transactions - data integrity
• Cons
• Scaling is hard
• Sharding and replication have
side-effects (performance,
reliability, cost)
• Denormalize to get performance
gains
For fun: Relational Scaling and the Temple of Gloom
8. Two lies and a truth
• Cassandra is column-oriented
• Cassandra is a schemaless database
• Cassandra is eventually consistent
They’re all lies, in a way
9. Problems Cassandra is Especially Good At
• Large scale storage
• >10s of TB
• Lots of writes
• Time-series data, IoT
• Statistics and analytics
• For example, as a Spark data
source
• Geographic distribution
• Multiple data centers
Personalization
Customer
360
Recommendation
Fraud
Detection
Inventory
Management
Identity
Management
Security
Supply
Chain
25. Tuneable Consistency and Consistency Levels
• ONE, TWO, THREE
• Useful for speed
• ANY (Write only, use with care)
• ALL
• Number of nodes to respond =
RF
• Overly restrictive?
• QUORUM
• (RF / 2) + 1
• Frequently very useful
• LOCAL_ONE,
LOCAL_QUORUM
– Similar to above, but nodes must
be in local data center
• EACH_QUORUM
– Quorum of nodes must respond
in each data center
26. Strong Consistency vs. Eventual Consistency
• Eventual consistency
• i.e. W ONE, R QUORUM
• Use cases
• Write heavy
• Data not read immediately
• Strong Consistency
• i.e. W QUORUM, R
QUORUM
• Use cases
• Read after write
• Data loading with validation
• Strong Consistency Formula
• R + W > RF = strong consistency
• R: read replica count required by
consistency level
• W: the write replica count
required by consistency level
• RF: replication factor
• Example: W QUORUM, R
QUORUM, RF = 3
• 2 + 2 > 3
• Implication: all client reads will
see the most recent write
32. Consistency Level and Deployment
Active-Active, Multi-Region
– EACH_QUORUM may limit availability if connection to one region/data center is down
– LOCAL_QUORUM, relying on Cassandra to complete writes to remote data centers
– QUORUM is a reasonable middle-ground approach
33. Consistency Level and Deployment
Separate data center for analytics
– Writes in ”online DC” – LOCAL_QUORUM
• Writes to analytics DC in background
• Analytic DC availability/performance decoupled from online DC
• EACH_QUORUM, QUORUM overkill, unless “real-time” analytics required
– Reads in “analytics DC” – LOCAL_QUORUM
• Or even LOCAL_ONE
34. Cassandra is most powerful when combined
with complementary technologies to form a
data platform
35. Spark + Cassandra
• Access Cassandra from
Spark via DataStax connector
DataStax is a registered trademark
of DataStax, Inc. and its
subsidiaries in the United States
and/or other countries.
3
6
• Co-locate Spark and
Cassandra
Spa
rk
SQL
Spark
Streami
ng
MLi
b
Grap
hX
Spar
k R
Spark Core Engine
DataStax Spark-Cassandra Connector
Cassandra
36. DSE Analytics
DataStax is a registered trademark
of DataStax, Inc. and its
subsidiaries in the United States
and/or other countries.
Application
Real Time Operations
Cassandra
Analytics
Analytics
Queries
Your Analytics
Real Time Replication
Single DSE Custer
Streaming, ad-hoc, and batch
• High-performance
• Workload management
• SQL reporting
Compared to self-managed
Spark cluster:
• No ETL
• True HA without Zookeeper
37. DSE Core - Certified Apache Cassandra
• The best distribution of Apache Cassandra™
• Production certified Cassandra
• Performance improvements
• Advanced Security
• Multi-tenancy through row-level access control
• Advanced Replication
• Great for retail and IoT use cases
DataStax is a registered trademark of DataStax, Inc. and its subsidiaries in the United States
and/or other countries.