Cassandra Data Modelling with CQL (OSCON 2015)

D ATA M O D E L I N G
C A S S A N D R A U S I N G C Q L 3
M I K E B I G L A N & E L I J A H H A M O V I T Z

A B O U T U S
Mike Biglan, M.S.
• Twenty Ideas, Inc – twentyideas.com
• Analytic Spot – spo.td
Elijah Hamovitz
• code.org
• Analytic Spot – spo.td

S I M P L I F I E D R E P R E S E N TAT I O N O F R E A L I T Y

A D ATA M O D E L M O D E L S D ATA
• Framework to store and organize data
• Models things, their differences, and
relationships between them
• Things can be real or virtual

I D E A L D ATA M O D E L P R O P E R T I E S
• Easy to create
• Easy to interface with
• Quick, flexible querying
• Writing is direct and simple
• Easily understandable
• Scalable: can read, write, and store huge amount of data safely

S C A L E A WAY, S C A L E A WAY, S C A L E A WAY
• Availability: fault tolerance, redundancy, supports
multiple data centers
• Consistency: strong or tunable
• Huge amounts of data (that won’t fit on single server)
• High-speed of incoming and/or accessed data

C O L U M N A R D ATA B A S E
• Document-Store (e.g. MongoDB) and Columnar (e.g. Cassandra, HBase,
Dynamo) are both “NoSQL”
• But modeling in Document-Store is quite different than Columnar
• Atomic unit of data storage
• Document-Store: document
• Relational Database: row
• Columnar: column
C A S S A N D R A C O L U M N :
- N A M E
- V A L U E ( O R T O M B S T O N E )
- T I M E S TA M P
- T T L

C A S S A N D R A
Highly Available Distributed Columnar Datastore that’s:
• Near-linearly scalable
• Fault tolerant, no master
• Tunable consistency
• Performant, especially for writes – don’t read before write

G E T T I N G O N T H E S C A L E
Phase 1 Install Cassandra
Phase 2
Phase 3 Scale!

S Q L / C Q L S TAT E M E N T F L O W
• SQL execution is complex
• CQL execution is relatively
simple, hence tiny subset
of syntax
• Much of CQL Query
complexity is which
node(s) to fetch/write/
confirm data from/to
• So denormalize!
Relational Database Cassandra
SQL Statement CQL Statement
Syntax/Semantic
Check
Query Plan &
Optimization
Result
Data Store Data Store
Syntax/Semantic
Check
Query Execution Query Execution
Result

C Q L ! = S Q L
SELECT <col1>, <col2>, …
FROM <table>
LEFT JOIN <table2>…
WHERE <where-clause>
GROUP BY <colx> HAVING …
ORDER BY <order-clause>
S E V E R E LY L I M I T E D
S E V E R E LY L I M I T E D
• CQL syntax is small subset of SQL
mechanics.flite.com/blog/2013/11/05/breaking-down-the-cql-where-clause/

S O W H Y T H E L I M I TAT I O N S ?
Thinking of Cassandra as a relational database, it’s
hard to understand:
• what is easy
• what is hard
• what is impossible

“Language serves not only to
express thought but to make
possible thoughts which could not
exist without it.”
— Bertrand Russell

T H E D I S T O R T I O N O F C Q L
• Broken mental model hinders optimal modeling
• CQL falsely implies a relational data model and
the design patterns that go with it
• To model Cassandra well, know the underlying
data structure

D ATA M O D E L I N G I N S Q L
( N O S H A R D I N G )
1. What are the Data?
2. What is the normalized data model?
… months pass …
3. How are the data going to be queried?
4. Optimize any slow areas and/or bottlenecks
• Add indexes, memcached/redis, sphinx/solr/elasticsearch, etc

D ATA M O D E L I N G I N C Q L
1. What are the data?
2. What read-queries are needed?
3. How to denormalize during writes?
• on initial write, or use external tools to make this sane
(Some) “premature” optimization is inherent and unavoidable

D ATA E C O S Y S T E M
To fully (and efficiently) enable everything SQL you are
used to, must rely on the big(ish) data ecosystem:
• ElasticSearch, Solr, Sphinx
• Redis, Memcached
• Spark
• Spark Streaming or Storm (and Kafka)

C O M P L E X I T Y O F I N I T I A L D ATA M O D E L
• Modeling with Relational DB
• Items & their relationships
• Modeling with Cassandra
• Items & their relationships
• How/Where they are stored
(sharding and hot spots)
• What data we want to read
• How (and how often) we write
data into those models
C A N O P T I M I Z E L AT E R

T O M O D E L , O P E N T H E B L A C K B O X
• Goal of a good black box is you can do a lot
without knowing much about what’s inside
• CQL DOES NOT allow you to ignore what’s
inside Cassandra

I N T H E B E G I N N I N G : T H R I F T
• Around Cassandra version 0.8, Thrift started getting
replaced with CQL
• Thrift too low-level, but the interface had a close
mapping to the underlying Cassandra data structure

T H R I F T & C Q L T E R M I N O L O G Y
T H R I F T C Q L
C O L U M N FA M I LY TA B L E
R O W PA R T I T I O N
C O L U M N C E L L
[ C E L L N A M E C O M P O N E N T O R VA L U E ] C O L U M N
[ G R O U P O F C E L L S W I T H S H A R E D
C O M P O N E N T P R E F I X E S ]
R O W
www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows

C Q L TA B L E
CREATE TABLE
employees (
company text,
name text,
age int,
role text,
PRIMARY KEY
((company), name)
);

V S H O W I T ’ S S T O R E D
employees = {
"Foo, inc" : {
"Fred:age" : 31,
"Fred:role" : "coder",
"Sara:age" : 39,
"Sara:role" : "boss"
},
"BarCo" : {
"Bill:age" : 50,
"Bill:role" : "SQL guru"
"Jane:age" : 20,
"Jane:role" : "hotshot",
}
}
CREATE TABLE
employees (
company text,
name text,
age int,
role text,
PRIMARY KEY
((company), name)
);

W I D E PA R T I T I O N S
( F O R M E R LY W I D E “ R O W S ” )
www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (p 51)
www.datastax.com/dev/blog/does-cql-support-dynamic-columns-wide-rows
• Based on Thrift “rows”, so
actually wide “partitions”
• Columns are the clustering
key values with the column-
name suffix
• Up to 2 billion (but don’t do this)

S E T S , M A P S , A N D L I S T S : O H M Y
• Sets/Maps/List still
column-level storage
• Enabling Schemaless,
but can result in long
column names
www.slideshare.net/DataStax/understanding-how-cql3-maps-to-cassandras-internal-data-structure (p 52-71)

PA R T I T I O N A N D C L U S T E R I N G K E Y
CQL Where-Clause variants
• None (kind of)
• key1 & key2
• key1 & key2 & key3
• key1 & key2 & key3 & key4

company | year | month | day | employee | reason
-------------------------------------------------
Foo, Inc | 2014 | 11 | 27 | Fred | Thanksgiving
Foo, Inc | 2014 | 12 | 25 | Fred | Christmas
Foo, Inc | 2014 | 12 | 25 | Sara | Christmas
Foo, Inc | 2014 | 12 | 26 | Sara | Boxing day
CREATE TABLE breaks (
company text,
year int,
month int,
day int,
employee text,
reason text,
PRIMARY KEY
((company, year),
month, day, employee)
);
C O M P O S I T E K E Y S

C O M P O S I T E K E Y S
CREATE TABLE breaks (
company text,
year int,
month int,
day int,
employee text,
reason text,
PRIMARY KEY
((company, year),
month, day, employee)
);
breaks = {
"Foo, Inc:2014" : {
"11" : {
"27" : {
"Fred:reason" : "Thanksgiving"
}
},
"12" : {
"25" : {
"Fred:reason" : "Christmas",
"Sara:reason" : "Christmas"
},
"26" : {
"Sara:reason" : "Boxing day"
}
}
}
}

T H E N W H AT I S E A S Y ?
With a dictionary of ordered
dictionaries:
• Grabbing the data (or subset)
from a partition-key
• Getting a slice of data (uses linear
search) based on a partition-key
breaks = {
"Foo, Inc:2014" : {
"11" : {
"27" : {
"Fred:reason" : "Thanksgiving"
}
},
"12" : {
"25" : {
"Fred:reason" : "Christmas",
"Sara:reason" : "Christmas"
},
"26" : {
"Sara:reason" : "Boxing day"
}
}
}
}

W H AT I S H A R D ?
( I . E . C O M M O N S Q L PAT T E R N S )
• Unique and Group by
• Ordered
• Inverted Index

G R O U P - B Y & C O U N T E R S
• Often group-by is used for
counting
• Use counter columns or other
tools (e.g. elasticsearch)
CREATE TABLE
employee_break_counts (
company text,
employee text,
break_counts counter,
PRIMARY KEY
((company), employee)
);

O R D E R I N G O R I N V E R T E D - I N D E X
• Redundant table, but ordered by new
“column”
• Depending on needs, this can store
the order-field and lookup key OR
some/all of the other data in that table
• If a read will generate more than a
few subsequent child reads then
some/all the other data should be
included
CREATE TABLE
employees_by_age (
company text,
id int,
age int,
name text,
role text,
PRIMARY KEY
((company), age, id)
);

C * M O D E L I N G A N T I - PAT T E R N S
C * G U I D E L I N E S S Q L G U I D E L I N E S
W R I T E S A R E C H E A P / FA S T M I N I M I Z E W R I T E S
S T O R A G E I S C H E A P M I N I M I Z E D U P L I C AT I O N O F D ATA
PA R T I T I O N S A R E I N H E R E N T S H A R D AT Y O U R O W N R I S K
S T R I C T C O M P O S I T E K E Y S F L E X I B L E S E C O N D A RY I N D E X E S
S I M P L E Q U E R I E S C O M P L E X Q U E R I E S

C * M O D E L I N G PAT T E R N S
C * G U I D E L I N E S C * PAT T E R N S
W R I T E S A R E C H E A P / FA S T
D U P L I C AT E Y O U R D ATA
S T O R A G E I S C H E A P
PA R T I T I O N S A R E I N H E R E N T AV O I D H O T S P O T S
S T R I C T C O M P O S I T E K E Y S
D E S I G N TA B L E S A R O U N D Q U E R I E S
S I M P L E Q U E R I E S

Questions?
Mike Biglan
mike@twentyideas.com
@twentyideas
Elijah Hamovitz
elijah@code.org

Cassandra Data Modelling with CQL (OSCON 2015)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Cassandra Data Modelling with CQL (OSCON 2015)

Ähnlich wie Cassandra Data Modelling with CQL (OSCON 2015) (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Cassandra Data Modelling with CQL (OSCON 2015)