Data modeling for Cassandra presents a new set of challenges, especially for developers with a background in relational data modeling. And there are added complexities in modeling for analytic applications which need to enable statistical functions over the data, but a good data model, exploiting Cassandra's strengths, can make all the difference to a successful project. This tutorial will examine a number of real-world customer data modeling examples and draw out some hints and tips that will benefit hnot just the Cassandra newbie, but also the more experienced data modeler.
3. #Cassandra13
•e.g Click stream, telemetry, logs
•100x more writes than reads
•Almost all reads are to results
•Almost no writes are ‘updates’
•Really not going to fit in RAM
Real-time analytics
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
•e.g User profiles
•Create, Read, Update, Delete
•Probably mostly reads
•Probably wants atomicity
•Probably fits in RAM
Session storage
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
What folk use C* for
4. #Cassandra13
Real-time analytics
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
Session storage
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
02:44:02 241.24.41 0.0.1 GET /index.html
What folk use C* for
S WP HA ACIDS WP HA ACID
6. #Cassandra13
Example use case
{
time: 13:50:11,
latitude: 12.5,
longitude: -43.4,
duration: 24,
device_type: ..
}
Call detail records
tens thousands/sec
Real-time dashboards
7. #Cassandra13
C* Data Modeling 101
• Denormalise: Writes (and disk) are
cheap, reads are expensive:
insert data in every arrangement that
you need to read it
• Items you’ll access together, and want
sorted: put in the same row
• Sets of items you’re likely to access
separately: keep in separate rows
• Atomic counters are the building block
of Cassandra real-time analytics apps
row2
row3
row1
One event
update
One query read
8. #Cassandra13
#1: Hierarchies
13:00 ... :01→45 :02→62 :03→87
<day> ... :12→2930 :13→3520 :14→3034
13:01 ... :10→3 :11→4 :12→2
14:00
13:02
......
Counting
occurrences
by day, hour,
min, sec
One row for each
value at each level in
the hierarchy
Columns encode sub-components for each level
12. #Cassandra13
#4: Drilldown
13:00 ... :01, e3→- :01, e4→- :02, e5→-
<day> ... :12, e1→- :12,e2→- :13,e3→-
14:00
......
Going from
counts to the
constituent events
{
_id: e3,
time: 13:01:11,
device_type : xx,
}
e3 time → 13:01:11 device_type → xx ...
Use an identifier in the column key and store
the event in a different ColumnFamily
15. #Cassandra13
API
event
stream
event
store
roll-up
cubes
Ingest
Processing
dashboard queries programatic interface
API
event
stream
event
store
roll-up
cubes
Ingest
Processing
dashboard queries programatic interface
Cassandra stores raw events, aggregates, data model definition
Acunu Analytics maps events and SQL-like queries into C* ops
API
event
stream
event
store
roll-up
cubes
Ingest
Processing
dashboard queries programatic interfacePROCESSING AT INGEST
JSON, CSV, log ingest
via RESTful HTTP API,
Flume, Storm, AMQP
Storm, MQ HTTP
Acunu Dashboards provides rich, real-time,
embeddable visualizations
SELECT AVG(r)
FROM metrics
GROUP BY host;
AQL Alerting
!
Cubes
MILLISECOND QUERIES
API
event
stream
event
store
roll-up
cubes
Ingest
Processing
dashboard queries programatic interface
API for rich queries,
threshold alerting
Backfill historic results
for new cubes to enable
agile schema changes
16. #Cassandra13 Apache,Apache Cassandra, Cassandra, Flume, and the eye logos
are trademarks of the Apache Software Foundation.
@timmoreton
@acunu
Thanks!