2. MADRID · NOV 27-28 · 2015@calonso
• Introductions
• Cassandra Core concepts
• CQL
• Data modelling and examples
3. MADRID · NOV 27-28 · 2015@calonso
CREATETABLE users (
id INT PRIMARY KEY,
nameVARCHAR,
surnameVARCHAR,
birthdateTIMESTAMP
);
INSERT INTO users (id, name, surname, birthdate)
VALUES (1,‘Carlos’,‘Alonso’, ’1985-03-19’);
SELECT * FROM users WHERE id = 1;
ALTERTABLE users ADD cityVARCHAR;
UPDATE users SET city = ‘London’WHERE id = 1;
4. MADRID · NOV 27-28 · 2015@calonso
IMPATIENT LEVEL…
Aha!! That’s SQL, I already know all this stuff.
I’ll start using Cassandra as soon as I get home.
5. MADRID · NOV 27-28 · 2015@calonso
IMPATIENT LEVEL…
Why is this taking so long?
Have they gone killing the cow for my BigMac?
6. MADRID · NOV 27-28 · 2015@calonso
IMPATIENT LEVEL…
As soon as I see the loading spinner…
7. MADRID · NOV 27-28 · 2015@calonso
CARLOS ALONSO
•Spanish Londoner
•MSc Salamanca University, Spain
•Software Engineer @MyDrive Solutions
•Cassandra certified developer
•Cassandra MVP 2015
•@calonso / http://mrcalonso.com
8. MADRID · NOV 27-28 · 2015@calonso
MYDRIVE SOLUTIONS
• World leading driver profiling company
• Using technology and data to understand
how to improve driving behaviour
• Recently acquired by the Generali Group
• @_MyDrive / http://mydrivesolutions.com
• We are hiring!!
9. MADRID · NOV 27-28 · 2015@calonso
CASSANDRA
• Open Source distributed database
management system
• Initially developed at Facebook
• Inspired by Amazon’s Dynamo and Google
BigTable papers
• Became Apache top-level project in Feb, 2010
• Nowadays developed by DataStax
11. MADRID · NOV 27-28 · 2015@calonso
THE CAPTHEOREM
12. MADRID · NOV 27-28 · 2015@calonso
CASSANDRA
• Fast Distributed NoSQL Database
• High Availability
• Linear Scalability => Predictability
• No SPOF
• Multi-DC
• Horizontally scalable => $$$
• Not a drop in replacement for RDBMS
13. MADRID · NOV 27-28 · 2015@calonso
CLUSTER COMPONENTS
14. MADRID · NOV 27-28 · 2015@calonso
CLUSTER COMPONENTS
15. MADRID · NOV 27-28 · 2015@calonso
CONSISTENT HASHING
Hash function
“Carlos” 185664
1773456738847666528349
-894763734895827651234
16. MADRID · NOV 27-28 · 2015@calonso
REQUEST COORDINATION
• Coordinator:The node designated to coordinate a particular query.
• ANY node can coordinate ANY request.
• No SPOF: One of the main Cassandra’s principles.
• The driver chooses which node will coordinate
17. MADRID · NOV 27-28 · 2015@calonso
REPLICATION FACTOR
How many copies (replicas) for
your data
18. MADRID · NOV 27-28 · 2015@calonso
WHY REPLICATION?
• Disaster recovery
• Bring data closer to users (to reduce latencies)
• Workload segregation (analytical vs transactional)
19. MADRID · NOV 27-28 · 2015@calonso
REPLICATION
Defined at keyspace level
CREATE KEYSPACE <my_keyspace> WITH REPLICATION =
{ “class”:“SimpleStrategy”,“replication_factor”: 2 };
CREATE KEYSPACE <my_keyspace> WITH REPLICATION =
{ “class”:“NetworkTopologyStrategy”,
“dc-east”: 2,“dc-west”: 3 };
20. MADRID · NOV 27-28 · 2015@calonso
CONSISTENCY LEVEL
How many replicas of your data
must respond ok?
21. MADRID · NOV 27-28 · 2015@calonso
CONSISTENCY LEVEL
Availability /
Partition tolerance
Consistency
22. MADRID · NOV 27-28 · 2015@calonso
DriverClient
Partitioner
f81d4fae-…
834
• RF = 3
• CL = QUORUM
• SELECT * … WHERE id = f81d4fae-…
23.
24. MADRID · NOV 27-28 · 2015@calonso
COMPACTIONS
• Merges most recent partition keys and columns
• Evicts tombstoned columns
• Creates new SSTable
• Rebuilds partition indexes and summaries
• Deletes old SSTables
25. MADRID · NOV 27-28 · 2015@calonso
COMPACTIONS
• SizeTieredCompactionStrategy
• LeveledCompactionStrategy
• DateTieredCompactionStrategy
CREATETABLE user (
id UUID PRIMARY KEY,
emailTEXT,
nameTEXT,
preferences SET<TEXT>,
) WITH COMPACTION =
{ “class”: “<strategy>”, <params> };
28. MADRID · NOV 27-28 · 2015@calonso
CQL
CREATETABLE users (
id UUID,
nameVARCHAR,
surnameVARCHAR,
birthdateTIMESTAMP,
PRIMARY KEY(id)
);
Familiar row-column SQL-like approach.
INSERT INTO users (id, name, surname, birthdate)
VALUES (uuid(),‘Carlos’,‘Alonso’, ’1985-03-19’);
SELECT * FROM users WHERE
id = ‘f81d4fae-7dec-11d0-a765-00a0c91e6bf6’;
ALTERTABLE users ADD addressVARCHAR;
29. MADRID · NOV 27-28 · 2015@calonso
WHERE
• Equality matches:
• SELECT … WHERE album_title = ‘Revolver’;
• IN:
• Only applicable in the last WHERE clause
• SELECT … WHERE album_title = ‘Revolver’ AND number IN (2, 3, 4);
• Range search:
• Only on clustering columns.
• SELECT … WHERE album_title = ‘Revolver’ AND number >= 6 AND number < 2;
30. MADRID · NOV 27-28 · 2015@calonso
COLLECTIONS
• Set: Uniqueness
• List: Order
• Map: Key-Value pairs
• Tuple: Fixed length
email_addresses SET<VARCHAR>
email_addresses LIST<VARCHAR>
email_addresses MAP<VARCHAR,VARCHAR>
email_addresses TUPLE<VARCHAR,VARCHAR>
31. MADRID · NOV 27-28 · 2015@calonso
MORE FEATURES
• COUNTERS
• LWT
• TTL
• UDT
• BATCHES
• BATCH + LWT
• STATIC COLUMNS
33. MADRID · NOV 27-28 · 2015@calonso
PHYSICAL DATA STRUCTURE
34. MADRID · NOV 27-28 · 2015@calonso
DATA MODELLING
• Understand your data
• Decide how you’ll query the data
• Define column families to satisfy those queries
• Implement and optimize
35. MADRID · NOV 27-28 · 2015@calonso
DATA MODELLING
Conceptual
Model
Logical
Model
Physical
Model
Query-Driven
Methodology
Analysis &
Validation
36. MADRID · NOV 27-28 · 2015@calonso
PRIMARY KEY
PARTITION KEY +
CLUSTERING
COLUMN(S)
37. MADRID · NOV 27-28 · 2015@calonso
QUERY DRIVEN METHODOLOGY
• Spread data evenly around the cluster
• Minimise the number of partitions read
• Follow the mapping rules:
• Entities and relationships: map to tables
• Equality search attributes: must be at the beginning of the primary key
• Inequality search attributes: become clustering columns
• Ordering attributes: become clustering columns
• Key attributes: map to primary key columns
38. MADRID · NOV 27-28 · 2015@calonso
ANALYSIS &VALIDATION
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
• How much data duplication? (batches)
• Client side joins or new table?
39. MADRID · NOV 27-28 · 2015@calonso
An E-Library project.
40. MADRID · NOV 27-28 · 2015@calonso
EXAMPLE 1
Books can be uniquely identified and accessed by ISBN,
we also need a title, genre, author and publisher.
QDM
Book
ISBN K
Title
Author
Genre
Publisher
Q1
Q1: Find books by ISBN
41. MADRID · NOV 27-28 · 2015@calonso
ANALYSIS &VALIDATION
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
• 1 X (5 - 1 - 0) + 0 < 1M
• How much data duplication? (batches)
• Client side joins or new table?
Book
ISBN K
Title
Author
Genre
Publisher
Q1
Q1: Find books by ISBN
N/A
N/A
42. MADRID · NOV 27-28 · 2015@calonso
Book
ISBN K
Title
Author
Genre
Publisher
Q1
Q1: Find books by ISBNCREATETABLE books (
ISBNVARCHAR PRIMARY KEY,
titleVARCHAR,
authorVARCHAR,
genreVARCHAR,
publisherVARCHAR
);
SELECT * FROM books WHERE ISBN = ‘…’;
43. MADRID · NOV 27-28 · 2015@calonso
Users register into the system uniquely identified by an email
and a password. We also want their full name. They will be
accessed by email and password or internal unique ID.
QDM K
Users_by_ID
ID
full_name
Q1
Q1: Find users by ID
Users_by_login_info
email K
password
Q2
Q2: Find users by login info
K
full_name
ID
44. MADRID · NOV 27-28 · 2015@calonso
ANALYSIS &VALIDATION
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
• 1 X (2 - 1 - 0) + 0 < 1M
• How much data duplication? (batches)
• Client side joins or new table?
N/A
N/A
K
Users_by_ID
ID
full_name
Q1
Q1: Find users by ID
45. MADRID · NOV 27-28 · 2015@calonso
CREATETABLE users_by_id (
IDTIMEUUID PRIMARY KEY,
full_nameVARCHAR
);
SELECT * FROM users_by_id WHERE ID = …;
K
Users_by_ID
ID
full_name
Q1
Q1: Find users by ID
46. MADRID · NOV 27-28 · 2015@calonso
ANALYSIS &VALIDATION
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
• 1 X (4 - 2 - 0) + 0 < 1M
• How much data duplication? (batches)
• Client side joins or new table? N/A
Users_by_login_info
email K
password
Q2
Q2: Find users by login info
K
full_name
ID
47. MADRID · NOV 27-28 · 2015@calonso
CREATETABLE users_by_login_info (
emailVARCHAR,
passwordVARCHAR,
full_nameVARCHAR,
IDTIMEUUID,
PRIMARY_KEY ((email, password))
);
SELECT * FROM users_by_login_info
WHERE email = ‘…’AND password = ‘…’;
Users_by_login_info
email K
password
Q2
Q2: Find users by login info
K
full_name
ID
48. MADRID · NOV 27-28 · 2015@calonso
BEGIN BATCH
INSERT INTO users_by_id (ID, full_name)VALUES (…) IF NOT EXISTS;
INSERT INTO users_by_login_info (email, password, full_name, ID)VALUES (…);
APPLY BATCH;
49. MADRID · NOV 27-28 · 2015@calonso
Users read books.
We want to know which books has a user read.
QDM
K
Books_read_by_user
user_ID
title
Q1
full_name
ISBN
genre
publisher
author
C
C
Q1: Find all books a logged
user has read
50. MADRID · NOV 27-28 · 2015@calonso
ANALYSIS &VALIDATION
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
• Books X (7 - 1 - 0) + 0 < 1M => 166.666 books per user
• How much data duplication? (batches)
• Client side joins or new table?
Q1: Find all books a logged
user has read
K
Books_read_by_user
user_ID
title
Q1
full_name
ISBN
genre
publisher
author
C
C
New table
N/A
51. MADRID · NOV 27-28 · 2015@calonso
CREATETABLE books_read_by_user (
user_IDTIMEUUID,
titleVARCHAR,
authorVARCHAR,
full_nameVARCHAR,
ISBNVARCHAR,
genreVARCHAR,
publisherVARCHAR
PRIMARY_KEY (user_id, title, author)
);
SELECT * FROM books_read_by_user
WHERE user_ID = …;
Q1: Find all books a logged
user has read
K
Books_read_by_user
user_ID
title
Q1
full_name
ISBN
genre
publisher
author
C
C
52. MADRID · NOV 27-28 · 2015@calonso
In order to improve our site’s usability we need to
understand how our users use it by tracking every interaction
they have with our site.
QDM
user_ID
2_weeks
time
element
type
K
Actions_by_user
Q1
Q1: Find all actions a user
does in a time range
K
C
53. MADRID · NOV 27-28 · 2015@calonso
ANALYSIS &VALIDATION
• 1 Partition per read?
• Are write conflicts (overwrites) possible?
• How large are partitions?
• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
• Actions X (5 - 2 - 0) + 0 < 1M => 333.333 per user every 2 weeks
• How much data duplication? (batches)
• Client side joins or new table?
N/A
K
Actions_by_user
user_ID
2_weeks
Q1
Q1: Find all actions a user
does in a time range
K
time
element
type
C
N/A
54. MADRID · NOV 27-28 · 2015@calonso
CREATETABLE actions_by_user (
user_IDTIMEUUID,
2_weeks INT,
timeTIMESTAMP,
elementVARCHAR,
typeVARCHAR
PRIMARY_KEY ((user_ID, 2_weeks), time)
);
SELECT * FROM actions_by_user
WHERE user_ID = … AND 2_weeks = … AND
time < … AND time > …;
K
Actions_by_user
user_ID
2_weeks
Q1
Q1: Find all actions a user
does in a time range
K
time
element
type
C