Cassandra for impatients

Cassandra for impatients
Carlos Alonso (@calonso)MADRID · NOV 27-28 · 2015

MADRID · NOV 27-28 · 2015@calonso
• Introductions
• Cassandra Core concepts
• CQL
• Data modelling and examples

CREATETABLE users (
id INT PRIMARY KEY,
nameVARCHAR,
surnameVARCHAR,
birthdateTIMESTAMP
);
INSERT INTO users (id, name, surname, birthdate)
VALUES (1,‘Carlos’,‘Alonso’, ’1985-03-19’);
SELECT * FROM users WHERE id = 1;
ALTERTABLE users ADD cityVARCHAR;
UPDATE users SET city = ‘London’WHERE id = 1;

IMPATIENT LEVEL…
Aha!! That’s SQL, I already know all this stuff.
I’ll start using Cassandra as soon as I get home.

IMPATIENT LEVEL…
Why is this taking so long?
Have they gone killing the cow for my BigMac?

IMPATIENT LEVEL…
As soon as I see the loading spinner…

CARLOS ALONSO
•Spanish Londoner
•MSc Salamanca University, Spain
•Software Engineer @MyDrive Solutions
•Cassandra certiﬁed developer
•Cassandra MVP 2015
•@calonso / http://mrcalonso.com

MYDRIVE SOLUTIONS
• World leading driver proﬁling company
• Using technology and data to understand
how to improve driving behaviour
• Recently acquired by the Generali Group
• @_MyDrive / http://mydrivesolutions.com
• We are hiring!!

CASSANDRA
• Open Source distributed database
management system
• Initially developed at Facebook
• Inspired by Amazon’s Dynamo and Google
BigTable papers
• Became Apache top-level project in Feb, 2010
• Nowadays developed by DataStax

Cassandra Core Concepts
Technical introduction to Apache CassandraMADRID · NOV 27-28 · 2015

THE CAPTHEOREM

CASSANDRA
• Fast Distributed NoSQL Database
• High Availability
• Linear Scalability => Predictability
• No SPOF
• Multi-DC
• Horizontally scalable => $$$
• Not a drop in replacement for RDBMS

CLUSTER COMPONENTS

CONSISTENT HASHING
Hash function
“Carlos” 185664
1773456738847666528349
-894763734895827651234

REQUEST COORDINATION
• Coordinator:The node designated to coordinate a particular query.
• ANY node can coordinate ANY request.
• No SPOF: One of the main Cassandra’s principles.
• The driver chooses which node will coordinate

REPLICATION FACTOR
How many copies (replicas) for
your data

WHY REPLICATION?
• Disaster recovery
• Bring data closer to users (to reduce latencies)
• Workload segregation (analytical vs transactional)

REPLICATION
Deﬁned at keyspace level
CREATE KEYSPACE <my_keyspace> WITH REPLICATION =
{ “class”:“SimpleStrategy”,“replication_factor”: 2 };
CREATE KEYSPACE <my_keyspace> WITH REPLICATION =
{ “class”:“NetworkTopologyStrategy”,
“dc-east”: 2,“dc-west”: 3 };

CONSISTENCY LEVEL
How many replicas of your data
must respond ok?

CONSISTENCY LEVEL
Availability / 
Partition tolerance
Consistency

DriverClient
Partitioner
f81d4fae-…
834
• RF = 3
• CL = QUORUM
• SELECT * … WHERE id = f81d4fae-…

COMPACTIONS
• Merges most recent partition keys and columns
• Evicts tombstoned columns
• Creates new SSTable
• Rebuilds partition indexes and summaries
• Deletes old SSTables

COMPACTIONS
• SizeTieredCompactionStrategy
• LeveledCompactionStrategy
• DateTieredCompactionStrategy
CREATETABLE user (
id UUID PRIMARY KEY,
emailTEXT,
nameTEXT,
preferences SET<TEXT>,
) WITH COMPACTION =
{ “class”: “<strategy>”, <params> };

CQL
Cassandra Query LanguageMADRID · NOV 27-28 · 2015

CQL
CREATETABLE users (
id UUID,
nameVARCHAR,
surnameVARCHAR,
birthdateTIMESTAMP,
PRIMARY KEY(id)
);
Familiar row-column SQL-like approach.
INSERT INTO users (id, name, surname, birthdate)
VALUES (uuid(),‘Carlos’,‘Alonso’, ’1985-03-19’);
SELECT * FROM users WHERE
id = ‘f81d4fae-7dec-11d0-a765-00a0c91e6bf6’;
ALTERTABLE users ADD addressVARCHAR;

WHERE
• Equality matches:
• SELECT … WHERE album_title = ‘Revolver’;
• IN:
• Only applicable in the last WHERE clause
• SELECT … WHERE album_title = ‘Revolver’ AND number IN (2, 3, 4);
• Range search:
• Only on clustering columns.
• SELECT … WHERE album_title = ‘Revolver’ AND number >= 6 AND number < 2;

COLLECTIONS
• Set: Uniqueness
• List: Order
• Map: Key-Value pairs
• Tuple: Fixed length
email_addresses SET<VARCHAR>
email_addresses LIST<VARCHAR>
email_addresses MAP<VARCHAR,VARCHAR>
email_addresses TUPLE<VARCHAR,VARCHAR>

MORE FEATURES
• COUNTERS
• LWT
• TTL
• UDT
• BATCHES
• BATCH + LWT
• STATIC COLUMNS

Data Modelling
Basic concepts for a Cassandra modelMADRID · NOV 27-28 · 2015

PHYSICAL DATA STRUCTURE

DATA MODELLING
• Understand your data
• Decide how you’ll query the data
• Deﬁne column families to satisfy those queries
• Implement and optimize

DATA MODELLING
Conceptual
Model
Logical
Model
Physical
Model
Query-Driven
Methodology
Analysis &
Validation

PRIMARY KEY
PARTITION KEY +
CLUSTERING
COLUMN(S)

QUERY DRIVEN METHODOLOGY
• Spread data evenly around the cluster
• Minimise the number of partitions read
• Follow the mapping rules:
• Entities and relationships: map to tables
• Equality search attributes: must be at the beginning of the primary key
• Inequality search attributes: become clustering columns
• Ordering attributes: become clustering columns
• Key attributes: map to primary key columns

ANALYSIS &VALIDATION
• 1 Partition per read?
• Are write conﬂicts (overwrites) possible?
• How large are partitions?
• Ncells = Nrow X ( Ncols – Npk – Nstatic ) + Nstatic < 1M
• How much data duplication? (batches)
• Client side joins or new table?

An E-Library project.

EXAMPLE 1
Books can be uniquely identiﬁed and accessed by ISBN,
we also need a title, genre, author and publisher.
QDM
Book
ISBN K
Title
Author
Genre
Publisher
Q1
Q1: Find books by ISBN

• 1 X (5 - 1 - 0) + 0 < 1M
Book
ISBN K
Title
Author
Genre
Publisher
Q1
Q1: Find books by ISBN
N/A
N/A

Book
ISBN K
Title
Author
Genre
Publisher
Q1
Q1: Find books by ISBNCREATETABLE books (
ISBNVARCHAR PRIMARY KEY,
titleVARCHAR,
authorVARCHAR,
genreVARCHAR,
publisherVARCHAR
);
SELECT * FROM books WHERE ISBN = ‘…’;

Users register into the system uniquely identiﬁed by an email
and a password. We also want their full name. They will be
accessed by email and password or internal unique ID.
QDM K
Users_by_ID
ID
full_name
Q1
Q1: Find users by ID
Users_by_login_info
email K
password
Q2
Q2: Find users by login info
K
full_name
ID

• 1 X (2 - 1 - 0) + 0 < 1M
N/A
N/A
K
Users_by_ID
ID
full_name
Q1

CREATETABLE users_by_id (
IDTIMEUUID PRIMARY KEY,
full_nameVARCHAR
);
SELECT * FROM users_by_id WHERE ID = …;
K
Users_by_ID
ID
full_name
Q1

• 1 X (4 - 2 - 0) + 0 < 1M
• Client side joins or new table? N/A
Users_by_login_info
email K
password
Q2
K
full_name
ID

CREATETABLE users_by_login_info (
emailVARCHAR,
passwordVARCHAR,
full_nameVARCHAR,
IDTIMEUUID,
PRIMARY_KEY ((email, password))
);
SELECT * FROM users_by_login_info
WHERE email = ‘…’AND password = ‘…’;
Users_by_login_info
email K
password
Q2
K
full_name
ID

BEGIN BATCH
INSERT INTO users_by_id (ID, full_name)VALUES (…) IF NOT EXISTS;
INSERT INTO users_by_login_info (email, password, full_name, ID)VALUES (…);
APPLY BATCH;

Users read books.
We want to know which books has a user read.
QDM
K
Books_read_by_user
user_ID
title
Q1
full_name
ISBN
genre
publisher
author
C
C
Q1: Find all books a logged
user has read

• Books X (7 - 1 - 0) + 0 < 1M => 166.666 books per user
user has read
K
Books_read_by_user
user_ID
title
Q1
full_name
ISBN
genre
publisher
author
C
C
New table
N/A

CREATETABLE books_read_by_user (
user_IDTIMEUUID,
titleVARCHAR,
authorVARCHAR,
full_nameVARCHAR,
ISBNVARCHAR,
genreVARCHAR,
publisherVARCHAR
PRIMARY_KEY (user_id, title, author)
);
SELECT * FROM books_read_by_user
WHERE user_ID = …;
user has read
K
Books_read_by_user
user_ID
title
Q1
full_name
ISBN
genre
publisher
author
C
C

In order to improve our site’s usability we need to
understand how our users use it by tracking every interaction
they have with our site.
QDM
user_ID
2_weeks
time
element
type
K
Actions_by_user
Q1
Q1: Find all actions a user
does in a time range
K
C

• Actions X (5 - 2 - 0) + 0 < 1M => 333.333 per user every 2 weeks
N/A
K
Actions_by_user
user_ID
2_weeks
Q1
K
time
element
type
C
N/A

CREATETABLE actions_by_user (
user_IDTIMEUUID,
2_weeks INT,
timeTIMESTAMP,
elementVARCHAR,
typeVARCHAR
PRIMARY_KEY ((user_ID, 2_weeks), time)
);
SELECT * FROM actions_by_user
WHERE user_ID = … AND 2_weeks = … AND
time < … AND time > …;
K
Actions_by_user
user_ID
2_weeks
Q1
K
time
element
type
C

THANKS!

Cassandra for impatients

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (12)

Andere mochten auch

Andere mochten auch (17)

Ähnlich wie Cassandra for impatients

Ähnlich wie Cassandra for impatients (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Cassandra for impatients