Cassandra is a distributed, massively scalable, fault tolerant, columnar data store, and if you need the ability to make fast writes, the only thing faster than Cassandra is /dev/null! In this fast-paced presentation, we'll briefly describe big data, and the area of big data that Cassandra is designed to fill. We will cover Cassandra's unique, every-node-the-same architecture. We will reveal Cassandra's internal data structure and explain just why Cassandra is so darned fast. Finally, we'll wrap up with a discussion of data modeling using the new standard protocol: CQL (Cassandra Query Language).
4. What is Big Data?
• The three V’s (and a C)
velocity
volume
Variety
Complexity
OpenSource Connections
5. What is Big Data
• Brewer’s CAP theorem
o
o
o
o
Consistency - all nodes have same world view
Availability - requests can be serviced
Partition tolerance - network/machine failure
Can’t have all 3 -- Pick 2!
• Examples
o MySQL – Consistent, Available
o HBase – Consistent, Partition Tolerant
o Cassandra – Available, Partition Tolerant
– and “Tunably Consistent”!
OpenSource Connections
6. What is Big Data?
• Common theme: Denormalize everything!
o What’s that?
• JOIN all the tables in the database...
• … well not all the tables
o Why?
• You can shard database at any point
• All related data is co-located
• What this means for you
o
o
o
o
o
No joins
No transactions - potential for inconsistency
Vastly simplified querying
No data-modeling -- Instead, query-modeling
“Infinite and easy” scaling potential
OpenSource Connections
7. How Does Cassandra Fit?
• No single point of failure
• Optimized for writes, still good with reads
• Can decide between Consistency and Availably
concerns
OpenSource Connections
20. C* Data Model
Row Key
Column
Column Name
Column Value
(or Tombstone)
Timestamp
Time-to-live
OpenSource Connections
21. C* Data Model
Row Key
Column
Column Name
Column Value
(or Tombstone)
Timestamp
Time-to-live
● Row Key, Column Name, Column
Value have types
● Column Name has comparator
● RowKey has partitioner
● Rows can have any number of
columns - even in same column family
● Rows can have many columns
● Column Values can be omitted
● Time-to-live is useful!
● Tombstones
OpenSource Connections
22. C* Data Model: Writes
Mem
Table
CommitLog
Row
Cache
Bloom
Filter
● Insert into
MemTable
● Dump to
CommitLog
● No read
● Very Fast!
● Blocks on CPU
before O/I!
Key
Cache
Key
Cache
Key
Cache
Key
Cache
SSTable
SSTable
SSTable
SSTable
OpenSource Connections
23. C* Data Model: Writes
Mem
Table
CommitLog
Row
Cache
Bloom
Filter
● Insert into
MemTable
● Dump to
CommitLog
● No read
● Very Fast!
● Blocks on CPU
before O/I!
Key
Cache
Key
Cache
Key
Cache
Key
Cache
SSTable
SSTable
SSTable
SSTable
OpenSource Connections
24. C* Data Model: Writes
Mem
Table
CommitLog
Row
Cache
Bloom
Filter
● Insert into
MemTable
● Dump to
CommitLog
● No read
● Very Fast!
● Blocks on CPU
before O/I!
Key
Cache
Key
Cache
Key
Cache
Key
Cache
SSTable
SSTable
SSTable
SSTable
OpenSource Connections
25. C* Data Model:
Reads
Mem
Table
CommitLog
Row
Cache
Bloom
Filter
● Get values from Memtable
● Get values from row
cache if present
● Otherwise check bloom
filter to find appropriate
SSTables
● Check Key Cache for fast
SSTable Search
● Get values from SSTables
● Repopulate Row Cache
● Super Fast Col. retrieval
● Fast row slicing
Key
Cache
Key
Cache
Key
Cache
Key
Cache
SSTable
SSTable
SSTable
SSTable
OpenSource Connections
26. C* Data Model:
Reads
Mem
Table
CommitLog
Row
Cache
Bloom
Filter
● Get values from Memtable
● Get values from row
cache if present
● Otherwise check bloom
filter to find appropriate
SSTables
● Check Key Cache for fast
SSTable Search
● Get values from SSTables
● Repopulate Row Cache
● Super Fast Col. retrieval
● Fast row slicing
Key
Cache
Key
Cache
Key
Cache
Key
Cache
SSTable
SSTable
SSTable
SSTable
OpenSource Connections
27. C* Data Model:
Reads
Mem
Table
CommitLog
Row
Cache
Bloom
Filter
● Get values from Memtable
● Get values from row
cache if present
● Otherwise check bloom
filter to find appropriate
SSTables
● Check Key Cache for fast
SSTable Search
● Get values from SSTables
● Repopulate Row Cache
● Super Fast Col. retrieval
● Fast row slicing
Key
Cache
Key
Cache
Key
Cache
Key
Cache
SSTable
SSTable
SSTable
SSTable
OpenSource Connections
28. C* Data Model:
Reads
Mem
Table
CommitLog
Row
Cache
Bloom
Filter
● Get values from Memtable
● Get values from row
cache if present
● Otherwise check bloom
filter to find appropriate
SSTables
● Check Key Cache for fast
SSTable Search
● Get values from SSTables
● Repopulate Row Cache
● Super Fast Col. retrieval
● Fast row slicing
Key
Cache
Key
Cache
Key
Cache
Key
Cache
SSTable
SSTable
SSTable
SSTable
OpenSource Connections
29. C* Data Model:
Reads
Mem
Table
CommitLog
Row
Cache
Bloom
Filter
● Get values from Memtable
● Get values from row
cache if present
● Otherwise check bloom
filter to find appropriate
SSTables
● Check Key Cache for fast
SSTable Search
● Get values from SSTables
● Repopulate Row Cache
● Super Fast Col. retrieval
● Fast row slicing
Key
Cache
Key
Cache
Key
Cache
Key
Cache
SSTable
SSTable
SSTable
SSTable
OpenSource Connections
30. C* Data Model:
Reads
Mem
Table
CommitLog
Row
Cache
Bloom
Filter
● Get values from Memtable
● Get values from row
cache if present
● Otherwise check bloom
filter to find appropriate
SSTables
● Check Key Cache for fast
SSTable Search
● Get values from SSTables
● Repopulate Row Cache
● Super Fast Col. retrieval
● Fast row slicing
Key
Cache
Key
Cache
Key
Cache
Key
Cache
SSTable
SSTable
SSTable
SSTable
OpenSource Connections
31. Internals: Twitter Example
• 4 ColumnFamilies
o
o
o
o
followers
following
tweets
timeline
OpenSource Connections
32. Internals: Twitter Example
• 4 ColumnFamilies
o
o
o
o
followers
following
tweets
timeline
• Nate follows Patricia
o
o
o
o
SET followers[Patricia][Nate] = „‟;
SET following[Nate][Patricia] = „‟;
storing data in column names (not values)
denormalized, redundant!
• Get all Nate’s followers
o GET followers[Patricia]
o => Nate,Eric,Scott,Matt,Doug,Kate
o No JOIN!
OpenSource Connections
33. Internals: Twitter Example
• Nate tweets
o SET tweets[Nate][2013-07-19 T 09:20] = “Wonderful morning.
This coffee is great.”
o SET tweets[Nate][2013-07-19 T 09:21] = “Oops, smoke is
coming out of the SQL server!”
o SET tweets[Nate][2013-07-19 T 09:51] = “Now my coffee is
cold :-(”
• Get Nate’s tweets
o GET tweets[Nate]
…(what you’d expect)...
OpenSource Connections
36. CQL (Cassandra Query Language)
CREATE TABLE users (
id timeuuid PRIMARY KEY,
lastname varchar,
firstname varchar,
dateOfBirth timestamp );
INSERT INTO users (id,lastname, firstname, dateofbirth)
VALUES (now(),‟Berryman‟,‟John‟,‟1975-09-15‟);
UPDATE users SET firstname = ‟John‟
WHERE id = f74c0b20-0862-11e3-8cf6-b74c10b01fc6;
OpenSource Connections
37. CQL (Cassandra Query Language)
CREATE TABLE users (
id timeuuid PRIMARY KEY,
lastname varchar,
firstname varchar,
dateOfBirth timestamp );
INSERT INTO users (id,lastname, firstname, dateofbirth)
VALUES (now(),'Berryman',‟John','1975-09-15');
UPDATE users SET firstname = 'John‟
WHERE id = f74c0b20-0862-11e3-8cf6-b74c10b01fc6;
SELECT dateofbirth,firstname,lastname FROM users ;
dateofbirth
| firstname | lastname
--------------------------+-----------+---------1975-09-15 00:00:00-0400 |
John | Berryman
OpenSource Connections
38. The CQL/Cassandra Mapping
CREATE TABLE employees (
company text,
name text,
age int,
role text,
PRIMARY KEY (company,name)
);
OpenSource Connections
39. The CQL/Cassandra Mapping
CREATE TABLE employees (
company text,
name text,
age int,
role text,
PRIMARY KEY (company,name)
);
company | name | age | role
--------+------+-----+----OSC | eric | 38 | ceo
OSC | john | 37 | dev
RKG | anya | 29 | lead
RKG | ben | 27 | dev
RKG | chad | 35 | ops
OpenSource Connections
40. The CQL/Cassandra Mapping
company | name | age | role
--------+------+-----+----OSC | eric | 38 | ceo
OSC | john | 37 | dev
RKG | anya | 29 | lead
RKG | ben | 27 | dev
RKG | chad | 35 | ops
CREATE TABLE employees (
company text,
name text,
age int,
role text,
PRIMARY KEY (company,name)
);
eric:age
OS
C
eric:role
john:age
john:role
38
dev
37
dev
anya:age
RK
G
anya:role
ben:age
ben:role
chad:age
chad:role
29
lead
27
dev
35
ops
OpenSource Connections
41. Modeling Strategy
• Don’t think about the data structure
• Do think of the questions you’ll ask
• Consider efficient operations for Cassandra
o
o
o
o
Writing (4K writes per second per core)
Retrieving a row
Retrieving a row slice
Retrieving in natural order (which you control)
• Write the data in the way you will query it
• Disk space is cheap
• Seperate read-heavy and write-heavy task
o Make wise use of caches
OpenSource Connections
42. Modeling Strategy: Anti-Patterns
• Read-then-write
• Heavy deletes
o Scatters dead columns throughout SSTables
o Won’t be corrected until first compaction after
gc_grace_seconds (10days)
• Distributed queue
• JOIN-like behavior
• Super wide-row sneak attack (>2B columns)
OpenSource Connections