Definition
Apache Cassandra is an open source, distributed,
decentralized, elastically scalable, highly available,
fault-tolerant, tuneably consistent, column-oriented
database that bases its distribution design on Amazon’s
Dynamo and its data model on Google’s Bigtable.
Created at Facebook, it is now used at some of the most
popular sites on the Web [The Definitive Guide, Eben
Hewitt, 2010]
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
2
Key Features
Distributed
and
Decentralized
High Performance
CQL – A SQL
like query
interface
Elastic
Scalability
Cassandra
Columnoriented
Key-Value
store
13/01/2014
High
Availability
and Fault
Tolerance
Tuneable
Consistency
Cassandra Introduction & Key Features by Philipp Potisk
4
Distributed and Decentralized
Datacenter 1
• Distributed: Capable of running
on multiple machines
• Decentralized: No single point of
failure
No master-slave issues due to
peer-to-peer architecture
(protocol "gossip")
Single Cassandra cluster may run
across geographically dispersed
data centers
13/01/2014
Datacenter 2
1
7
6
2
5
3
4
12
8
11
9
10
Read- and writerequests to any node
Cassandra Introduction & Key Features by Philipp Potisk
5
Elastic Scalability
1
8
1
• Cassandra scales horizontally,
adding more machines that have
all or some of the data on
• Adding of nodes increase
performance throughput linearly
• De-/ and increasing the
nodecount happen seamlessly
4 Performance
2
throughput = N
3
2
Performance
throughput = N x 2
7
4
6
5
Linearly scales to
terabytes and
petabytes of data
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
3
6
Scaling Benchmark By Netflix*
48, 96, 144 and 288
instances, with 10, 20,
30 and 60 clients
respectively. Each client
generated ~20.000w/s
having 400byte in size
Cassandra scales linearly far
beyond our current capacity
requirements, and very
rapid deployment
automation makes it easy to
manage. In particular,
benchmarking in the cloud
is fast, cheap and scalable,
*http://techblog.netflix.com/201
1/11/benchmarking-cassandrascalability-on.html
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
7
High Availability and Fault Tolerance
• High Availability?
Multiple networked computers
operating in a cluster
Facility for recognizing node
failures
Forward failing over requests to
another part of the system
1
6
2
5
3
4
• Cassandra has High Availability
No single point of failure
due to the peer-to-peer
architecture
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
8
Tunable Consistency
• Choose between strong and eventual
consistency
• Adjustable for read- and writeoperations separately
• Conflicts are solved during reads, as
focus lies on write-performance
TUNABLE
Available
Consistency
Use case dependent
level of consistency
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
9
When do we have strong consistency?
• Simple Formula:
jsmith
(nodes_written + nodes_read) >
replication_factor
jsmith
t1
t2
NW: 2
NR: 2
RF: 3
t1
t2
jsmith
t1
• Ensures that a read always
reflects the most recent write
• If not: Weak consistency
Eventually consistent
jsmith
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
t2
10
Column-oriented Key-Value Store
Row Key1
Column
Key1
Column
Value1
Column
Key2
Column
Value2
Column
Key3
Column
Value3
…
…
…
• Data is stored in sparse
multidimensional hash tables
• A row can have multiple columns –
not necessarily the same amount of
columns for each row
• Each row has a unique key, which
also determines partitioning
• No relations!
Stored sorted by row key *
Stored sorted by column key/value
Map<RowKey, SortedMap<ColumnKey, ColumnValue>>
* Row keys (partition keys) should be hashed, in order to distribute data across the cluster evenly
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
11
CQL – An SQL-like query interface
• “CQL 3 is the default and primary interface into the Cassandra DBMS” *
• Familiar SQL-like syntax that maps to Cassandras storage engine and
simplifies data modelling
CRETE TABLE songs (
id uuid PRIMARY KEY,
title text,
album text,
artist text,
data blob,
tags set<text>
);
INSERT INTO songs
(id, title, artist,
album, tags)
VALUES(
'a3e64f8f...',
'La Grange',
'ZZ Top',
'Tres Hombres'‚
{'cool', 'hot'});
SELECT *
FROM songs
WHERE id = 'a3e64f8f...';
“SQL-like” but NOT
relational SQL
* http://www.datastax.com/documentation/cql/3.0/pdf/cql30.pdf
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
12
High Performance
• Optimized from the ground up
for high throughput
• All disk writes are sequential,
append only operations
• No reading before writing
• Cassandra`s threading-concept is
optimized for running on
multiprocessor/ multicore
machines
13/01/2014
Optimized for writing,
but fast reads are
possible as well
Cassandra Introduction & Key Features by Philipp Potisk
13
Benchmark from 2011 (Cassandra 0.7.4)*
ops
Cassandra showed
outstanding throughput in
“INSERT-only” with 20,000
ops
Insert: Enter 50 million 1K-sized records
Read: Search key for a one hour period + optional update
Hardware: Nehalem 6 Core x 2 CPU, 16GB Memory
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
*NoSql Benchmarking by Curbit
http://www.cubrid.org/blog/de
v-platform/nosqlbenchmarking/
14
Benchmark from 2013 (Cassandra 1.1.6)*
* Benchmarking Top NoSQL Databases by End Point Corporation,
http://www.datastax.com/wp-content/uploads/2013/02/WP-Benchmarking-Top-NoSQL-Databases.pdf
Yahoo! Cloud Serving Benchmark: https://github.com/brianfrankcooper/YCSB
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
15
When do we need these features?
Lots of
Writes,
Statistics, and
Analysis
Geographical
Distribution
Large
Deployments
13/01/2014
Evolving
Applications
Cassandra Introduction & Key Features by Philipp Potisk
16
Who is using Cassandra?
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
17
ebay Data Infrastructure*
•
•
•
•
•
•
Thousands of nodes
> 2K sharded logical host
> 16K tables
> 27K indexes
> 140 billion SQLs/day
> 5 PB provisioned
• 10+ clusters
• 100+ nodes
• > 250 TB provisioned
(local HDD + shared SSD)
• > 9 billion writes/day
• > 5 billion reads/day
• Hundreds of nodes
• Persistent & in-memory
• > 40 billion SQLs/day
Not replacing RDMBS but
complementing!
Hundreds of nodes
> 50 TB
> 2 billion ops/day
• Thousands of nodes
• The world largest cluster
with 2K+ nodes
*by Jay Patel, Cassandra Summit June 2013 San Francisco
13/01/2014
Cassandra Introduction & Key Features by Philipp Potisk
18
Cassandra Use Case at Ebay
Application/Use Case
• Time-series data and real-time insights
• Fraud detection & prevention
• Quality Click Pricing for affiliates
• Order & Shipment Tracking
•…
• Server metrics collection
• Taste graph-based next-gen recommendation
system
• Social Signals on eBay Product & Item pages
13/01/2014
Why Cassandra?
• Multi-Datacenter (active-active)
• No SPOF
• Easy to scale
• Write performance
• Distributed Counters
Cassandra Introduction & Key Features by Philipp Potisk
19
Summary
• History
• Key features of Cassandra
•
•
•
•
•
•
•
Distributed and Decentralized
Elastic Scalability
High Availability and Fault Tolerance
Tunable Consistency
Column-oriented key-value store
CQL interface
High Performance
• Ebay Use Case
13/01/2014
Apache project: http://cassandra.apache.org
Community portal: http://planetcassandra.org
Documentation: http://www.datastax.com/docs
Cassandra Introduction & Key Features by Philipp Potisk
21