3. Why NoSQL?
Increase in data led to use of cluster of small machines for handling it
(Scale out), but RDBMS are not designed to run on clusters
Big Table from Google and Dynamo from Amazon – were the
alternatives for data storage in the early 2000s
Common characteristics of NoSQL DBs are
◦ Not using relational model
◦ Running well on clusters
◦ Schemaless, Open-source and built for 21st century web estates
July 3, 2014 3
4. Types of NoSQL DBs
NoSQL Types
Aggregate
Oriented DBs
Key Value
Data Model
Amazon
DynamoDB
Document
Model
MongoDB
CouchDB
Column
Family Model
Cassandra
HBase
Graph DBs
Neo4J
Infinite Graph
July 3, 2014 4
5. Cassandra Data Model
The table below shows analogy in terms of relational model
Cassandra column family can be thought as map of map
◦ Map<RowKey, SortedMap<ColumnKey, ColumnValue>>
July 3, 2014 5
Relational Model Cassandra Model
Database Keyspace
Table Column Family
Primary Key Row Key
6. Cassandra Key Components
Gossip
◦ Peer-to-peer communication protocol between nodes of cluster
Partitioner
◦ Determines how to distribute data across nodes of cluster
Replication Strategy
◦ For data replication
Snitch
◦ For network topology
Cassandra.yaml
◦ Timeout settings, tuning properties, etc
July 3, 2014 6
7. Cassandra Storage
The memtable data is flushed to SSTables on disk. Data in the commit
log is purged after its corresponding data in the memtable is flushed to
the SSTable.
July 3, 2014 7
8. Cassandra Data Partitioning
Lets say, we have following data
Data is placed on each node based on Partition Key and the range the
node is responsible for
July 3, 2014 8
jim age: 36 car: camaro gender: M
carol age: 37 car: bmw gender: F
johnny age: 12 gender: M
suzy age: 10 gender: F
Node Start Range End Range Partition
Key
Hash Value
A -9223372036854 -4611686018427 johnny -6723372854875
B -4611686018427 -1 jim -2245462676723
C 0 4611686018427 suzy 1168604627387
D 4611686018427 9223372036854 carol 7723358927203
9. Cassandra Data Distribution
using Vnodes
Vnodes allow each node to own a large number of small partition
ranges distributed throughout the cluster
July 3, 2014 9