Apache Cassandra is a non-relational database which is given by the Apache. Initially, Cassandra was open sourced by Facebook in 2008, and is now developed by Apache Group.
In the normal relational databases data stores in the format of rows, but in Cassandra the data will stored in columns format as key value pairs. Due to this column based data storage its giving the high performance while comparing the relational databases.
Cassandra can handle many terabytes of data if need be and can easily handle millions of rows, even on a smaller cluster. Cassandra can get around 20K inserts per second.
The performance of Cassandra is high and keeping the performance up while reading mostly depends on the hardware, configuration and number of nodes in your cluster. It can be done in Cassandra without much trouble.
1. CASSANDRA – An Open Source
Data Storage system
Presented By :
Vipul Kumar
Cr No. - 11/269
UNIVERSITY COLLEGE OF ENGINEERING, KOTA
RAJASTHAN TECHNICAL UNIVERSITY
Presented To :
Mr R K Banyal Sir
CSE Department
COMPUTER SCIENCE AND ENGINEERING DEPARTMENT
2. Contents
• What is Cassandra ?
• History
• Data Model
• System architecture
• Key features and benefits
• Who is using Cassandra ?
• Conclusion and future scope
Contents
3. Apache Cassandra™ is a free
Distributed
High performance
Extremely scalable
Fault tolerant(i.e. no single point of failure..)
open source NoSQL database.
Definition of Cassandra
5. • Table is a multi dimensional map indexed by key (row
key).
• Columns are grouped into Column Families.
• 2 Types of Column Families
– Simple
– Super (nested Column Families)
• Each Column has
– Name
– Value
– Timestamp
Data Model
7. • Partitioning
How data is partitioned across nodes
• Replication
How data is duplicated across nodes
System Architecture
8. • Nodes are logically structured in Ring
Topology.
• Hashed value of key associated with data
partition is used to assign it to a node in the
ring.
• Hashing rounds off after certain value to
support ring structure.
• Lightly loaded nodes moves position to
alleviate highly loaded nodes.
Partitioning
9. • Each data item is replicated at N (replication factor)
nodes.
• Different Replication Policies
– Rack Unaware – replicate data at N-1 successive
nodes after its coordinator
– Rack Aware – uses ‘Zookeeper’ to choose a leader
which tells nodes the range they are replicas for
– Datacenter Aware – similar to Rack Aware but leader
is chosen at Datacenter level instead of Rack level.
Replication
11. Gossip Protocol
• Network Communication protocols inspired for real life
rumor spreading.
• Periodic, Pairwise, inter-node communication.
• Low frequency communication ensures low cost.
• Random selection of peers.
• Example – Node A wish to search for pattern in data
– Round 1 – Node A searches locally and then gossips with node
B.
– Round 2 – Node A,B gossips with C and D.
– Round 3 – Nodes A,B,C and D gossips with 4 other nodes ……
• Round by round doubling makes protocol very robust.
12. Key features & benefits
• Gigabyte to Petabyte scalability
• Big data scalability
• No single point of failure
• Easy Replication / Data distribution
• No need for caching software
• Flexible Schema
13. Big Data Scalability
• Capable of comfortably scaling to petabytes
• New nodes = linear performance increases
• Add new nodes online
2
1
2
1
4
3
Double throughput
capacity
14. No single point of failure
• All nodes are same
• Read/write from any node
• Can replicate data among different physical data center
racks
15. Easy Replication
• Transparency handled by Cassandra
• Multi data center capable
• Exploit all the benefit of cloud computing
16. No need for caching layer
• Peer to peer layer removes need for special caching layer and
the programming.
• The database use the memory from all the participating nodes
to cache the assigned data.
17. Flexible Schema
• Dynamic schema design allows for more flexible data storage
than rigid RDBMS
• Handles structured, semi-structured and unstructured data.
• No offline / downtime for schema changes
19. Conclusion & future scope
• Cassandra is an open source storage system providing
scalability, high performance, and wide applicability.
• Cassandra can support a very high update throughput
while delivering low latency.
• Future works involves adding compression, ability to
support atomicity across keys and secondary index
support.