2. Session Plan
• Need & Introduction to NoSQL DB
• Cassandra Introduction
• Data model creation
• Pycassa in action
3. Heard of NO - SQL?
• Stands for Not Only SQL
• Class of non-relational data storage systems
• No fixed table schema
• No Joins!
• Relax one or more of the ACID properties & will
implement BASE & CAP Theorem!
4. Do we “REALLY” need them ?
• RDBMS …So strong
• so crisp
• so vast
• And WE know it well!
5. Trends shrends!
– Gartner‟s 10 key IT trends for 2012
• unstructured data will grow some 80% over the
course of the next five years
5
6. What made some apps go No-SQLized?
• Explosion of social media sites with large data needs
• Open-source community
• Upsurge of cloud-based solutions
• Migration to dynamically-typed languages
7. RDBMS..hmmm
• Normalization => Joins => Slow Queries /Complications
• Consistency => locks /transactions => Performance issues in
distributed environments
• Scalability becomes a mess as our apps grow in size and
demand
8. Current Approach to Scalability
• Add hardware
• Upgrade hardware
• More machines
• Turn off unwanted services
• Caching
• De-normalize…
10. But Why..
• ACID
• - transaction slow under heavy load
• - in distributed /replicated environment = 2 phase
commit => infinite wait by either NODE or Coordinator
11. But RDBMS is still holding up!!
• Yes..it is
• Will continue to Co-exist with NOSQL
• What if data is no more a problem to me!
• What new problems will I like to have?
12. Seeds of NoSQL
• Three major papers
– BigTable (Google)
– Dynamo (Amazon)
• Gossip protocol (discovery and error detection)
• Distributed key-value data store
• Eventual consistency
– CAP Theorem
13. Brewer’s CAP Theorem
• Properties of a system:
– Consistency
– Availability
– Partitions
14. Brewer’s CAP Theorem
• You can have it good, you can have it fast, you can have
it cheap: pick two
14
15. BASE Vs ACID - Eventual Consistency
• No updates for a long duration => eventually all updates
will propagate through the system => all the nodes will
be consistent
• Any given accepted update and a given node, eventually
either the update reaches the node or the node is
removed from service
• Known as BASE (Basically Available, Soft state,
Eventual consistency)
16. What kinds of NoSQL
• 2 Major areas:
– Key/Value or „the big hash table‟.
• Dynamo
• Voldemort
• Scalaris
– Schema-less
• column-based, document-based or graph-based.
– Cassandra (column-based)
– CouchDB (document-based)
– Neo4J (graph-based)
– HBase (column-based)
18. Cassandra to the Rescue!
– , source,
Open
Distributed, Decentralized,
Elastically scalable
Highly available / fault-tolerant
Tune ably consistent
Column-oriented database
Automatic sharding
Gossip Architecture
18
19. Distributed and Decentralized
Can be running Decentralized
on multiple • that there is no single
machines point of failure.
• appearing to users as • All the nodes in
single instance cluster function
exactly the same
[server symmetry]
19
20. Elastic Scalability
• Vertical scaling :
– more hardware capacity /memory
• Horizontal scaling :
• More machines that have all or some
of the data
• So that no machine is bearing the
complete load
20
21. Elastic Scalability , No single point failure
• Elastic scalability :
– Cluster will be able to scale up & down
• Master Slave issue
21
22. Scale UP & Scale down
• Add nodes and they can start serving clients!
– NO server restart / NO query change / NO
balancing
– JUST add an another machine.
• Just unplug the system.
– Since cassandra has multiple copies of the same
data in more than one node [configurable] there
wont be any loss of data.
23. High Availability and Fault Tolerance
• High availability + central server based system = problem
– Internal Hard ware redundancy
– Sounds cool but Extremely Costly
23
24. High Availability and Fault Tolerance
– Cassandra allows to :
• replace failed nodes in with no downtime
• replicate data to multiple data centers to prevent
downtime [automatic]
25. Tuneable Consistency
• Consistency : All Reads return the most recently written
value
– Cassandra is “eventually consistent” model by
default.
25
26. But then!
• Amazon, Facebook, Google, Twitter which uses this
model.
– DATA is their main sales item
– High performance!
27. Setting up Apache Cassandra
• From the DataStax community Project
– www.datastax.com/download
• From the Apache Cassandra project:
– http://cassandra.apache.org/
Believe it.. It‟s easy to
install & set up!
28. Keyspace & Column Family creation
Column family 1
Key1 ColumnName1 ColumnName2
Value Value
Key2 ColumnName1 ColumnName2
Value Value
Key3 ColumnName1 ColumnName2 ColumnName3
Value Value Value
Column family 2
Key1 ColumnName1 ColumnName2 ColumnName3
Value Value Value
29. Data makes sense..
Column family Close Friends
010051 Mail id tweets
Ramesh_Rajini Hello
010052 Mail id tweets
Vinz_Raj I‟m logged in!
010053 Mail id tweet1 tweet2
Ragh_Rao Hey, how r u ? Movie..
Column family Colleagues
020061 Mail id City Likes
Puru_lal Bangalore Ladoos!
30. Cassandra Data Structure
key space
Ex:
column family
Colony
Name,
UserIDs,
Ex:
Address, column
EmpIDs Tweets,
Likes, name value timestamp
Skill Set
33. Multi-level Dictionary
{“FriendsInfo”: Keyspace
{“closefriends”: Column Family
Key {010053: OrderedDict(
[(“MailId”:“Ragh_Rao”),
Columns (“tweet1”:“Hey, how r u ?”),
(“tweet2”: “Movie..”)])
OrderedDict(
..
}} ColumnKeys ColumnValues
34. Can I insert in bulk?
• Yes, luckily as an ordered dict..
col_fam.batch_insert(
{'010054': {'Name': 'Vinayak', 'Id': „9308'},
'010057': {'Name': 'Poorvi'}
})
__________________________________
for i in range(1000, 1010):
... col_fam.insert('EmpIDs', {str(i): 'Hello'})
34
35. Is the data stored?
• With Key , get all details:
col_fam.get('010052')
OrderedDict
([('Maild', 'Vinz_Raj'), ('tweets', 'Im loggedin!')])
• With Key, get specific details:
col_fam.get('010053', columns=['MaiID', 'tweet2'])
OrderedDict([('tweet2', 'Movie..')])
• Specifying start & end columns:
col_fam.get('EmpIDs', column_start='1002', column_finish='1006')
OrderedDict([('1002', 'Hello'), ('1003', 'Hello'), ('1004', 'Hello'),
('1005', 'Hello'), ('1006', 'Hello')])
35
36. Can the columns be sliced?
• Specifying the reverse way
col_fam.get('EmpIDs', column_reversed=True, column_count=3)
OrderedDict([('1009', 'Hello'), ('1008', 'Hello'), ('1007', 'Hello')])
• Fetching multiple rows
col_fam.multiget(['010053', '010051'])
OrderedDict(
[('010053',
OrderedDict([('Maild', 'Ragh_Rao'), ('tweet1', 'Hey, how r u?'),
('tweet2', 'Movie..')])),
('010051',
OrderedDict([('Mailid', 'Ramesh_Rajini'), ('tweets', 'Hello')]))])
36
37. Counting..
• get_count()
Count the number of columns in the row with key .
• multiget_count()
Perform a column count in parallel on a set of rows.
Similar parameters as for multiget(), except that a list
of keys may be used.
A dictionary of the form {key: int} is returned.
37
38. What Next?
• Explore more on Pycassa modules..
– http://pycassa.github.com/pycassa/api/index.html
• Start using it.. I‟m sure you‟ll enjoy because it is simply
superb!
38
39. Recap
• Need & Introduction to NoSQL DB
• Cassandra Introduction
• Data model creation
• Pycassa in action
39