The Hong Kong Big Data community had a guest speaker at our Tuesday, 18 February meeting. Chris Yuen from Demyst Data discussed his experience with three NoSQL solutions: Cassandra, MongoDB, and HBase. For more information see http://www.infoincog.com/hong-kong-big-data-meeting-tuesday-18-february/.
3. Overview
Introduction
Motivation for NoSQL
The NoSQL landscape
Experience sharing
HBase
MongoDB
Cassandra
Tying it up – how does it really matter
6. Motivation
Too much data – the need to “scale out”
CAP theorem
Performance
RDMBS joining is slow
Denormalization
Key value data store
Alternative data representation
Schemaless “No SQL”
7.
8. Motivation
Too much data – the need to “scale out”
CAP theorem
Performance
RDMBS joining is slow
Denormalization
Key value data store
Alternative data representation
Schemaless “No SQL”
Document data store
9. HBase
Builds on top of HDFS
Consistent “big-data” database
Automatically scales out
11. HBase
A nightmare to set up and maintain
Depends on Hadoop, HDFS, Zookeeper
12.
13. HBase
A nightmare to set up and maintain
Depends on Hadoop, HDFS, Zookeeper
No secondary index
“Table” alteration requires downtime
Not spectacular latency for OLTP usage
19. MongoDB
… but it got ugly pretty fast
Devil’s in the details
Replica set management fiasco
Sharding is difficult to set up and poorly implemented
https://github.com/kizzx2/mongolab
24. Cassandra
Column Family data store
More “NoSQL” than MongoDB. Less features
Column data store – strictly key/value query
25. Cassandra
Auto-sharding just works
Replica set requires 0 configuration
Append only, LSM-tree based storage format
Good for SSD
High insert throughput
For storing analytic data
26. Cassandra
Has rudimentary support for secondary index
Difficult to do table scan or range scan
Require substantial application / paradigm shift
27. Real World Implications
Why does NoSQL matter to Big Data?
Schemaless storage model
Performance
Scalability
Rapidly incorporate unstructured new data sources
without extensive planning
28. How to Choose
Maintenance / Scalability
Supported operations
OLAP vs. OLTP