The document provides an overview of NoSQL databases and cloud computing, discussing various NoSQL databases like MongoDB, CouchDB, HBase, Cassandra, and Redis. It covers concepts like CAP theorem, data structures, use cases, and common architectures. The document also discusses hybrid solutions used by companies like Twitter, Google, and Facebook that combine SQL and NoSQL databases.
1. A Walk down NOSQL Lane
in the Cloud
New York City Cloud Computing Group
February 2011
Alexander Sicular
@siculars
2. Who is this blowhard?
Columbia University pays my mortgage
For the better part of a decade in Medical
Informatics
Am not shilling for any of these companies
Am not a computer scientist
Am a computer science enthusiast
particularly in the area of Informatics
3. When I put my data in
the “cloud”, to me it
just means that it’s
virtualized in
someone else’s
server room
4. ...the Silver Lining
Many, many providers and only growing
Amazon, Rackspace, Joyent, CouchOne,
Cloudant, Azure, GAE, Heroku, no.de
Outsourced management
Zero capex
Controlled costs
5. ...With a Chance of
Rain?
Vendor lock in
Unreliable performance
i/o
cpu, memory
Bare metal > software virtualization
6. NoSQL or NOSQL?
Not Only SQL
Non/post relational
Big tent policy
Umbrella term
Fragmented
http://www.flickr.com/photos/morgennebel/2933723145/
7. Your Usage Patterns
Read vs. Write
Mutable vs. Immutable
Product Considerations:
In place updates
Write Only Logs
8. This vs. That
Riak wiki comparisons page
http://wiki.basho.com/Riak-Comparisons.html
Popular one page comparison of a number of
NOSQL players by Kristof Kovacs:
http://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis
9. NOSQL concepts are
Not Brand New
Memcached since 2003 http://memcached.org
Google papers 2004-2006
Amazon Dynamo 2007
Consistent Hashing 2007 http://www.last.fm/user/RJ/journal/
2007/04/10/rz_libketama_-_a_consistent_hashing_algo_for_memcache_clients
Using relational systems as a key-value blob
store
2009 FriendFeed (not the first) http://bret.appspot.com/entry/how-
friendfeed-uses-mysql
10. Why NOSQL
Support for “Vary Large” data sets
Schemaless
Denormalized
Green field
New applications
http://www.flickr.com/photos/gailtang/1243984297/
11. Academia
Google:
Bigtable http://labs.google.com/papers/bigtable.html
GFS http://labs.google.com/papers/gfs.html
M/R http://labs.google.com/papers/mapreduce.html
Amazon:
Dynamo http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf
NOSQL Summer http://nosqlsummer.org/papers
12. Under the Hood
Terminology
Write Only Log http://en.wikipedia.org/wiki/Log-structured_file_system
Merkle Trees http://en.wikipedia.org/wiki/Hash_tree
B-trees http://en.wikipedia.org/wiki/B-tree
Vector clock http://en.wikipedia.org/wiki/Vector_clock
Bloom filters http://en.wikipedia.org/wiki/Bloom_filters
Big O Notation http://en.wikipedia.org/wiki/Big_o_notation
Consistent Hashing http://en.wikipedia.org/wiki/Consistent_hashing
16. MongoDB
10Gen, MongoHQ, Soft landing for
MongoLab those coming from
mysql (relational
C++ databases)
huMONGOus Native javascript
Sharded scaling, Secondary indexes
replicated master/
slave
Located in NYC
(go visit them)
20. Hadoop
Cloudera, Apache Huge ecosystem
Foundation
Yahoo, FB, Twitter,
Java Fortune 500
High latency Pig, Hive, Flume
Batch oriented
HDFS is GFS based
Open source Google
stack via the Google
papers
21. HBase
Java
Low latency store
sits on top of Hadoop
Modeled after Google Bigtable
Column oriented
Thrift, protobuf
Backend for new Facebook Messaging service
22. Cassandra
Apache
Java
Column oriented
Like Bigtable and Dynamo
Originated at Facebook
At Twitter, Distributed counting
http://www.infoq.com/presentations/NoSQL-at-Twitter-by-Ryan-King
http://www.slideshare.net/kevinweil/rainbird-realtime-analytics-at-twitter-strata-2011
23. Redis
OpenRedis incredibly fast
C memcached on
steroids
REmote
DIctionary replicated
Server master/slave
Specific data
structures
24. Commonalities
Open Source
Adherence to common or standard:
data formats
json, bson, utf8, binary
data trandport mechanisms
http, thrift, protobuf,
simple wire protocols
25. Ok. So Now What?
Analyze your requirements
Mailing lists
IRC, twitter
Project pages, wiki
Github/Google Code/Bitbucket:
project page
specific language clients
26. Variety Pack
Hybrid architectures will become the norm
Twitter - mysql, cassandra, hadoop
Google - mysql, GAE (BT)
Facebook - mysql,
cassandra, hbase,
memcached
Yahoo - mysql, hadoop
LinkedIn - voldemort http://www.flickr.com/photos/uncleweed/82245324/