This document provides an overview and introduction to non-relational (NoSQL) databases. It discusses some of the limitations of relational databases and why NoSQL databases were developed as an alternative. It describes different types of NoSQL databases, including key-value, document, columnar, and graph databases. Specific NoSQL database examples like HBase, Cassandra, Riak, MongoDB, and Neo4j are also mentioned.
2. I’M ERIC
I work at Airisdata and we are hiring!
http://airisdata.com
3. WHAT’S WRONG WITH
RELATIONAL DATABASES?
Nothing :)
Google & Amazon (followed by web tech)
Higher Performance
Larger Scale
Lower Cost
New Capabilities
4. BUT FIRST A POOR
METAPHOR
Cars!
What leads to better performance?
• Bigger engine,
remove excess
weight/features
• Better
controls/steering/br
aking
9. SO WHAT IS NOSQL
‘UM, NON-RELATIONAL’
No good definitions to be found
For me:
Scales horizontally
Foregoes the ‘old school’ SQL relations, concurrency, etc.
“exactly like SQL (except where it’s not)”
Trades-in or reimagines most SQL features for ‘something else’
Developer friendly/developer driven
Schema loose / semi-structured
Usually Open Source and usually associated with web infrastructure
Ignoring older non-relational databases of the past
Scales Horizontally (usually) – did I mention that?
Can be ‘glued’ to other data stores
Don’t like mine; create your own definition :)
10. SIDEBAR OBSERVATION
ON SOFTWARE TEAMS
Software teams tied to large central relational
database (think 1990s/2000s)
Large relational database ‘glue’ teams and apps together
leads to complex databases and dbadmins
Vs.
Software teams using no sql
Independent except at the edges (input/logs &
output/reports)
11. FOWLER’S IMPEDANCE
MISMATCH
Java objects
vs.
rows in tables
What I have called Fowler’s Impedance is mentioned in his and Sadlage’s book NoSQL Distilled
Most of nosql
beasties can store
data in more
interesting ways
12. CAP
Here because management loves to chat endlessly about
it.
C is for Consistency
“This is equivalent to requiring requests of the distributed shared
memory to act as if they were executing on a single node, responding
to operations one at a time.
Most systems are not (exactly)
A is for Availability
“For a distributed system to be continuously available, every request
received by a non-failing node in the system must result in a
response. …even when severe network failures occur, every request
must terminate.”
I think everyone here understands this one ;)
P is for Partition Tolerance
“In order to model partition tolerance, the network will be allowed to
lose arbitrarily many messages sent from one node to another.”
Quotes from “Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant
Web Services”
13. YOU CAN HAVE TWO
Consistency
The system may shutdown or take a day to answer but you will have
the correct answer.
Availability
The system will always answer; you might get your checking balance
from last year instead of today’s balance but you will get an answer.
Like asking a research group or asking folks in the pub.
Can’t have both :(
One can accept the write not knowing if all the servers
are up OR you can refuse until you know all the servers
are up. Partition Tolerance is mandatory in distributed
systems!https://codahale.com/you-cant-sacrifice-partition-
tolerance/
14. ONLY TWO, THE FINE
PRINT
Only two at any moment in time :)
For some systems you can choose different pairs for
each operation (Cassandra, Riak).
15. WHY WOULD ANYONE
BE INCONSISTENT?
Speed while highly concurrent
“good now better is than perfect later”
i.e. don’t block
Handling “partition cases” i.e. part of the system/network
is down!
16. DB CHEMISTRY – MORE
BUZZ
Is it ACID or BASE?
Atomicity, Consistency, Isolation, Durability
Basically Available, Soft-state, Eventually consistent
See “The Transaction Concept: Virtures and Limitations” by Jim Gray
21. KEY-VALUE STORES
Designed for
Speed (even memory-only)
High load
Global data model of key-values (surprise!)
Ring partition and replication
23. DOCUMENT STORES
Similar to key-value but the value is a
document!
Document is stored in json (or similar)
Flexible schema
Some support keys/references/indices
{
“date”:[ 2016, 04. 01],
“booktitle”:
”Hhitchhikers guide to
the galaxy”,
“author”:”Dogulas
Adams”
}
25. GRAPH DATABASES
Remember your data structures class in
college?
Edges and vertices – both can hold data
Reduces tough sql queries to simple
graph queries
Easier to model – ‘matches the
whiteboard’
Relationships between vertices are first
class
28. SITS ON TOP OF HDFS
Name nodes
Data nodes
Replication
And the rest of that whole megillah
29. Column-oriented
Handles ‘wide’ ‘sparse’ tables well
Fault tolerant
Supports java, REST, Avro and Thrift
All operations are atomic at the row level (via write ahead
logs)
30. KIND OF SQL
Key – values
Keys are arbitrary strings
Values are a entire row of data
No joins
Apache Phoenix
JDBC interface
34. CAP WITH QUORUMS
KNOB TWEAKING
Symmetric / peer to peer
Linearly scalable
Replication
Eventually consistency
Partitioning
35. CAP WITH QUORUMS
KNOB TWEAKINGSome systems choose per event!
Three knobs:
replication amount,
how many successful writes == ‘your writing to the
database is done!”,
how many successful reads out of a full set == “here
is your data”
Higher the values, longer the wait...
39. RIAK
simple interface, high write-availability, linear scaling
Rest api via http – put, get, delete, post, etc.
Or Protobufs for quicker serialized data
‘hundreds of nodes’
40. DISTRIBUTED
Consistent hashing, vector clocks, sloppy quorums, virtual nodes (not machines
but light weight processess - more like having eggs in many baskets – easier to
give the eggs to folks during a failure), hinted hand off (“please pass along”),
replication.
Request -> riak
|
<- ask other nodes ->
| |
virt node -> virt node ->
| |
data store data store
And then return answers back up the stack
42. KEYS AND BUCKETS
Riak can create them automatically (and return to you the
key)
http://SERVER:PORT/riak/BUCKET/KEY
http://SERVER:PORT/riak/BUCKET/KEY?keys=true
^ gets all the keys
http://SERVER:PORT/riak/BUCKET/KEY?keys=stream
^better for huge sets of data
You can store your code in a bucket!
43. LINKS
Curl blah –H “link: /riak/BUCKET/KEY;
riaktag=”tagname”
Link walking
^ can create other structures
44. HOMEWORK AND OTHER
READINGS
GENERAL
Brewer’s conjecture
https://www.comp.nus.edu.sg/~gilbert/pubs/BrewersConjecture-SigAct.pdf
Vogels’ thoughts on eventually Consistent
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
Old school techniques for “almost perfect” systems: “The
Transaction Concept: Virtures and Limitations” by Jim Gray
http://research.microsoft.com/en-
us/um/people/gray/papers/theTransactionConcept.pdf
ACID defined: Haerder and Reuter "Principles of transaction-
oriented database recovery”
http://www.minet.uni-jena.de/dbis/lehre/ws2005/dbs1/HaerderReuter83.p
All your base: Dan Pritchett “Base: An Acid Alternative”
http://queue.acm.org/detail.cfm?id=1394128
NoSQL Distilled by Sadalage and Fowler
Seven Databases in Seven Weeks by Redmond and Wilson
45. HOMEWORK AND OTHER
READINGS CONT’D
Google’s big table
http://static.googleusercontent.com/media/research.google.com/en
//archive/bigtable-osdi06.pdf
Hbase: The Definitive Guide by Lars George
Hbase in Action by Dimiduk and Kurana
Hadoop: The Definitive Guide by Tom White
46. HOMEWORK AND OTHER
READINGS CONT’D
• A Little Riak Book by Eric Redmond
– http://www.littleriakbook.com/
• Nice video on system details on safari
by Justin Sheehy
– https://www.safaribooksonline.com/libra
ry/view/riak-
core/9781449306144/part00.html?auto
Start=True
• Riak Handbook
– http://www.riakhandbook.com/
47. READINGS FOR GRAPHS
Graph Databases by Robinson, Webber and Eifrem
Mostly about Neo4j, uses Cypher through out
What improves auto performance? Bigger engine; less weight
Which leads to better brakes, steering, etc. <- better tools to manage
And better safety systems.
Which leads to a vehicle that requires a more support
Formula 1 car uses 18000 liters of air per minute (you use 25 liters of air per minute to move a bicycle)