There is a lot of confusion out there about the various kinds of NoSQL, and NewSQL, technologies. Document stores, graph databases, columnar databases, graph databases, and the list goes on. This confusion has lead to a good deal of less than optimal deployments, pain, and, ultimately, antipathy.
In this talk, Dan will walk us through a high-level explanation of the various NoSQL technologies available to us, how they work, and provide some dos and don'ts for their implementation.
24. NoSQL solves specific problems
Horizontal scalability
Availability
Schema updates
Performance
Data gets very large
25. NoSQL solves specific problems
Horizontal scalability
Availability
Schema updates
Performance
Data gets very large
Data gets very wide
26. NoSQL solves specific problems
Horizontal scalability
Availability
Schema updates
Performance
Data gets very large
Data gets very wide
High volatility
27. NoSQL solves specific problems
Horizontal scalability
Availability
Schema updates
Performance
Data gets very large
Data gets very wide
High volatility
CPU-bound operations
28. “NoSQL” is not a monolith
Key-value
Document
Graph
Inverted Index
Object
RDF (triplestore/quadstore)
Columnar
29. Key-value
Description Pros Cons Examples
Data is modeled
as key-value pairs.
• Simplicity
• In-memory
• Flexibility
• Easy to partition
• Limited query
capabilities
• Limited ability
to build out
complex data
relationships
• Redis
• Couchbase
• Membase
• DynamoDB
• Riak
30. Document
Description Pros Cons Examples
Data is
represented as a
“document,” and
is serialized in a
hierarchical data
format.
• Flexibility • Limited ability
to build out
complex data
relationships
• ArrangoDB
• Couchbase
• CouchDB
• DynamoDB
• ElasticSearch
• MongoDB
• RethinkDB
• Riak
31. Graph
Description Pros Cons Examples
Based on graph
theory. Data is
represented as
edges and
vertices.
• Data
relationships
are stored with
the data itself at
the logical level
• Complexity • Apache Giraph
• ArrangoDB
• BlazeGraph
• InfinitGraph
• Neo4j
32. Inverted Index
Description Pros Cons Examples
An index that
builds
relationships
between the
contents of
documents. Full-
text indexes, and
probability
searches.
• Flexibility
• Robust querying
• Mutations are
slow
• Requires a lot
of storage
• Probability
searches may
not be what
you want.
• ElasticSearch
• Solr
33. Object
Description Pros Cons Examples
Models data as
you would model
them in an object-
oriented language.
• Automatic
schema updates
• Simplicity
• Lightweight
• Lack of
standards
• Lack of tooling
• db4o
• JADE
• ObjectDB
• Perst
• Zope
34. RDF (triplestore/quadstore)
Description Pros Cons Examples
A type of graph
database where
vertices and edges
are represented as
semantic
expressions.
• Extremely
scalable
• Engines are
incredibly fast
• Complexity • AllegroGraph
• BlazeGraph
• MarkLogic
• Oracle NoSQL
35. Columnar
Description Pros Cons Examples
Tabular data is
stored by column
instead of rows. A
single table
typically consists
of many files.
• Supports wide
tables (100s of
columns)
• Aggregates on
pico-scale data
is fast
• Write
performance
• Complexity
• Mutations are a
no-no
• Cannot query
by row
• Accumulo
• Cassandra
• Druid
• HBase
• Vertica
The goal of this talk is provide a definition for the term “NoSQL,” and give a high-level overview of the various kinds of NoSQL databases.
What NoSQL is and is not has caused confusion.
There are a lot of options.
NoSQL is not. A monolith.
The name implies that these are database that do not use SQL to query data.
Well, then how the heck are we getting data? There are a number of alternatives.
Queries are objects built using a domain specific language.
Different keys and structures result in different filter criteria and logical conjunctions/disjunctions.
Gigantic hashtables.
Simple hash lookup.
Completely different languages for expressing queries.
And of course there’s…SQL.
But it’s called “NoSQL”!
Are we just being trolled here?
As it turns out, SQL is a pretty good tool.
It makes it easy to build highly-complex queries and database mutations.
Collectively a lot of knowledge about SQL within the development community.
The term “NoSQL” has morphed from “there is no SQL” to “not only SQL.”
We have this broad classification of databases.
That are no SQL technologies.
But also use SQL.
It’s a bit confusing.
What’s the difference between something like MySQL or PostgreSQL?
What we start to find is that many of these technologies approach the modeling of data in dramatically different ways.
There is often is no construct of a “relationship” as a first-class citizen in the database.
Not exactly.
The primitives for defining objects in a graph database are called vertices and edges.
Edges are the relationship between vertices.
Really, NoSQL is what it is not.
That is, a database that is not an RDBMS.
Perhaps at this point you may be saying, “yo, Dan, imma let you finish, but PostgreSQL is the greatest database of all time!”
If you find that an RDBMS solves your problems, then use it.
NoSQL is not an “alternative” to using RDBMS databases.
Nor is NoSQL a panacea for data storage.
The rise of these NoSQL database technologies has been about solving specific problems.
Generally, scaling a traditional RDBMS means adding more disks, CPUs, and memory
This can get costly, and has limitations
Being able to simply add new nodes to a cluster to scale out a single database instance is compelling
With RDBMS, if our database goes down, we have to fail over to a replica.
Some NoSQL databases approach this problem by running hot-hot replicas. Traffic is simply routed away from an unhealthy node.
If you have to change a database table, this can be a complicated exercise.
Especially if you need to automate this as part of your continuous deployment pipeline.
Some NoSQL technologies approach this by going ”schemaless.”
Generally, RDBMS databases are pretty fast.
There are situations where they are start to break down.
Indexes begin to get difficult to manage and query.
Many RDBS systems have a number of optimizations built into them that help it handle the efficient storage of data.
Often the trade off is often limitations on the number of columns in a database.
Lots of data mutations means lots of recomputing of indexes, and index fragmentation.
CPUs tend to be the most limited resource on a server.
Lots of CPU-intensive operations can result in the CPU getting pegged, which means your database is no longer available.
Again, NoSQL databases solve specific problems.
Consequently, there are a number of types of databases that often get associated with the “NoSQL” moniker.
The different types are defined by the various ways a database will internally model data.
They all have their pros and cons.
Lets take a deeper look.
Giant hashtable.
Often in-memory.
Typically used for caching
Couchbase has additional methods of querying
Redis has data structures
DynamoDB has range keys
Typically requires a fair amount of denormalization.
The advantage of a graph database over an RDBMS is the ability to represent and query deeply-nested hierarchies.
Typically associated with full-text indexes.
Can be used as a document database.
ElasticSearch offers additional, non-probably queries.
This is a type of database that does not get a lot of publicity.
The original “NoSQL” database (Strozzi).
”Humans are animals” / “Bob is human” / “Bob has hair” / “Bob is 35" / "Bob knows Fred” / “Sarah has hair”
Engines are mind-bogglingly fast.
BlazeGraph running on GPU servers has been clocked at 50 billion edge traversals per second
Query language is SPARQL
NO MUTATIONS!
Often used for time series
Typically stored in a tabular format
This is a lot of information, so here’s some more!
You may have noticed some databases appear on multiple NoSQL type lists.
These are called multi-modal databases.
Models data internally in multiple forms to overcome limitations of individual models.
Consistency Availability Partition-tolerance.
For example…
CP gives you immediate consistency by running hot-cold replicas. This impacts availability because of warmup time for replicas.
AP is the opposite.
Some in-memory databases do not support larger-than-memory datasets with persistence (Redis).
Eventually durable.
There is another category worth mention here called NewSQL.