2. What is NoSQL?
Stands for Not Only SQL
implying that when designing a software solution or product there are more than one
storage mechanism that could be used based on the needs
Class of non-relational data storage systems
Usually do not require fixed table schema that is schema-less nor do they use
concept of joins
Running well on clusters
Mostly open-source, distributed, & built for 21st web estates
Designed to cope up with the scale & agility challenges that face modern
applications
Built to take advantage of the cheap storage & processing power available today
3. Why NoSQL Databases?
Allows developers to develop
without having to convert in-memory structures to relational structures
4. Why NoSQL Databases?
Using databases as
integration points in favor of
encapsulating databases with
applications & integrating using services
The rise of the web as a platform also
created a vital factor change in data
storage
need to support large volumes of data by
running on clusters
Relational databases were not
designed to run on clusters
for example the data storage for ERP
application are lot more different than
data storage needs of a Facebook or an
Etsy
5. Data Models of NoSQL
A data model is a set of constructs for representing the information
Relational model: tables, columns & rows
Storage model: how the DBMS stores & manipulates the data internally
A data model is usually independent of the storage model
Data models for NoSQL systems
Aggregate Data Models
key-value
document
column-family
Distribution Models
6. Aggregate Data Models
Data as units that have a complex structure
more structure than just a set of tuples
example:
complex record with: simple fields, arrays, records nested inside
Aggregate in Domain-Driven Design
a collection of related objects that we treat as unit
a unit for data manipulation and management of consistency
Advantages of aggregates:
easier for application programmers to work with
easier for database systems to handle operating on cluster
7. Distribution Models
Aggregate oriented databases make distribution of data easier
the distribution mechanism has to move the aggregate that contained all the related
data in the aggregate
There are two styles of distributing data
Sharding
distributes different data across multiple servers
each server acts as the single source for a subset of data
Replication
copies data in multiple servers, so each bit of data can be found in multiple places
comes in two forms
Master-slave replication makes one node the authoritative copy that handles writes while slaves
synchronize with the master and may handle reads
reduces the chance of update conflicts
Peer-to-peer replication allows writes to any node that nodes coordinate to synchronize their copies of
the data
avoids loading all writes onto a single server creating a single point of failure
8. CAP Theorem
Proposed by Eric Brewer (talk on
Principles of Distributed
Computing July 2000)
Three properties of a system:
consistency, availability and
partitions
Can have at most two of these
three properties for any shared-
data system
To scale out, partition will need.
That leaves either consistency or
availability to choose from
In almost all cases, choose
availability over consistency
Consistency
Partition
tolerance
Availability
9. CAP Theorem
Once a writer has written, all
readers will see that write
Two kinds of consistency:
strong consistency – ACID(Atomicity
Consistency Isolation Durability)
weak consistency – BASE(Basically
Available Soft-state Eventual consistency )
Consistency
Partition
tolerance
Availability
10. CAP Theorem
System is available
during software & hardware upgrades
& node failures
Traditionally, thought of as the
server/process available five 9’s
(99.999 %)
However, for large node system,
at almost any point in time there’s
a good chance that a node is either
down or there is a network
disruption among the nodes
Want a system that is resilient in the
face of network disruption
Consistency
Partition
tolerance
Availability
11. CAP Theorem
A system can continue to operate
in the presence of a network
partitions
Consistency
Partition
tolerance
Availability
12. CAP Theorem
Theorem: Can have at most two of
these properties for any shared-data
system
Consistency
Partition
tolerance
Availability
13. Types of NoSQL Databases
NoSQL
Key-Value or ‘the
big hash table’
Schema-less
Column-based
Document-based
Graph-based
14. Key-Value databases
Simplest NoSQL data stores to use from
an API perspective
The client can
either get the value for the key
put a value for a key
or delete a key from the data store
The data stores just store the value is blob
without caring what is inside
Can store whatever like in the aggregate
Can only access an aggregate by lookup
based on its key
Examples: Riak, Redis, Memcached,
Berkely DB, HamsterDB, Amazon
DynamoDB (not open-source), Project
Voldemort & Couchbase
15. Document databases
Main concept are – ‘Documents’
Database stores & retrieves documents
which can be
XML, JSON, BSON and so on
Documents are
Self-describing
Hierarchical tree data structures that can
consist of maps, collections & scalar values
Documents are stored similar to each other
but do not have to be exactly the same
Store documents in the ‘value’
i.e. part of the key-value store where the values are
examinable
Example: MongoDB, CouchDB, Terrastore,
OrientDB, RavenDB
16. Column family stores
Store data in column families as
rows
that have many columns associated
with a row key
Column families are group of
related data
that is often accessed together
Various rows do not have the
same columns
Columns can be added
to any rows at any time without having
to add it to other rows
Example: Cassandra, Hbase,
Hypertable, Amazon DynamoDB
17. Graph stores
Allows to store entities & relationships
between these entities
Entities are also known as nodes
can be an instance of an object in the
application
Relations are known as edges
Nodes are organized by relationships
allows you to find interesting patterns
between the nodes
complex relationship requires complex
join
Like storing a graph like structure in
RDBMS in relational databases model
the graph beforehand the traversal
need.
Traversal will change the data
movement
18. Graph stores
In database traversing
the joins or relationships are very fast
Nodes can have
different types of relationships
Value of the graph databases
derived from the relationships
Relationships don’t only have a type
but also
a start node &
an end node
Adding new relationship types is easy
Changing existing nodes &
relationships are similar to
data migration
Example : Neo4J, Infinite Graph,
OrientDB or FlockDB
19. Key/Value Vs. Schema-less
Key/Value
Pros:
very fast
very scalable
simple model
able to distribute horizontally
Cons:
many data structures (objects) can’t be
easily modeled as key value pairs
Schema-less
Pros:
Schema-less data model is richer than
key/value pairs
eventual consistency
many are distributed
still provide excellent performance and
scalability
Cons:
typically no ACID transactions or joins
20. SQL Vs. NoSQL
Topics SQL NoSQL
Types One type : SQL Database (with minor
variations)
Many different types: Key/Value,
document database, column stores
database, graph database
Development
History
Developed in 1970s Developed in 2000s
Deal with First wave of data storage applications Limitations of SQL databases, particularly
concerning scale, replication &
unstructured data storage
Examples MySQL, Postgres, Oracle MongoDB, Cassandra, Hbase, Neo4J
Data Storage Model Individual records are stored as rows in
tables with columns much like
spreadsheet. Separate data stored in
separate tables & used joined
operation for querying data
Varies based on database type. For
example, key-value stores function similar
to the SQL but have only two columns:
‘key’ & ‘value’ with more information
sometimes stored in ‘value’ & Document
databases work with table & row model
storing all relevant data in single
document like JSON, XML etc.
21. Topics SQL NoSQL
Schemas Predefined i.e. structure & datatypes are
fixed
Dynamic. Unlike SQL can store dissimilar data
if necessary.
Scaling Vertically i.e. single sever must be made
increasingly powerful. To spread SQL
database over many servers additional
engineering required
Horizontally i.e. to add capacity, a database
administrator can simply add more
commodity servers & cloud instances
Sharding Manual sharding Auto sharding
Development
Model
Mix of open-source (e.g. Postgres, MySQL)
and closed source (e.g. Oracle)
Open-source
Supports
Transactions
Update can be configured entirely or not
at all
In certain circumstances and at certain levels
(e.g. document level vs. database level)
Data
Manipulation
Specific language using select, insert &
update statements e.g. SELECT fields
FROM table WHERE
Object oriented APIs
Consistency Strong consistency Depends on product. Some provide strong
consistency (e.g. MongoDB) whereas others
eventual consistency (e.g. Cassandra)
SQL Vs. NoSQL
22. Handling Relational Data
Lack ability of joins in queries
Three main techniques for handling relational data
Multiple queries
instead of retrieving all data with one query, it’s acceptable to do several queries
Caching/replication/non-normalized data
instead of storing only foreign keys, it’s common to store actual foreign values with model’s data
Nesting data
put more data in a smaller number of collections so that a single document can contains all the
data that need for a specific task
23. Benefits of NoSQL
Cheap, easy to implement (open source)
Data are replicated to multiple nodes (therefore identical & fault tolerant) and can
be partitioned
Down nodes easily replaced
No single point of failure
Easy to distribute
Don’t require a schema
Can scale up and down
Relax the data consistency requirement (CAP)
24. Conclusion
NoSQL database doesn’t mean
the demise of RDBMS databases
improve programmer productivity
improve data access performance via some combination
handling larger data volumes
reducing latency
improving throughput
Entering an era of ‘Polyglot Persistence’
a technique that uses different data storage technologies to handle varying data storage
needs
can apply across an enterprise or within a single application
- NoSQL encompasses a wide variety of different database technologies that were developed in response to a rise in the volume of data stored about users, objects & products, the frequency in which this data is accessed and performance and processing needs
- NoSQL was a hashtag(#nosql) choosen for a meetup to discuss these new databases
- Not like RDBMS, NoSQL designed to cope up with the scale & agility challenges that face modern applications & built to take advantage of the cheap storage & processing power available today
Application developers have been frustrated with the impedance mismatch between the relational data structures and the in memory data structures of the application.
By using NoSQL databases allows developers to develop without having to convert in-memory structures to relational structures.
- Currently there is a movement from using databases as integration points to encapsulating databases with application and integrating using services
- As the web based data are increasing day by day which is a major change in data storage to manage large volume of data on clusters
- Relational DB were not designed to run on clusters
- RDBMS modeling is vastly different than the types of data structures that application developers use.
- Using the data structures as modeled by the developers to solve different problem domains has given rise to movement away from relational modeling and towards aggregates models, most of this is driven by Domain Driven Design
-> Proposed by Eric Brewer (talk on Principles of Distributed Computing July 2000).
-> Partitionability: divide nodes into small groups that can see other groups, but they can't see everyone.-> Consistency: write a value and then you read the value you get the same value back. In a partitioned system there are windows where that's not true.-> Availability: may not always be able to write or read. The system will say you can't write because it wants to keep the system consistent.-> To scale you have to partition, so you are left with choosing either high consistency or high availability for a particular system. You must find the right
overlap of availability and consistency.
-> Choose a specific approach based on the needs of the service.-> For the checkout process you always want to honor requests to add items to a shopping cart because it's revenue producing.
In this case you choose high availability. Errors are hidden from the customer and sorted out later. -> When a customer submits an order you favor consistency because several services--credit card processing, shipping and handling, reporting—
are simultaneously accessing the data.
Consistency: write a value and then you read the value you get the same value back. In a partitioned system there are windows where that's not true.
*A consistency model determines rules for visibility and apparent order of updates.
For example:
Row X is replicated on nodes M and N
Client A writes row X to node N
Some period of time t elapses.
Client B reads row X from node M
Does client B see the write from client A?
Consistency is a continuum with tradeoffs
For NoSQL, the answer would be: maybe
CAP Theorem states: Strict Consistency can't be achieved at the same time as availability and partition-tolerance.
- The CAP theorem which states that in any distributed system we can choose only two of consistency , availability or partition tolerance.
- Many NoSQL databases try to provide options where the developer has choices where they can tune the database as per their needs
Usually when we store graph-like structure in RDBMS, it’s for a single type of relationship & adding another relationship to the mix usually means a lot of schema changes and data movement, which is not in the case when we are using graph databases.
Similarly in relational databases we model the graph beforehand based on the traversal we want; if the traversal changes the data will have to change
In database traversing the joins or relationships is very fast. Because the relationship between nodes is not calculated at query time but is actually persisted as a relationship & traversing persisted relationship is faster than calculating them for every query
Relationships don’t only have a type but also a start node & an end node but can have properties of their own. Using these properties on the relationships we can add intelligence to the relationship & can be used to query the graph
Adding new relationship types is easy; changing existing nodes & their relationships is similar to data migration because these changes will have to be done on each node and each relationship in the existing data
- SQL: Data storage model: Individual records are stored as rows in tables with each column storing a specific piece of data about the record much like spreadsheet. Separate data stored in separate table & used joined operation for querying data
Schemas
SQL: Predefined i.e. structure & datatypes are fixed. Entire database must be altered to store new data information & to do this database must be taken offline
NoSQL: Dynamic. Unlike SQL can store dissimilar data if necessary. But for some database like wide-column stores it is challenging to add new fields dynamically
Scaling
SQL: Vertically i.e. single sever must be made increasingly powerful. Possible to spread SQL database over many servers but also additional engineering required
NoSQL: Horizontally i.e. to add capacity, a database administrator can simply add more commodity servers & cloud instances
Sharding
SQL: Manual sharding. In relational databases application code is developed to distribute the data, queries, & aggregate the results of data across all of the database instances. Additional code must be developed to handle resource failures to perform joins across the different databases, for data rebalancing, replication & other requirements. Furthermore many benefits of the relational database such as transactional integrity are compromised or eliminated when employing manual sharding
NoSQL: Auto sharding meaning that they natively and automatically spread data across an arbitrary number of servers
Most NoSQL database lack ability for joins in queries, the database schema generally needs to be designed differently. There are three main techniques for handling relational data in a NoSQL database
Multiple queries: instead of retrieving all data with one query it’s common to do several queries to get the desired data. NoSQL queries are often faster than traditional SQL queries so the cost of doing additional queries may be acceptable
Caching/replication/non-normalized data: Each blog comment might include the username in addition to a user id, thus providing easy access to the username in addition to a user is thus providing easy access to the username without requiring another lookup. when a username changes however this will now need to be changed over many places in the database
Nesting data: For example in a blogging application, one might choose to store comments within the blog post document so that with a single retrieval one gets all the comments. Thus in this approach a single document contains all the data you need for specific task
- To improve programmers productivity by using a database that better matches an application’s need