The presentation begins with an overview of the growth of non-structured data and the benefits NoSQL products provide. It then provides an evaluation of the more popular NoSQL products on the market including MongoDB, Cassandra, Neo4J, and Redis. With NoSQL architectures becoming an increasingly appealing database management option for many organizations, this presentation will help you effectively evaluate the most popular NoSQL offerings and determine which one best meets your business needs.
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
NoSQL Architecture Overview
1. NoSQL Architecture
Overview
OVER 400 CUSTOMERS
TRUST THEIR DATABASES
TO RDX
RDX Insights Series Presentation – Introduction to NoSQL Architectures
Chris Foot
VP DB Technologies
RDX
March 23, 2017Video recording of this
presentation can be
found on RDX’s YouTube
Channel:
https://lnkd.in/g96cbUV
3. www. .com
NoSQL Competitors
Document
Graph
Key-Value
• Pairs a key with a complex data structure
called a document
• Records not required to have uniform
structure
• MongoDB, CouchDB, DynamoDB,
Couchbas, MarkLogic
• Record can have billions of columns
• Tables are collections of columns, rather than rows
• Column names and record keys are not fixed
• Cassandra, Bigtable, Hbase, Accumulo
• All items are stored as an indexed key-value pairs
• Redis, Riak, Memcached, Oracle NoSQL, DynamoDB
• Stores nodes (data elements) with
relationships
• Interconnected, strong relationships
• Neo4j, Datastax Cassandra, Titan,
ArangoDB
IN-MEMORY DB
Persistent DB
Wide-Column
• Operations performed in
memory
• Lightening fast read/write
• Often use Key-Value or Wide-
Column as data store
• Redis, Memcached, Oracle
Times 10, SAP HANA
In-Memory
4. www. .com
RDBMS and NoSQL will Merge
NoSQL vendors desire to
increase market share will
drive them to compete directly
with relational product
manufacturers
Vendors will add RDBMS-like
functionality that allows their
product to be more widely
adopted. Those that don’t will
quickly lose market share to
those that do
The larger relational vendors
will attempt to co-opt any
NoSQL technology that
challenges their dominant
role in the industry
As they identify offerings as
tangible threats, their
strategy will be to ensure
that the technologies used
by those vendors become a
component of, not a
replacement for, their
traditional database
products
Relational
DBMS
NoSQL
DBMS
General Purpose
DBMS
6. www. .com
NoSQL Adoption Drivers - Modern Applications
Single View
Sensor Data
Biometrics
Radiology
Videos, Images
Weather Data
Catalogs
Content Management
Geospatial
Social Data
• IDC: Unstructured data is growing at the rate of 62% per
year
• IDC: By 2022, 93% of all data in the digital universe will
be unstructured
• Gartner: Data volume is set to grow 800% over the next
5 years and 80% of it will reside as unstructured data
7. www. .com
NoSQL architectures leverage horizontal scalability to cost
effectively handle large volumes of data and/or users
NoSQL Adoption Drivers – Horizontal Scaling
Horizontal
Vertical
8. www. .com
Relational and NoSQL Parallel Adoption Drivers
Hierarchical and Network Databases – IMS and CODASYL/Network
Logical and physical layers entirely dependent upon each other. Both data storage and data navigation were rigidly
defined. Programs were required to follow the prebuilt paths to navigate through the stored data
Early Releases of DB2
• Flexibility
• Separate logical and physical layers - schema
• Set vs row processing
• Ease of use
• SQL language was intuitive
• Poor performance
• Crude locking, transaction management and
limited features
Early Releases of Oracle
• Flexibility
• Easy to use
• Lower Total Cost of Ownership (support, product costs)
• Low cost commodity hardware (as in it didn’t need a
mainframe)
• Crude locking, transaction management and limited
features
Early Releases of NoSQL
• Flexibility
• Easy to use
• Lower Total Cost of Ownership (support, product
costs)
• Faster application development
• Architected to scale horizontally for availability
and performance
• Crude locking, transaction management and
limited features
“Niche implementations, crude technically, will never become popular, no features - no future”.
Pretty much…. “Your career is going to be toast.”
9. www. .com
ACID vs BASE
ACID
Relational
BASE
NoSQL Distributed Tradeoff
Atomicity
All operations in a single
transaction succeed or fail as a
group. No partial operations
Consistency
The database is never in an
inconsistent state
Isolated
Transactions do not interfere
with another. Contentious data
access is handled by the
database to make the
transactions appear to run
sequentially
Durable
Transactions are permanent in
the presence of failures
Basic Availability
The system is able to tolerate a
partial failure (loss of a single
node for example)
Soft State
The state of the system is in flux
and may change over time
because of bullet below
Eventual Consistency
As data is being added to the
system, consistency is gradually
replicated across all nodes.
Data may be inconsistent in the
short term but will eventually
become consistent
The application is given a
greater responsibility for data
management in systems that
don’t follow ACID
Leads to complex application
code when strong consistency is
needed across replicated nodes
10. www. .com
CAP Theorem Distributed Systems – Pick C or A
Consistency
A
C P Partition
Tolerance
Availability
CP:
MongoDB, Redis, BigTable,
Hbase, MemcacheDB
CA: Oracle, SQL
Server, MySQL…
AP: Cassandra, Riak,
CouchDB, DynamoDB
USER
USER
USER
USER
USER USER
USER USER
SAME DATA
HERE
SAME DATA
HERE
Consistency:
All clients see the same
data
AVAILABLE AVAILABLE
Availability:
All clients can read and
write
Partition Tolerance:
System continues to work
during network partitions
11. www. .com
CAP Theorem
Allow Updates Allow Updates
INCONSISTENT
Synchronizing Data
Partition
Allow Updates Prevent Updates
UNAVAILABLE
Synchronizing Data
Partition
AVAILABILITY
CONSISTENCY
System is available, but
data is inconsistent
due to lack of
synchronization
Data is in synch
because only one node
allows updates. The
system is unavailable
to one group of users
12. www. .com
Why Did RDX Choose MongoDB?
Business Drivers
• Industry analyst evaluations
• Customer use cases and recommendations
• Largest commercial investment in any database
vendor
• Popularity
• 10 million+ software downloads
• 1,000 partners
• 2,000 customers
• 1/3 of the Fortune 100
• Robust training available
• Strong open source community
• Excellent partnership support
Technical Drivers
• Wide scope of potential application
• Low TCO
• Combines capabilities of relational databases with next
generation NoSQL technologies
• Schemaless, flexible data model
• Nonstructured data support
• Easily accommodates large data volumes
• Rich query capability
• Strong, tunable consistency model
• Elastic, horizontal scalability
• Easily configurable system resiliency
• Vendor provided database support
Craigslist, New York Times, Verizon, Viacom, AstraZeneca, MTV, Google, Genetech, Adobe, GAP, Cisco, MetLife,
Facebook, Expedia, Ebay, Edmunds, Washington Post, Aol, ADP, Forbes, Intuit, The Weather Channel, Carfax…..
13. www. .com
MongoDB Features
• Multiple storage engines
WiredTiger
InMemory
Encrypted
Third-Party
MMAPV1
• Indexing
Enforce uniqueness on user defined and Object ID fields
Partial – Only indexed if they meet filter expression
Sparse – Only indexed if field is populated
Compound – Multiple column index
Multikey – Indexes on arrays
TTL – Allow documents to be purged based on time
Text Search
Hash – Creates random values
• Easily ingests large, nonstructured data elements
Decomposes large video files, images into smaller
components and rebuilds them using pointer during
retrieval
Document validation rules enforce data validity
Enforce checks on document structure, data types, data
ranges and the presence of mandatory fields
DBAs can apply data governance standards, while
developers maintain the benefits of a flexible schema
• Automatic failover with no application
redirects to new primary required
• Driver support for all common
programming languages
• Data compression
• Tunable consistency model
• BI Connector allows MongoDB to act as
data source for SQL based BI analytics
platforms
• LDAP, Kerberos, Windows AD, x.509
authentication
• DML, DCL, DDL audit logging
• FIPs compliant and data encryption
14. www. .com
Rigid vs Dynamic Schemas
Relational Tables and Rows
• Schema design performed before application
is developed
• Schema must be built before inserting data
• Enforces data structure – rows can not
deviate from the predefined schema
• Schema design based on storage
• Schema alterations require database and
application changes to be coordinated
• Normalization process is critical
MongoDB Collections and Documents
• No schema required before inserting data
• Schema is created as each document is inserted
• Documents in collection can have a different
schema (sets of fields)
• Schema design based on application usage
• Schemas can evolve iteratively during application
life-cycle
• Higher dependency on application layer for data
integrity
• Normalization not as important
Predescribed Self-Describing
15. www. .com
Flexible Schemas
Insurance Policy Document Collection
AUTO LIFE HOME EQUIPMENT CYBER
Collections do *not* enforce document structure.
You do not predefine document schemas. The schema is defined during
initial document insertion. Data types are selected by MongoDB based on
data being inserted
16. www. .com
Agile Development Features
• Schemaless architecture
• Flexible data model = easy schema
changes
• Drivers for all major programming
languages
• Ability to store all types of data
FASTER
BETTER
LEANER
• Flexible JSON document format
• Rich content Using GridFS
• Simple system provisioning
• Scale vertically and horizontally
• Pluggable storage engines
• Easy replication setup
17. www. .com
Automatic Sharding
Logical Logical
Primary
Physical
Server
Secondary
Physical Server
Secondary
Physical Server
Primary
Physical
Server
Secondary
Physical Server
Secondary
Physical Server
Automatic Data Distribution - Sharded Cluster
Shard 1 Shard 2
Primary
Physical
Server
Secondary
Physical Server
Secondary
Physical Server
Horizontally
Scalable
Cluster metadata
includes data location,
shards, # of chunks….
Replicas ReplicasReplicas
Shard N
18. www. .com
Replica Sets
BI Connector
MULTI DATACENTER
CLUSTER
Site 2
Sec 1.1
Display
Sec 2.1
Batch
Sec 3.1
Batch
Site 2 – Display and Batch
Priority 1
Votes 1 Site 3
Sec 1.2
Batch
Sec 2.2
Batch
Sec 3.2
Delayed
Site 3 – Batch and DR
Priority 0
Votes 1
Config
Server
Config
Server
Priority 1
Votes 1
Config
Server
Collection
Primary 1
Display
Primary 2
Display
Primary 3
Display
19. www. .com
Global Data Distribution
Read Global/Write Local
Primary
Secondary
Secondary
20. www. .com
Videos and Images – Unstructured Data
• Store files larger than 16MB i.e. video, images
Load chunks without reading entire file into memory
• Atomically sync files with their metadata
• Shard and distribute around the cluster
doc.jpg
doc.jpg
(meta data) doc.jpg
(1)
GridFS
API
fs.files fs.chunks
Driver
21. www. .com
Cassandra
Cassandra is a highly scalable, eventually consistent,
distributed, structured key-value store. Cassandra
brings together the distributed systems technologies
from Dynamo and the log-structured storage engine
from Google's BigTable.
.
Apple, Sony, Walmart, Comcast, eBay, GitHub, GoDaddy, Hulu, Instagram, Intuit, Netflix, Reddit, Weather Channel,
CERN, Constant Contact, Macy’s, Expedia
• Fault Tolerant
• Data Durability
• Data Center Aware
• High Performance
• Decentralized
• Horizontal Scalability
• Elastic Architecture
• Apple - 75,000 nodes storing over 10 PB of data
• Netflix - 2,500 nodes, 420 TB, over 1 trillion
requests per day
• Chinese search engine Easou - 270 nodes, 300 TB,
over 800 million requests per day
• eBay - 100 nodes, 250 TB
.
BIG Data High # Concurrent Users
22. www. .com
Datastax/Cassandra Features
• Multi-model storage
• Key Value NoSQL
• Tabular NoSQL
• JSON/Document NoSQL
• Graph
• Very high “linear” scalability
• Automatic data distribution amongst nodes
• Multi-data center replication
• CQL Access Language
• SQL “like” language
• Tunable consistency model
• Strong node fault detection and recovery
• Writes to Memtables in RAM
• Materialized views
• Advanced replication allows multiple clusters to
be synchronized
• OpsCenter – browser based administration and
monitoring toolset
• Driver support for all common programming
languages
• In-Memory option allows parts (or all) of
database to reside in RAM
• Tiered storage
• Interface to Spark (in-memory)
• Data stream processing
• Access to Spark SQL (more robust than CQL)
• Security
• End to end encryption
• AD, LDAP, Kerberos support
24. www. .com
Cassandra/DataStax
• Keyspace - A keyspace is a logical container for data tables and indexes. It can be compared to an Oracle
Schema or a SQL Server database. Keyspaces define how the data is replicated amongst the nodes
• Table - A collection of columns fetched by a row. Columns are ordered by name
• Column - Supports different data types and consists of a name, value and timestamp
• Primary Key - Uniquely identifies a row occurrence in a Cassandra table
• Partition Key - The partition key identifies which node in the cluster will store the row. It is responsible for data
distribution across the nodes
• Clustering Key - Orders rows based on the column’s value
• Data Center - A collection of related nodes in a Cassandra Cluster
• Snitch - Determines which datacenters and racks nodes belong to. They inform Cassandra about the network
topology so that requests are routed efficiently and allows Cassandra to distribute replicas by grouping machines
into datacenters and racks
• Partitioner - A hashing algorithm that generates a hash value token from the partition key. The token is the
value used to distribute the data across the various nodes in the cluster. The partitioner’s goal is to assign equal
portions of data to each node. Each node in a Cassandra cluster becomes responsible for storing a range of
hash values
• Gossip - A peer-to-peer communications mechanism that identifies and shares node information (state and
location) to all nodes in the Cassandra cluster
25. www. .com
Cassandra/DataStax Decentralized Storage
Partitioners are hashing
algorithms that generate
tokens from partition keys
Each node in a Cassandra
cluster is responsible for a
range of tokens (hash
keys)
First column of primary
key becomes partition
key
Can use multiple
columns as primary key,
partition key
Also able to cluster
columns to order data
PRIMARY KEY (emp_id)
PRIMARY KEY (emp_id, dept_id) WITH
CLUSTERING ORDER BY (dept_loc))
PRIMARY KEY (emp_id, dept_id)
Partitioner
TOKEN RANGE
0 0-25
26 26-50
51 51-75
76 76-100
All nodes can
accept reads
and writes
Distributes data
amongst nodes
26. www. .com
Cassandra/DataStax Tunable Consistency
Write Consistency
Read Consistency
Read and Write consistency levels
are different than row replication
settings.
Replication factor will affect how
many copies are eventually
written vs tunable consistency for
fast client response
Level Description
ALL Returns the record after all replicas have responded. The read operation
will fail if a replica does not respond.
QUORUM Returns the record after a quorum of replicas from all datacenters has
responded.
LOCAL_QUORUM Returns the record after a quorum of replicas in the current datacenter as
the coordinator has reported. Avoids latency of inter-datacenter
communication.
ONE Returns a response from the closest replica, as determined by the snitch.
By default, a read repair runs in the background to make the other replicas
consistent.
TWO Returns the most recent data from two of the closest replicas.
THREE Returns the most recent data from three of the closest replicas.
LOCAL_ONE Returns a response from the closest replica in the local datacenter.
SERIAL Allows reading the current (and possibly uncommitted) state of data without
proposing a new addition or update. If a SERIAL read finds an uncommitted
transaction in progress, it will commit the transaction as part of the read.
LOCAL_SERIAL Same as SERIAL, but confined to the datacenter. Similar to
LOCAL_QUORUM.
Consistency
Latency
Level Description
ALL A write must be written to the commit log and memtable on all replica nodes in the cluster for that partition.
EACH_QUORU
M
Strong consistency. A write must be written to the commit log and memtable on a quorum of replica nodes in
each datacenter.
QUORUM A write must be written to the commit log and memtable on a quorum of replica nodes across all datacenters.
LOCAL_QUOR
UM
Strong consistency. A write must be written to the commit log and memtable on a quorum of replica nodes in
the same datacenter as the coordinator. Avoids latency of inter-datacenter communication.
ONE A write must be written to the commit log and memtable of at least one replica node.
TWO A write must be written to the commit log and memtable of at least two replica nodes.
THREE A write must be written to the commit log and memtable of at least three replica nodes.
LOCAL_ONE A write must be sent to, and successfully acknowledged by, at least one replica node in the local datacenter.
ANY A write must be written to at least one node. If all replica nodes for the given partition key are down, the write
can still succeed after a hinted handoff has been written. If all replica nodes are down at write time, an ANY
write is not readable until the replica nodes for that partition have recovered.
27. www. .com
Relational vs Cassandra NoSQL – Data Modeling
In relational systems, administrators model the data
In Cassandra, administrators design schemas that are
based on query patterns
28. www. .com
Cassandra/DataStax Modeling
Cassandra – YOU DESIGN SCHEMAS
BASED ON QUERY PATTERNS THEN DATA
RELATIONSHIPS
Maximization of Denormalization
Cassandra/Datastax
recommendation = 1
table per query
You are prebuilding
answers to unique
requests for data!
Overcome data duplication
by leveraging extremely
fast write performance
• Determine queries accessing data FIRST, then design the
data models
• No concept of foreign keys
• No concept of join operations
• Prepare data for fast reads by writing pre-built result sets
• Attempt to minimize reads from multiple partitions
• Cassandra prefers INSERTs over UPDATEs and
DELETEs
29. www. .com
Redis
• In-Memory, Key-Value Database
• Dumps to disk is configurable
• Database handles swapping
• All data can live in memory but key caching is required
• 1 Million Keys = 160 MEGs
• 10 Million Keys – 1.6 GIGs
• ATOMIC Operations
• Master-slave replication
• Scalability
• Redundancy
• Slaves
• Can’t respond to queries during initial synch
• Automatically reconnect and resynch after outage
• Journal file
• Every write is logged
• Commands replayed when server is started
• Configurable – Can choose between 2 settings
• Eventually consistent - “Speed”
• Immediately consistent - Safety”
Tumblr, Uber, Coinbase, Flickr, Hulu, Craigslist, Alibaba, Digg
30. www. .com
Redis Features
• Not a replacement for relational databases but can
be used as their “front end”
• Lightening fast read and write access
• Single threaded architecture – does not exploit
multiple CPU/Cores
• Does not support unit-of-work roll back
• Optimistic locking – data contention (race) will
cause transaction failure
• Redis Clusters
• Not able to guarantee strong consistency
amongst nodes
• Able to add/remove nodes in a Redis cluster
• Partitioning allows data to be split and stored in
multiple Redis instances. Each instance contains a
subset of keys
• Range partitioning
• Hash partitioning
• Can be used as a data store or a pure cache
• When used as a Cache, can be configured as a
LRU (gets rid of old data to make way for new)
• Sensor data
• Redis RDB persistence and backups
• Redis snapshots at specified time intervals
= a full database backup
• Move RDB files to other storage
• Write operations in memory can be logged
to Append Only Files (AOF)
• Appendfsynch parameter allows
administrator to configure log writes
31. www. .com
Neo4j
Walmart, Ebay, Cisco, Adobe, CrunchBase, Pitney Bowes, CareerBuilder, TomTom, ConocoPhillips, National
Geographic, Century Link, Glassdoor, Zephyr Health, Gamesys, Telenor
• Highly scalable, native graph database
• Enterprise and community editions
• Store, manage, analyze, and use data within the context of
connections, like the circles and lines drawn on whiteboards
• More than 1 Million downloads
• Understanding data relationships is also key to understanding
dependencies, uncovering cascading impacts, and predicting
behavior
• Access language allows you to traverse relationships in a
much more simple, and easy to understand, way than
relational SQL
SQL – Dozens of lines
Cypher – Couple of lines
32. www. .com
Neo4j Features
• Provides graphical browser utility to
better visualize relationships
• Import data from different sources
using rules
• Cypher is another SQL “like” language
• Properties are key-value pairs
• Nodes with properties (node is data, not server)
• Named relationships with properties
• Key – string
• Value – individual data types or array
• Path – connecting relationships, which you traverse using an API
• Schemaless
• Easily able to store unstructured data
• Easily able to store large volumes of data
• Full support for ACID Transactions
• Full indexing capabilities
• Constraint capabilities
• Unique
• Exists (like a Foreign Key with no parent
delete rules)
Find Sushi Restaurants in New York that my friends like
33. www. .com
Neo4j Graph Examples
Master Data
Management
Graph Based
Search
Recommendations
34. www. .com
NoSQL vs Relational
Strengths Weaknesses
• ACID
• Transaction management
• Sophisticated locking and latching
• Power of the SQL Language – Two-phase commits,
foreign key constraints, joins, subqueries,
integrated aggregations, complex business rule
enforcement
• Product maturity
• Robust utilities
• Vendor support
• Most vendors have robust cloud strategies
• Strong third-party software provider adoption
(applications, tools and utilities)
• Product purchase/support costs
• Scalability can be complex and expensive
• Data normalization can impact performance
• Schemas are not flexible
• Not all data fits neatly into rows and columns
• Geographic distribution can be complex
Relational DBMS
35. www. .com
NoSQL vs Relational
NoSQL DBMS
Strengths Weaknesses
• Dynamic schema flexibility
• Faster development times
• Total cost of ownership
• Easily stores semi, non and fully structured
data
• Horizontal and vertical scalability
• Geographic replication and data distribution
• Easier to achieve high performance accessing
large volumes of data
• Custom tailor environment to data storage
and processing needs
• Cost effective clustering
• Crude transaction management and locking
mechanisms (BASE vs ACID)
• Limited cloud offerings
• Vendor support (or lack thereof)
• Data is often denormalized leading to duplicate
updates
• Weak access languages
• No inherent data integrity enforcement
mechanisms
36. www. .com
NoSQL vs Relational
Transactions – COMPLEX Transactions – SIMPLE
Data – STRUCTURED AND STATIC Data – FULL/SEMI/NON STRUCTURED DYNAMIC
Data Velocity – MODERATE TO HIGH Data Velocity – HIGH to ASTRONOMICAL
Data Locations – FEWER THE BETTER Data Locations – MANY LOCATIONS
Data Volumes – MAINTAIN BY PURGING Data Volumes – RETAIN FOREVER
Data Availability – CLUSTER, LOG SHIPPING Data Availability – INHERENT ARCHITECTURE
Data Performance – FOCUS ON READS Data Performance – FOCUS ON READS/WRITES
Relational
DBMS
NoSQL
DBMS
37. www. .com
Questions and Additional Information
cfoot@rdx.com
Next Month’s Presentation – Evaluating and Selecting Cloud
Database Management Systems
The RDX Report
Is NoSQL the Natural Progression of DB Technology?, Cloud’s Hidden Impact on
IT Support, SQL Server 2016 Licensing Best Practices, The Rise of Corporate
Ransomware
LinkedIn
Selecting Cloud DBMS, NoSQL Architectures, Database Security Series,
Improving Customer Service
20YEARS OF
SERVICE DELIVERY
EXPERIENCE