Evolution of Distributed Database Technologies in the Digital era

Evolution of Distributed
Operational Database
Technologies in the Digital Era
An architectural and digital fitment analysis
Vishal Puri
Executive Architect
Data Platforms Services
IBM GBS Cognitive Business Decisions Support
Email - vpuri@us.ibm.com
Mobile - 4692197510

Table of contents
• Overview
• Characteristics of Distributed Databases
• Distributed Database Models
• The Incumbents
• The Challengers

• Overview
• Characteristics of Distributed Databases
• Distributed Database Models
• The Incumbents
• The Challengers

Executive Brief
The use of distributed databases (also called NoSQL DBs) for supporting operational processes including
operational intelligence is increasing as companies increase the adoption of digital process engagement
with their consumers as well as digitizing internal and partner facing operations. With digitization, comes
the challenge of scale, flexibility and agility that distributed databases are uniquely positioned to address.
Distributed database marketplace
• Distributed databases have been in the market for a long time with mature solutions with significant
enterprise adoption. The examples of such databases is MongoDB, Cassandra, Redis and HBase.
Incumbent leaders continue to innovate their feature sets, with some doing so more successfully than
others, such as Mongo DB (with MongoRocks).
• The market leaders are saddled with legacy architectural debt, which has opened the door to new
challengers that provide attractive propositions and are making seamless replacement of existing
solutions a key architectural goal. These solutions are ScyllaDB, Aerospike, Accumulo and Yugabyte
among others.
• While this paper does not touch upon public cloud only offerings (Amazon DynamoDB, Microsoft
Cosmos or Google Spanner) there is a significant interest in these offerings in the marketplace. There
is also significant interest in adopting solutions that are cloud native and multi-cloud enabled, with as-
a service offerings from multiple vendors.
Database practitioner recommendations
• Data Architects must keep on top of the changing landscape and innovations. As this paper
articulates, the best way forward is to understand distributed database architecture patterns. This
paper also postulates a framework to breakdown the characteristics of any distributed database for
comparison purposes and to gauge its suitability for specific architectures.
• Most distributed databases require upfront data modeling and data modeling as a discipline for
implementing such databases is a skill that must be nurtured in technology organizations
• With changing landscapes and use case specific implementations, it is important to have database
devops SME’s to be multiskilled, especially w.r.t database operations in a cloud environment

Core Audience and Goals
Who is the core audience?
• The core audience for this document is a practicing architect providing solutions for digital business
functions.
What are the key goals ?
• Explore core architectural patterns for distributed databases.
• Explore a class of distributed databases that are used to enable digital engagement use cases.
• Establish an architectural framework to determine suitability of specific Distributed DBs in addressing
various digital use cases.
• Apply the above architectural framework in assessing market leading DBs as well as emerging
challengers in the marketplace.
• Enable deep architectural thinking as opposed to providing convenient but shallow answers.
What is not a goal?
• Providing a simple decision tree for choosing a database.
• Direct and convenient comparison of all DBs.
What are the class of NoSQL DBs not covered ?
• Any Relational or hybrid relational + NoSQL DB – Postgres, MySQL, DashDB.
• Any solution that only scales vertically or via replication (MySQL, Postgres, Aurora etc).
• Graph Databases or purpose built time-series DBs – Neo4J,OrientDB, Titan etc.
• Analytics oriented NoSQL solutions – Hive, Impala, Drill, Spark SQL etc.
• Search oriented solutions –Solr, ElasticSearch etc.

What problems do Distributed Databases solve in
the digital world?
Internet Scale
• By making compromises along the consistency and availability spectrum, NoSQL databases
enable distributed databases that can scale horizontally and can possibly be distributed
geographically. This was not possible with traditional relational DB’s with ACID semantics and
enforced referential integrity (However, some distributed NewSQL architectures such as
Spanner challenge this)
Insights Driven Digital Engagement
• Enable alternate data models that more naturally represent problem domains in the dynamic
digital world (As opposed to force fitting everything into a predefined relational DB), thus
enabling more efficient queries for operational insights and interactions
• Dynamic Customer attributes and activities
• Time series data – Sensors, Stock price
• Session caches
• Searchable logs,
• Large Collections – Queues, Lists, Maps, Counters etc.
• Network and semantic Graphs
• Real time counters
• Global custom
• er transaction databases

Common characteristics of distributed databases
Horizontally scalable
distributed database,
for internet scale use
cases
1
High availability and
resiliency built in
2
Data is partitioned
and replicated across
the cluster
3
Data model is some
form of Multi-
Dimensional Map
4
Primarily optimized
for operational use
cases
5

How to dissect and analyze fitment to purpose of a Distributed DB
Consistency, Availability, and
Latency
•Eventual vs strong consistency
•What happens on network partitions or
server failures
• Write optimized, read optimized or
throughput optimized
Data Model
•Wide Column/Key Value/Document
•Static vs Dynamic Typing
Horizontal Scaling Strategy
•Hash based partitioning/ Ordered
Partitioning/ Replication
•Load-balancing reads and writes
Operational Analytics and Search
•Support for secondary indexes –
transitionally consistent indexes/external
indexes/distributed indexes
•Support for range scans
•Support for joins
•Support for counters and aggregation
Replication
•Master Slave / Distributed
•Cross datacenter replication support
Management Operation Behavior
– Add/Remove Node
•Data redistribution, Master election
Storage Support
•Tiered vs Memory centric vs Disk Centric vs
Flash Centric
Developer Friendliness
•Client APIs - REST API/CQL/Java
Library/JSON/Other
•Popularity score in DbEngines
Handling Updates
•Locking vs MVCC
•Partial vs full update
Technology Used
•Open source vs Closed source
•Java/JVM based vs C/C++ based
•Cross platform vs Linux optimized
Security
•DB and schema level security –
Read/Write/Delete/DDL
•Data Level security – Row/Column/Cell level
security
•LDAP/Kerberos integration
Ease of Setup and Scale
•Minimum infrastructure required to start
•Ease of scaling

The Consistency Availability Spectrum for Distributed Databases – CAP
Theorem
CAP Theorem
Only 2 of the 3 properties - Consistency, Availability, and Partition-
tolerance can be satisfied by a distributed database.
Consistency: A read operation is guaranteed to return the most recent
write
Availability: Any operation is guaranteed to receive a response saying
whether it has succeeded or failed
Partition tolerance: The system continues to operate when a network
partition occurs
While the CAP theorem is not considered to be sufficient* to articulate
the behavior of a distributed database it does make for a useful
classification in making high level decisions about a distributed database
choice.
• https://arxiv.org/pdf/1302.0309.pdf
An alternative to understanding distributed systems is the PACELC
Theorem, described in the original paper as -
“if there is a partition (P), how does the system trade off availability and
consistency (A and C); else (E), when the system is running normally in
the absence of partitions, how does the system trade off latency (L) and
consistency (C)?”

The Distributed Operational DB Architecture Spectrum
Big Table
Dynamo
In Mem Data
Grid
KV Store
Cassandra
Scylla
Google
BigTable
Dynamo DB
HBase
Redis
Aerospike
Memcache
Hazelcast
Ignite
Clustered /
Sharded /
Distributed
SQL DB
Mysql NDB Cluster
Citus DB
(Postgres)
MemSQL
Document Store
Wide Column
Store
Graph /
RDF Store
Orient DB
Titan
CoucHBase
MongoDB
Couch DB
Marklogic
Marklogic
Elastic Search
Search
Solr
Accumulo
• There is no perfect all purpose
distributed database.
• Each distributed database has made
specific architectural choices and
compromises so as to be best suited
for a narrow range of use cases
• Fortunately each architectural
category has several choices of DB’s,
providing us with choices no what
our use case
• Architectures continue to evolve
with newer distributed databases
addressing shortcomings and
technical debt of existing leaders by
making better ground up
architectural decisions and
technology choices
Google
Spanner Cockroach
DB
YugaByte DB
Coherence
Azure Cosmos
DB

Data Modeling in a Distributed Database
Along with making choices on availability, consistency and latency, every
database offering makes fundamental architectural choices regarding the
data models it would like to enable and the associated restrictions it needs
to put in those data-models to honor the laws of computational physics.
Most of the distributed database data model characteristics have evolved
based on popular and expanded usage scenarios along with new
developments in Hardware architectures. Key characteristics of a Data
Model to look for are –
1. What is the fundamental data structure ? – Wide Column,
Document, Key Value, Object Collections
2. How is the data distributed/sharded in the cluster? – Random
partitioning Or Ordered partitioning, Range partitions or Hash
Partitions
3. Does the data model support clustering or ordered data storage
within a server to support scans? – Clustering Keys, Composite
primary Keys, Ordered Storage
4. Does the the data model support secondary indexes? If so, are the
secondary indexes transitionally consistent. Are the secondary
indexes efficient in supporting a search without a primary key
reference?
5. Does the data model support CRDTs (Conflict-free replicated data
types) – These are especially useful in eventually consistent DBs like
Cassandra, Riak and Aerospike to enable reliable distributed
updates. For example Counters, Sets etc
6. Does the data processing engine support server side operations
such as stored procedures, map-reduce functions or triggers
Needless to say, data modelling in a Distributed NoSQL or newSQL
database is very different from modeling a relational schema. One
requires a shift in mindset -
• View the data as a composite dataset driven by narrow access
patterns.
• Since multi-row or multi-document transactions are at a premium,
having denormalized or nested data is preferred
• Since joins of data are either not supported or prohibitively expensive,
plan for manual joins in the application
• Design of the primary key is the most important part of modeling and
getting it right will dictate
• Ease of access to the data for most access patterns
• Scalability of reads and writes across the cluster
• Ability to have shared multi-tenant database
• Referential integrity is not maintained by the database in most
solutions (barring distributed sql DBs like Spanner) and should not be
relied on for maintaining this
• Know your CRDTs

Distributed DB Engine Popularity*
• Popularity in the marketplace and developer
mind-space is often a consideration in selecting
a Distributed DB of choice
• A very high popularity score often reflects the
ease of setup and developer friendliness of the
solution, rather than merits of the solution at
scale
• MongoDB, Redis and Cassandra and incumbent
leaders in this space
• However, cloud centric offerings (Dynamo,
Cosmos) along with newer entrants such as
Scylla, Aerospike and Yugabyte are on the rise
• While the popularity may not reflect the fit of
the DB for the use case, it does reflect on
availability of skills in the marketplace
• Stagnation in popularity of a DB also indicates
emergence of alternatives that are addressing
shortcomings of a solution. It may also reflect
the stagnation of the ecosystem itself, for e.g.
Hadoop adoption stagnation impacts adoption
of Accumulo and HBase
* https://db-engines.com/en/ranking_definition
* https://db-engines.com/en/ranking_trend

Distributed DB architecture limitations and consequences
Category Issue Consequence Alternative Solutions
Inefficient
Processing
Use of Java/JVM
technologies
Garbage Collection Stops, Inefficient memory
usage, Complexity in operations and expensive at
scale
• Use C/C++ frameworks
• Use and manage offheap memory where feasible
Not optimized for
OS/Hardware,
Uses generic OS Caches and default kernels that do
not optimize for target use cases and access
patterns leading to large installations at scale
• Optimized for multi core, NUMA architectures, L1-L2 Caches,
vector processing etc
Not optimized for storage Uses a single storage system that does not provide
adequate flexibility between speed and cost
• Intelligently use of multi-tiered storage strategy, dynamically
moving data between flash, RAM and disk in order to
optimize based on access pattern
Flexibility Fixed access patterns based
on primary or shard keys
Multiple access patterns require changes to data
model design and/or duplication of data
• Support for scalable secondary index implementation
• Support for range scans
Fixed schemas Addition of schema elements is operationally
complex process
• Support for easy schema evolution through flexible schemas
with dynamic attributes
Operational
complexity
Eventual consistency Results in stale data across the distributed nodes,
that needs to be reconciled and repaired
• Strongly consistent solutions
• Incremental and fast repair tools that do not impact
availability/performance
• Support for distributed transactions
Data redistribution Adding or removing nodes causes redistribution of
large volumes of data which impacts performance
and takes long time to complete
• Architectures that minimize data movement for scaling and
load balancing

References
• Jepsen – A tool to understand Distributed database Availability and Consistency characteristics –
• https://aphyr.com/tags/jepsen
• https://aphyr.com/posts/343-scala-days-2017-jepsen-keynote
• Conflict-Free Replicated Data Types (CRDT) - https://en.wikipedia.org/wiki/Conflict-free_replicated_data_type

Wide-Column - Big Table
Architecture
• Provides horizontal scalability, immediate consistency and network partition tolerance, at the cost of loss of availability in some scenarios
• Requires Master Server for metadata
• Write optimized as no disk seek is needed – Write to log and Memtable, flushed periodically to SSTable on disk. Many SSTables on disk – 1
per generation/version
• Merged reads – read from Memtable and SSTable.
• Compaction - Many SSTables increase read time, therefore compaction occurs to reduce disk seek time. Compaction = multiple SSTables
combined into single SSTable at O(logN) rate
• Relies on distributed file system for durability
Data model
• Column Family oriented
• Supports sparse data – lots of null values
• File per column family. Data sorted per column family by row id, column name and timestamp. Sorted data is compressed
• Cell is smallest addressable unit – row, column. Data in each cell has configurable number of versions
Best used for
• high throughput writes, where read to write ratio is balanced
Example
• HBase, Cassandra, Scylla

Wide-Column - Dynamo
Architecture
• Enables high availability and partition tolerance with horizonal scalability, at the cost of immediate consistency, by enabling eventual
consistency models
• Dynamo allows read and write operations to continue even during network partitions and resolves update conflicts using different conflict
resolution mechanisms, some client driven.
• Dynamo's Gossip based membership algorithm helps every node maintain information about every other node.
• Dynamo can be defined as a structured overlay with at most one-hop request routing.
• Dynamo detects updated conflicts using a vector clock scheme, but prefers a client side conflict resolution mechanism. A write operation in
Dynamo also requires a read to be performed for managing the vector timestamps. This is can be very limiting in environments where
systems need to handle a very high write throughput.
Data model
• Item model – {Key – Attribute Maps}
Example
• DynamoDB, Riak, Cassandra, Scylla

Document Stores
Architecture
• Wide variety of architectures –
• from strongly consistent (MongoDB, Marklogic, YugaByte) to eventually consistent (CouchDB, Cloudant)
• Automated sharding with fixed shards (MongoDB) to truly distributed and dynamic sharding (YugaByte, CouchDB)
• Read Scaling and fault tolerance with dedicated Replica Sets (MongoDB) to sharded replication with distributed read and writes
• Enables dynamic schemas, leaving data validation and integrity to be largely driven by applications
• Data model is usually JSON in text or Binary form (BSON) or XML
• Most distributed document databases only support local secondary indexes forcing one to search all nodes in a cluster when primary key is
not part of the query
Data model
• Item model – {Key – Binary JSON}
Example
• MongoDB, CouchDB, Cloudant, MarkLogic, YugaByte

In Memory Key-Value Store
Architecture
• Most solutions (Memcache, Redis) in this category have roots in single server data storage with a focus on speed and throughput.
Application driven sharding, as well as lack of availability and data consistency guarantees has been the rule rather than the exception
• Some solutions such as Redis Cluster evolved to provide horizontal scalability in addition to performance, without providing true
availability or consistency guarantees under network partitions.
• Some solutions such as Aerospike support tunable consistency with varying degrees of availability, i.e., always available with simple conflict
resolution for eventual consistency or primary key consistency with lower availability
• Master slave architectures
• Automated or manual sharding
• In Memory storage of keys, Values stored in memory or on other storage media (Disk/Flash)
Data model
• Key – Value Pairs, support for complex data structures such as Lists and Nested Maps
Example
• Memcache, Redis, Redis Cluster, Aerospike, Riak KV

In Memory Data Grid
Architecture
• In – Memory cache
• Configurable Replication
• Limited ACID Compliance through transactions
• Configurable Consistency and availability semantics with limited Network partition tolerance
Data model
• Distributed Collections – Maps, Sets Lists
Example
• Hazelcast, Ignite (GridGain), Coherence

Distributed ACID compliant DBs - Spanner
Architecture
• The architecture was mostly developed to take care of three problems at massive scale (Trillions of rows):
• Global Consistency - if a record is updated in one region (say, Asia) someone reading in another region (Say Africa) will have the same record
updated on reading. This is supported through synchronous globally distributed transactions using the TrueTime API and atomic clocks
supported by GPS, in order to prevent any time inconsistency
• Table-Like Schema – Support for table like schema via rows and columns, backed by a key-value store
• SQL Query Language – A SQL query parser to support SQL clients
• In order to support the above characteristics this architecture sacrifices support for raw performance required by low latency use cases and
incurs possible additional latency with every new node added
• The TrueTime API along with the atomic clock infrastructure ensures that data is consistent within a very small time interval (~1-7 ms)
• Spanner has three main types of operations: read-write transaction, read-only transaction and snapshot read at a timestamp, which are all
supported on a globally distributed database
• Spanner provides first class support for unified cross data center and cross availability zone data clusters. It enables this via concepts of
global indexes, zone masters, span servers (which are similar to HBase region servers) and use of paxos for fine grained cluster consensus
Data model
• Item model – Table like schemas implemented using timestamped key value data structure
Use case
• Data Consistency in a global deployment, across many data centers. E.g. Global digital banking or any global ecommerce application
• Very large data sets with relaxed read write latencies (>20 ms)
• Blockchain and crypto currency implementations
Example
• Spanner, CockroachDB, Azure Cosmos DB, YugaByte, NuoDB

References
• Google BigTable Paper - https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
• Amazon Dynamo Paper - https://www.dynamodbguide.com/the-dynamo-paper/
• Spanner Architecture - https://kunigami.blog/2017/04/27/paper-reading-spanner-googles-globally-distributed-database/
• Comparing scaleout sql - https://blog.yugabyte.com/practical-tradeoffs-in-google-cloud-spanner-azure-cosmos-db-and-yugabyte-db-
ce720e07c0fd

Cassandra - Overview
Cassandra is a distributed eventually consistent database which follows the
BigTable wide column format and the Dynamo DB masterless “ring” distributed
storage architecture
Data Objects in Cassandra
• Keyspace – a container for data tables and indexes; analogous to a database in
many relational databases. It is also the level at which replication is defined.
• Replication factor − Number of machines in the cluster that will receive
copies of the same data.
• Replica placement strategy − Strategy to place replicas in the ring.
• Column families − Keyspace is a container for a list of one or more column
families. A column family, in turn, is a container of a collection of rows.
Each row contains ordered columns. Column families represent the
structure of your data. Each Keyspace has at least one and often many
column families.
• Row key – used to identity a row uniquely in a Column Family and also distribute
a table’s rows across multiple nodes in a cluster.
• Index – similar to a relational index in that it speeds some read operations; also
different from relational indices in important ways.
Query Language – Cassandra Query Language (CQL)
Masterless Ring architecture –
Multi-datacenter and cloud deployable
Keyspace as a Column Family
container
Column Family consists of Row
Key and Columns

Cassandra – Features and limitations
Consistency, Availability, and Latency
•Provides very high availability and eventual
consistency based on Dynamo architecture
•Remains available in the face of network partitions
• Optimized for very low latency writes.
• However, reads incur cost through quorum checks, if
read consistency levels are set to >1
Data Model
•Adopts the BigTable Wide Column data model
•Enables Clustering columns to provide order within a
partition
•Supports secondary indexes as well
•Schema needs to be modeled upfront, including
column families and columns
•Supports scalability via consistent hashing and random
spread of data across the cluster.
• Large number of nodes in a cluster (>100) causes
excessive inter node communication. Similarly large
number of tables in a cluster also causes issues
• Since data is randomly distributed, range scans are not
supported on primary keys or on secondary indexes,
• All clustering columns must be provided in a query
• Secondary index lookups require querying all partitions
• Materialized views are supported but are very limited
in their function
• Good support for distributed Counters
•Most common implementation is to enable secondary
indexes using Elastic Search connectors. These are not
strongly consistent indexes
Replication
• Very good support for cross datacenter replication out
of the box
Management Operation Behavior –
Add/Remove Node
•Adding a node can take a long time (~days) because
data must replicate. More smaller servers alleviates
this
•Node repair is a costly operation that can impact
resources used to serve data requests. Incremental
node repair can minimize this impact but is only
available in the enterprise version
•Similarly data compactions can also impact online
performance
Storage Support
•Primarily uses hard disk for persistent storage. Node
density of no more than 2 TBs, although 1 TB is
recommended. This can lead to large cluster sizes
quickly
• Most popular wide column DB and 3nd most popular
NoSQL DB after MongoDB and Redis
•Supports SQl like language called CQL –used for query
and DDL
Handling Updates
•Supports lightweight transactions through compare
and Set features.
• There is no support for multi-row transactions. Must
eliminate need for multi-row and multi-table
transactions through appropriate data modeling
techniques
Technology Used
•Java based implementation
• Need to monitor and tune JVM garbage collection and
heap usage constantly
• Offheap memory can also be allocated for memtables,
bloom filters etc
Security
•Row level access control supported, but require exact
string matches and Datastax
•Columns level security not yet implemented
•Requires minimal setup to get started.
•Scales from small to large datasets seamlessly.
•Unlike HBase, all nodes are equal

Cassandra - Practices and Use Cases
Best Practices
• One must commit to upfront design based on known use cases and
data access paths. If the use case evolves and there are multiple
optimal access paths to the data required then it will almost certainly
have to be supported through duplication of atleast a subset of data
• It is common to pair Cassandra with an implementation of Solr or
Elastic search in order to enable secondary indexes
• It is recommended to keep Cassandra Cluster sizes to a moderate
number of servers (<150 ) with at most 2 TB disk size
• Cassandra requires an operations team that will work towards
monitoring and optimizing of the settings and infrastructure on a
regular basis, apart from monitoring complex operations such as
splitting, compaction and node repair
When to consider
• For application features or microservices that serve a narrow use case
and access pattern, for large data sets and users
• For maintaining global or fine grained counts through counters
• Suited well for applications and feature sets that need to start small
but grow with increased usage
• Ingesting large volumes of data with high throughput and low latency
• Data services across geographies and data centers
Use cases
• Consumer activity and extended profile DB for recommendation and
personalization
• Web analytics data such as clickstreams and counters
• Graph DB for social network analysis, fraud analysis
• Storage for IOT data
• Storage for firehose datasets
• E-commerce carts and checkout
• Product catalogs and playlists
When not to consider
• For applications that require flexible access patterns through sql like
queries, range scans etc
• For applications that require a flexible and dynamic schema
• As an analytics DB. Cassandra is often paired with Spark to perform
analytical operations, however Cassandra does not provide any
benefits to the spark engine such as query execution or data set
filtering/pruning

HBase - Overview
HBase is a distributed strongly consistent database which follows the BigTable
architecture and wide column format, and uses HDFS for storage, while providing low
latency read and write access
Data Objects in HBase
• HBase Tables – Logical collection of rows stored in individual partitions
known as Regions.
• HBase Row – Instance of data in a table.
• RowKey -Every entry in an HBase table is identified and indexed by a
RowKey.
• Columns - For every RowKey an unlimited number of attributes can be
stored.
• Column Family – Data in rows is grouped together as column families
and all columns are stored together in a low level storage file known as
HFile.
Query Language – SQL via Phoenix or Java API

HBase – Features and limitations
•Provides strong Consistency at the expense of
Availability, based on the BigTable architecture
•Availability can be increased by adding read replica’s
on failure of primary replica
• Reads and writes are optimized for throughput
rather than latency, since the storage is dependent
on HDFS.
•Can provide MySQL like latencies at scale
Data Model
•Adopts the BigTable Wide Column data model
•Data is stored and ordered by row key
•Fixed column families – No more than 2 column
families advised. Within a column family flexible
number of columns are allowed
•Supports scalability via auto sharding. Basic unit of
sharding is region. Regions are created and merged
depending upon a configurable policy
•Writes are bound to a primary region and unlike
reads, are not load-balanced to the replicas
• Metadata does not split and scale
• Can easily handle billions of rows X millions of
columns
• Stores keys in ordered fashion. Good at range scans
and sorts
•Query predicate push down via server side scan and
get filters
• Most common implementation is to enable
secondary indexes using Phoenix or Solr connectors.
These are not strongly consistent indexes
Replication
•Support for synchronous replication, as of HBase 2.0,
as well as eventually consistent replication to read
only replica
Add/Remove Node
•Rolling restart for configuration changes and minor
upgrades
•Region Splits caused by data growth results in loss of
availability as Region of data is forced offline to
enable movement of data
Storage Support
•HBase can use tiered storage across RAM, SSD and
disk
•Uses HDFS for storage – HDFS is optimized for
throughput, not latency
•Uses Region Server as a middleware for
implementing Cache and mediating access – Adds
network latency
• 2nd most popular wide column DB, second only to
Cassandra
•Supports Thrift, REST and Java Client APIs
•HTTP supports XML, Protobuf, and binary
Handling Updates
•Supports single row transactions for safe atomic
updates, including safe server side row locks
•No support for cross table or multi row transactions .
However, Alternatives exist via external libraries such
as OMID
Technology Used
•Java based implementation with offheap read and
write capability (as of HBase 2.0)
Security
•HBase now supports cell level security via Co-
Processors. Supports Kerberos integration
•Requires considerable starter infrastructure with a
minimum of 3 data nodes and 3 master nodes
•Works well with large data sets. Accessing Small to
medium sparsely distributed data is not performant

HBase - Practices and Use Cases
Best Practices
optimal access paths to the data required then it will almost certainly
have to be supported through duplication of atleast a subset of data
• It is common to pair HBase with an implementation of Solr or Elastic
search in order to enable secondary indexes
• HBase requires an operations team that will work towards monitoring
and optimizing of the settings and infrastructure on a regular basis,
apart from monitoring complex operations such as splitting and
merging of data, backups and multi-cluster replication
When to consider
• When the Hadoop/HDFS stack is well adopted
• Data sets that are naturally ordered (time series data) such as sensor,
stock prices, IOT etc
• Large Data Processing using MapReduce like algorithms
• Hybrid Operations and Analytics along with Hadoop
• Real time high performance scans of large join-less dataset
• Large master data sets with frequent updates
Use cases
• Consumer activity and extended profile DB for recommendation and
personalization
• Log processing
• Time Series data storage and analysis for network and sensor data
• Graph DB for social network analysis, fraud analysis
• Product Price by day by Store, location and competitor
• https://blogs.apache.org/HBase/entry/HBase-application-archetypes-
redux
• General purpose database with support for multiple access patterns and
an evolving schema.
• Small data sets - Does not perform comparatively well for small data
sets as it has significant overhead. Consider a minimum of 5 data nodes
storage before using HBase
• Need guarantees of very low latencies – Due to overheads and possible
blocking operations (Region Splits, Garbage collection), low latency
guaranties cannot be given
• Need Complex 2 phase commits across database tables or across
resources
• No existing or planned implementations of Hadoop/HDFS

Redis Cluster - Overview
Redis is an in-memory Key-Value data store built for low latency access. Redis Cluster
is a distributed database that automatically shards Redis key-value data structures
across a cluster of nodes, in a master Slave HA architecture
• Redis Cluster was designed as a general solution for high availability and
horizontal scalability while keeping the core focus of Redis in mind - low latency
and a strong data model. Because of this, Redis Cluster implements neither true
availability nor consistency of the CAP theorem
Data Objects in Redis
• Redis supports any Key Value data structure, where the value data structure may
be of type String, HashMap, List, Sorted Set, Set, Bitmap or HyperLogLog
• Redis Cluster places restrictions on multi-key operations and the ability to
support multiple database in a single cluster
Query Language – Via client side libraries in multiple languages

Redis Cluster – Features and limitations
• Provides raw performance and scalability at the cost
of true high availability and consistency
• It is possible to lose data if a failure occurs after a
master has acknowledged a write but before
replication has completed.
• Redis Clusters are unavailable on network partitions
Data Model
•Key value data store with support for complex data
structures such as lists, sets, sorted sets and Maps
•Values can be set to expire (as in a cache)
•Supports Lua scripting for data processing
•Supports scalability via sharding based on evenly
distributing keys across the cluster using a hashing or
range partitioning algorithm
• The number of shards is fixed across all data
collections
• Supports Client assisted query routing for data shard
• First class support for Counters, top N queries
• No support for secondary indexes, which must be
manually created as top level collections
Replication
• Manual configuration of master-slave. A node is only a
master or a slave, requiring more machines to be
managed.
•All replication is performed asynchronously. It is
possible to lose data if a failure occurs after a master
has acknowledged a write but before replication has
completed.
• No support for cross datacenter replication
• Redis enterprise now supports geo distributed active
active replication using CRDTs
Add/Remove Node
•Adding a node can take a long time (~days) because
data must replicate. More smaller servers alleviates
this
•Supports only manually resharding keys while staying
online. However this procedure is not guaranteed to
survive all kinds of failure and may result in loss of data
Storage Support
•Disk backed- In memory DB. Highly tuned for RAM
usage. This can make infrastructure costs go up quickly
with large data sets
• Redis Enterprise now supports Flash
• Most popular key value store and 2nd most popular
NoSQL DB after MongoDB
•Memcache API, client libraries in most languages
•Supports Lua Scripting
• Supports messaging semantics out of the box
Handling Updates
•Supports node local transactions in theory, however
most Redis cluster clients do not support this
Technology Used
•Implemented in C, but cross platform. Single threaded
and tuned for RAM
Security
•No fine grained data level security
• Supports LDAP authentication and RBAC
•Requires minimal setup to get started.
•Scales from small to large datasets easily

Redis Cluster - Practices and Use Cases
Best Practices
optimal access paths to the data required then it will almost
certainly have to be supported through duplication of atleast a
subset of data through multiple key value collections
• Redis cluster is fairly new and it is best used with the Redis
Enterprise distribution in order to enable easier data operations
and administrative functions
When to consider
• As a data cache with a narrow access pattern
• For small data sets that need to be accessed many times
• Storing temporary state for fast access
• Message queues and pub sub scenarios
Use cases
• Analytics Counters and leaderboards
• Mobile Event notifications and subscriptions
• Spam filtering
• Item expiration by time
• User session information
• Cache for serving backend analytics
• For applications that require consistency or data integrity guarantees
• For applications that require flexible sql like queries
• Storing wide data, such as thousands of attributes for a key
• Storing data that requires queries with high time complexity
• Storing data that requires secondary access paths

MongoDB - Overview
MongoDB is a distributed document (binary JSON) Database that provides a tunable
consistency and availability model, in a master slave architecture. Mongo supports a
flexible MySQL like storage layer architecture with pluggable storage engines – such
as WiredTiger, MMAP, RocksDB etc
Data Objects in MongoDB
• Namespace – A logical grouping of collections
• Collections– Analogous to a table in a relational database, represents a
set of documents.
• Document – A record in a binary JSON (BSON) format .
• ObjectID – Unique identity of a document
• Index – A named index on a collection. Index is maintained local to a
shard
Query Language – Mongo DB custom query language and API

MongoDB– Features and limitations
•Provides tunable consistency semantics, including strong consistency
at the highest read and write concerns using the V1 protocol
•Also provides tunable availability at the (considerable) cost of
consistency
• Expect read latency if you want linearizable reads, as a quorum is
enforced at read time
• lack of centralized resource management across the cluster can lead
to unmanaged performance issues, such as when all nodes can
independently decide to do compaction or garbage collection
Data Model
•Provides a document model with support for both embedded and
normalized documents
• Does not impose a schema on the documents and this is largely
managed by applications using the schema, allowing for maximum
flexibility, albeit with considerable scope for application induced
integrity and quality issues
•Cannot guarantee uniqueness of index across shards. This must be
managed by the application, if required
•Supports scalability via consistent hashing as per a shard id and
splitting data into a fixed number of predetermined shards
• A replicated set of Config servers store metadata and configuration
settings
• Each shard has one write replica set for handling writes and multiple
read replica sets for handling reads. This keeps read replicas idle for
any write activity, leading to lower resource utilization, but provides
better separation of workloads
•Requires additional mongos instances for query routing to the correct
shard
• MongoDB supports secondary indexes.
• Secondary indexes are local to the shard, and any query without a
shard key will result in querying every shard
• MongoDb does not support a group by operation in a sharded cluster.
Instead, we need to run mapreduce and aggregate operations.It also
does not support Covered indexes
• Supports multikey indexes for searching on array structures embedded
inside a document
•
Replication
• Replication is enabled using a master slave architecture via Replica
Sets
• Primary replication sets are used for writes and secondary replication
sets for reads
•Replication is asynchronous from master to slave, consistent reads can
only be achieved by incurring the cost of a majority based quorum at
the time of a read
• Good replication characteristics for cross data center replication
Management Operation Behavior – Add/Remove Node
• Easy to add a shard. The balancer process automatically balances the
cluster by migrating chunks
• MongoDb also manages splitting and rebalancing data chunks once
they reach a threshold
• However these operations come at a cost and impact read and write
performance, leading to operations teams often shutting down the
auto balancing and splitting processing
• Need to manually run compact processes in order to release unused
memory/space to the OS
• Supports background rolling builds that keep DB available during index
rebuilds (although performance is impacted)
Storage Support
•WiredTiger storage engine - Primarily uses hard disk for persistent
storage and RAM for indexes and working datasets. There is also an in-
memory storage engine that stores data entirely in memory
•The wiredtiger storage engine supports TTL indexes and tiered storage
via MongoDB Zones
•Typical Disk space used for Mongo production deployment /per
physical instance is 512 GB- 1 TB, largely limited by backup/restore
processes
• Most popular NoSQL DB
•Supports a proprietary json based query syntax.
• No imposition of document structure and ease of indexing is a big plus
in terms of getting started quickly
• Support for Java, Python and other language APIs
Handling Updates.
• There is no support for multi-document or multi-collection
transactions. Must eliminate need for multi-row and multi-table
transactions through appropriate data modeling techniques. Although
mongodb documentation suggests the use of a Two phase commit like
pattern, but it is not recommended for production apps that need to
ensure consistency and integrity of data
•WiredTiger uses MVCC for non locking algorithms for concurrent
updates which improves efficiency
•Supports change streams out of the box, to support near real time
event notification and synchronization scenarios
Technology Used
• Written in C++
• Does not optimize for the linux kernel, especially for disk or SSD based
IO
•The wiredtiger storage engine does provide storage efficiencies via
enhanced compression and granular concurrency control. It also
supports intra cluster network compression
•While mongo is multi-threaded it does not utilize OS native techniques
that optimize for modern multi-core and numa aware hardware
architectures
Security
•Comes with loose default configurations that have been recently
exploited in malicious attacks
• Provides LDAP based authentication, role based ACLs and Encryption
at rest as well as in transit.
•Provides, only collection level access control policies and not row or
field level security
•Considerable initial setup is required with 3 replica sets per shard
equating to 9 nodes with 1 mongod instance each or 3 nodes with 3
mongod instances each
•Requires additional mongos instances for query routing to the correct
shard

MongoDB- Practices and Use Cases
Best Practices
• It is important to give significant attention to data modeling upfront
– including deciding on embedding documents or to have references,
sharding keys and indexing strategy. Avoid scatter-gather queries and
choose the right level of write guarantees and read concerns
• Attention needs to be given to hardware sizing and configuration
parameters that accounts for growth in volume and usage, thus
avoiding cumbersome migrations. Ensure working sets fit in RAM
and avoid large documents. Dedicate each server to a single role, as
each Mongo DB server role has significantly different workload
characteristics. Ensure proper configuration of compression and data
tiering
When to consider
• When schema flexibility is required
• When MySQL like latencies are desired but the data does not fit a
single server
• When eventual consistency can be tolerated
Use cases
• A datastore for customer data
• Ecommerce product catalog
• For near real time event notifications and collaboration
• Real time analytics
• Mobile and social networking applications
• Storing semi-structured data such as blogs, content and logs
• For any application that has evolving data requirements.
• When analytical or general search queries are required
• When ACID transactions need to be guaranteed across
documents or collections
• When very low latency reads or writes need to be guaranteed
• When very high throughput writes are required
• For large batch data processing jobs
• For very large datasets (>100 TB) – While MongoDB can handle
these large datasets, the lack of cluster wide resource
optimization, and replica sets based architecture can make such
large clusters expensive to provision and maintain

References
• Cassandra - https://academy.datastax.com/resources/brief-introduction-apache-cassandra
• Cassandra Consistency - https://aphyr.com/posts/294-jepsen-cassandra
• HBase - https://mapr.com/blog/in-depth-look-HBase-architecture/
• HBase Splitting and merging - https://hortonworks.com/blog/apache-HBase-region-splitting-and-merging/
• HBase Filters - https://intellipaat.com/tutorial/HBase-tutorial/client-api-advanced-features/
• HBase vs Cassandra - http://bigdatanoob.blogspot.in/2012/11/HBase-vs-cassandra.html
• Redis scale-out - https://www.credera.com/blog/technology-insights/open-source-technology-insights/an-introduction-to-redis-cluster/
• MongoDB performance guide - https://neotan.github.io/images/media/MongoDB-Performance-Best-Practices.pdf
• MongoDb at Baidu - https://www.slideshare.net/matkeep/mongodb-at-baidu/7

Scylla - Overview
Scylla is a distributed database designed from the ground up, using the Seastar framework, to be a significantly more efficient and scalable
drop-in replacement of Cassandra. In using the Seastar framework, Scylla has optimized heavily across the utilization of CPU, memory, network
and IO resources, significantly reducing costs compared to a Cassandra deployment with similar workloads
As a distributed database Scylla has the same architectural foundation of Cassandra, in that it uses the wide column data model and masterless
ring architecture. There are however, a few system architecture choices that allow it to offer a few enhanced operations capabilities , such as
guaranteed low latencies (no JVM stops) and availability during repair processes (due to parallel repair)

Scylla – Innovations
Faster packet processing
by bypassing the kernel
space in 80 CPU cycles
Scylla system architecture innovations
Implications
No garbage collection
pauses, expensive
locking and low CPU
utilization
No thread context
switches.
Asynchronous lockless
inter-core
communication which is
highly scalable
Reconcile data in
cache with incoming
writes – reduces IO
and data model
complexity
Row cache format
same as the serialized
format – reducing
serialization and
deserialization
overhead
Direct storage
access, with
explicit cache
management
leading to better
control
Reduced
serialization and
deserialization
overhead

Scylla – as an evolution of Cassandra
The implications of the system architecture improvements Scylla has made result in
• 5-10 times better throughput for combined read/write workloads, when compared to Cassandra, making Scylla more cost effective
alternative to Cassandra
• Ability to scale effectively with additional number of cores in a node
• Guaranteed low latency , which cannot be offered by Cassandra because of Garbage collection stops and thread locking behaviour
• Better compression rates, compaction rates and IO efficiency leads to deployment of higher density storage per node (2-5 TB/node), thus
reducing the total cost of infrastructure
Scylla offers additional operations benefits
• Running repair and compaction processes in parallel with query workloads
• Tuning – Self tuning capability removes a lot of the manual overhead and guessing
• Isolation and scheduling of background and foreground jobs
• Provisioning – Ease of adding nodes to a cluster – multiple nodes can be added at once and standing up nodes is a faster process compared
to adding a Cassandra node to a Cassandra cluster

Accumulo – Overview
Apache Accumulo is a highly scalable structured store based on Google’s BigTable. Accumulo is written in Java and operates over the Hadoop
Distributed File System (HDFS). Accumulo supports efficient storage and retrieval of structured data, including queries for ranges, and provides
support for using Accumulo tables as input and output for MapReduce jobs. Accumulo provides strong consistency models and is CP on the
CAP spectrum
Accumulo is the 4th most popular wide column data store after Cassandra, HBase and Microsoft Cosmos. It stands at 60th overall in database
popularity as per DB-Engines ranking
Accumulo, has a lot in common with HBase and can be considered for similar use cases. However, Accumulo has implemented several
enhanced capabilities that give it an edge when compared to HBase

Accumulo – Innovations
Accumulo architecture innovations
Security
• Accumulo data model adds the
concept of column visibility to the
original BigTable model. This
enabled implementing very fine
grained cell level security for big
data
Data Model Flexibility
• Ability to add and change column
families “after the fact”
• Flexible locality groups, which
allow application designers to
control how columns are grouped
on disk, a trick that conventional
column-oriented databases rely on
for performance of ad-hoc queries.
• Configurable conditions under
which writes to a table will be
rejected. Constraints are written in
Java and configurable on a per
table basis.
Secondary Index support
• While Accumulo does not support
secondary indexes out of the box,
it provides several architectural
features that make it easy to
implement secondary indexes
• support for very large rows and
partial scans of rows, which
allows applications to build and
maintain their own secondary
index tables without hitting
memory limits
• batch scanners, which can
facilitate fetching many small
reads in a random access fashion
that allows applications to
quickly return full rows
corresponding to matches found
via index tables
• Through the use of specialized
iterators, Accumulo can be a
parallel sharded document store.
For example Wikipedia could be
stored and searched for
documents containing certain
words.
Volume support
• Supports a collection of HDFS URIs
(host and path) which allows
Accumulo to operate over multiple
disjoint HDFS instances. This
allows Accumulo to scale beyond
the limits of a single namenode.
• Allows splitting of metadata files,
thus allowing scaling to very large
volumes in trillions of rows of data

Accumulo – A better BigTable implementation
The implications of the architecture improvements for Accumulo enable the following additional use cases to be implemented with Accumulo
• When flexible data models are required in big data scenarios
• When data ingest rates are high and the data sets can grow up to trillions of rows
• For flexible querying of data using search terms or for Graph traversal in very large graphs
• When data security requirements are complex and fine grained at the data attribute level
However, Accumulo does have a few downsides compared to HBase
• The Accumulo server instances tend to be large to accommodate for large rows. Therefore recovery from a failed node can take significant
time
• Accumulo is not as well integrated into the Hadoop ecosystem as HBase (eg. Atlas integration, Oozie integration etc)

Aerospike - Overview
Aerospike is a distributed, scalable NoSQL database for storing key-Value based structures. Although
Aerospike architecture is fundamentally geared towards maximizing availability, it has made
accommodations for strong consistency models since Aerospike 3.0, albeit at the cost of latency.
Aerospike, is unlike any other popular key value stores – Redis Server, Redis Cluster or Memcached,
and has made drastically different architectural choices while making significant improvements to
the system architecture
Aerospike Architecture
• The Aerospike architecture comprises three layers:
• Client Layer: This cluster-aware layer includes open source client libraries, which implement Aerospike
APIs, track nodes, and know where data resides in the cluster.
• Clustering and Data Distribution Layer: This layer manages cluster communications and automates fail-
over, replication, cross data center synchronization, and intelligent re-balancing and data migration.
• Data Storage Layer: This layer reliably stores data in DRAM and Flash for fast retrieval.

Aerospike – Innovation
Distributed Architecture innovation
• Unlike other key-value databases, which started out as single server high performance caches, the architecture of Aerospike has three key
objectives:
• Create a flexible, scalable platform for web-scale applications – Multi cluster masterless setup without complex master slave configurations. It also provides
flexibility in its data model in that it allows multiple datatypes to be mixed into bins
• Provide the robustness and reliability (as in ACID) expected from traditional databases – Provides single key atomic transactions (Does not support multi-key
acid transactions)
• Provide operational efficiency with minimal manual involvement – Automated balancing, indexing, sharding, cross datacenter replication and recovery
System Architecture Innovation
• Aerospike is implemented in C and is optimized for SSDs and processing speed
• Supports hybrid storage across SSD, HDD and RAM – This enables scalability without compromising on speed
Limitations
• Indexes are all stored in memory, which increases the cost of storage for very large data sets. Also indexes are not global, and need to be
queried using scatter gather queries
• Some data structures have Maps associated with them

Aerospike – The best key-value clustered DB
The applications of Aerospike will tend to be in the following areas
• Storing massive amounts of profile data in online advertising or retail Web sites.
• For real time low latency streaming analytics applications, such as real time fraud detection, financial front office applications etc
What is not a good use
• data store with large number of indexes
• where transactional integrity across datasets or keys are required – e.g. Inventory management, financial transaction processing etc
• Very large volumes - ~PBs

Yugabyte – Overview
Yugabyte is an implementation of the Google Spanner Architecture (as laid
out in the Google Spanner paper). Like Google Spanner It is meant to be a
system-of-record/authoritative database that geo-distributed applications
can rely on for correctness and availability.
It is written in C++ and is an apache 2.0 licensed open source software.
Yugabyte API is wire compatible with CQL, Redis and PostgreSQL
Unlike Google Spanner DB, Yugabyte maintains the goals of low latency
inspite of respecting ACID semantics
This opens up a host of application possibilities for Yugabyte, including
blockchain implementations

Yugabyte - Innovation
Yugabyte is built with the following very ambitious goals in mind
1. Transactional
• Distributed acid transactions that allow multi-row updates across any number of shards at any scale.
• Strongly consistent secondary indexes
• Transactional key-document storage engine that’s backed by self-healing, strongly consistent replication.
2. High Performance
• Low latency for geo-distributed applications with multiple read consistency levels and read-only replicas.
• High throughput for ingesting and serving ever-growing datasets.
3. Planet-Scale
• Global data distribution that brings consistent data close to users through multi-region and multi-cloud deployments.
• Auto-sharding and auto-rebalancing to ensure uniform load balancing across all nodes even for very large clusters.
4. Cloud Native
• Built for the container era with highly elastic scaling and infrastructure portability, including Kubernetes-driven orchestration.
• Self-healing database that automatically tolerates any failures common in the inherently unreliable modern cloud infrastructure.
Key Weaknesses
• Relatively new and not yet adopted in the enterprise, although some promising adoption with startups
• No independently published benchmarks. Although lot of positive benchmarks in comparison to Mongo, Cassandra and Google Spanner, highlighting
performance and throughput strengths when other DBs are tuned to strong consistency levels

Yugabyte – Promising cloud native Globally consistent DB
Yugabyte’s unique capabilities make it usable for many demanding use cases –
• Elastic DB service for IOT data, especially where sensor data may be geographically distributed
• Geographically distributed consumer facing digital operations.
• Financial data service – real time strongly consistent updates– stock quote service, finance portfolio management
• Lambda architectures for serving real time analytics – especially for time ordered datasets – eg.. Personalization based on user activity

References
• Scylla –
• https://github.com/scylladb/scylla/wiki/Repair---Scylla-vs-Cassandra
• https://www.youtube.com/watch?v=YBsbXYvyZnA
• https://www.scylladb.com/product/technology/
• https://github.com/scylladb/scylla/wiki/Repair---Scylla-vs-Cassandra
• https://www.youtube.com/watch?v=YBsbXYvyZnA
• Accumulo
• https://www.slideshare.net/DonaldMiner/survey-of-accumulo-techniques-for-indexing-data
• http://accumulosummit.com/program/talks/comparing-accumulo-cassandra-HBase/
• Yugabyte vs MongoDb - https://blog.yugabyte.com/overcoming-mongodb-sharding-and-replication-limitations-with-yugabyte-db-
ec4eefa5bbd5
• Aerospike Availability and Consistency-
• https://aphyr.com/posts/324-jepsen-aerospike
• Aerospike consistency - https://www.aerospike.com/docs/architecture/acid.html

Databases to be added
• Google Spanner
• MongoRocks (MongoDB with the RocksDB engine)
• Microsoft Cosmos
• CouchDB (maybe)

Evolution of Distributed Database Technologies in the Digital era

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Evolution of Distributed Database Technologies in the Digital era

Ähnlich wie Evolution of Distributed Database Technologies in the Digital era (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Evolution of Distributed Database Technologies in the Digital era