Designing data intensive applications

Designing Data-Intensive
Applications

Summary of storage and retrieval strategies
# Strategy Read
Perf
Write
Perf
Advantages Limitations
1 Append writes to data
log
O(n) O(1) Simple to implement Not great for random access reads
2 #1 + Hash indexes O(1) O(1) Better read performance than #1 • Variance in read performance due to disk seeks
because data is really spread out
• Can eventually run out of disk space
3 #2 + compaction in
background
O(1) O(1) • Merging old segments avoids data
fragmentation
• Append and merge design ensures
sequential I/O and simplifies
concurrency and crash recovery
compared to overwriting data
• To offer O(1) read performance, hash table must
fit in memory
• Range queries will be inefficient
4 Sorted String Tables
(SSTables) and Log
Structured Merge Trees
(LSM-Trees)
> O(1)
and
< O(n)
O(1) • Range queries can be performed since
compaction involves merge sort
• Can handle any data volume, unlike
#3, wherein there was a need to fit in
memory
• Compaction have direct impact on user initiated
read/write requests
• Since looking up a key’s existence still requires
going through multiple segments, range queries
can be challenging
5 B-trees O(logm n) O(1) • Consistent read and write
performance
• Writes in B-Trees require updating the WAL and
the tree. i.e., write locks are used for concurrency
control
• Data can be fragmented since there is no
compaction

Hash indexes with compaction in background
Compaction is a means to avoid the pitfalls
of just appending to the data logs.
Specifically, data log is broken down into
segments with a hash table for each
segment.
• Writes are written to the latest segment.
• Reads lookup the latest to oldest hash
table for keys.
• Compaction goes about compacting data
logs from the oldest to the latest in a
background by making a new compacted
copy. When a set of segments have
successfully compacted they are swapped
out.
Implementation details and optimizations:
• Storing data in binary format is efficient
that textual format.
• When a record is deleted, instead of
deleting the value from the hash table, you
just write a tombstone record to the log
for the reader to know that a record was
deleted to prevent it from continuing to
lookup for the record in an older segment.
• If hashes are just stored in memory, then
upon crash recovery, the hashes need to
be rebuilt. To avoid this, the hash maps can
be stored on disk and can be read in during
crash recovery.
• Since crashes can happen while a record is
being written, storing checksums of the
record at the end of record appends to the
log, helps detecting corrupted parts of the
log to be ignored.
• Concurrency control is accomplished by
allowing a single writer to append to the
log and multiple readers.

SSTables and LSM Trees
1. When writes come in, add the key-value
pairs to an in-memory balanced tree like
red-black tree, often referred to as
memtable.
2. When memtable becomes bigger than a
certain size (a few MB), it is written to
the disk as a Sorted String Table
(SSTable) segment. While the SST is
being written, writes can happen to a
new instance of memtable.
3. When reads come in, look it up in the
memtable, then in the most recent on-
disk segment, and so on.
4. To speed up lookups in the SSTable, you
maintain a sparse index in memory that
maps a subset of keys to offsets in the
SSTable. To lookup a key, you do a binary
search on the sparse index to identify
the range to search for he key-value in
the SSTable.
5. From time to time, run a merging and
compaction process (that is a merge
sort) in the background that retains the
sorted order of the merged segments.
6. To handle crashes, in addition to writing
to the memtable, the records are written
to an append-only log that is unsorted.
This is fine since the log is used only
during crash recovery. This log is
discarded once the memtable is backed
up by an SSTable.
7. Some performance optimizations:
i. With LSM-tree algorithm, looking up keys
that are not present in the store can be
expensive resulting in searching through
multiple SSTables. To optimize this
scenario, you can use Bloom filters.
ii. Size-tiered and leveled compaction are
used to improve the performance of
SSTables

SSTables and LSM Trees
Step 4: Sparse index in memory Step 5: Merging and compaction in the background

Size-tiered and Leveled compactions
• In size-tiered compaction strategy, when
enough similar-sized SSTables are present
(four by default), they will be merged. As new
SSTables are created, nothing happens at first.
Once there are four, they are compacted
together, and again when we have another
four. When the second-tier SSTables have
been combined to create third-tier, third tier
are combined to create fourth, and so on.
Problems with size-tiered compaction in
update-heavy workloads:
• Performance can be inconsistent because there
is no guarantee as to how many SSTables a row
may be spread across: in the worst case, we
could have columns from a given row in each
SSTable.
• A substantial amount of space can be wasted
since there is no guarantee as to how quickly
obsolete columns will be merged out of
existence; this is particularly noticeable when
there is a high ratio of deletes.
• Space can also be a problem as SSTables grow
larger from repeated compactions, since an
obsolete SSTable cannot be removed until the
merged SSTable is completely written.
• Leveled compaction creates SSTable of a
fixed, relatively small size, that are grouped
into "levels." Within each level, SSTable are
guaranteed to be non-overlapping. Each level
is ten times as large as the previous. This
solves the above problems with tiered
compaction:
• Leveled compaction guarantees that 90% of all
reads will be satisfied from a single sstable
(assuming nearly-uniform row size).
• At most 10% of space will be wasted by obsolete
rows.
• Only enough space for 10x the sstable size needs
to be reserved for temporary use by
compaction.
• Leveled compaction isn't great for insertion
dominated workloads where there isn't any
overlap in the data being written to the
datastore.

B-Trees
• B-Trees are another means to improve the
performance of Hash Indexes. The B-tree is
a generalization of a binary search tree in
that a node can have more than two
children. In B-trees, internal (non-leaf)
nodes can have a variable number of child
nodes within some pre-defined range.
• When data is inserted or removed from a
node, its number of child nodes changes. In
order to maintain the pre-defined range,
internal nodes may be joined or split.
• Since a range of child nodes is permitted,
B-trees do not need re-balancing as
frequently as other self-balancing search
trees, but may waste some space, since
nodes are not entirely full.
• Since B-Trees require nodes to be
overwritten when insertions/deletions
happen, care needs to be taken to handle
crashes. To make databases resilient to
crashes, any tree updates are first written
to a write ahead log (WAL) before the tree
is updated.
• Access control is needed since tree nodes
can be updated. Read locks can be
completely avoided, while write locks are
still needed.
• Copy-on-write scheme can be used to
avoid WAL for crash recovery. A modified
page is written to a different location and a
new version of the parent pages in the tree
is created, pointing to the new location.
This method can help with concurrency
control.

Secondary indexes
• There are two components that
constitute an index:
• The key in an index is the thing that
queries search for
• The value can be one of the two things:
• actual row OR
• a reference to the row stored elsewhere, on
a heap file
• Heap file approach is efficient if you
want to update the value without
affecting the keys. However, if the
value's size could change and needs to
be moved to a different location, then
it requires you to leave a forward
reference at the old location to the
new location.
• Clustered index avoids the extra hop
by storing the value alongside the
index.
• Ex: the primary key of a table could be a
clustered index and the secondary indexes
just refer to the primary index.
• Covering index stores some of a table's
columns within the index. This allows
answering queries by using the index
alone.
• Note: Clustered and covering indexes
speed up reads by potentially adding
overhead to writes
• R-Trees are used to map multiple keys
to a value.
• Use case: Geo-spatial search wherein you
want to x and y coordinates to an address.

In-memory databases
• While most databases use hard-disks to store
the data, for durability and cost-effectiveness,
there might be cases where you might prefer
in-memory databases.
• In-memory databases are useful
• If your dataset isn't big enough
• If your dataset is used for cache purposes only
(i.e., ok to lose data on machine restart)
• If you want to use data-models that are not
possible through disk-based databases. For
example, Redis offers priority queues and sets in
addition to the traditional database-like interface.
• Most of the performance of in-memory
databases comes from the fact that you don't
need to maintain additional data structures for
representing data stored on the disk.
• In memory databases can also offer durability
by
• writing a log of changes to disk,
• writing periodic snapshots to disk,
• replicating the in-memory state on other machines
• In-memory database architecture could be
extended to support datasets larger than the
available memory, without bringing back the
overhead of disk-centric architecture. This anti-
caching approach works by evicting the LRU
data from memory to disk when there isn't
enough memory. This is similar to how OS
operates, except at a record level instead of at
page level.
• Some in-memory databases are Memcached,
Redis, and MemSQL.

OLTP vs OLAP
• Transaction processing has different access patterns as opposed to Data Analytics. The table below does a comparison
of that.
Property OLTP OLAP
Main read pattern Small number of records fetched per query Aggregate over a large number of
records
Main write pattern Random-access, low-latency writes from user
input
Bulk import (ETL) or event stream
Primarily used by End user/customer, via web application Internal analyst for decision support
What data represents Latest state of data (current point in time) History of events, that happened over
time
Dataset size GB or TB TB or PB

Data warehouse access patterns
• In data warehouses, the databases
schema looks like
• A star schema. Typically it has:
• A fact table, with tens of columns that serve
as foreign keys to other tables. For example,
a "sales facts" table has rows, each of which
captures all the facts of each and every sale
that happened.
• Several dimension tables that are connected
to the fact table. Example dimensions to the
sales table are product, store, date,
customer, promotion tables.
• A snowflake schema where the dimensions
are further broken down into sub-
dimensions. For example, the product
table can be broken down into brand and
category tables.
• Column-oriented storage stores all
column data in a given table together
instead of storing records together. This
is useful when only a small subset of
columns in a record are accessed in a
given query.
• This is typically the case in OLAP.
• When the number of potential values that
can be stored in a column are relatively
small, column compression is useful.
• In column-oriented storage, administrator
can pick which columns need to be sorted
based on typical query patterns. Sorting
columns can certainly help with
compression, especially with the primary
column.
• Materialized views that cache the
results of aggregates of potential
queries help improve the performance
of such queries.

Motivation and challenges
• Replication means keeping a copy
of the same data on multiple
machines that are connected via a
network. We replicate data
• To keep data geographically close
to users
• To allow system to work despite of
machine failures
• To increase read throughput
• The challenges with replication
comes from handling changes to
replicated data. There are three
main categories of replication
algorithms
• Single-leader
• Multi-leader
• Leaderless

Single leader replication
At a high-level single leader replication does the following:
• One of the replicas is designated the leader. All writes from the clients are sent to the
leader.
• Other replicas are known as followers. Whenever a leader writes data to its local storage,
it sends the data change to all its followers as part of a replication log or change stream.
• Client read requests can be satisfied by any of the replicas (including the leader).
The replication itself can be synchronous or asynchronous.
Synchronous Asynchronous
In synchronous replication, the leader waits for the
replication to be completed by the follower.
In asynchronous replication, the leader doesn't wait.
Typically, to minimize the write delays, when synchronous
replication is enabled, one of the followers is made
synchronous, and others are asynchronous.
The advantage of asynchronous replication is that it
improves the write throughput at the cost of diverging the
data consistency across all the replicas.

Handling follower failures
• Handling node outages is necessary to
deal with planned maintenance or
unexpected faults so that we can
reduce their impact on client requests.
• To deal with follower failure:
• Each follower keeps a log of the data
changes it has received from the leader.
• If the follower crashes and is restarted, or
if there were network connectivity issues,
the follower recovers quite easily:
• It looks up its log for the last transaction
that it processed and requests the leader for
all the data changes since then.
• Once it has caught up, it can continue
processing changes as they happen on the
leader.
• New followers could be setup as
existing followers become unavailable.
Setting up new followers typically
involves the following steps
• Take a consistent snapshot of the leader's
database at some point in time.
• Copy the snapshot to the new follower
node.
• The follower connects to the leader and
requests all the data changes that have
happened since the snapshot.
• When the follower has processed the
backlog of data changes since the
snapshot, it is caught up and is ready to
process changes as they happen at the
leader.

Handling leader failures
• Handling leader failure is trickier than
follower failures and involves a failover
process:
• Detecting a leader failure
• Choosing a new leader
• Re-configuring the system to use the new
leader
• Detecting a leader failure: Since there is
no foolproof way of detecting what has
gone wrong, most systems simply use a
timeout on "keep alive messages" to
detect whether a node is down.
• Choose a large timeout and your clients might
suffer when leader goes down.
• Choose too small a timeout and you change a
leader just because the leader was under a
heavy load; thereby resulting in putting the
stress of a failover on a system that is already
suffering under the load.
• Choosing a new leader: A leader can be
elected amongst the available replicas or
appointed by an already elected controller
node. The best candidate is typically the
replica that is most caught up.
• Reconfiguring the system to use the new
leader: The typical challenges here are
with
• routing the requests from the clients to the
new leader
• avoiding confusion in the system when the old
leader comes back into the system thinking
that it is the leader. This results in split brain.
• avoiding lost changes when asynchronous
replication is used and when the new leader
does not have all the writes from the previous
leader. This means some of the writes from
clients will be lost, thereby impacting clients'
durability expectations.

Strategies for leader-based replication
1. Statement-based replication: This is the
simplest strategy. In this model, the leader
logs every write request (a SQL statement)
that it executes and sends it to the
followers. This approach has the following
issues
• Any statement that results in non-deterministic
(like calling rand()) result can cause the data to
diverge.
• If statements have side effects (like triggers,
stored procedures, user defined functions).
• If statements have auto-incrementing columns
2. Write ahead log (WAL) shipping: This
strategy avoids the issues with statement-
based replication by simply shipping write
ahead logs to the followers. The shipped
data just contains the bytes of all the writes
to the database. When the followers
process the data, they just build the same
copy of the data structures as found on the
leader. The downside of this approach is
that it requires that all the replicas
understand the same WAL. As a result, all
replicas will have to be updated at the same
time and updates to software cannot be
rolled over incrementally.
3. Logical log replication: This strategy avoids
the issues with WAL shipping method by
decoupling the replication log format from
the storage engine.
4. Trigger-based replication: This strategy is
used where there is a need to have much
more flexibility on what is replicated (like
replicate only a subset of the data or want
to deal with conflict resolution). In this
model, a user supplied application code is
invoked as part of replication. This method
can be bug-prone compared to the
database supplied replication strategy.

Read scaling and replication lag
• When replication is introduced to increase read
throughput, then asynchronous followers will be
introduced. With async followers, we have
replication lags. Following strategies are used to
handle replication lag:
• Read-your-writes consistency ensures a user can
read the writes that they made consistently. There
are various strategies to implement this model when
using a leader-based model.
• When reading from something that the user may have
modified themselves (like their user profile on
Facebook), then read it from the leader; otherwise, read
it from a follower.
• If most things from the application are editable, then the
client remembers the last write timestamp and reads
from the leader until a specific time window expires,
upon which the client can read from a follower.
Alternatively, the client remembers the last update
timestamp and uses that to read from any follower that
has been updated at least to that timestamp. This
approach has the downside when the same customer
can log onto multiple devices. This would require a
centralized server that manages the timestamp.
• Monotonic reads is a guarantee that prevents users
from seeing things moving backwards in time (ex: a
comment appears and disappears). It is a lesser
guarantee than strong consistency, but a stronger
guarantee than eventual consistency. Users typically
run into this when their initial read is from a replica
that is further ahead in time. One strategy to
prevent this is to assign a specific replica to each
user. However, when a replica fails, the user needs to
be rerouted to another replica.
• Consistent prefix reads is needed where we want to
maintain causality (Ex: a question/response on an
online forum). This is a problem in partitioned
databases. If the database applies writes in the same
order, read always see a consistent prefix, so there
will not be any violation of causality. More
discussion later.

Multi-leader replication
In a multi-leader replication model, more than one node accepts writes. Replication still
happens the same way: each node that processes the write forwards that data change to
all the other nodes. This model is useful in the following situations:
• Multi-datacenter operation: With a normal leader-based replication setup, the leader
must be in one of the datacenters, and all writes must go through it. Instead, in a multi-
leader setup, you can have a leader in each data center. Both these models can be
compared using the following metrics:
• Performance: Since the inter-datacenter latency can be hidden from users in multi-leader model,
through asynchronous replication, write performance is much better in multi-leader model.
• Tolerance of datacenter outages: In a single-leader model, datacenter failure with the leader
results in promotion of a follower in another datacenter. In multi-leader model, each datacenter
can continue operating independently of another and can catchup when the failed datacenter
comes back online.
• Tolerance of network problems: In single leader model, problems in the inter-datacenter
link can impact all the writes. A multi-leader model is more resilient to these problems.
• Clients with offline operation: Another situation in which multi-leader replication is
appropriate is if you have an application that needs to continue to work while it is
disconnected from the internet. Calendar apps in a phone is an example of this use case
in that it allows users to create a meeting irrespective of whether they are connected or
not and syncs up their calendar as and when the connectivity is established.

Replication topology
A replication topology describes the
communication paths along which writes are
propagated from one node to another.
• The most general is all-to-all, in which
every leader sends its writes to every other
leader.
• A circular topology is one in which each
node receives writes from one node and
forwards it to one other node.
• Another popular topology is star topology,
in which a designated root node forwards
writes to all of the other nodes.
• Star can be generalized to a tree topology.
• In circular and star topologies, the
messages carry details on all the nodes
that have already seen the message to
prevent re-circulation of the same
message.
• An advantage of all-to-all topology is that it
is resilient to single node failures, which
isn't the case with circular or star
topologies.
• A disadvantage of all-to-all topology is that
the messages could be received by the
nodes in an out-of-order manner due to
different network latencies between the
nodes. The messages cannot be
timestamped since we cannot guarantee
that the clocks are in sync across the
nodes. To order these events correctly,
version vectors can be used.

Multi-leader replication and write conflicts
Multi-leader replication introduces write conflicts. Tradeoffs for
handling write conflicts:
• Synchronous versus asynchronous conflict detection: With
asynchronous replication, the conflicts are not detected at the
time of the write. The detection happens asynchronously.
• Conflict avoidance: The simplest strategy for dealing with conflicts
is to avoid them completely. For example, all the writes to a
specific record (like a specific user's profile) are all routed to the
same data center and use the leader in that datacenter for reading
and writing. Different users might have different "home"
datacenters, but from any user's point of view the configuration is
essentially single-leader.
• Converging to a consistent state: With single-leader model, the
writes are all applied in the same order across all the replicas. If
each replica in the multi-leader model also did the same, then the
state will converge. There are a few strategies to accomplish this:
• Give each write a unique ID/time stamp and ensure that the
highest ID is persisted. This model is prone to data loss.
• Give each replica a unique ID and let writes that originated at
a higher-numbered replica take precedence over the writes
that originated from lower-numbered replica. This is prone to
data loss too.
• Custom conflict resolution logic: As the most appropriate way of
resolving a conflict may depend on the application, most multi-
leader replication tools let you write conflict resolution logic using
application code. This code may run on write or read:
• On write: As soon as the database system detects a conflict in
the log of replicated changes, it calls the conflict handler. The
handler typically cannot prompt a user - it turns in a
background process and it must execute quickly.
• On read: When a conflict is detected, all the conflicting writes
are stored. The next time data is read, these multiple versions
of the data are returned to the application. The application
may prompt the user or automatically resolve the conflict,
and write the result back to the database.
• Automatic conflict resolution: There has been some interesting
research into automatically resolving conflicts caused by
concurrent data modification. A few lines of research are worth
mentioning:
• Conflict-free replicated data types
• Mergeable persistent data structures
• Operational transformations

Leaderless replication
Leaderless Replication is ideal for systems that:
• Require high availability and low latency
• Can tolerate occasional stale reads (Eventual
Consistency)
In both single-leader and multi-leader replication
models, the leader determines the order in which the
writes are processed, and followers apply the leader's
writes in the same order. In a leaderless replication
model, the client sends its writes simultaneously to
several replicas or to a coordinator node that does this
on the behalf of the client, without enforcing any
ordering on the writes.
One of the main challenges with leaderless replication
model is the case wherein a client attempts to read data
from a node that didn't yet process the latest write. To
avoid this situation:
• The client can version every write message that it
sends out and sends read requests to several nodes in
parallel. The client then uses the version numbers to
determine which value is the latest.
• To repair replicas that were unavailable and now back
online back in sync, there are two mechanisms:
• Read repair: When a client makes a read from
several nodes in parallel, it can detect any stale
responses and update those specific records on
that replica with the up-to-date value.
• Anti-entropy process: A background process
could constantly look for differences in the data
between replicas and copy any missing data from
one replica to another. It is worth noting that the
copies don't maintain the write order similar to
leader-based replication models. Additionally, it
might take a while before the detection and
catch up happens eventually.
As you can see, any process to bring data consistency
has an impact on read/write throughput.

Leaderless replication and write conflicts
Another challenge that leaderless replication has to face is
detecting and handling concurrent writes because of the design
to allow multiple clients to write to the same key. Similar to
multi-leader replication, this design too requires handling write
conflicts that arise
• When writes from same client arrive in different order to
different nodes
• When writes from different clients happen on the same key
and arrive in different order at two nodes
• When some writes are lost and never arrive at some nodes
There are a few strategies that can be used to handle concurrent
writes:
• Last writer wins model discards concurrent writes. At a high-
level, in this model clients timestamp writes and nodes
overwrite messages with lower timestamps. This model
cannot handle lost messages or resolving conflicts across
multiple messages.
• Merge concurrent writes:
• The server maintains a version number for every key and
increments the version number every time key is written
and stores the new version number along with the value
written.
• When a client reads a key, the server reads all values
that have not been overwritten, as well as the latest
version number. Client must read a key before writing.
• When a client writes a key, it must include the version
number from the prior read, and it must merge together
all the values that received in the prior read.
• When the server receives a write with a particular
version number, it can overwrite all values with the
version number or below, but it must keep all values
with a higher version number.
• This algorithm works nicely when inserting new keys or
updating existing keys. For deletes, clients would still
need to leave tombstones along with version number to
help merging the results.
• The algorithm described above is for a single node. When you
have multiple replicas accepting concurrent writes, we need a
version number per replica for every key. The collection of
version numbers from all the replicas is called a version
vector. Like version numbers, version vectors are sent from
the database replicas to the clients when the values are read
and need to be sent back to the database when a value is
subsequently written.

Leaderless replication and consistency
Quorums help maintain read/write consistency when
using leaderless replication model. When there are n
replicas:
• A write is deemed successful when written to w
writers
• A read is consistent when read from r readers
• where w + r > n
The quorum condition, w + r > n, allows the system
to tolerate unavailable nodes as follows:
• If w < n, we can still process writes if a node is
unavailable
• If r < n, we can still process reads if a node is
unavailable
• If n = 5, w = 3, r = 3 we can tolerate two
unavailable nodes.
Normally, reads and writes are always sent to all n
replicas in parallel. The parameters, w and r
determine how many nodes we wait for before we
consider a read or write successful.
In a large cluster, a client may connect to some
database nodes, but might not be able to assemble a
quorum, due to a network interruption. In such
cases, database designers might choose to accept
writes instead of returning errors to the application.
In sloppy quorums, writes and reads require w and r
successful responses, but those may include nodes
that are not among the designated n "home" nodes
for a value. Once the network interruption is fixed,
any writes that one node temporarily accepted on
behalf of another node are sent to the appropriate
"home" nodes. This is called hinted handoff. Sloppy
quorums have the following attributes:
• They increase write availability at the cost of read
consistency because when a client reads a value,
the latest value might have been temporarily
written to some nodes outside of n.
• Sloppy quorums offer durability and isn't a quorum
in the traditional sense.

Partitioning
The main reason for partitioning data is scalability.
Partitions are defined in such a way that each piece of data belongs to exactly one
partition, with the net effect of each partition is a small database of its own, although the
database may support operations that touch multiple partitions at the same time.

Partitioning of key-value data
One situation that any partitioning methodology will have to deal with is that of skewed workloads and
hotspots. One strategy that could be used when a lot of writes happen to the same key is to append the key
with a 2-digit random number to split the load across 100 partitions. This strategy does make reading results
for that key challenging and requires piecing data together from all the involved partitions.
Partition by key range Partition by hash of a key
In this model, data is partitioned by assigning a continuous range
of keys to each partition.
In this model, a hash function determines the partition for a given
key. The hash function should generate the same hash value for a
given key and distribute the keys consistently across partition
boundaries.
When a certain range of data has more requests than others, then
it can result in hotspots. This can be avoided by picking the
appropriate key values that avoid such situations.
Use of a hash function can eliminate the need for human or
database intervention in key selection to an extent.
This method of partitioning makes it easy to perform range
queries when the data is stored in each of the partitions in a
sorted order.
One of the disadvantages of this model is that we lose the ability
to perform range queries. You can arrive at a compromise by
using compound primary keys consisting of several columns. For
example, on a social media site, primary key for updates could be
(user_id, update_timestamp), this allows you to retrieve updates
made by a user for a range of time stamps.

Secondary indexes
Partitioning data on secondary indexes poses a
challenge because it doesn't map neatly to
partitions. There are two main approaches to
partitioning a database with secondary indexes.
When partitioning secondary indexes by
document (Local Secondary Index), each
partition is separate:
• Each partition maintains its own secondary
indexes, covering only the documents in that
partition.
• Writes to the database deal with only a single
partition.
• Reads will require a scatter/gather approach
wherein a query on secondary index is sent to
all the partitions and merged together; making
it expensive.
Partitioning secondary indexes by term (Global
Secondary Index):
• Instead of each partition having its own
secondary index, this model constructs a
global index that covers data in all partitions.
The index itself is not stored in one node and is
partitioned across multiple nodes to avoid a
single node from being a bottleneck.
• Read queries based on the secondary index are
efficient since only a single node is ever
touched.
• Writes are slower with this model compared to
local index. Additionally, since updating
secondary indexes are asynchronous, clients
should be prepared to handle latency in
updates.

Rebalancing partitions
Motivation for rebalancing:
• Increased query throughput overtime
• Larger dataset
• Failed machines
Requirements for rebalancing:
• After the rebalancing, the load is evenly balanced across
the partitions
• Database should keep accepting reads and writes while
rebalancing is in progress
• Rebalancing is optimal and doesn't result in unnecessary
moves across nodes
Strategies for rebalancing:
• Do not use hash mod N to assign a key to a node. This will
result in moving data across all the nodes when the
number of nodes (N) changes.
• Fixed number of partitions:
• Start by assigning partitions to a set of nodes
• When a node gets too much traffic, move some of
the partitions from that node to a new node
• Choosing the right number of partitions, when your
dataset changes overtime, is difficult with this
approach.
• Too many partitions can cause unnecessary
overhead, while too few makes rebalancing
expensive.
• Size of the partition is proportional to the dataset.
• Dynamic partitioning:
• When partition grows beyond a certain size, the
partition is broken into two of similar size. Similarly,
two partitions can be merged.
• This can be used by both key-range and hash
partitioned data.
• Size of the partition is between a fixed min and max.
• Partitioning proportionally to nodes:
• The size of the partition is proportional to the
number of nodes. i.e., fixed no. of partitions/node.
• When a new node joins a cluster, it splits partition(s)
from other nodes and rebalances.
• This method works with hash partitioned data.
Rebalancing initiation:
• Manual: Safer, but can cause delays.
• Automatic: Faster but can result in unintended
consequences like overloading the network, nodes, etc.
due to premature initiation.

Request routing
• Send all requests from clients to a routing tier first
• An external coordination service like Zookeeper uses a consensus algorithm to
ensure that a central mapping of partitions  nodes is maintained.
• Allow clients to contact any node:
• A gossip protocol can be used for nodes to share information with each other.
It adds complexity to the nodes. However, eliminates the need for an external
coordination service.
• Require that clients be aware of the partition assignments
• In this model, the complexity is moved to the client.

ACID Transactions
The safety guarantees offered by transactions are
often described by the well-known acronym ACID,
which stands for Atomicity, Consistency, Isolation,
and Durability.
• Atomicity describes what happens if a client
wants to make several writes, but a fault occurs
after some of the writes. If the writes are
grouped together into an atomic transaction,
and the transaction cannot be completed due
to a fault, then the transaction is aborted, and
the database must discard/undo the writes
done so far.
• The idea of consistency is that you have certain
statements about your data (invariants) that
must always be true (ex: in an accounting
system, credits and debits must always be
balanced). However, consistency is dependent
on the application's invariants and it is
application's responsibility to preserve
consistency.
• Isolation in the sense of ACID means that
concurrently executing transactions are
isolated from each other: they don't stomp
over each other. Database textbooks talk about
serializability, although it is rarely implemented.
Snapshot isolation and other weaker isolation
levels are what databases implement.
• Durability is the promise that once a
transaction has committed successfully, any
data it has written will not be forgotten, even if
there is hardware fault or the database crashes.
Durability "guarantees" should always be taken
with a grain of salt.
We will focus on atomicity and isolation since that is where databases have the most impact on transactions.
Consistency is mostly application dependent, while durability is mostly an invariant across different database
implementations.

Atomicity and isolation
Atomicity and Isolation describe what the database should
do if a database makes several writes within the same
transaction.
• Atomicity and isolation are applicable to single object
writes too. Here are a couple of examples
• What happens when a network connection is lost
after the first 10KB of a JSON fragment is sent?
• What happens when a client attempts to read a
record, half-way through a write to a large record?
• We need multi-object transactions for the following
reasons:
• In relational models, we want foreign keys to
reference valid data.
• In a document data model where join functionality
is lacking, denormalization is encouraged. In those
cases, you still end up updating several documents
and need to keep them consistent.
• In databases with secondary indexes, you need to
update secondary indexes, which are different
database objects from the database's point of view.
• Atomicity is a key feature of transactions because it
allows a transaction aborted and safely retried if an error
occurs. If the database does not abandon a transaction
and leaves it in a half-finished state, then retrying is not
easy for an application:
• If the transaction succeeded and a network failure
caused the server to not acknowledge
• If the error is due to overload, retrying will make
things worse
• If the transaction has any side-effects outside of the
database, like sending emails to clients
• Isolation hides concurrency issues from developers as
clients try to update the same record at the same time.
Serializable isolation means that database guarantees
that transactions have the same effect as if they ran
serially. Serializable isolation has performance costs. We
will look at a couple of non-serializable isolations next.

Read committed: A Non-serializable isolation
Read committed is the most basic level of transaction
isolation that offers two guarantees: No dirty reads and No
dirty writes.
No dirty reads: When reading from the database, you will
only see data that has been committed. This avoids cases
wherein:
• Clients read half-accurate data because not all objects
have been committed yet.
• Clients read data from transactions that might eventually
be rolled back.
No dirty writes: When writing to the database, you will only
overwrite data that has been committed. This avoids cases
wherein:
• Multiple writes stomp over each other in such a manner
that the resultant state gets the database to an
inconsistent state.
Implementation notes:
• Dirty writes are avoided by enforcing that writing to an
object requires acquiring a lock and holding onto it until
the end of the transaction.
• Dirty reads could be implemented by enforcing reads to
also acquire the same lock. But, that would be a
performance bottleneck. So, database maintains an old
and new value of all the objects updated by a pending
transaction and serves the old values for read requests
until the transaction is committed.
Drawbacks:
• Nonrepeatable read or read skew. For example, there can
be cases wherein a transaction B starts and completes
while another transaction A is in progress. Different parts
of transaction A sees different views of the database that
are inconsistent from transaction A's standpoint. This is
especially unacceptable for backups and analytic queries
and integrity checks.

Snapshot Isolation: A Non-serializable isolation
Snapshot isolation is the most common solution to
nonrepeatable read problem. The idea is that each
transaction sees a consistent snapshot of the database -
that is, the transaction sees all the data that was committed
in the database at the start of the transaction. Even if the
data is subsequently changed by another transaction, each
transaction sees only the old data from that point in time.
Snapshot isolation is implemented by a technique known as
Multi version concurrency control. Implementation notes:
• Databases use a generalization of the mechanism we saw
for preventing dirty reads, wherein the database keeps
track of several different committed versions of an object
because various in-progress transactions may need to see
the state of the database at different points in time.
Because it maintains several versions of an object side by
side, this technique is referred to as MVCC.
• When a transaction reads from the database, transaction
IDs are used to decide which objects it can see, and
which are invisible. By carefully defining visibility rules
the database can present a consistent snapshot of the
database to the application:
• At the start of a transaction, the database makes a
list of all the other transactions that are in progress
at that time. Any writes that those transactions have
made are ignored.
• Any writes made by aborted transactions are
ignored.
• Any writes made by transactions with a later
transaction ID are ignored, regardless of whether
those transactions have committed.
• All other writes are visible to the application's
queries.
• Indexes pose an interesting challenge for multi-version
database. One option is to have the index simply point to
all versions of an object and require an index query to
filter out any object versions that are not visible.

Lost updates in snapshot isolation
Snapshot isolation works well for read only transactions in the presence of concurrent writes. However, lost
updates can happen when a write by one client can be overwritten by a write from another client
unintentionally.
There are a variety of solutions to this problem:
• Many databases provide atomic update operations that eliminate the need for read-modify-update cycles in
application code that can cause consistency issues when writes from multiple clients stomp on each other.
• Applications can explicitly lock objects that are going to be updated. This is useful when the database itself
cannot offer a similar capability. This implementation is prone to bugs in application code.
• The former two approaches avoid the issues with read-modify-write cycles by forcing sequential execution.
Another approach is for the database to automatically detect a lost update to abort the transaction and
force the client to retry its read-modify-update cycle.
• In databases that don't provide transactions, you sometimes find an atomic compare-and-set operation. The
purpose of this operation is to avoid lost updates by allowing an update to happen only if the value has not
changed since you last read.
• In replicated databases, preventing lost updates takes another dimension. In those datastores, the typical
approach is to let concurrent writes through and let the application code perform conflict resolution.

Write skews and phantoms in snapshot isolation
Most examples of write skews follow
the following pattern that result in
phantoms:
1. A SELECT query checks whether
some requirement is satisfied by
searching for rows that match
search criteria.
2. Depending on the result of the
first query, the application code
decides how to continue.
3. If the application decides to go
ahead, it makes a write (INSERT,
UPADTE or DELETE) to the
database and commits the
transaction. The effect of this
write changes the precondition
of the decision in step 2. i.e., if
you run step 1 after committing
the write, you get a different
result.
Techniques to avoid write skews:
• Some databases allow you to
configure constraints, which are
then enforced by the database (like
uniqueness, foreign key constraints,
or restrictions on a particular value).
• Explicit locking is another means to
avoid write skews (FOR UPDATE)
Phantoms
The effect where a write in one
transaction changes the result of a
search query in another transaction is
referred to as a phantom. The problem
with phantoms is that there is no
object in step 1 to attach a lock to. This
problem is avoided by materializing
conflicts, wherein objects are
introduced into the database
artificially and locks are attached to
those objects before proceeding with
adding/updating the real objects in the
database.
Write skew

Serial execution: A serializable isolation
The simplest way to avoid concurrency problems is to
remove concurrency entirely and execute one transaction at
a time. i.e., do actual serial execution.
Why is this an option now than before?
• RAM is cheaper now to fit the entire database in
memory. As a result transactions can be completed faster
than when portions of the database was on the disk.
• Database designers have realized that OLTP transactions
are usually short and make few reads/writes, unlike
OLAP.
How to handle multi-statement transactions?
While the operations involved in the transaction are few, if
they are separated by long periods of inactivity because
they are initiated by a client across the internet, then
serializing transactions will impact performance. Hence,
databases allow multi-statement transactions only as stored
procedures.
What are the disadvantages of stored procedures?
• Each database vendor has its own language for stored
procedures. This is changing more recently with support
for stored procedures written in general purpose
programming language.
• Code running in a database is hard to debug and manage
• A badly written stored procedure can harm the database
far more than a statement issued by an application
• Partition databases throw a challenge at this model. You
are fine, if you can partition the data such that there is no
need to locking needed across partitioned databases.
This increases the database throughput. However, for any
transactions that span multiple partitions, the stored
procedures will have to be run in lock-step fashion and
the throughput of the database will go down significantly.

Two Phase Locking: A serializable isolation
Overview
For around 30 years, one widely used algorithm for
serializability in databases is two-phase locking (2PL).
In 2PL, several transactions can read objects concurrently as
long as none write to it. As soon as a transaction wants to
write an object, exclusive access is required:
• If transaction A read an object and transaction B wants to
write to the object, then B waits until A commits or
aborts the transaction.
• If transaction A has written an object and transaction B
wants to read the object, B needs to wait until A commits
or aborts.
Performance
While snapshot isolation promises that readers never block
writers and writers never block readers, 2PL provides
serializability and avoids a lot of concurrency issues with
snapshot isolation by breaking that promise. This also
results in degraded performance at especially higher
percentiles.
Phantoms and 2PL
Note that the problem of phantoms exists even with 2PL -
that is one transaction changing the results of another
transaction's search query. This problem is solved with
predicate locks wherein instead of getting a lock for a single
object, the lock belongs to multiple objects that match a
search criteria.
Performance of predicate locks
Implementing predicate locks is not performant when
checking locks across multiple active transactions. Most
databases implement predicate locks using index range
locks. Holding an index range lock is like holding a lock to a
range of objects. This optimization reduces the number of
locks that need to be checked before allowing a concurrent
transaction to proceed.

Serializable snapshot isolation: A serializable isolation
Overview
Unlike 2PL and serial execution, serializable snapshot isolation
is an optimistic concurrency control technique. i.e., instead of
blocking if something potentially dangerous happens,
transactions continue anyway. When a transaction wants to
commit, the database checks whether isolation was violated; if
so, the transaction is aborted and must be retried.
Advantages
• Optimistic concurrency performs well when there is less
contention and badly when there is a lot.
• Contention can be reduced with commutative atomic
operations. Ex: it doesn't matter the order in which you
increment the like counter on a tweet.
SSI, an improvement on snapshot isolation
SSI is basically snapshot isolation wherein reads within a
transaction happen from a consistent snapshot of the
database. This is the main difference with earlier concurrency
control techniques. On top of snapshot isolation, SSI adds an
algorithm for detecting serialization conflicts among writes to
determine which transactions to abort. The detection needs to
identify the following scenarios
• Detecting reads of a stale MVCC version of an object
(uncommitted write occurred before the read) - In order to
accomplish this, the database keeps track of whether any
read ignored another transaction’s write due to MVCC rules.
When a transaction wants to commit, the database checks
whether any of the ignored writes have now been
committed. If so, the transaction must be aborted.
• Detecting writes that affect prior reads - To accomplish this,
the database notifies each transaction of any writes made
by writes from other transactions. When a transaction
proceeds to commit, it checks if any conflicting writes from
other transactions have been committed; if so, the
transaction is aborted. If not, the transaction is committed.
Performance of serializable snapshot isolation
• SSI is better than 2PL for read heavy workloads because
they can run without needing any locks.
• SSI is better than serial execution from the standpoint of not
being limited to single CPU

Linearizability
The is the idea behind linearizability is to eliminate the
replication lag for the client by:
• Making all operations atomic
• Making it appear as if there is only one copy of the
data
In a Linearizable system, when a client completes a
write, all client reading from the database see the
same value. i.e., linearizability is a recency guarantee.
Serializability Linearizability
Serializability is an isolation property of transactions,
where every transaction may read and write multiple
objects. It guarantees that transactions behave the
same as if they had executed in some serial order.
Linearizability is a recency guarantee on reads and
writes of a register. It doesn't group operations
together into transactions, so it does not prevent
problems such as write skew.
2PL and actual serialization are typically linearizable. SSI is not serializable since writes from two different
transactions are not visible to each other until committed.
Serializability vs Linearizability

Linearizable systems
Usages of linearizable systems
• Locking and leader election - Single leader replication systems
need to ensure that there is a single leader. This can be
accomplished by expecting replicas to acquire a lock to become a
leader. The lock needs to be linearizable.
• Constraints and uniqueness guarantees - Unique username, need
for an account balance to never go negative, booking a seat in a
flight, all require that there is a single up to date value that all the
nodes agree on.
• Cross channel timing dependencies - Without recency guarantee of
linearizability, race conditions can cause problems (see figure 9-5).
Implementing linearizable systems
Since linearizable systems behave as if there is only one copy of the
data, one option is to not have replicas at all. But, the reason why we
have replicas is to be fault-tolerant. So, let's look at replication
methods and assess if they are linearizable.
• Single-leader replication: leader has the primary copy for writes
and the followers maintain backup copies. You can potentially make
this model linearizable by one of the two options:
• Forcing reads from the leader alone. This option assumes that
you know for sure who the leader is (i.e., you can avoid split
brain). Also, with async replication, when failover happens you
might lose committed writes, which violates durability and
linearizability.
• Allowing reads from synchronously updated followers only
• Consensus algorithms: Consensus algorithms are similar to single-
leader model and have measures to prevent split brain and stale
replicas. Hence, they are linearizable.
• Multi-leader replication: Systems with multi-leader replication are
generally not linearizable because they concurrently process writes
on multiple nodes and asynchronously replicate them to other
nodes.
• Leaderless replication: Leaderless replication offers consistency
through quorums. But replication lag can cause readers to see
different values (see figure 9-6). It is possible to address this
through read repair synchronously; but several caveats still apply
(see page 335 for full discussion).
The cost of linearizability
• If your application requires linearizability, and there is network
partition, then the replicas with user requests either wait for
disconnected replicas or return an error. If your application does
not require linearizability, then each replica can process requests
independently, even when disconnected from other replicas. i.e., a
system is either consistent or available when partitioned.
• Distributed databases that choose not to provide linearizability
guarantees do so primarily to increase performance, not so much
for fault tolerance. Linearizability is slow all the time, not just
during a network fault.

Ordering and causality
Ordering has been discussed so far in the following contexts:
• In Single leader replication, the order of leader’s writes determines
the order in which followers apply writes.
• Serializability is about ensuring that transactions behave as if they
were executed in some sequence order.
• Timestamps and clocks are used to introduce order into a
disorderly world.
Ordering helps preserve causality. Here are a few examples:
• Consistent prefix reads offer causal dependency between question
and answer
• Multi-leader replication can violate causality
• When detecting concurrent writes, we need the "happened before"
relationship.
• Snapshot isolation maintains consistency with the goal of offering
consistent causality.
• Serializable Snapshot Isolation identifies causal dependencies
across transactions
• Cross channel timing dependencies can result in causal
inconsistencies
Definitions
• A total order allows any two elements to be compared; i.e., given
any two elements you can say which one is greater.
• A group of elements are said to be partially ordered if some
elements can be ordered while some cannot.
• Causality implies partial order and not total order due to
concurrent events.
• Linearizability guarantees total order because the system acts as if
there is only one copy. i.e., linearizability is stronger than causal
consistency. You can build a non-linearizable system that doesn't
have performance issues and still offer causal system.
Capturing causal dependencies
In order to maintain causality, you need to know which operation
happened before another. This is a partial order - concurrent
operations can be processed in any order, but if one operation
happened before another then they must be run in that order across
all the replicas. This was discussed in two different places so far:
• Concurrent writes to the same key need to be detected so that
updates are not lost. Causal consistency goes further - it needs to
track causal dependencies across the entire database, not just for a
single key. Version vectors can be generalized to do this.
• Serializable snapshot isolation does conflict resolution by tracking
version numbers to track causality across related operations.

Sequence number ordering
We can use sequence numbers, which act as counters that
are incremented for every operation, to ensure a total order
that is consistent with causality across all the operations
performed in a database. In a database with a single leader
replication, the replication log defines a total order of write
operations that is consistent with causality.
Lamport timestamps
While the aforementioned strategy works for single leader
replication model, it doesn't work for multi-leader or
leaderless replication models where there is no single entity
generating the sequence numbers.
Below are the key characteristics of Lamport timestamps
that help accomplish the goal of causality:
• Unique timestamps: The Lamport timestamp is simply a
pair of (counter, node ID). Two nodes may have the same
counter value, but by including the node ID in the
timestamp, each timestamp is made unique.
• Ordering across timestamps: Between two timestamps,
the timestamp with the higher counter is the greatest.
When counters are the same, higher node ID is the
greatest timestamp.
• Consistent with causality:
• Every client keeps track of the maximum counter that
they have seen thus far and includes that in its
request.
• When a node receives a request with a maximum
counter value greater than its own counter, it
immediately sets its counter to the maximum.
Additionally, for every operation that it performs, the
node increments its counter by one.
• This ensures causality because every causal
dependency results in an increased timestamp.
Limitation of timestamp ordering
Despite all this, timestamp ordering is not sufficient.
Consider a system that needs to ensure that usernames are
unique, and two users try to create the same username.
When two nodes receive a request for the same username
concurrently and they need to decide right now and not be
told later. i.e., you want to be sure that no other node can
insert a claim for the same username ahead of your
operation in the total order, so that you can declare the
operation successful.

Total order broadcast
• Total order broadcast is described as a protocol for
exchanging messages between nodes that requires
two safety properties to always be satisfied:
• Reliable delivery - No messages are lost: if a
message is delivered to a node it is delivered to
all the nodes.
• Totally ordered delivery - Messages are
delivered to every node in the same order.
• A way to look at total order broadcast is that it is a
way of creating a log: delivering a message is like
appending to the log. Since all nodes must deliver
the same messages in the same order, all nodes can
read the log and see the same sequence of messages
• Since messages cannot be inserted retroactively into
an earlier position in the order, it is stronger than
timestamp ordering.
Use cases
• Consensus services such as Zookeeper and etcd
implement total order broadcast
• It is leveraged by database replication services to
ensure that every replica processes the writes in the
same order. This principle is known as state machine
replication.
• It can be used to implement serializable transactions
as discussed in actual serial execution, if every
message represents deterministic transaction to be
executed.

Atomic commit and two-phase commit
Single node vs distributed atomic commits: For transactions
that execute at a single database node, atomicity is
implemented by the database engine. On a single node,
commit depends on the order in which data is durably written
to disk: first the data, then the commit record. Once the
commit record is written, the transaction is committed. For
transactions involving multiple nodes, it is not sufficient to
send a commit request to all the nodes and independently
commit the transaction on each one. If done so, commit might
succeed on some nodes and fails on others, which violates the
atomicity guarantee (i.e., irrevocable).
Two phase commit uses a coordinator that runs on the client
initiating the transaction spanning nodes. The system of
promises for a 2PC looks as below:
• An application gets a transaction ID from the coordinator.
• The application initiates a transaction across all the
participating nodes using the transaction ID.
• When the application is ready to commit, the coordinator
sends a prepare request to all the participants. If any of the
participants do not respond, the transaction is aborted
across all the participants.
• When the participant receives the prepare request (phase
1), it makes sure that it can commit the transaction under
all circumstances. If so, it replies "yes" to the coordinator.
• When the coordinator receives responses, it decides on
whether to commit the transaction or abort and writes the
decision to a transaction log on the disk. This is called
commit point.
• Once the coordinator's decision is written to disk, it sends
the commit/abort request (phase 2) to all the participants. If
the request fails or times out, the coordinator must retry
forever, no matter how many retries it takes.
• Coordinator failure - A coordinator failure before the
prepare request is simple to handle because the participant
can safely abort the transaction after a certain point of time.
However, if the coordinator crashes after the prepare
request, then the participants wait until the coordinator
comes back up. Whenever the coordinator comes back, it
checks the transaction log for any in-doubt transactions and
aborts any that don't have a commit record.

Fault tolerant consensus
The goal of consensus is simply to get several nodes to
agree on something. There are several situations in
which it is important for nodes to agree:
• Leader election - In a database with a single-leader
it is important for all nodes to agree on which node
is the leader. The leadership position might become
contestable when some nodes cannot communicate
with others. In those situations, without consensus
the system might suffer from split-brain resulting in
multiple leaders accepting writes that lead to data
inconsistency.
• Atomic commit - In a database that supports
transactions spanning several nodes, we can have
situations where a transaction might succeed in
some and fail in others. If we want atomicity of
transactions, we must get all the nodes to agree on
the outcome of the transaction.
A fault-tolerant consensus algorithm does not block if
a majority of processes are working. Requirements for
consensus are:
• Validity – Only proposed values may be selected
• Uniform agreement – No two nodes may select
different values
• Integrity – A node can select only a single value
• Termination – Every node will eventually decide on a
value
Paxos
Nice slides on Paxos, a best-known fault-tolerant
consensus algorithm, are here.
• Consensus algorithms decide on a sequence of
values, which makes them total order broadcast
algorithms.
Leaders and consensus algorithms
Consensus algorithms need a leader. However, there
needs to be a consensus on who the leader is. This
circular dependency is broken by epoch numbering
and quorums. Before voting for any proposal can take
place, voting for a leader takes place based on an
epoch number that a proposed leader uses. A leader
elected in this manner can then coordinate the
consensus.
Limitation
One of the limitations of a consensus algorithm is that
a majority of nodes should be available.

Batch processing with Unix tools
Chain of commands in Unix
cat /var/log/nginx/access.log |
awk '{print $7}’ | sort | uniq -c |
sort -r -n | head -n 5
Custom program
counts = Hash.new(0)
File.open('/var/log/nginx/access.log') do |file|
file.each do |line|
url = line.split[6]
counts[url] += 1
end
end
top5 = counts.map{|url, count| [count, url]
}.sort.reverse[0..5]
Top5.each{|count, url| puts "#{count} #{url}" }
Sorting versus in-memory aggregation
The custom program might be better
suited for a file that can fit in memory.
The Unix approach scales nicely for large
files wherein the sorting approach ends
up writing results to the disk.
The Unix Philosophy.
Documented in The Art of Unix
Programming and on Wikipedia as below
• Make each program do one thing well.
• Expect the output of one program to
become the input of another, as yet
unknown, program.
• Design and build software, even
operating systems, to be tried early,
ideally within weeks. Don't hesitate to
throw away clumsy parts and rebuild
them.
• Use tools in preference to unskilled
help to lighten a programming task,
even if you have to detour to build the
tools and expect to throw some of
them out after you've finished using
them.
Additional points
• A uniform interface - a file is that
common interface that serves as input
and output for any Unix program
• Separation of logic and wiring -
separating the input/output wiring
from the program logic makes it easier
to compose small tools into bigger
systems
• Transparency and experimentation -
Success of Unix tools comes from the
ease of experimenting and
understanding.

MapReduce
Unix batch processing and MapReduce
• MapReduce is like Unix tools, but distributed across
thousands of machines to parallelize the computation.
• A single MapReduce job is like a Unix process: it takes one
or more inputs and produces one or more outputs without
any side effects than producing the output.
• While Unix tools use stdin and stdout, MapReduce jobs
read and write files on distributed filesystem
MapReduce and Job Execution - MapReduce is a
programming framework with which you can write code to
process large datasets in a distributed filesystem like HDFS.
The MapReduce job execution pattern for our earlier
examples is:
• Read a set of input files, and break it up into records. In
web server example, each record is one line in the log. This
is handled by the input format parser of the framework.
• Call the mapper function to extract a key and value from
each input record. In the preceding example, awk is the
mapper. This step is something that you supply custom
code.
• Sort all key-value pairs by key. This is an implicit step in
MapReduce and you don't write code for this.
• Call the reducer function to iterate over the sorted key-
value pairs. In the previous example, uniq did this job. This
is another step where you supply custom code.
MapReduce programming model
To create a MapReduce job, you need to implement two
callback functions, the mapper and reducer, which behave as
follows:
• Mapper: The mapper is called once for every input record,
and its job is to extract the key and value from the input
record. For each input, it may generate any number of key-
value pairs. It does not keep any state from one input
record to the next, so each record is handled
independently.
• Reducer: The MapReduce framework takes the key-value
pairs produced by the mappers, collects all the values
belonging to the same key, and calls the reducer with an
iterator over that collection of values. The reducer can
produce output records.

Distributed execution of MapReduce
Running Map Tasks
• The MapReduce scheduler determines the number
of Map tasks by the number of input file blocks.
• It copies the Mapper code to one of the machines
that stores a replica of the input file and runs the
Map tasks on it - this principle is referred as putting
the computation near the data.
• The Map task reads in the input file passing in one
record at a time to the Mapper callback. The output
of the Mapper are key-value pairs.
• The Map task ensures that the output files
generated are sorted (using SSTables or LSM Trees)
and can be supplied to multiple Reduce tasks.
Running Reduce Tasks
• The number of reduce tasks is configured by the
author.
• The MapReduce framework uses a hash value
mapping to ensure that the records with the same
key reach the same Reducer.
• Whenever a Mapper is done processing, the
corresponding files are downloaded for the Reducer
Tasks to process (this process is called as shuffle).
• The MapReduce framework merges the sorted
results from multiple Mappers.
• The Reducer is called with a key and an iterator that
incrementally scans over all records with the same
key. The output file generated by the Reducer is
stored locally and replicated on other machines in
the distributed file system.
MapReduce workflows
It is usually unlikely for a single MapReduce job to solve
a problem. It is therefore common to chain these jobs
to form a workflow. It is not uncommon for large
organizations to have workflows with 50 - 100 jobs
chained together. Given this pattern, rich tooling
support and ecosystem exists for setting up and
managing MapReduce workflows.

Joins in batch processing
Joins in the context of batch
processing is resolving
occurrences of some
association within a dataset. In
the context of relational
databases, joins involve foreign
keys. In the context of
document databases, joins
involve a document ID and an
edge when it comes to graphs.
In the context of batch
processing, the job ends up
reading the contents of all the
files that are supplied to it. In
that sense, batch processing
acts like OLAP.

Reduce-Side Joins and Grouping
Sort-merge joins – An example approach is as follows:
• A user activity mapper that produces sorted set of records
for user activity based on user ID
• A user database mapper that produces sorted set of records
for user database on user ID
• The MapReduce framework partitions the mapper output
by key and then sorts key-value pairs. The result of that is
that records for the same user ID become adjacent to each
other in the reducer input. The MapReduce job can also
sort the records such that the Reducer sees the record from
the user database first, followed by user activity - this
technique is known as secondary sort.
• The reducer then performs that actual join logic easily.
This approach is called sort-merge join since the mapper does
the sorting and the reducers merge the sorted list of records
from both sides of the join.
Group by
GROUP BY clause in SQL helps accomplish the following
aggregation tasks:
• Counting the number of records
• Adding up the values in one particular field
• Picking up the top k records based on some ranking
The simplest way of implementing such a grouping operation
with MapReduce is to setup the mappers so that the key-value
pairs they produce use the desired grouping key. The
partitioning and sorting process then brings together all the
records with the same key in the same reducer. Thus, grouping
and joining look quite similar when implemented on top of
MapReduce.

Handling skew
The pattern of bringing all records with the same key to the same place breaks down if there is a large amount
of data related to a single key. This happens in situations like a Twitter celebrity followed by millions of users.
The problem with skewed data is that when a reducer is assigned to handle the data from the hotspot, then it
holds up the entire MapReduce job since the job is not done until all the reducers are done.
There are a few techniques for handling joins involving hotkeys:
• In the skewed join method, a sampling job identifies the hot keys. When performing join, records related to
a hot key is sent to one of the several reducers. For the other input to the join, records relating to the hot
key need to be replicated across all the reducers handling that key.
• In the sharded join method, hot keys are to be specified explicitly instead of being determined by a sampling
job.
• Another approach is to store records related to hot keys in separate files from the rest. When performing
join on that table, a map-side join could be used for the hot keys.
• When grouping records by a hot key and aggregating them you perform grouping in two stages:
• MapReduce sends records to a random reducer, so that each reducer performs the grouping on a
subset of records and outputs a compacted aggregated value per key.
• A second MapReduce job then combines the values from the first-stage reducers to a single value per
key.

Map-side joins
The advantage of Reduce-Side joins is that there are no
assumptions about the input data. The downside is that you
sort, copy and merge data from multiple Mappers and Reducer
and this can be an overhead. If we can make certain
simplifying assumptions about the data, then we can eliminate
some of these inefficiencies. Below are a couple of means to
accomplish that:
• Broadcast hash joins - The simplest way to perform a map-
side join is when you have a small dataset that can fit the
memory on one side of the join. In such a case, when the
mapper starts up, it just reads up the dataset from the
distributed file system into the memory. It then reads up the
user events from the file table to perform a join with the
user ID loaded in the memory. This is referred to as
broadcast since the dataset is broadcast to all the mappers
that load a hash table into the memory to perform the join.
• Partitioned hash joins - If both the inputs to the map-side
join are partitioned in the same way, then the hash join
approach can be applied to each partition independently.
For example, in figure 10-2, the user and activity logs could
be partitioned on the ending digit of user_id. This would
enable each mapper to load a smaller dataset into its hash
table. This is possible if both the inputs have the same
number of partitions, the records are assigned to the
partition based on the same key and same hash function.
• Map-side merge joins - Another variant of map-side join
applies if the input is not only partitioned in the same way,
but also sorted based on the same key. In this case, it does
not matter if the input fits the memory or not. The mapper
can perform the same merging operation that the reducer
normally does.
Some observations about map-side joins are listed below.
• The output of a reduce-side join is sorted based on the key
on which the join is performed, whereas as the output of a
map-side join is partitioned and sorted in the same way as
the large input.
• Map-side joins makes assumptions about the size, sorting
and partitioning of the data.

Batch workflow output and Hadoop
Output of batch workflows
• Building search indexes
• Key-value stores can be outputs of batch process outputs. The
generated files can be loaded up by databases that offer results
from the analytics performed by the batch processes.
• Batch processing follows the same philosophy as Unix. Not only do
they offer better performance, but also become much easier to
maintain:
• If you introduce a bug into the code and the output is wrong,
you can rollback to a previous version.
• Feature development can proceed more quickly due to the
ease of rollback.
• If a map or a reduce fails, the MapReduce framework
automatically re-schedules the job and runs it on the same
input. If there is a bug in the code, the framework gives up
after specific number of retries. If it is a transient issue, then
the job is resilient.
• Same set of files can be used for different jobs for different
purposes.
• Like Unix tools, MapReduce jobs separate logic from wiring
which makes it easy to reuse code.
Comparison of Hadoop and Distributed Databases
• Diversity of storage: While databases expect you to structure the
data according to a specific model, files in a distributed file system
are just byte sequences, which can be written using any data model
and encoding. This allows the following:
• The ability to bring in disparate data from multiple sources to
process and join them.
• The onus to make sense of the data is on the consumer and
not the producer. This allows performing analysis on data after
the fact that the data is produced, as is the case almost
always.
• Diversity of processing model: Hadoop allows you to write code to
perform complex tasks that might not be possible through simple
SQL query analysis that a database would support.
• Designing for frequent faults: MapReduce tasks are written to be
rerun several times as the scheduler can choose to preempt and
restart a job. This makes MapReduce resilient to crashes and faults.
• Usage of memory and disk: MapReduce tasks are run on large
datasets with heavy disk I/O and relatively low memory. Databases
as such require significant memory. Furthermore, processing their
queries takes up a few seconds as opposed to batch processing
which could take several minutes.

Beyond MapReduce
Materialization of intermediate state
In MapReduce the intermediate state is saved to a file
before it is passed on to the next job - this is referred to
as the materialization of intermediate state. By
contrast, Unix pipes stream the results from one job to
the next. The downsides of MapReduce approach are:
• A MapReduce job can only start when all the tasks in
the preceding jobs have completed.
• Mappers are often redundant: they just read back
the same file that was just written by a reducer.
• Storing the intermediate state in a distributed
filesystem means those files are replicated across
several nodes, which is an overkill for temporary
state.
Dataflow engines
To overcome these downsides, dataflow engines came
into existence. Unlike MapReduce jobs, these functions
need not take strict roles of alternating map and
reduce, but instead can be assembled in more flexible
ways. We call these functions operators, and the
dataflow engine provides several different options for
connecting one operator to another's input. Fault
tolerance is something that the framework provides
through its understanding of the operators that were
run until a fault was encountered.
Graphs and Iterative processing
For graph like data models, the batch processing can be
slightly different since many graph algorithms are
expressed by traversing one edge at a time, joining
vertices, until a condition is met. Such iterative
algorithms often take the following form:
1. An external scheduler runs a batch process to
calculate one step of the algorithm.
2. When the batch process completes, the scheduler
checks whether it has finished.
3. If it has not yet finished, the scheduler goes back
to step 1.
This approach in very inefficient to implement in
MapReduce since such approach typically results in
reading up the entire input dataset instead of just the
small part of the graph that has changed compared to
the last iteration.

The Pregel processing model
• Like MapReduce, wherein mappers "send messages" to
a reducer that handles a specific key, in Pregel, one
vertex can "send a message" to another vertex, and
typically those messages are sent along the edges in a
graph.
• In each iteration, a function is called for each vertex,
passing it all the messages that were sent to it, like a
reducer. Each vertex then generates messages that are
guaranteed to be sent to the target vertices at the start
of the next iteration by copying all the messages over
the network.
• Unlike MapReduce, a vertex remembers its state in
memory from one iteration to another, so the function
needs to process only the new messages.
• If no new messages are sent in some part of the graph,
no work needs to be done.
• Fault tolerance: There are a few aspects of this that are
worth calling out:
• Pregel implementations guarantee that messages
are processed exactly once at their destination
vertex in the following iteration irrespective of
network delays or dropped packets.
• Checkpoints are taken at the end of each iteration,
wherein the state of each vertex is serialized.
When a fault occurs, the framework restarts at the
previous checkpoint and passing in messages to
that restored state.
• Parallel execution: The messages are passed across
vertices and each vertex can be run on any machine.
However, finding the right way to partition a graph such
that minimal messages are sent across the network is
an active area of research. If the entire graph can be
run on a single machine, then that offers best
performance. If however, that is not possible because it
is not easy to fit in the entire graph into memory, a
distributed algorithm like Pregel is unavoidable.

Batch processing Vs Stream processing
Batch processing assumes that the input data is bounded. That isn't the case for many real-world scenarios. For scenarios
that deal with unbounded data in a responsive/timely manner, stream processing is necessary.
Batch and stream processing terminology:
Batch processing Stream processing
A record An event
A file is written once and then potentially read by
multiple jobs
An event is generated once by a producer and potentially processed by
multiple consumers
A file is collection of related records Related events are grouped together by a topic or stream
Lacking event notification mechanisms, consumers rely on
polling files/databases for data. This can create a lot of
unnecessary overhead on the entire system.
If a simple pipe or TCP connection between the producer and
consumer is not sufficient (because you have multiple
consumers), then you need a messaging system.
Tradeoffs
What happens if the producers send messages faster than the
consumers can process them?
• Drop messages
• Buffer messages in a queue
• Apply flow control to slow the producer down
What happens if nodes crash or temporarily go offline?
• Make the system durable by writing to disk or replication
• Maintain high throughput by ignoring lost messages

A comparison of messaging systems
Direct messaging from producers to consumers
• UDP multicast: It is used in financial industry. Applications need to
handle lost messages by requesting a retransmit.
• Webhooks: Consumers expose a service on the network that
producers push data to. Throttling is applied by the consumer.
Consumer crashes can be handled in this model by having the producer
retry until all consumers acknowledge delivery. However, when the
producer crashes, this state is lost.
Message brokers
Message broker is a kind of database that is optimized to handle
message streams, wherein producers and consumers register with it as
clients. This system moves durability to the message broker, which
handles the scenarios wherein the clients come and go (connect,
disconnect, crash) by writing events to the disk. Consumers are
asynchronous: a producer sends a message for it to be buffered on the
broker. Broker sends the messages to consumers based on the queuing
backlog. Differences between message brokers and databases are
summarized below.
Multiple consumers
When multiple consumers read messages on the same topic, two main
patterns emerge:
• Load balancing: Each message is delivered to one of the consumers,
so the consumers can share the work of processing the messages in
the topic. The broker may assign messages to consumers arbitrarily.
• Fan-out: Each message is delivered to all the consumers – like
several different batch jobs that read the same input file.
Acknowledgements and redelivery
Message brokers use acknowledgements to ensure that messages are
received/processed by consumers. When a message is not
acknowledged, the message broker that is load balancing its messages
could chose to redeliver the message to another consumer. When that
happens, it is likely that messages are delivered out of order. It is
therefore, necessary that messages are self-contained and can be
processed independently when messages are being load balanced.
Database Message broker
Data is kept around until it is explicitly deleted Typically, messages are deleted once they are delivered to all the consumers
Databases allow secondary indexing and searching data While consumers can subscribe to a topic based on a matching pattern, the mechanisms are very different.
When data is queried, clients just get a snapshot and
aren't notified when the data changes
Consumers typically subscribe to an event and are notified when the data changes. There is no way to make
arbitrary queries.

Partitioned logs
A messaging system wherein the messages are deleted once they
have been delivered to all the consumers have a downside: if you
add a new consumer (or an old consumer just recovers from a
crash), the consumer has no way to receive prior messages. Log-
based message brokers address this gap as follows:
• When a producer sends a message, it is appended to the end of
the log. A consumer gets a message by reading the log
sequentially and waiting for event notifications when it reaches
the end of the log.
• For throughput/scalability, the log can be partitioned, and
different partitions hosted on multiple machines. Within each
partition, the broker assigns a monotonically increasing sequence
number, or offset, to every message. i.e., messages within a
partition are ordered. But, not across the partitions within the
same message topic/stream.
• Fan-out messaging is implemented by having all consumers to
read logs without affecting other consumers. Load balancing can
be implemented by assigning a group of consumers to a
partition. This form of load-balancing has a few downsides:
• The number of nodes sharing the workload is limited by the
number of partitions
• If a message in a partition takes longer to process, then
other messages are help up
• Since the consumers keep track of the messages that they have
processed through consumer offsets, there is less overhead on
the message broker to track acknowledgements/delivery. Hence,
better throughput. The broker just needs to update the current
consumer offset periodically to recover from cases wherein a
consumer might have become unresponsive or crashed.
• Typically, message brokers maintain a circular buffer such that
any old messages that don't fit in a bounded buffer are
discarded. With this model, a disk with 6 TB capacity and
150MB/s write throughput can write messages for 11 hours
before running out of disk space. i.e., consumers can lag by 11
hours before the messages are discarded. This has the following
implications:
• There is enough time to react for the system/operators
when a consumer crashes and unprocessed backlog is piling
up.
• Even when a consumer is falling behind, it impacts just that
consumer and not the overall system. As a result, it is
feasible to test/debug a production log without impacting
other services.
• Since the lifecycle of a message in the log is controlled by
producer throughput and not the consumer's, replaying
older messages is possible. This allows the iterative
development/experimentation that was possible through
batch processing systems.

Keeping heterogenous systems in sync
One of the problems that occurs in heterogenous data systems is that
of keeping systems in sync. As the same or related data appears in
several places, they need to be kept in sync with one another: if an
item is updated in the database, it also needs to be updated in the
cache, search indexes and data warehouse. There are several ways to
keep them in sync:
Batch process - The database is exported, transformed and then
loaded into the data warehouse as a batch process. Any additional
processing like search index generation, or other derived data systems
might be created using batch processes.
Dual writes - When batch processing is too slow for your needs, you
can use dual writes. i.e., the application itself writes to different
systems directly. Dual writes has problems
• When concurrent clients could cause the data to be permanently
corrupted/out of sync, unless you use concurrency detection
mechanisms like version vectors.
• Without atomic commit/2PC implemented across the heterogenous
systems, it is likely for the two systems to get out of sync when the
value is committed to one data store and the client fails to commit
in the other.
Single-leader replicated database - If somehow there was a single
leader across the database, the search index and all the other derived
datastores, then it becomes easy to ensure that the data is replicated
across all the data stores without some of the concurrency issues
outlined with dual writes. However, such a model isn't possible given
the breadth of datastores that are available.
Change data capture: CDC is the process of observing all data changes
written to a database and extracting them in a form in which they can
be replicated to other systems. This model makes one database the
leader and turns others to followers. A log-based message broker is
well suited for transporting the change events from the source
database, since it preserves the ordering of the messages. There are a
few ways to implement CDC to generate a change log:
• Using database triggers - This has performance overhead
• Parsing the replication log - This might be a challenge to deal with
schema changes. However, Facebook's Warmhole, Yahoo's Sherpa
do something like this.
• Snapshot model - When building a full text index you need the full
database and just the previous few messages will not be sufficient.
For scenarios like this, we need the ability to start with an initial
snapshot to which changes can be applied incrementally.
• Log compaction – Please see next slide

Keeping heterogenous systems in sync – Cont’d
• Log compaction is an alternative to the snapshot model discussed
above. It allows you to keep track of the net result of the data
stored in a compact manner by ignoring the history of the changes
made thus far. A compacted log is proportional to the size of the
data stored and independent of the number of writes done.
• As part of log compaction, the storage engine periodically
looks for log records with the same key, throws away any
duplicates, and keeps only the most recent updates for each
key by compacting the log in the background. Deletes are
usually marked by setting the value of the key to a magic value
like null.
• A derived data system can start a new consumer from offset 0
of the log-compacted topic, and sequentially scan all the
messages in the log to obtain a full copy of the database
contents without taking another snapshot of the CDC source
database.
Event Sourcing: Similar to CDC, event sourcing involves storing all
changes to the application state as a log of change events. The two
main differences between CDC and Event sourcing are:
• With CDC, the application is oblivious to the fact that CDC is
happening. With event souring, the application is explicitly aware of
the events that are written to the event log.
• The data in CDC is mutable and updated/deleted. Events in the
event store are append only and are not updated/deleted since
events capture the explicit user-initiated actions, state changes, etc.
Because of how it is designed, event sourcing can help application
developers draw richer insights from the usage patterns than what can
be done using the lowlevel CDC model.
• To derive the current state from the event log, applications must
take the log of events and transform it into application state this is
suitable for showing to a user. While this can be arbitrary logic, it
needs to be deterministic so that you get to the same state
irrespective of how many times you rerun it.
• Log compaction needs to be a bit different for event sourcing. Since
recent events do not just supersede older events, and just offer a
history of what all has happened so far, you need all the events to
rebuild the final state.
• Applications typically take a snapshot of the current state as a
performance optimization to not have to process the entire event
log. However, the entire event log is often needed to show a
timeline of events to the user.
• It needs to be noted that requests from users should be viewed as
commands. Commands can be processed successfully, rejected or
might even fail. An event is typically raised after the command has
been successfully committed/executed. i.e., it has now become a
fact in the system.

State, streams and immutability
Immutability helps batch processing by allowing users to
experiment with existing input files. Immutability is what
makes event sourcing and change data capture powerful.
Current state of an application is typically stored in a
database. A database is optimized for reading this state
through queries. To support state changes, the database
supports updates and deletes.
Whenever you have a state that changes, that state is the
result of the events that mutated it over time. The key idea is
that the mutable state and an append-only change log that
tracks the stream of immutable events over time are the two
sides of the same coin.
Advantages of immutable events:
• Immutability makes it easier to diagnose bugs and recover
from problems due to deploying buggy code.
• Immutable events also capture more information than just
the current state. For example, they make it possible to
track a user's journey through an e-commerce website than
if we just kept track of user transactions/purchases.
• Deriving several views from the same event log -
separating mutable state from the immutable event log,
you can derive several different read-oriented
representations from the same log of events. Having an
explicit translation step from an event log to a database
allows you to evolve your application over time.
• Concurrency control - Some of the challenges with derived
data were addressed by CDC and event sourcing. However,
they update the derived data asynchronously. If you wanted
the derived data to be updated right away, then you have to
go back to dealing with distributed transactions across
heterogenous databases. Instead, if you have a append-only
changelog, then you can just use that as your source of
truth to derive your state. This read and re-built can be
done atomically as can the write.
Limitations of immutability:
• If the changelog can go grow disproportional to the size of
the actual data, then the system might see some
performance problems.
• From privacy standpoint, you will need to support the
ability to go back in time and delete all the data about a
specific user completely from your changelog instead of just
appending a delete entry.

Processing streams
Broadly, there are three ways that streams are processed:
• You can take the data in the events and write it to a database, a
cache, a search index, etc., from where it can be queried by other
clients.
• You can push the events/notifications to the users in some way
like alerts, notifications, etc.
• You can process one or more input streams to produce one or
more output streams, similar to batch processing. One crucial
difference is that batch processing process ends at some point
since the input is finite. However, since the stream never really
ends, the stream processing can go on forever.
Uses of stream processing - Stream processing can be used in
several scenarios that requiring raising alerts or alarms like those in
fraud detection, trading systems, manufacturing systems, military
and intelligence systems. Let us look at a few places other than
these where stream processing can be used:
• Complex event processing: Similar to regular expressions, CEP
systems allow you to search for certain patterns of events in a
stream. When a match is found, the engine emits a complex
event with the details of the event pattern that was detected.
Unlike traditional datastores where data is persistent and queries
are transient, in CEP systems, queries are stored long-term, and
events from the input streams continuously flow past them in
search of a query that matches an event pattern. Ex: IBM
InfoSphere Streams, TIBCO StreamBase.
• Stream analytics: While CEP is mostly concerned about finding a
specific sequence of events, stream analytics is more oriented
toward aggregations and statistical metrics over a large number
of events - for example:
• Measuring the rate of some type of event
• Calculating the rolling average of a value over a time period
• Comparing current statistics to previous time intervals
Some example products are: Apache Storm, Flink, Kafka Streams,
Azure Stream Analytics
• Search on streams: CEP allows searching for patterns consisting
of multiple events. Sometimes, there is a need to search for
individual events based on complex criteria, such as full-text
search queries. Conventional search engines first index the
documents and then run queries over the index. By contrast,
when searching a stream, the queries are stored and documents
are run past the queries. To optimize, it is possible to index the
queries and the documents to narrow down the queries that may
match.

Designing data intensive applications

Designing data intensive applications

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Designing data intensive applications

Ähnlich wie Designing data intensive applications (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Designing data intensive applications