Cassandra consistency

Hinted Handoff(HH)
A hint is written to the coordinator node when a replica is down

Read Repair(RR)
Background digest query on-read to find and update out-of-date replicas*
* carried out in the background unless CL:ALL

http://www.planetcassandra.org/data-replication-in-nosql-databases-explained/#
更新(insert,update,delete)

https://uberdev.wordpress.com/2015/11/29/cassandra-developer-certiﬁcation-study-notes-read-path/

SSTable是不可变的，当Memtable刷写到磁盘后就不能继续写⼊入，同⼀一个Partition可能跨越多个SSTable，但是不可能跨越多个节点
Partition/Primary Index：Partition keys以及在Data File⽂文件中这⼀一⾏行的起始位置（数据的元数据，索引）
Partition/Index Summary：Partition Index的抽样信息，保存在内存中（元数据的元数据，索引的索引）
Bloom Filter：检查⼀一⾏行数据（Partition Key）是否在SSTable中，如果不再，就不会读取SSTable
http://docs.datastax.com/en/cassandra/2.2/cassandra/dml/dmlHowDataWritten.html

①
https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlAboutReads.html
http://www.datastax.com/dev/blog/maximizing-cache-benefit-with-cassandra
Memtable RowCache
N
Y
②
③
④
⑤
⑤
a pk is found
in key cache
⑥
⑦
Read Request Flow
Row cache & Key cache
The row cache is not write-through. If a write comes in for the row,
the cache for that row is invalidated and is not cached again until
the row is read. Similarly, if a partition is updated, the entire partition
is evicted from the cache. When the desired partition data is not
found in the row cache, then the Bloom filter is checked.
RowCache是不可写的，如果更新了⼀一⾏行，则在RowCache中的这
⼀一⾏行就彻底失效了：会从RowCache中移除直到下次访问这⼀一⾏行时
A Bloom filter can establish that a SSTable does not contain certain
partition data. A Bloom filter can also find the likelihood that partition
data is stored in a SSTable. However, because the Bloom filter is a
probabilistic function, it can result in false positives. Not all SSTables
identified by the Bloom filter will have data. If the Bloom filter does
not rule out an SSTable, Cassandra checks the partition key cache
The partition key cache stores a cache of the partition index off-heap.
If a partition key is found in the key cache can go directly to the
compression offset map to find the compressed block on disk that
has the data.

https://2012.nosql-matters.org/cgn/wp-content/uploads/2012/06/Sylvain_Lebresne-Cassandra_Storage_Engine.pdf
Write & Read Example

Compaction
SSTable
Storage
Format

http://distributeddatastore.blogspot.com/2013/08/cassandra-sstable-storage-format.html
Index.db
Data.db
索引⽂文件存储的是所有的Key(不采样)
⽽而MD5表数据的KeyValue⼤大⼩小均匀，
所以索引⽂文件和数据⽂文件⼤大⼩小差不多
Regular Column Tombstone Column

Full Index & Sample Index
Index.dbSummary.db
1. Row key length (short/2 bytes)
2. Key (N bytes)
3. Offset in SSTable data ﬁle (long/8 bytes)
4. Promoted size (int/4 bytes)
00000000 00 04 72 6f 77 41 00 00 00 00 00 00 00 00 00 00 |..rowA..........|
00000010 00 00 00 04 72 6f 77 42 00 00 00 00 00 00 00 5f |....rowB......._|
00000020 00 00 00 00 00 0a 72 6f 77 45 78 63 6c 75 64 65 |......rowExclude|
00000030 00 00 00 00 00 00 00 be 00 00 00 00 |............|
0000003c

http://www.datastax.com/dev/blog/cassandra-error-handling-done-right

http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure
When a timeout is not a failure

Rapid Read Protection(speculative_retry/dynamic snitch)
https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlClientRequestsRead.html
http://www.planetcassandra.org/blog/rapid-read-protection-in-cassandra-202/
https://issues.apache.org/jira/browse/CASSANDRA-5932
1.客户端向Coordinator节点请求数据，协调节点将请求
路由到性能最好的节点(副本)，最后将结果返回给客户端
只针对读。读只会请求⼀一个节点的副本，然后根据⼀一致性级别和ReadRepair概率，
只会请求其他副本的Checksum(没有请求数据)：选择⼀一个最适合的副本很重要。
DynamicSnitch会监测不同副本的读取性能，基于历史选择最好的那个副本。
ALTER TABLE users WITH speculative_retry = '10ms';
ALTER TABLE users WITH speculative_retry = '99percentile';
优点：某些节点性能差时可以降低读延迟
缺点：产⽣生额外的请求，吞吐量下降
注意：
1）不适⽤用于⼀一致性级别=ALL，因为该级别本⾝身就需要读取所有副本
2）集群规模较⼩小时，快速读保护也会降低吞吐量，规模较⼤大时不明显
Recovering from replica node failure with rapid read protection

2.如果路由到的节点在返回响应给协调节点
之前失败了，客户端的请求最终会超时
3.快速读保护: 允许协调者监测未完成的请求，
当原始副本的读取请求响应⽐比预期的要慢时，
协调者发送额外的请求给其他副本所在的节点

✅🙅
🙅
凡事不能绝对，都不开启推测执⾏行不好，总是开启也不是好主意
只对90%的请求开启推测执⾏行，这样只有10%的请求不会被保护

Data Consistency
数据⼀一致性
Paxos consensus protocol
Lightweight Transaction(CAS)two-phase commit
https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlAboutDataConsistency.html
Linearizable consistency
Tunable Consistency可调节的⼀一致性：
R：the consistency level of read operations
W: the consistency level of write operations
N：the number of replicas
Strong consistency guaranteed： R + W > N
Eventual consistency occured：R + W <= N

Client read or write requests can go to any node in the cluster because all nodes in Cassandra are peers(对等). When a client
connects to a node and issues a read or write request, that node serves as the coordinator for that particular client operation.
The job of the coordinator is to act as a proxy between the client application and the nodes (or replicas) that own the data
being requested. The coordinator determines which nodes in the ring should get the request based on the cluster conﬁgured
partitioner and replica placement strategy.
https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/
Coordinator

Consistency refers to how up-to-date and synchronized a row of Cassandra data is on all of its replicas.
Using repair operations, Cassandra data will eventually be consistent in all replicas. Repairs work to
decrease the variability in replica data, but at a given time, stale data can be present.
The consistency level determines the number of replicas that need to acknowledge the read or write
operation success to the client application. For read operations, the read consistency level specifies how
many replicas must respond to a read request before returning data to the client application. For write
operations, the write consistency level specified how many replicas must respond to a write request
before the write is considered successful.
Even at low consistency levels, Cassandra writes to all replicas of the partition key, including replicas in
other data centers. The write consistency level just specifies when the coordinator can report to the client
application that the write operation is considered completed.
If a read operation reveals(揭⽰示) inconsistency among replicas, Cassandra initiates(启动) a read repair to
update the inconsistent data. Write operations will use hinted handoffs to ensure the writes are
completed when replicas are down or otherwise not responsive to the write request.
Typically, a client specifies a consistency level that is less than the replication factor specified by the
keyspace. Another common practice is to write at a consistency level of QUORUM and read at a
consistency level of QUORUM. The choices made depend on the client application's needs, and Cassandra
provides maximum flexibility for application design. There is a tradeoff between operation latency and
consistency: higher consistency incurs higher latency, lower consistency permits lower latency. You can
control latency by tuning consistency.
Consistency Level(CL): How many replicas must respond to declare success?
Hinted Handoff(HH): A hint is written to the coordinator node when a replica is down
Read Repair(RR): Background digest query on-read to find and update out-of-date replicas
https://docs.datastax.com/en/cassandra/2.2/cassandra/dml/dmlAboutDataConsistency.html
Consistency Level

Direct Read
Digest Read
Compare In Memory
Decide Which Latest
What If n4 newer than n3, issure another Direct Read to n4?
(Because n4 is just digest, for full data, we need Direct Read)
In this situation, n3 will also pull data from newer data at n4.
❓

虽然副本存储在n2,n3,n4，⽽而且n2可以认为是主副本
但是协调节点会根据历史数据选择最快那个节点的副本

CL=ONE?
读取负载最低的节点的数据(如果它不是最新的呢)
两两⽐比较，还是Direct Read和Digest Read⽐比较?
当CL=ONE时read_repair_chance配置有效:只有10%的请求需要进⾏行Read Repair.
chance对CL>ONE⽆无效,即CL=QUORUM/ALL，所有请求⼀一旦不⼀一致都需要Repair
read_repair_chance is ignored if the ConsistencyLevel
is greater than ONE and read repair always occurs.
Write=ALL, READ=ONE, 保证了强⼀一致性，同时只有10%的请求才会在后台启动Read Repair

Read repair means that when a query is made against a given key, we perform a digest query against all the replicas of the key and push the
most recent version to any out-of-date replicas. If a lower ConsistencyLevel than ALL was specified, this is done in the background after
returning the data from the closest replica to the client; otherwise(CL=ALL), it is done before returning the data. This means that in almost all
cases, at most the first instance of a query will return old data(第⼀一次可能会收到过期的数据，但是后续相同的查询因为修复过数据就是新的).
Read Repair机制：查询时先向最近的节点查询数据[1]，然后向其他节点发送Digest请求，在对所有的副本进⾏行⽐比较后将最新时间撮的副本数据
推送到其他过期的副本。不同的⼀一致性级别只是Read Repair的时机不同，ONE或QUORUM时，在将最近那个节点的数据[1]返回给客户端之后
才在后台开始ReadRepair操作。当⼀一致性级别=ALL，在返回数据给客户端前完成ReadRepair。
不管哪种⼀一致性，请求完整的数据只会是最近的那个节点，即使这个节点的数据不是最新的，最终还是会返回给客户端，就有可能返回过期数据
https://wiki.apache.org/cassandra/ReadRepair
https://docs.datastax.com/en/cassandra/2.2/cassandra/dml/dmlClientRequestsRead.html
http://www.datastax.com/dev/blog/common-mistakes-and-misconceptions
There are three types of read requests that a coordinator can send to a replica:
+ A direct read request
+ A digest request
+ A background read repair request
The coordinator node contacts one replica node with a direct read request. Then the coordinator sends a digest request to a number of
replicas determined by the consistency level specified by the client. The digest request checks the data in the replica node to make sure it
is up to date. Then the coordinator sends a digest request to all remaining replicas. If any replica nodes have out of date data, a
background read repair request is sent. Read repair requests ensure that the requested row is made consistent on all replicas.
For a digest request the coordinator first contacts the replicas specified by the consistency level. The coordinator sends these requests to
the replicas that are currently responding the fastest. The nodes contacted respond with a digest of the requested data; if multiple nodes are
contacted, the rows from each replica are compared in memory to see if they are consistent. If they are not, then the replica that has the
most recent data (based on the timestamp) is used by the coordinator to forward the result back to the client. To ensure that all replicas have
the most recent version of the data, read repair is carried out to update out-of-date replicas.
CL=ONE，Direct Read⼀一个节点，但只有10%的请求会在后台发⽣生Read Repair（剩余的两个副本）
CL=QUORUM，Direct Read⼀一个节点，向另⼀一个节点发送Digest Read，此次满⾜足QUORUM级别，确保这两个节点数据⼀一致后
返回Direct Read读取的数据给客户端，再次向最后⼀一个节点发送Digest Read（如果最后这个节点才是最新的数据呢？）
CL=ALL，Direct Read⼀一个节点，向另外两个节点发送Digest Read，运⾏行Read Repair确保所有节点数据⼀一致，返回Direct Read数据给客户端
Read & Read Repair
Read repair is not directly related to repair, but both play a role in the overall anti-entropy system in Cassandra. read_repair_chance setting used to be
started out as 1. That is, at a consistency level of 1, for every read, we would check the other replicas to see if the thing data we just read is consistent
with the other replicas. This was good, because if you ever read stale data, the next time you read the same row you would probably read something
more up to date. The bad part about this was requiring every read to become RF reads (and typically your RF is set to at least 3). Meaning that reads
happen more often, and require more IO. In newer versions of Cassandra the default for this value is 0.1, and it is set on a per-columnfamily basis.
Which means 10% of your requests will trigger a background read repair. This is more than enough for typical scenarios.

When data is read to satisfy a query and return a result, all replicas are queried for the data needed(所有的副本都会被查询). The ﬁrst replica
node receives a direct read request and supplies the full data(第⼀一个副本收到Direct Read请求，提供完整的数据给协调节点). The other
nodes contacted receive a digest request and return a digest, or hash of the data(其他节点收到Digest请求，返回数据的概要给协调节点). A
digest is requested because generally the hash is smaller than the data itself.
A comparison of the digests allows the coordinator to return the most up-to-date data to the query(对概要进⾏行⽐比较, 这样允许协调者返回最新
的数据给客户端, 问题：概要能直接返回给客户端吗？如果Direct Read不是最新的怎么办？概要可以和Direct Read⽐比较吗？). If the digests are
the same for enough replicas to meet the consistency level, the data is returned(概要的数量满⾜足⼀一致性级别，数据返回给客户端). If the
consistency level of the read query is ALL, the comparison must be completed before the results are returned; otherwise for all lower
consistency levels, it is done in the background(⼀一致性级别为ALL，⽐比较操作必须在返回结果给客户端之前完成，否则可以在返回结果后⽐比较).
The coordinator compares the digests, and if a mismatch is discovered(发现了不⼀一致), a request for the full data is sent to the mismatched
nodes(完整的数据会被发送到不匹配的节点，这个完整的数据是Direct Read的吗，还是Digest中时间撮最新的？). The most current data found
in a full data comparison is used to reconcile(调解) any inconsistent data on other replicas.
http://docs.datastax.com/en/cassandra/2.2/cassandra/operations/opsRepairNodesTOC.html
http://docs.datastax.com/en/cassandra/2.2/cassandra/operations/opsRepairNodesReadRepair.html
Node repair makes data on a replica consistent with data on other nodes and is important for every Cassandra cluster. Repair is the process
of correcting the inconsistencies so that eventually, all nodes have the same and most up-to-date data.
Repair can occur in the following ways:
✅ Hinted Handoff
During the write path, if a node that should receive data is unavailable, hints are written to the coordinator. When the node comes back online,
the coordinator can hand off the hints so that the node can catch up and write the data.
✅ Read Repair
During the read path, a query acquires data from several nodes. The acquired data from each node is checked against each other node. If a
node has outdated data, the most recent data is written back to the node.
✅ Anti-Entropy Repair
For maintenance purposes or recovery, manually run anti-entropy repair to rectify inconsistencies on any nodes(by nodetool repair).
Repair

Hint TTL, max_hint_window_in_ms=3hour
如果⼀一个节点当掉超过3⼩小时，后续的hint不会存储

Low Latency，Low Consistency 低的⼀一致性才能有低的延迟
High Latency，High Consistency ⾼高的⼀一致性会产⽣生⾼高的延迟
Read
Write

https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlClientRequestsWrite.html
The coordinator sends a write request to all replicas that own the row being written. As long as all replica nodes are up and available, they will get the
write regardless of the consistency level specified by the client. The write consistency level determines how many replica nodes must respond with a
success acknowledgment in order for the write to be considered successful. Success means that the data was written to the commit log and the
memtable as described in how data is written.
In a single data center 12 node cluster with a replication factor of 3, an incoming write will go to all 3 nodes that own the requested row. If the write
consistency level specified by the client is ONE, the first node [R1] to complete the write responds back to the coordinator, which then proxies the
success message back to the client [write response]. A consistency level of ONE means that it is possible that 2 of the 3 replicas [R2,R3] could miss
the write if they happened to be down at the time the request was made.
That node [coordinator] forwards the write to all replicas of that row. It responds to the client once it receives write acknowledgments from the number
of nodes specified by the consistency level.
1. If the coordinator cannot write to enough replicas to meet the requested CL, it throws an Unavailable Exception and does not perform any writes.
2. If there are enough replicas available but the required writes don't finish within the timeout window, the coordinator throws a Timeout Exception.
写⼀一致性

DC:2, RF:3, CL:QUORUM=>
所有数据中⼼心，两个副本
In multiple data center deployments, Cassandra
optimizes write performance by choosing one
coordinator node. The coordinator node contacted
by the client application forwards the write request
to each replica node in each all the data centers.
If using a consistency level of LOCAL_ONE or
LOCAL_QUORUM, only the nodes in the same
data center as the coordinator node must respond
to the client request in order for the request to
succeed. This way, geographical latency does not
impact client request response times.

https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlClientRequestsReadExp.html
DC:1, RF:3, CL:QUORUM=>2
In a single data center cluster with a replication factor of 3, and a read consistency level of QUORUM, 2 of the 3 replicas for the given row
must respond to fulﬁll the read request. If the contacted replicas have different versions of the row, the replica with the most recent version will
return the requested data [to Client]. In the background, the third replica is checked for consistency with the ﬁrst two, and if needed, a read
repair is initiated for the out-of-date replicas.
读⼀一致性

DC:1, RF:3, CL:ONE=>1
In a single data center cluster with a replication factor of 3, and a read consistency level of ONE, the closest replica for the given row is
contacted to fulﬁll the read request. In the background a read repair is potentially initiated, based on the read_repair_chance setting of the
table, for the other replicas.

In a two data center cluster with a RF=3, and a
read consistency of QUORUM, 4 replicas for the
given row must respond to fulﬁll the read request.
The 4 replicas can be from any data center. In the
background, the remaining replicas are checked
for consistency with the ﬁrst four, and if needed,
a read repair is initiated for the out-of-date replicas.
DC:2, RF:3, CL:QUORUM=>
任何数据中⼼心，四个副本

DC:2, RF:3, CL:LOCAL_QUORUM=>
本地数据中⼼心，两个副本
In a multiple data center cluster with a RF=3,
and a read consistency of LOCAL_QUORUM,
2 replicas in the same DC as the coordinator
node for the given row must respond to fulﬁll
the read request. In the background, the
remaining replicas are checked for consistency
with the ﬁrst 2, and if needed, a read repair is
initiated for the out-of-date replicas.

DC:2, RF:3, CL:ONE=>
任何DC，⼀一个副本
and a read consistency of ONE, the closest replica
for the given row, regardless of data center,
is contacted to fulﬁll the read request. In the
background a read repair is potentially initiated,
based on the read_repair_chance setting of the

DC:2, RF:3, CL:LOCAL_ONE=>
本地数据中⼼心，⼀一个副本
and a read consistency of LOCAL_ONE, the
closest replica for the given row in the same
data center as the coordinator node is
contacted to fulﬁll the read request. In the
background a read repair is potentially initiated,
based on the read_repair_chance setting of the

sstable sstablekey1
Bloom
Filter
Bloom
Filter
sstable sstable
Bloom
Filter
Bloom
Filter
key1
Am I Here?
Query key1
sstable sstable
Bloom
Filter
Bloom
Filter
No,U’r NOT here!
sstable sstable
Bloom
Filter
Bloom
Filter
OK, I Believe U!
key1
key1
GO NEXT SSTABLE…
sstable sstable
Bloom
Filter
Bloom
Filter

bloom_filter_fp_chance
false positive
determines the percent chance of the bloom filter returning a false positive
that a partition exists in an SSTable when in fact it does not.
false positives are possible;
false negatives are not possible
If you increase the percent chance of false positives, then you lower memory usage via a smaller filter size at the expense of more disk seeks
due to an increase in false positives.
If you decrease the percent chance of false positives, then you increase memory usage via a larger filter size for the benefit of fewer disk
seeks thanks to fewer false positives.
https://grockdoc.com/cassandra/2.1/articles/tuning-reads-via-the-bloom-filter_88c8f57a-71d0-41ee-b77f-617c64ad4739/
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/compactSubprop.html
False positive matches are possible, but false negatives are not. In other words,
a query returns either “possibly in set” or “definitely not in set”.

http://www.datastax.com/dev/blog/improving-compaction-in-cassandra-with-cardinality-estimation

https://issues.apache.org/jira/browse/CASSANDRA-6474

https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRepairNodesManualRepair.html

http://www.datastax.com/dev/blog/more-efﬁcient-repairs

http://christopher-batey.blogspot.com/2015/02/cassandra-anti-pattern-misuse-of.html

https://www.pythian.com/blog/guide-to-cassandra-thread-pools/

Cassandra consistency

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Cassandra consistency

Similar to Cassandra consistency (20)

Recently uploaded

Recently uploaded (20)

Cassandra consistency