21. ①
https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlAboutReads.html
http://www.datastax.com/dev/blog/maximizing-cache-benefit-with-cassandra
Memtable RowCache
N
Y
②
③
④
⑤
⑤
a pk is found
in key cache
⑥
⑦
Read Request Flow
Row cache & Key cache
The row cache is not write-through. If a write comes in for the row,
the cache for that row is invalidated and is not cached again until
the row is read. Similarly, if a partition is updated, the entire partition
is evicted from the cache. When the desired partition data is not
found in the row cache, then the Bloom filter is checked.
RowCache是不可写的,如果更新了⼀一⾏行,则在RowCache中的这
⼀一⾏行就彻底失效了:会从RowCache中移除直到下次访问这⼀一⾏行时
A Bloom filter can establish that a SSTable does not contain certain
partition data. A Bloom filter can also find the likelihood that partition
data is stored in a SSTable. However, because the Bloom filter is a
probabilistic function, it can result in false positives. Not all SSTables
identified by the Bloom filter will have data. If the Bloom filter does
not rule out an SSTable, Cassandra checks the partition key cache
The partition key cache stores a cache of the partition index off-heap.
If a partition key is found in the key cache can go directly to the
compression offset map to find the compressed block on disk that
has the data.
49. Data Consistency
数据⼀一致性
Paxos consensus protocol
Lightweight Transaction(CAS)two-phase commit
https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlAboutDataConsistency.html
Linearizable consistency
Tunable Consistency可调节的⼀一致性:
R:the consistency level of read operations
W: the consistency level of write operations
N:the number of replicas
Strong consistency guaranteed: R + W > N
Eventual consistency occured:R + W <= N
50. Client read or write requests can go to any node in the cluster because all nodes in Cassandra are peers(对等). When a client
connects to a node and issues a read or write request, that node serves as the coordinator for that particular client operation.
The job of the coordinator is to act as a proxy between the client application and the nodes (or replicas) that own the data
being requested. The coordinator determines which nodes in the ring should get the request based on the cluster configured
partitioner and replica placement strategy.
https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/
Coordinator
51. Consistency refers to how up-to-date and synchronized a row of Cassandra data is on all of its replicas.
Using repair operations, Cassandra data will eventually be consistent in all replicas. Repairs work to
decrease the variability in replica data, but at a given time, stale data can be present.
The consistency level determines the number of replicas that need to acknowledge the read or write
operation success to the client application. For read operations, the read consistency level specifies how
many replicas must respond to a read request before returning data to the client application. For write
operations, the write consistency level specified how many replicas must respond to a write request
before the write is considered successful.
Even at low consistency levels, Cassandra writes to all replicas of the partition key, including replicas in
other data centers. The write consistency level just specifies when the coordinator can report to the client
application that the write operation is considered completed.
If a read operation reveals(揭⽰示) inconsistency among replicas, Cassandra initiates(启动) a read repair to
update the inconsistent data. Write operations will use hinted handoffs to ensure the writes are
completed when replicas are down or otherwise not responsive to the write request.
Typically, a client specifies a consistency level that is less than the replication factor specified by the
keyspace. Another common practice is to write at a consistency level of QUORUM and read at a
consistency level of QUORUM. The choices made depend on the client application's needs, and Cassandra
provides maximum flexibility for application design. There is a tradeoff between operation latency and
consistency: higher consistency incurs higher latency, lower consistency permits lower latency. You can
control latency by tuning consistency.
Consistency Level(CL): How many replicas must respond to declare success?
Hinted Handoff(HH): A hint is written to the coordinator node when a replica is down
Read Repair(RR): Background digest query on-read to find and update out-of-date replicas
https://docs.datastax.com/en/cassandra/2.2/cassandra/dml/dmlAboutDataConsistency.html
Consistency Level
58. Direct Read
Digest Read
Compare In Memory
Decide Which Latest
What If n4 newer than n3, issure another Direct Read to n4?
(Because n4 is just digest, for full data, we need Direct Read)
In this situation, n3 will also pull data from newer data at n4.
❓
62. Read repair means that when a query is made against a given key, we perform a digest query against all the replicas of the key and push the
most recent version to any out-of-date replicas. If a lower ConsistencyLevel than ALL was specified, this is done in the background after
returning the data from the closest replica to the client; otherwise(CL=ALL), it is done before returning the data. This means that in almost all
cases, at most the first instance of a query will return old data(第⼀一次可能会收到过期的数据,但是后续相同的查询因为修复过数据就是新的).
Read Repair机制:查询时先向最近的节点查询数据[1],然后向其他节点发送Digest请求,在对所有的副本进⾏行⽐比较后将最新时间撮的副本数据
推送到其他过期的副本。不同的⼀一致性级别只是Read Repair的时机不同,ONE或QUORUM时,在将最近那个节点的数据[1]返回给客户端之后
才在后台开始ReadRepair操作。当⼀一致性级别=ALL,在返回数据给客户端前完成ReadRepair。
不管哪种⼀一致性,请求完整的数据只会是最近的那个节点,即使这个节点的数据不是最新的,最终还是会返回给客户端,就有可能返回过期数据
https://wiki.apache.org/cassandra/ReadRepair
https://docs.datastax.com/en/cassandra/2.2/cassandra/dml/dmlClientRequestsRead.html
http://www.datastax.com/dev/blog/common-mistakes-and-misconceptions
There are three types of read requests that a coordinator can send to a replica:
+ A direct read request
+ A digest request
+ A background read repair request
The coordinator node contacts one replica node with a direct read request. Then the coordinator sends a digest request to a number of
replicas determined by the consistency level specified by the client. The digest request checks the data in the replica node to make sure it
is up to date. Then the coordinator sends a digest request to all remaining replicas. If any replica nodes have out of date data, a
background read repair request is sent. Read repair requests ensure that the requested row is made consistent on all replicas.
For a digest request the coordinator first contacts the replicas specified by the consistency level. The coordinator sends these requests to
the replicas that are currently responding the fastest. The nodes contacted respond with a digest of the requested data; if multiple nodes are
contacted, the rows from each replica are compared in memory to see if they are consistent. If they are not, then the replica that has the
most recent data (based on the timestamp) is used by the coordinator to forward the result back to the client. To ensure that all replicas have
the most recent version of the data, read repair is carried out to update out-of-date replicas.
CL=ONE,Direct Read⼀一个节点,但只有10%的请求会在后台发⽣生Read Repair(剩余的两个副本)
CL=QUORUM,Direct Read⼀一个节点,向另⼀一个节点发送Digest Read,此次满⾜足QUORUM级别,确保这两个节点数据⼀一致后
返回Direct Read读取的数据给客户端,再次向最后⼀一个节点发送Digest Read(如果最后这个节点才是最新的数据呢?)
CL=ALL,Direct Read⼀一个节点,向另外两个节点发送Digest Read,运⾏行Read Repair确保所有节点数据⼀一致,返回Direct Read数据给客户端
Read & Read Repair
Read repair is not directly related to repair, but both play a role in the overall anti-entropy system in Cassandra. read_repair_chance setting used to be
started out as 1. That is, at a consistency level of 1, for every read, we would check the other replicas to see if the thing data we just read is consistent
with the other replicas. This was good, because if you ever read stale data, the next time you read the same row you would probably read something
more up to date. The bad part about this was requiring every read to become RF reads (and typically your RF is set to at least 3). Meaning that reads
happen more often, and require more IO. In newer versions of Cassandra the default for this value is 0.1, and it is set on a per-columnfamily basis.
Which means 10% of your requests will trigger a background read repair. This is more than enough for typical scenarios.
63. When data is read to satisfy a query and return a result, all replicas are queried for the data needed(所有的副本都会被查询). The first replica
node receives a direct read request and supplies the full data(第⼀一个副本收到Direct Read请求,提供完整的数据给协调节点). The other
nodes contacted receive a digest request and return a digest, or hash of the data(其他节点收到Digest请求,返回数据的概要给协调节点). A
digest is requested because generally the hash is smaller than the data itself.
A comparison of the digests allows the coordinator to return the most up-to-date data to the query(对概要进⾏行⽐比较, 这样允许协调者返回最新
的数据给客户端, 问题:概要能直接返回给客户端吗?如果Direct Read不是最新的怎么办?概要可以和Direct Read⽐比较吗?). If the digests are
the same for enough replicas to meet the consistency level, the data is returned(概要的数量满⾜足⼀一致性级别,数据返回给客户端). If the
consistency level of the read query is ALL, the comparison must be completed before the results are returned; otherwise for all lower
consistency levels, it is done in the background(⼀一致性级别为ALL,⽐比较操作必须在返回结果给客户端之前完成,否则可以在返回结果后⽐比较).
The coordinator compares the digests, and if a mismatch is discovered(发现了不⼀一致), a request for the full data is sent to the mismatched
nodes(完整的数据会被发送到不匹配的节点,这个完整的数据是Direct Read的吗,还是Digest中时间撮最新的?). The most current data found
in a full data comparison is used to reconcile(调解) any inconsistent data on other replicas.
http://docs.datastax.com/en/cassandra/2.2/cassandra/operations/opsRepairNodesTOC.html
http://docs.datastax.com/en/cassandra/2.2/cassandra/operations/opsRepairNodesReadRepair.html
Node repair makes data on a replica consistent with data on other nodes and is important for every Cassandra cluster. Repair is the process
of correcting the inconsistencies so that eventually, all nodes have the same and most up-to-date data.
Repair can occur in the following ways:
✅ Hinted Handoff
During the write path, if a node that should receive data is unavailable, hints are written to the coordinator. When the node comes back online,
the coordinator can hand off the hints so that the node can catch up and write the data.
✅ Read Repair
During the read path, a query acquires data from several nodes. The acquired data from each node is checked against each other node. If a
node has outdated data, the most recent data is written back to the node.
✅ Anti-Entropy Repair
For maintenance purposes or recovery, manually run anti-entropy repair to rectify inconsistencies on any nodes(by nodetool repair).
Repair
79. https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlClientRequestsWrite.html
The coordinator sends a write request to all replicas that own the row being written. As long as all replica nodes are up and available, they will get the
write regardless of the consistency level specified by the client. The write consistency level determines how many replica nodes must respond with a
success acknowledgment in order for the write to be considered successful. Success means that the data was written to the commit log and the
memtable as described in how data is written.
In a single data center 12 node cluster with a replication factor of 3, an incoming write will go to all 3 nodes that own the requested row. If the write
consistency level specified by the client is ONE, the first node [R1] to complete the write responds back to the coordinator, which then proxies the
success message back to the client [write response]. A consistency level of ONE means that it is possible that 2 of the 3 replicas [R2,R3] could miss
the write if they happened to be down at the time the request was made.
That node [coordinator] forwards the write to all replicas of that row. It responds to the client once it receives write acknowledgments from the number
of nodes specified by the consistency level.
1. If the coordinator cannot write to enough replicas to meet the requested CL, it throws an Unavailable Exception and does not perform any writes.
2. If there are enough replicas available but the required writes don't finish within the timeout window, the coordinator throws a Timeout Exception.
写⼀一致性
80.
81.
82. DC:2, RF:3, CL:QUORUM=>
所有数据中⼼心,两个副本
In multiple data center deployments, Cassandra
optimizes write performance by choosing one
coordinator node. The coordinator node contacted
by the client application forwards the write request
to each replica node in each all the data centers.
If using a consistency level of LOCAL_ONE or
LOCAL_QUORUM, only the nodes in the same
data center as the coordinator node must respond
to the client request in order for the request to
succeed. This way, geographical latency does not
impact client request response times.
83. https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlClientRequestsReadExp.html
DC:1, RF:3, CL:QUORUM=>2
In a single data center cluster with a replication factor of 3, and a read consistency level of QUORUM, 2 of the 3 replicas for the given row
must respond to fulfill the read request. If the contacted replicas have different versions of the row, the replica with the most recent version will
return the requested data [to Client]. In the background, the third replica is checked for consistency with the first two, and if needed, a read
repair is initiated for the out-of-date replicas.
读⼀一致性
84. DC:1, RF:3, CL:ONE=>1
In a single data center cluster with a replication factor of 3, and a read consistency level of ONE, the closest replica for the given row is
contacted to fulfill the read request. In the background a read repair is potentially initiated, based on the read_repair_chance setting of the
table, for the other replicas.
85. In a two data center cluster with a RF=3, and a
read consistency of QUORUM, 4 replicas for the
given row must respond to fulfill the read request.
The 4 replicas can be from any data center. In the
background, the remaining replicas are checked
for consistency with the first four, and if needed,
a read repair is initiated for the out-of-date replicas.
DC:2, RF:3, CL:QUORUM=>
任何数据中⼼心,四个副本
86. DC:2, RF:3, CL:LOCAL_QUORUM=>
本地数据中⼼心,两个副本
In a multiple data center cluster with a RF=3,
and a read consistency of LOCAL_QUORUM,
2 replicas in the same DC as the coordinator
node for the given row must respond to fulfill
the read request. In the background, the
remaining replicas are checked for consistency
with the first 2, and if needed, a read repair is
initiated for the out-of-date replicas.
87. DC:2, RF:3, CL:ONE=>
任何DC,⼀一个副本
In a multiple data center cluster with a RF=3,
and a read consistency of ONE, the closest replica
for the given row, regardless of data center,
is contacted to fulfill the read request. In the
background a read repair is potentially initiated,
based on the read_repair_chance setting of the
table, for the other replicas.
88. DC:2, RF:3, CL:LOCAL_ONE=>
本地数据中⼼心,⼀一个副本
In a multiple data center cluster with a RF=3,
and a read consistency of LOCAL_ONE, the
closest replica for the given row in the same
data center as the coordinator node is
contacted to fulfill the read request. In the
background a read repair is potentially initiated,
based on the read_repair_chance setting of the
table, for the other replicas.
92. bloom_filter_fp_chance
false positive
determines the percent chance of the bloom filter returning a false positive
that a partition exists in an SSTable when in fact it does not.
false positives are possible;
false negatives are not possible
If you increase the percent chance of false positives, then you lower memory usage via a smaller filter size at the expense of more disk seeks
due to an increase in false positives.
If you decrease the percent chance of false positives, then you increase memory usage via a larger filter size for the benefit of fewer disk
seeks thanks to fewer false positives.
https://grockdoc.com/cassandra/2.1/articles/tuning-reads-via-the-bloom-filter_88c8f57a-71d0-41ee-b77f-617c64ad4739/
http://docs.datastax.com/en/cql/3.1/cql/cql_reference/compactSubprop.html
False positive matches are possible, but false negatives are not. In other words,
a query returns either “possibly in set” or “definitely not in set”.