Redis Cluster 
design tradeoffs 
@antirez - Pivotal
What is performance? 
• Low latency. 
• IOPS. 
• Operations quality and data model.
Go Cluster 
• Redis Cluster must have same Redis use case. 
• Tradeoffs are inherently needed in DS. 
• CAP? Merge values?...
CP systems 
CAP: consistency price is added latency 
Client S1 
S2 
S3 
S4
CP systems 
Reply to client after majority ACKs 
Client S1 
S2 
S3 
S4
And… there is the disk 
S1 S2 S3 
Disk Disk Disk 
CP algorithms may require fsync-befor-ack. 
Durability / Consistency not...
AP systems 
Eventual consistency with merges? 
(note: merge is not strictly part of EC) 
Client 
S1 
S2 
A = {1,2,3,8,12,1...
Many kinds of consistencies 
• “C” of CAP is strong consistency. 
• It is not the only available tradeoff of course. 
• Co...
Redis Cluster 
Sharding and replication (asynchronous). 
Client 
A,B,C 
A,B,C 
A,B,C 
D,E,F 
D,E,F 
D,E,F
Asynchronous replication 
Client A,B,C 
A,B,C 
A,B,C 
A,B,C 
A,B,C 
A,B,C 
async ACK
Full Mesh 
A,B,C A,B,C 
D,E,F D,E,F 
• Heartbeats. 
• Nodes gossip. 
• Failover auth. 
• Config update.
No proxy, but redirections 
Client Client 
A? D? 
A,B,C D,E,F G,H,I L,M,N O,P,Q R,S,T
Failure detection 
• Failure reports within window of time (via gossip). 
• Trigger for actual failover. 
• Two main state...
Failure detection 
S1 is not responding? 
S1 
S2 
S3 
S4 
S1 = PFAIL 
S1 = PFAIL 
S1 = PFAIL
Failure detection 
PFAIL state propagates 
S1 
S2 
S3 
S4 
S1 = PFAIL 
S1 = PFAIL 
Reported by: 
S2, S4 
S1 = PFAIL
Failure detection 
PFAIL state propagates 
S1 
S2 
S3 
S4 
S1 = PFAIL 
S1 = FAIL 
S1 = PFAIL
Failure detection 
Force FAIL state 
S1 
S2 
S3 
S4 
S1 = FAIL 
S1 = FAIL 
S1 = FAIL
Global slots config 
• A master FAIL state triggers a failover. 
• Cluster needs a coherent view of configuration. 
• Who ...
Raft and failover 
• Config propagation is solved using ideas from the 
Raft algorithm (just a subset). 
• Raft is a conse...
Failover and config 
Failed 
Slave 
Slave 
Slave 
Master 
Master 
Master 
Epoch = Epoch+1 
(logical clock) 
Vote for me!
Too easy? 
• Why we don’t need full Raft? 
• Because our config is idempotent: when the 
partition heals we can trow away ...
Config propagation 
• After a successful failover, new slot config is 
broadcasted. 
• If there are partitions, when they ...
Redis Cluster consistency? 
• Eventual consistent: last failover wins. 
• In the “vanilla” losing writes is unbound. 
• Me...
Failure mode… #1 
Client A,B,C 
A,B,C 
A,B,C 
Failed 
A,B,C 
A,B,C 
lost write…
Failure mode #2 
Client 
A,B,C 
A,B,C 
D,E,F 
G,H,I 
Minority side Majority side
Boud divergences 
Client 
A,B,C 
D,E,F 
G,H,I 
After node-timeot 
Minority side Majority side
More data safety? 
• OP logging until async ACK received. 
• Re-played to master when node turns into slave. 
• “Safe” con...
Multi key ops 
• Hey hashtags! 
• {user:1000}.following {user:1000}.followers. 
• Unavailable for small windows, but no da...
Multi key ops 
(availability) 
• Single key ops: always available during resharding. 
• Multi key ops, available if: 
• No...
{User:1}.key_A {User:2}.Key_B 
{User:1}.key_A 
{User:1}.Key_B 
{User:1}.key_A 
{User:1}.Key_B 
SUNION key_A key_B 
-TRYAGA...
Redis Cluster ETA 
• Release Candidate available. 
• We’ll go stable in Q1 2015. 
• Ask me anything.
Nächste SlideShare
Wird geladen in …5
×

Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

7.000 Aufrufe

Veröffentlicht am

Salvatore Sanfilippo – How Redis Cluster works, and why

In this talk the algorithmic details of Redis Cluster will be exposed in order to show what were the design tensions in the clustered version of an high performance database supporting complex data type, the selected tradeoffs, and their effect on the availability and consistency of the resulting solution.Other non-chosen solutions in the design space will be illustrated for completeness.

Veröffentlicht in: Daten & Analysen
0 Kommentare
12 Gefällt mir
Statistik
Notizen
  • Als Erste(r) kommentieren

Keine Downloads
Aufrufe
Aufrufe insgesamt
7.000
Auf SlideShare
0
Aus Einbettungen
0
Anzahl an Einbettungen
138
Aktionen
Geteilt
0
Downloads
118
Kommentare
0
Gefällt mir
12
Einbettungen 0
Keine Einbettungen

Keine Notizen für die Folie

Salvatore Sanfilippo – How Redis Cluster works, and why - NoSQL matters Barcelona 2014

  1. 1. Redis Cluster design tradeoffs @antirez - Pivotal
  2. 2. What is performance? • Low latency. • IOPS. • Operations quality and data model.
  3. 3. Go Cluster • Redis Cluster must have same Redis use case. • Tradeoffs are inherently needed in DS. • CAP? Merge values? Strong consistency and consensus? How to replicate values?
  4. 4. CP systems CAP: consistency price is added latency Client S1 S2 S3 S4
  5. 5. CP systems Reply to client after majority ACKs Client S1 S2 S3 S4
  6. 6. And… there is the disk S1 S2 S3 Disk Disk Disk CP algorithms may require fsync-befor-ack. Durability / Consistency not always orthogonal.
  7. 7. AP systems Eventual consistency with merges? (note: merge is not strictly part of EC) Client S1 S2 A = {1,2,3,8,12,13,14} Client A = {2,3,8,11,12,1}
  8. 8. Many kinds of consistencies • “C” of CAP is strong consistency. • It is not the only available tradeoff of course. • Consistency is the set of liveness and safety properties a given system provides. • Eventual consistency: like to say nothing at all. What liveness/safety properties if not “C”?
  9. 9. Redis Cluster Sharding and replication (asynchronous). Client A,B,C A,B,C A,B,C D,E,F D,E,F D,E,F
  10. 10. Asynchronous replication Client A,B,C A,B,C A,B,C A,B,C A,B,C A,B,C async ACK
  11. 11. Full Mesh A,B,C A,B,C D,E,F D,E,F • Heartbeats. • Nodes gossip. • Failover auth. • Config update.
  12. 12. No proxy, but redirections Client Client A? D? A,B,C D,E,F G,H,I L,M,N O,P,Q R,S,T
  13. 13. Failure detection • Failure reports within window of time (via gossip). • Trigger for actual failover. • Two main states: PFAIL -> FAIL.
  14. 14. Failure detection S1 is not responding? S1 S2 S3 S4 S1 = PFAIL S1 = PFAIL S1 = PFAIL
  15. 15. Failure detection PFAIL state propagates S1 S2 S3 S4 S1 = PFAIL S1 = PFAIL Reported by: S2, S4 S1 = PFAIL
  16. 16. Failure detection PFAIL state propagates S1 S2 S3 S4 S1 = PFAIL S1 = FAIL S1 = PFAIL
  17. 17. Failure detection Force FAIL state S1 S2 S3 S4 S1 = FAIL S1 = FAIL S1 = FAIL
  18. 18. Global slots config • A master FAIL state triggers a failover. • Cluster needs a coherent view of configuration. • Who is serving this slot currently? • Slots config must eventually converge.
  19. 19. Raft and failover • Config propagation is solved using ideas from the Raft algorithm (just a subset). • Raft is a consensus algorithm built on top of different “layers”. • Raft paper is already a classic (highly recommended). • Full Raft not needed for Redis Cluster slots config.
  20. 20. Failover and config Failed Slave Slave Slave Master Master Master Epoch = Epoch+1 (logical clock) Vote for me!
  21. 21. Too easy? • Why we don’t need full Raft? • Because our config is idempotent: when the partition heals we can trow away slots config for new versions. • Same algorithm is used in Sentinel v2 and works well.
  22. 22. Config propagation • After a successful failover, new slot config is broadcasted. • If there are partitions, when they heal, config will get updated (broadcasted from time to time, plus stale config detection and UPADTE messages). • Config with greater Epoch always wins.
  23. 23. Redis Cluster consistency? • Eventual consistent: last failover wins. • In the “vanilla” losing writes is unbound. • Mechanisms to avoid unbound data loss.
  24. 24. Failure mode… #1 Client A,B,C A,B,C A,B,C Failed A,B,C A,B,C lost write…
  25. 25. Failure mode #2 Client A,B,C A,B,C D,E,F G,H,I Minority side Majority side
  26. 26. Boud divergences Client A,B,C D,E,F G,H,I After node-timeot Minority side Majority side
  27. 27. More data safety? • OP logging until async ACK received. • Re-played to master when node turns into slave. • “Safe” connections, on demand. • Example SADD (idempotent + commutative). • SET-LWW foo bar <wall-clock>.
  28. 28. Multi key ops • Hey hashtags! • {user:1000}.following {user:1000}.followers. • Unavailable for small windows, but no data exchange between nodes.
  29. 29. Multi key ops (availability) • Single key ops: always available during resharding. • Multi key ops, available if: • No manual resharding of this hash slot in progress. • Resharding in progress, but source or destination node have all keys. • Otherwise we get a -TRYAGAIN error.
  30. 30. {User:1}.key_A {User:2}.Key_B {User:1}.key_A {User:1}.Key_B {User:1}.key_A {User:1}.Key_B SUNION key_A key_B -TRYAGAIN SUNION key_A key_B … output … SUNION key_A key_B … output …
  31. 31. Redis Cluster ETA • Release Candidate available. • We’ll go stable in Q1 2015. • Ask me anything.

×