SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Downloaden Sie, um offline zu lesen
Dynamic Dynamos:
Riak and Cassandra
Jason Brown, @jasobrown
Senior Software Engineer, Netflix
#riconwest 2013-Oct-30
CHOICE
Choice
The whole "NoSQL movement" is really about
choice. At scale there will never be a single
solution that is best for everyone.
@jtuple, “Absolute Consistency”, Riak ML, 2012-Jan-11
http://lists.basho.com/pipermail/riak-users_lists.basho.com/2012-January/007157.html
whoami
● @Netflix, > 5 years
● Apache Cassandra committer
● wannabe distributed systems geek
Netflix and Cassandra
●
●
●
●

long-time Oracle shop
Aug 2008
needed new db for cloud migration
2010 - selected cassandra
○ dynamo-style, masterless system
○ multi-datacenter support
○ written in Java
Netflix’s C* Prod Deployment
Production clusters

> 65

Production nodes

> 2300

Multi-region clusters

> 40

Most regions used

4 (three clusters)

Total data

~300 TB

Largest cluster

288 nodes (actually, 576 nodes)

Max reads/writes

300k rps / 1.3m wps
Jason’s whiteboard, summer 2013
<image of white board here>
Why Riak?
● Another dynamo-style system
● vector clocks
● not java
● not jvm
Talk Structure
1. Comparisons
a. Write/Read path
b. Conflict resolution
c. Anti-entropy
d. Multiple Datacenter support

2. Riak @ Netflix
Data modeling
Riak
● key -> value
Cassandra
● columnar layout
● row key with one to many columns
Virtual Nodes
● Split hash ring into smaller chunks
● Physical node responsible for 1..n tokens
Cassandra
● purely for routing
Riak
● burrowed deep into the code base
Write Path - Cassandra
● Coordinator gets request
● Determine replica nodes, in all DCs
● Send to all in local DC
● Send to one replica in each remote DC
● All respond back to coordinator
○ block for consistency_level nodes

● Execute triggers
Tunable Consistency
Coordinator blocks for specified replica count to
respond
Consistency Levels:
● ALL
● EACH_QUORUM
● LOCAL_QUORUM / LOCAL_ONE
● ONE / TWO / THREE
● ANY
Put Path - Riak
●
●
●
●
●

Node gets request
determine vnodes (preflist)
if node not in preflist, forward
run precommit hooks
perform coordinating put
○ looks for previous riak object
○ increment vclock

● calls other preflist vnodes to put
● prepare return value (mult vclocks)
● coordinator vnode runs postcommit hooks
riak “consistency levels”
● n (n_val) - vnode replication count
● r - read count
● w - write count
○ {all | one | quorum | <int>}

● pr / pw - primary read/write count
● dw - durable write
bucket defaults can be overriden on request
That was the happy path
...
what about partitions?
Hinted Handoff - Riak
Sloppy quorum
● preflist fallbacks to secondary vnodes
○ skip unavailable primary vnode
○ use next available vnode in ring

● Send data to vnode when available
Put data is written, and available for reads
Hinted Handoff - Cassandra
Coordinator stores hints
○ for unavailable nodes
○ if replica fails to respond

Replay hints to node when available
Mutation is stored, but data not read available
CL.ANY - stores a hint if no replicas available
Read path - Riak
●
●
●
●

coordinator gets request
determine preflist
send request to all vnodes in preflist
when read count of vnodes return
○ merge values
○ possibly read repair
Read Repair - Riak
● compare vclocks for object
● if resolvable differences, ship newest object
to out of date vnodes
● return object with latest vclock to client
Read Path - Cassandra
● Determine replicas to invoke
○ based on consistency level

● First replica responds with full data set,
others send digests
● Coordinator waits for consistency_level
nodes to respond
Consistent Read - Cassandra
● compare digest of columns from replicas
● If any mismatches:
○ re-request full data set from same replicas
○ compare full data sets, send updates
○ block until out of date replicas respond

● Return merged data set to client
Read Repair - Cassandra
Converge requested data across all replicas
Piggy-backs on normal reads, but waits for all
replicas to respond (async)
Follows same alg as consistent reads
Conflict resolution - Riak
Vector clocks
●
●
●
●
●
●

logical clock per object
array of {incrementer, version, timestamp}
maintains causal relationships
safe in face of ‘concurrent’ writes
performance penalty
resolution burden pushed to caller
Conflict Resolution - Cassandra
Last Writer Wins
● every column has timestamp value
● “whatever timestamp caller passed in”
● “What time is it?”
○

http://aphyr.com/posts/299-the-trouble-with-timestamps

● faster
● system resolves conflicts
Anti-entropy
Converge cold data
Merkle Tree exchange
Stream inconsistencies
IO/CPU intensive
Anti-entropy - Cassandra
Node repair - converges ranges owned by a
node with all replicas
● Initiator identifies peers
● Each participant reads range from disk,
generates Merkle Tree, return MT
● Initiator compares all MTs
● Range exchange
Anti-Entropy - Riak
AAE - conceptually similar to Cassandra
Merkle Tree updated on every write
Leaf nodes contain keys, not hash value
Tree is rebuilt periodically
Each execution only between two vnodes
Multi Datacenter support
Cassandra
● in the box
● node interconnects are plain TCP scokets
○ two connections per node pair

● queries not restricted to local DC
○ read repair
○ node repair
Riak
● included in RiakEE (MDC)
● local nodes use disterl
● remote nodes use TCP
● queries do not span multiple regions
● repl types:
○ realtime
○ fullsync

● AAE
Riak @ Netflix
(hypothetically)
Subscriber data
c* = wide row implementation
● row key = custId (long)
● column per distinct attribute
○
○
○
○

subscriberId
name
subscription details
holds

riak = fit reasonably well with JSON/text blob
Movie Ratings
c* implementation:
● new ratings stored in individual columns
● recurring job to aggregate into JSON blob
● reads grab JSON + incremental updates

Riak = JSON blob already, append new ratings
Viewing History
Time-series of ‘viewable’ events
● one column per event
● playback/bookmark serialized JSON blob
● 7-8 months worth of playback data

Riak - time-series data doesn’t feel like a
natural fit
“Large blob” storage
● Team wanted to store images in c*
● key -> one column
● blob size
Right in the wheelhouse for Riak/RiakCS
Operations
Priam (https://github.com/Netflix/priam)
● backup/restore
● Cassandra bootstrap / token assignment
● configuration management
● supports multi-region deployments
Every node decentralized from peers
Priam for Riak
(perceived) challenges in supporting Riak
● some degree of centralization
○ cluster launch
○ backups for eLevelDb

● prod -> test refresh (riak reip)
● MDC
BI Integration
Aegthithus - pipeline for importing into BI
● grab nightly backup snapshot for cluster
● convert to JSON
● merge, dedupe, find new data
● import into Hive, Teradata, etc
Downside is (semi-) stale data into BI
Alternative BI Integration
Live, secondary cluster
● C* - just another datacenter in cluster
● Riak - MultiDataCenter (MDC) sloution
All mutations sent to secondary cluster
● what happens when things get slow?
● now part of c* repairs & riak full-sync
Wrap up
● Choice
● Cassandra and Riak are great databases
○ resilient to failure
○ flexible data modeling
○ strong communities

● Running databases in the cloud ain’t easy
Thank you, Basho!
Mark Phillips, Jordan West, Joe Blomstedt,
Andrew Thompson, @evanmcc,
Basho Tech Support
Q & A time

@jasobrown

Weitere ähnliche Inhalte

Andere mochten auch

Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowAdrian Cockcroft
 
Netflix Cloud Platform and Open Source
Netflix Cloud Platform and Open SourceNetflix Cloud Platform and Open Source
Netflix Cloud Platform and Open Sourceaspyker
 
Amazon Web Services Security
Amazon Web Services SecurityAmazon Web Services Security
Amazon Web Services SecurityJason Chan
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4aspyker
 
Netflix Webkit-Based UI for TV Devices
Netflix Webkit-Based UI for TV DevicesNetflix Webkit-Based UI for TV Devices
Netflix Webkit-Based UI for TV DevicesMatt McCarthy
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thingaspyker
 
Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Adrian Cockcroft
 
Careers in Security
Careers in SecurityCareers in Security
Careers in SecurityJason Chan
 
The Psychology of Security Automation
The Psychology of Security AutomationThe Psychology of Security Automation
The Psychology of Security AutomationJason Chan
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Programaspyker
 
Splitting the Check on Compliance and Security
Splitting the Check on Compliance and SecuritySplitting the Check on Compliance and Security
Splitting the Check on Compliance and SecurityJason Chan
 
Defending Netflix from Abuse
Defending Netflix from AbuseDefending Netflix from Abuse
Defending Netflix from AbuseJason Chan
 
Big data para principiantes
Big data para principiantesBig data para principiantes
Big data para principiantesCarlos Toxtli
 
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesAdrian Cockcroft
 

Andere mochten auch (16)

Netflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search RoadshowNetflix Global Applications - NoSQL Search Roadshow
Netflix Global Applications - NoSQL Search Roadshow
 
Netflix Cloud Platform and Open Source
Netflix Cloud Platform and Open SourceNetflix Cloud Platform and Open Source
Netflix Cloud Platform and Open Source
 
Amazon Web Services Security
Amazon Web Services SecurityAmazon Web Services Security
Amazon Web Services Security
 
Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4Netflix OSS Meetup Season 4 Episode 4
Netflix OSS Meetup Season 4 Episode 4
 
Netflix Webkit-Based UI for TV Devices
Netflix Webkit-Based UI for TV DevicesNetflix Webkit-Based UI for TV Devices
Netflix Webkit-Based UI for TV Devices
 
Netflix and Containers: Not A Stranger Thing
Netflix and Containers:  Not A Stranger ThingNetflix and Containers:  Not A Stranger Thing
Netflix and Containers: Not A Stranger Thing
 
Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013Bottleneck analysis - Devopsdays Silicon Valley 2013
Bottleneck analysis - Devopsdays Silicon Valley 2013
 
Careers in Security
Careers in SecurityCareers in Security
Careers in Security
 
The Psychology of Security Automation
The Psychology of Security AutomationThe Psychology of Security Automation
The Psychology of Security Automation
 
Riak perf wins
Riak perf winsRiak perf wins
Riak perf wins
 
Netflix Open Source: Building a Distributed and Automated Open Source Program
Netflix Open Source:  Building a Distributed and Automated Open Source ProgramNetflix Open Source:  Building a Distributed and Automated Open Source Program
Netflix Open Source: Building a Distributed and Automated Open Source Program
 
Splitting the Check on Compliance and Security
Splitting the Check on Compliance and SecuritySplitting the Check on Compliance and Security
Splitting the Check on Compliance and Security
 
Defending Netflix from Abuse
Defending Netflix from AbuseDefending Netflix from Abuse
Defending Netflix from Abuse
 
Big data para principiantes
Big data para principiantesBig data para principiantes
Big data para principiantes
 
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with NotesYow Conference Dec 2013 Netflix Workshop Slides with Notes
Yow Conference Dec 2013 Netflix Workshop Slides with Notes
 
Culture
CultureCulture
Culture
 

Kürzlich hochgeladen

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 

Kürzlich hochgeladen (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 

Ricon2013 preso upload

  • 1. Dynamic Dynamos: Riak and Cassandra Jason Brown, @jasobrown Senior Software Engineer, Netflix #riconwest 2013-Oct-30
  • 2.
  • 4. Choice The whole "NoSQL movement" is really about choice. At scale there will never be a single solution that is best for everyone. @jtuple, “Absolute Consistency”, Riak ML, 2012-Jan-11 http://lists.basho.com/pipermail/riak-users_lists.basho.com/2012-January/007157.html
  • 5. whoami ● @Netflix, > 5 years ● Apache Cassandra committer ● wannabe distributed systems geek
  • 6. Netflix and Cassandra ● ● ● ● long-time Oracle shop Aug 2008 needed new db for cloud migration 2010 - selected cassandra ○ dynamo-style, masterless system ○ multi-datacenter support ○ written in Java
  • 7. Netflix’s C* Prod Deployment Production clusters > 65 Production nodes > 2300 Multi-region clusters > 40 Most regions used 4 (three clusters) Total data ~300 TB Largest cluster 288 nodes (actually, 576 nodes) Max reads/writes 300k rps / 1.3m wps
  • 8. Jason’s whiteboard, summer 2013 <image of white board here>
  • 9. Why Riak? ● Another dynamo-style system ● vector clocks ● not java ● not jvm
  • 10. Talk Structure 1. Comparisons a. Write/Read path b. Conflict resolution c. Anti-entropy d. Multiple Datacenter support 2. Riak @ Netflix
  • 11. Data modeling Riak ● key -> value Cassandra ● columnar layout ● row key with one to many columns
  • 12. Virtual Nodes ● Split hash ring into smaller chunks ● Physical node responsible for 1..n tokens Cassandra ● purely for routing Riak ● burrowed deep into the code base
  • 13. Write Path - Cassandra ● Coordinator gets request ● Determine replica nodes, in all DCs ● Send to all in local DC ● Send to one replica in each remote DC ● All respond back to coordinator ○ block for consistency_level nodes ● Execute triggers
  • 14.
  • 15.
  • 16. Tunable Consistency Coordinator blocks for specified replica count to respond Consistency Levels: ● ALL ● EACH_QUORUM ● LOCAL_QUORUM / LOCAL_ONE ● ONE / TWO / THREE ● ANY
  • 17. Put Path - Riak ● ● ● ● ● Node gets request determine vnodes (preflist) if node not in preflist, forward run precommit hooks perform coordinating put ○ looks for previous riak object ○ increment vclock ● calls other preflist vnodes to put ● prepare return value (mult vclocks) ● coordinator vnode runs postcommit hooks
  • 18.
  • 19.
  • 20. riak “consistency levels” ● n (n_val) - vnode replication count ● r - read count ● w - write count ○ {all | one | quorum | <int>} ● pr / pw - primary read/write count ● dw - durable write bucket defaults can be overriden on request
  • 21. That was the happy path ... what about partitions?
  • 22. Hinted Handoff - Riak Sloppy quorum ● preflist fallbacks to secondary vnodes ○ skip unavailable primary vnode ○ use next available vnode in ring ● Send data to vnode when available Put data is written, and available for reads
  • 23. Hinted Handoff - Cassandra Coordinator stores hints ○ for unavailable nodes ○ if replica fails to respond Replay hints to node when available Mutation is stored, but data not read available CL.ANY - stores a hint if no replicas available
  • 24. Read path - Riak ● ● ● ● coordinator gets request determine preflist send request to all vnodes in preflist when read count of vnodes return ○ merge values ○ possibly read repair
  • 25. Read Repair - Riak ● compare vclocks for object ● if resolvable differences, ship newest object to out of date vnodes ● return object with latest vclock to client
  • 26. Read Path - Cassandra ● Determine replicas to invoke ○ based on consistency level ● First replica responds with full data set, others send digests ● Coordinator waits for consistency_level nodes to respond
  • 27. Consistent Read - Cassandra ● compare digest of columns from replicas ● If any mismatches: ○ re-request full data set from same replicas ○ compare full data sets, send updates ○ block until out of date replicas respond ● Return merged data set to client
  • 28. Read Repair - Cassandra Converge requested data across all replicas Piggy-backs on normal reads, but waits for all replicas to respond (async) Follows same alg as consistent reads
  • 29. Conflict resolution - Riak Vector clocks ● ● ● ● ● ● logical clock per object array of {incrementer, version, timestamp} maintains causal relationships safe in face of ‘concurrent’ writes performance penalty resolution burden pushed to caller
  • 30. Conflict Resolution - Cassandra Last Writer Wins ● every column has timestamp value ● “whatever timestamp caller passed in” ● “What time is it?” ○ http://aphyr.com/posts/299-the-trouble-with-timestamps ● faster ● system resolves conflicts
  • 31. Anti-entropy Converge cold data Merkle Tree exchange Stream inconsistencies IO/CPU intensive
  • 32. Anti-entropy - Cassandra Node repair - converges ranges owned by a node with all replicas ● Initiator identifies peers ● Each participant reads range from disk, generates Merkle Tree, return MT ● Initiator compares all MTs ● Range exchange
  • 33. Anti-Entropy - Riak AAE - conceptually similar to Cassandra Merkle Tree updated on every write Leaf nodes contain keys, not hash value Tree is rebuilt periodically Each execution only between two vnodes
  • 34. Multi Datacenter support Cassandra ● in the box ● node interconnects are plain TCP scokets ○ two connections per node pair ● queries not restricted to local DC ○ read repair ○ node repair
  • 35. Riak ● included in RiakEE (MDC) ● local nodes use disterl ● remote nodes use TCP ● queries do not span multiple regions ● repl types: ○ realtime ○ fullsync ● AAE
  • 37. Subscriber data c* = wide row implementation ● row key = custId (long) ● column per distinct attribute ○ ○ ○ ○ subscriberId name subscription details holds riak = fit reasonably well with JSON/text blob
  • 38. Movie Ratings c* implementation: ● new ratings stored in individual columns ● recurring job to aggregate into JSON blob ● reads grab JSON + incremental updates Riak = JSON blob already, append new ratings
  • 39. Viewing History Time-series of ‘viewable’ events ● one column per event ● playback/bookmark serialized JSON blob ● 7-8 months worth of playback data Riak - time-series data doesn’t feel like a natural fit
  • 40. “Large blob” storage ● Team wanted to store images in c* ● key -> one column ● blob size Right in the wheelhouse for Riak/RiakCS
  • 41. Operations Priam (https://github.com/Netflix/priam) ● backup/restore ● Cassandra bootstrap / token assignment ● configuration management ● supports multi-region deployments Every node decentralized from peers
  • 42. Priam for Riak (perceived) challenges in supporting Riak ● some degree of centralization ○ cluster launch ○ backups for eLevelDb ● prod -> test refresh (riak reip) ● MDC
  • 43. BI Integration Aegthithus - pipeline for importing into BI ● grab nightly backup snapshot for cluster ● convert to JSON ● merge, dedupe, find new data ● import into Hive, Teradata, etc Downside is (semi-) stale data into BI
  • 44. Alternative BI Integration Live, secondary cluster ● C* - just another datacenter in cluster ● Riak - MultiDataCenter (MDC) sloution All mutations sent to secondary cluster ● what happens when things get slow? ● now part of c* repairs & riak full-sync
  • 45. Wrap up ● Choice ● Cassandra and Riak are great databases ○ resilient to failure ○ flexible data modeling ○ strong communities ● Running databases in the cloud ain’t easy
  • 46. Thank you, Basho! Mark Phillips, Jordan West, Joe Blomstedt, Andrew Thompson, @evanmcc, Basho Tech Support
  • 47. Q & A time @jasobrown