SlideShare ist ein Scribd-Unternehmen logo
1 von 54
Downloaden Sie, um offline zu lesen
NO COMPROMISES:
DISTRIBUTED TRANSACTIONS WITH
CONSISTENCY,
AVAILABILITY AND
PERFORMANCE
DRAGOJEVIC ET. AL., SOSP '15
TODAY
> Overview of FaRM, plus technological context
> No proofs this time! (yay)
> Only cursory overview of recovery protocol
WHAT'S TO
LOVE?
1. CHALLENGE
TO ORTHODOXY
2. FORWARD LOOKING
.. (WITHOUT BEING OVERLY SPECULATIVE)
3. ENGINEERING
IS GREAT
DO WE NEED TO
COMPROMISE?
1980S: DISKS ARE SLOW AND MEMORY IS SMALL
1980S: DISKS ARE SLOW AND MEMORY IS SMALL
... SO LET'S INVENT GRACE JOIN AND FRIENDS.1
1
'Implementation techniques for main memory database systems', DeWitt et. al., SIGMOD'84
1990S: WANS ARE SLOW!
1990S: WANS ARE SLOW!
... SO LET'S BUILD A CROSS-SITE OPTIMIZER2
2
'Mariposa: a wide-area distributed database system', Stonebraker et. al.
2000S: MEMORY IS SLOW!
2000S: MEMORY IS SLOW!... SO LET'S BUILD A CACHE-EFFICIENT JOIN ALGORITHM (X-100)3
3
'Database Architecture Optimized for the new Bottleneck, Memory Access', Boncz et. al., VLDB'99
2010: DISKS ARE SLOW AGAIN!
2010: DISKS ARE SLOW AGAIN!
... SO LET'S PUT LOTS OF THEM IN A SINGLE MACHINE!
DATABASE SYSTEM DESIGN CAN BE VIEWED AS
AN EXERCISE IN CHASING A MOVING TARGET.
2015: CPUS ARE
GOING TO
BECOME SLOW
2015: CPUS ARE GOING TO BECOME SLOW
... WHAT CAN WE DO ABOUT IT?
WHY ARE CPUS GOING TO BECOME SLOW?
> Non-volatile storage is going to get much, much quicker
> Message latency is going to decrease
WHY ARE CPUS GOING TO BECOME SLOW?
> Non-volatile storage is going to get much, much quicker
> Message latency is going to decrease
AND BOTH WILL BECOME AFFORDABLE IN
DATACENTERS
FASTER NON-VOLATILE STORAGE
> Add a UPS to main memory
> When power is lost, write to SSD!
> NV-DRAM is not new, but this is a cheap (effective) hack.
LOW-LATENCY IN-DATACENTER MESSAGING
> Remote Direct Memory Access (RDMA) is a low-latency
link (v1) or IP (v2)-level protocol
> Allows machines to directly access memory of remote
peers
> with no CPU involvement at all!
> Infiniband was expensive, but RDMA-over-Ethernet
(RoCE) is cheaper and becoming popular.
DISTRIBUTED
DATABASE
CONTEXT
DURABILITY REQUIRES WRITES TO NON-VOLATILE STORAGE
MESSAGING IS EXTREMELY CPU EXPENSIVE
THE CPU COST OF AN RPC:
> Interrupt for kernel service
> Memory copy into kernel
> Copy into userspace
> Wake-up handler thread
> De-serialize message
> Do something
THE CPU COST OF AN RPC:
> Interrupt for kernel service
> Memory copy into kernel
> Copy into userspace
> Wake-up handler thread
> De-serialize message
> Do something
4
4
'Profiling a warehouse-scale computer', Kanev et. al. ISCA'15
RDMA
> No CPU on the usual write or read path
> NIC has its own set of page tables (without paging)
> Address memory regions directly
> FaRM uses two data structures:
> Transactional log
> Messaging ring-buffer
FARM
TWO PAPERS:
> 'No compromises...', Dragojevic et. al., SOSP'15
> 'FaRM: Fast Remote Memory', Dragojevic et. al., NSDI'14
MAIN CONTRIBUTIONS:
> Very low-latency, high-throughput transactional
system.
> Very fast failure detection / recovery protocol
> Unusual distributed system architecture based on
Vertical Paxos
> Commit protocol optimised for RDMA / low message
count
WHAT YOU GET: ABSTRACTIONS
> Global address space of addressable memory
> Transactional API, including lock-free reads
PROGRAMMING MODEL
> Application threads run in FARM servers
> Can perform arbitrary logic during transaction (but no
side-effects, please!)
> May have to deal with anomolies on read, thanks to
optimistic commit
SYSTEM ARCHITECTURE
ADDRESSABLE MEMORY: REGIONS
> Memory is partitioned into 2GB regions, pinned into
memory on each machine
> Regions are served by a primary, but have f backups
> Region->primary mapping is maintained by the
'configuration manager'
> Regions may be co-located at application's behest
HOW A CHUNK OF MEMORY BECOMES A
REGION
> Two-phase commit from CM (initiated by machine)
> Ensures that all replicas have mapping before it gets
used
REGION MAPPING RECOVERY?
> State is present in the cluster, so if CM fails can
recover it from active replicas.
> Individual machines cache mapping after fetching
through RDMA
TRANSACTIONAL
PROTOCOL
OPTIMISTIC CONCURRENCY:
TRANSACTIONS MAY FAIL AFTER LOCK ACQUISITION
COMMIT PROTOCOL
COMMIT PROTOCOL
COMMIT PROTOCOL NOTES
> All communication is over RDMA
> Total message delays not fewer than Paxos
> But total number of messages is:
4P(2f + 1) vs Pw(f+3) + Pr
> And some of those are extremely cheap
*
FAILURE DETECTION
AND RECOVERY
LEASES
> i.e. registration + keepalive, created by three-way-
handshake
> 5ms leases for 90-node cluster, with 1ms-frequency
retries!!
LEASES - HOW THEY DID IT
> Preallocation of lease manager memory
> Pin code in RAM
> Keep hardware threads free
> Use unreliable transport
SEVEN-STEP PROCESS TOWARDS RECOVERY
1. Suspect - block external requests
2. Probe - check for correlated failures
3. Update configuration - atomically move configuration
to next version in ZK
4. Remap regions - recover replication guarantee from
existing replicas
SEVEN-STEP PROCESS: COMMIT PROTOCOL
1. Send new configuration - replicas are informed of
new configuration
2. Apply new configuration - replicas update their
configurations in parallel, and wait...
3. Commit new configuration - replicas are told to
start serving requests again
Commit protocol ensures consistent membership state,
TRANSACTION RECOVERY
THANKS! QUESTIONS?
@HENRYR / HENRY.ROBINSON@GMAIL.COM
Papers We Love - FaRM distributed transactions (Henry Robinson)

Weitere ähnliche Inhalte

Was ist angesagt?

Cpu高效编程技术
Cpu高效编程技术Cpu高效编程技术
Cpu高效编程技术
Feng Yu
 
CS4344 Lecture 9: Traffic Analysis
CS4344 Lecture 9: Traffic AnalysisCS4344 Lecture 9: Traffic Analysis
CS4344 Lecture 9: Traffic Analysis
Wei Tsang Ooi
 

Was ist angesagt? (20)

Understanding DPDK
Understanding DPDKUnderstanding DPDK
Understanding DPDK
 
LF_OVS_17_Ingress Scheduling
LF_OVS_17_Ingress SchedulingLF_OVS_17_Ingress Scheduling
LF_OVS_17_Ingress Scheduling
 
Packet Framework - Cristian Dumitrescu
Packet Framework - Cristian DumitrescuPacket Framework - Cristian Dumitrescu
Packet Framework - Cristian Dumitrescu
 
How to Speak Intel DPDK KNI for Web Services.
How to Speak Intel DPDK KNI for Web Services.How to Speak Intel DPDK KNI for Web Services.
How to Speak Intel DPDK KNI for Web Services.
 
Enable DPDK and SR-IOV for containerized virtual network functions with zun
Enable DPDK and SR-IOV for containerized virtual network functions with zunEnable DPDK and SR-IOV for containerized virtual network functions with zun
Enable DPDK and SR-IOV for containerized virtual network functions with zun
 
Cpu高效编程技术
Cpu高效编程技术Cpu高效编程技术
Cpu高效编程技术
 
The n00bs guide to ovs dpdk
The n00bs guide to ovs dpdkThe n00bs guide to ovs dpdk
The n00bs guide to ovs dpdk
 
Linux Network Stack
Linux Network StackLinux Network Stack
Linux Network Stack
 
CS4344 Lecture 9: Traffic Analysis
CS4344 Lecture 9: Traffic AnalysisCS4344 Lecture 9: Traffic Analysis
CS4344 Lecture 9: Traffic Analysis
 
Performance Lessons learned in vRouter - Stephen Hemminger
Performance Lessons learned in vRouter - Stephen HemmingerPerformance Lessons learned in vRouter - Stephen Hemminger
Performance Lessons learned in vRouter - Stephen Hemminger
 
Dpdk applications
Dpdk applicationsDpdk applications
Dpdk applications
 
Open vSwitch Implementation Options
Open vSwitch Implementation Options Open vSwitch Implementation Options
Open vSwitch Implementation Options
 
OpenvSwitch Deep Dive
OpenvSwitch Deep DiveOpenvSwitch Deep Dive
OpenvSwitch Deep Dive
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Open vSwitch - Stateful Connection Tracking & Stateful NAT
Open vSwitch - Stateful Connection Tracking & Stateful NATOpen vSwitch - Stateful Connection Tracking & Stateful NAT
Open vSwitch - Stateful Connection Tracking & Stateful NAT
 
First steps on CentOs7
First steps on CentOs7First steps on CentOs7
First steps on CentOs7
 
DPDK Support for New HW Offloads
DPDK Support for New HW OffloadsDPDK Support for New HW Offloads
DPDK Support for New HW Offloads
 
Virtual Distro Dispatcher - A light-weight Desktop-as-a-Service solution
Virtual Distro Dispatcher - A light-weight Desktop-as-a-Service solutionVirtual Distro Dispatcher - A light-weight Desktop-as-a-Service solution
Virtual Distro Dispatcher - A light-weight Desktop-as-a-Service solution
 
Mercurial
MercurialMercurial
Mercurial
 
Dpdk performance
Dpdk performanceDpdk performance
Dpdk performance
 

Ähnlich wie Papers We Love - FaRM distributed transactions (Henry Robinson)

Intro (Distributed computing)
Intro (Distributed computing)Intro (Distributed computing)
Intro (Distributed computing)
Sri Prasanna
 
Basic ccna interview questions and answers ~ sysnet notes
Basic ccna interview questions and answers ~ sysnet notesBasic ccna interview questions and answers ~ sysnet notes
Basic ccna interview questions and answers ~ sysnet notes
Vamsi Krishna Kalavala
 

Ähnlich wie Papers We Love - FaRM distributed transactions (Henry Robinson) (20)

Coa presentation3
Coa presentation3Coa presentation3
Coa presentation3
 
Offloading for Databases - Deep Dive
Offloading for Databases - Deep DiveOffloading for Databases - Deep Dive
Offloading for Databases - Deep Dive
 
DataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The SequelDataStax: Extreme Cassandra Optimization: The Sequel
DataStax: Extreme Cassandra Optimization: The Sequel
 
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations2009-01-28 DOI NBC Red Hat on System z Performance Considerations
2009-01-28 DOI NBC Red Hat on System z Performance Considerations
 
L05 parallel
L05 parallelL05 parallel
L05 parallel
 
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDLIRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
IRJET- Design And VLSI Verification of DDR SDRAM Controller Using VHDL
 
Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...
 
Intel® RDT Hands-on Lab
Intel® RDT Hands-on LabIntel® RDT Hands-on Lab
Intel® RDT Hands-on Lab
 
os
osos
os
 
VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...
VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...
VMworld 2013: Silent Killer: How Latency Destroys Performance...And What to D...
 
Practical steps to mitigate DDoS attacks
Practical steps to mitigate DDoS attacksPractical steps to mitigate DDoS attacks
Practical steps to mitigate DDoS attacks
 
Intro (Distributed computing)
Intro (Distributed computing)Intro (Distributed computing)
Intro (Distributed computing)
 
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. TanenbaumA Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
A Reimplementation of NetBSD Based on a Microkernel by Andrew S. Tanenbaum
 
MARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 AltreonicMARC ONERA Toulouse2012 Altreonic
MARC ONERA Toulouse2012 Altreonic
 
Mainframe
MainframeMainframe
Mainframe
 
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
Технологии работы с дисковыми хранилищами и файловыми системами Windows Serve...
 
Towards Software Defined Persistent Memory
Towards Software Defined Persistent MemoryTowards Software Defined Persistent Memory
Towards Software Defined Persistent Memory
 
Basic ccna interview questions and answers ~ sysnet notes
Basic ccna interview questions and answers ~ sysnet notesBasic ccna interview questions and answers ~ sysnet notes
Basic ccna interview questions and answers ~ sysnet notes
 
What a Modern Database Enables_Srini Srinivasan.pdf
What a Modern Database Enables_Srini Srinivasan.pdfWhat a Modern Database Enables_Srini Srinivasan.pdf
What a Modern Database Enables_Srini Srinivasan.pdf
 
Seastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration SummitSeastar at Linux Foundation Collaboration Summit
Seastar at Linux Foundation Collaboration Summit
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Kürzlich hochgeladen (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Papers We Love - FaRM distributed transactions (Henry Robinson)

  • 1. NO COMPROMISES: DISTRIBUTED TRANSACTIONS WITH CONSISTENCY, AVAILABILITY AND PERFORMANCE DRAGOJEVIC ET. AL., SOSP '15
  • 2. TODAY > Overview of FaRM, plus technological context > No proofs this time! (yay) > Only cursory overview of recovery protocol
  • 3.
  • 4.
  • 7. 2. FORWARD LOOKING .. (WITHOUT BEING OVERLY SPECULATIVE)
  • 9. DO WE NEED TO COMPROMISE?
  • 10. 1980S: DISKS ARE SLOW AND MEMORY IS SMALL
  • 11. 1980S: DISKS ARE SLOW AND MEMORY IS SMALL ... SO LET'S INVENT GRACE JOIN AND FRIENDS.1 1 'Implementation techniques for main memory database systems', DeWitt et. al., SIGMOD'84
  • 13. 1990S: WANS ARE SLOW! ... SO LET'S BUILD A CROSS-SITE OPTIMIZER2 2 'Mariposa: a wide-area distributed database system', Stonebraker et. al.
  • 15. 2000S: MEMORY IS SLOW!... SO LET'S BUILD A CACHE-EFFICIENT JOIN ALGORITHM (X-100)3 3 'Database Architecture Optimized for the new Bottleneck, Memory Access', Boncz et. al., VLDB'99
  • 16. 2010: DISKS ARE SLOW AGAIN!
  • 17. 2010: DISKS ARE SLOW AGAIN! ... SO LET'S PUT LOTS OF THEM IN A SINGLE MACHINE!
  • 18. DATABASE SYSTEM DESIGN CAN BE VIEWED AS AN EXERCISE IN CHASING A MOVING TARGET.
  • 19. 2015: CPUS ARE GOING TO BECOME SLOW
  • 20. 2015: CPUS ARE GOING TO BECOME SLOW ... WHAT CAN WE DO ABOUT IT?
  • 21. WHY ARE CPUS GOING TO BECOME SLOW? > Non-volatile storage is going to get much, much quicker > Message latency is going to decrease
  • 22. WHY ARE CPUS GOING TO BECOME SLOW? > Non-volatile storage is going to get much, much quicker > Message latency is going to decrease AND BOTH WILL BECOME AFFORDABLE IN DATACENTERS
  • 23. FASTER NON-VOLATILE STORAGE > Add a UPS to main memory > When power is lost, write to SSD! > NV-DRAM is not new, but this is a cheap (effective) hack.
  • 24. LOW-LATENCY IN-DATACENTER MESSAGING > Remote Direct Memory Access (RDMA) is a low-latency link (v1) or IP (v2)-level protocol > Allows machines to directly access memory of remote peers > with no CPU involvement at all! > Infiniband was expensive, but RDMA-over-Ethernet (RoCE) is cheaper and becoming popular.
  • 26. DURABILITY REQUIRES WRITES TO NON-VOLATILE STORAGE
  • 27. MESSAGING IS EXTREMELY CPU EXPENSIVE
  • 28. THE CPU COST OF AN RPC: > Interrupt for kernel service > Memory copy into kernel > Copy into userspace > Wake-up handler thread > De-serialize message > Do something
  • 29. THE CPU COST OF AN RPC: > Interrupt for kernel service > Memory copy into kernel > Copy into userspace > Wake-up handler thread > De-serialize message > Do something
  • 30. 4 4 'Profiling a warehouse-scale computer', Kanev et. al. ISCA'15
  • 31. RDMA > No CPU on the usual write or read path > NIC has its own set of page tables (without paging) > Address memory regions directly > FaRM uses two data structures: > Transactional log > Messaging ring-buffer
  • 32. FARM
  • 33. TWO PAPERS: > 'No compromises...', Dragojevic et. al., SOSP'15 > 'FaRM: Fast Remote Memory', Dragojevic et. al., NSDI'14
  • 34.
  • 35. MAIN CONTRIBUTIONS: > Very low-latency, high-throughput transactional system. > Very fast failure detection / recovery protocol > Unusual distributed system architecture based on Vertical Paxos > Commit protocol optimised for RDMA / low message count
  • 36. WHAT YOU GET: ABSTRACTIONS > Global address space of addressable memory > Transactional API, including lock-free reads
  • 37. PROGRAMMING MODEL > Application threads run in FARM servers > Can perform arbitrary logic during transaction (but no side-effects, please!) > May have to deal with anomolies on read, thanks to optimistic commit
  • 39. ADDRESSABLE MEMORY: REGIONS > Memory is partitioned into 2GB regions, pinned into memory on each machine > Regions are served by a primary, but have f backups > Region->primary mapping is maintained by the 'configuration manager' > Regions may be co-located at application's behest
  • 40. HOW A CHUNK OF MEMORY BECOMES A REGION > Two-phase commit from CM (initiated by machine) > Ensures that all replicas have mapping before it gets used
  • 41. REGION MAPPING RECOVERY? > State is present in the cluster, so if CM fails can recover it from active replicas. > Individual machines cache mapping after fetching through RDMA
  • 43. OPTIMISTIC CONCURRENCY: TRANSACTIONS MAY FAIL AFTER LOCK ACQUISITION
  • 46. COMMIT PROTOCOL NOTES > All communication is over RDMA > Total message delays not fewer than Paxos > But total number of messages is: 4P(2f + 1) vs Pw(f+3) + Pr > And some of those are extremely cheap *
  • 48. LEASES > i.e. registration + keepalive, created by three-way- handshake > 5ms leases for 90-node cluster, with 1ms-frequency retries!!
  • 49. LEASES - HOW THEY DID IT > Preallocation of lease manager memory > Pin code in RAM > Keep hardware threads free > Use unreliable transport
  • 50. SEVEN-STEP PROCESS TOWARDS RECOVERY 1. Suspect - block external requests 2. Probe - check for correlated failures 3. Update configuration - atomically move configuration to next version in ZK 4. Remap regions - recover replication guarantee from existing replicas
  • 51. SEVEN-STEP PROCESS: COMMIT PROTOCOL 1. Send new configuration - replicas are informed of new configuration 2. Apply new configuration - replicas update their configurations in parallel, and wait... 3. Commit new configuration - replicas are told to start serving requests again Commit protocol ensures consistent membership state,
  • 53. THANKS! QUESTIONS? @HENRYR / HENRY.ROBINSON@GMAIL.COM