SlideShare ist ein Scribd-Unternehmen logo
1 von 17
RADOS level replication
ISSUES
• Maintaining causal ordering
• Guarantee replication correctness when some OSDs go down
• Some other considerations
System run time
Time
slice
Time
slice
Time
slice
Time
slice
Time
slice
Time
slice
Time
slice
Time
slice
Time
slice
Time
slice
Sage Weil suggested(to preserve causal order):
• Split the whole system run time into a series of time slice
• All client Ops during the same time slice are in the same transaction
• At time slice boundaries, clients has to pause for some time in order for the physical clock to go
forward enough to prevent causal order to be violated across the time slice boundary
Some more details
• How to make all clients pause at exactly every time slice boundary?
• How to make all client Ops in the same time slice either all replicated
or all not-replicated?
Making all clients pause at time slice
boundaries
• Monitor send an timestamp Tts_bound to all clients for every time slice
boundary
• All clients set their pause timers according to the following rule:
• CONDITION: Local system clock has to be synchronized with the same NTP
server as monitors, and time skew is small enough.
• Tpause_timer_expire = Tts_bound + TimeSlice – Tlocal
• When the pause timer expires, if the client’s local system clock still
satifies the CONDITION, its worker threads has to pause for the same
period of time, Ppause.
• Ppause must be much larger than the time synchronization error bound.
Making all clients pause at time slice
boundaries
Tts_bound
Tts_bound + TimeSlice
Client1’s system clock
Client2’s system clock
Client3’s system clock
Time Sync
Error Bound
Client1’s Ppause
Client2’s Ppause
Client3’s Ppause
Time Sync
Error Bound
Making all ops in the same time slice in the
same transaction
• All clients, after being paused for Ppause, report to the monitor of the
pause and the last Tts_bound_last according to which they set this pause.
• OSDs in the master cluster periodically report to the monitor of their
latest replicated client op
• When all clients has finished the pauses which they set according to
Tts_bound_last, and all OSDs have started to replicate client Ops whose
timestamp are later than Tts_bound_last, monitor send the Tts_bound_last +
TimeSlice to the backup cluster.
Making all ops in the same time slice in the
same transaction
• OSDs in the master cluster keep replicating client ops to the backup
cluster despite the time slice constrains
• OSDs in the backup cluster cache the ops in their journal, and write
them back to the backing store only when there is a confirmed time
slice boundary(Tts_bound + TimeSlice) sent from the monitor of the
master cluster.
• When the time slice boundary(Tts_bound + TimeSlice) is received, all ops
with their time stamp earlier than that boundary are written back to
the backing store.
Put together
client client client client
OSD
monitor
Tts_bound
Tts_bound_last
OSD OSD
Top_latest
Master cluster
OSD
monitor
OSD OSD
Backup cluster
Tts_bound_last+TimeSlice
Tts_bound_last+TimeSlice
Transfer
node
Transfer
node
Transfer
node
ops ops
Key points
• The expiration time of clients’ pause timer must NOT be influenced by
time synchronization service like ntp or chrony
• If one client fail to pause correctly, this time slice should be merged
into later time slices instead of be replicated.
• OSD journal space of the OSDs in backup cluster should be large
enough to cache multiple time slices’ Ops since the sending of the
confirmed time slice boundary can be delayed.
ISSUES
• Maintaining causal ordering
• Guarantee replication correctness when some OSDs go down
• Some other considerations
Guarantee replication correctness when some
OSDs go down
• The following two conditions should be enough to guarantee the
correctness:
• For all OSDs in the acting set, “original op journal” gets removed only after its
corresponding “original op” is replicated
• in the recovery/backfill phase, recovery source replicate all journal related to
the recovering object before pushing
Guarantee replication correctness when some
OSDs go down
• Can we use the current OSD journal?
• Op_replication_head –
op_replication_tail <= some_threshold
• If the condition above is hold,
journal_head should be pointing to the
same “op” as the op_replication_head
• Otherwise, only journal_head move
forward.
Guarantee replication correctness when some
OSDs go down
• Can we use the current OSD journal?
• Replication is controlled by acting
primary
• Replica OSDs should report there
journal space usage info to acting
primary, for example, add this info in
the reply to CEPH_OSD_OP_REPOP
msg.
Guarantee replication correctness when some
OSDs go down
• Details about this issue is in the document of the last CDM:
http://tracker.ceph.com/attachments/download/2903/ceph_rados-
level_replication.pdf :-)
ISSUES
• Maintaining causal ordering
• Guarantee replication correctness when some OSDs go down
• Some other considerations
Some other considerations
• Making all clients to pause periodically seems not necessary, maybe
we can let clients identify themselves as whether or not they need
the point-in-time consistency. So, only those who need point-in-time
consistency need to pause.

Weitere ähnliche Inhalte

Ähnlich wie Sep 6 cdm

Process scheduling
Process schedulingProcess scheduling
Process schedulingHao-Ran Liu
 
Backing Up and Recovery
Backing Up and RecoveryBacking Up and Recovery
Backing Up and RecoveryMaham Huda
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2ScyllaDB
 
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT V Real Time Operating System (RTOS)
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT V  Real Time Operating System (RTOS)SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT V  Real Time Operating System (RTOS)
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT V Real Time Operating System (RTOS)Arti Parab Academics
 
Process scheduling &amp; time
Process scheduling &amp; timeProcess scheduling &amp; time
Process scheduling &amp; timeYojana Nanaware
 
Database recovery techniques
Database recovery techniquesDatabase recovery techniques
Database recovery techniquespusp220
 
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBeganKoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBeganTobias Koprowski
 
Distributed Performance testing by funkload
Distributed Performance testing by funkloadDistributed Performance testing by funkload
Distributed Performance testing by funkloadAkhil Singh
 
Insider operating system
Insider   operating systemInsider   operating system
Insider operating systemAditi Saxena
 
Doyle h 0945-high-availablity-cep-with-red_hat-j_boss_brms-3
Doyle h 0945-high-availablity-cep-with-red_hat-j_boss_brms-3Doyle h 0945-high-availablity-cep-with-red_hat-j_boss_brms-3
Doyle h 0945-high-availablity-cep-with-red_hat-j_boss_brms-3Duncan Doyle
 
Progress OE performance management
Progress OE performance managementProgress OE performance management
Progress OE performance managementYassine MOALLA
 
Progress Openedge performance management
Progress Openedge performance managementProgress Openedge performance management
Progress Openedge performance managementYassine MOALLA
 
Client Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayClient Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayDataStax Academy
 
High availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication SystemHigh availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication SystemScott Moonen
 
What to do when detect deadlock
What to do when detect deadlockWhat to do when detect deadlock
What to do when detect deadlockSyed Zaid Irshad
 
FreeRTOS basics (Real time Operating System)
FreeRTOS basics (Real time Operating System)FreeRTOS basics (Real time Operating System)
FreeRTOS basics (Real time Operating System)Naren Chandra
 
Round Robin Algorithm.pptx
Round Robin Algorithm.pptxRound Robin Algorithm.pptx
Round Robin Algorithm.pptxSanad Bhowmik
 
Cassandra and drivers
Cassandra and driversCassandra and drivers
Cassandra and driversBen Bromhead
 

Ähnlich wie Sep 6 cdm (20)

Process scheduling
Process schedulingProcess scheduling
Process scheduling
 
Backing Up and Recovery
Backing Up and RecoveryBacking Up and Recovery
Backing Up and Recovery
 
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2Scylla Summit 2022: Scylla 5.0 New Features, Part 2
Scylla Summit 2022: Scylla 5.0 New Features, Part 2
 
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT V Real Time Operating System (RTOS)
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT V  Real Time Operating System (RTOS)SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT V  Real Time Operating System (RTOS)
SYBSC IT SEM IV EMBEDDED SYSTEMS UNIT V Real Time Operating System (RTOS)
 
Process scheduling &amp; time
Process scheduling &amp; timeProcess scheduling &amp; time
Process scheduling &amp; time
 
Database recovery techniques
Database recovery techniquesDatabase recovery techniques
Database recovery techniques
 
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBeganKoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
KoprowskiT_SQLSatMoscow_2AMaDisaterJustBegan
 
Distributed Performance testing by funkload
Distributed Performance testing by funkloadDistributed Performance testing by funkload
Distributed Performance testing by funkload
 
Insider operating system
Insider   operating systemInsider   operating system
Insider operating system
 
Doyle h 0945-high-availablity-cep-with-red_hat-j_boss_brms-3
Doyle h 0945-high-availablity-cep-with-red_hat-j_boss_brms-3Doyle h 0945-high-availablity-cep-with-red_hat-j_boss_brms-3
Doyle h 0945-high-availablity-cep-with-red_hat-j_boss_brms-3
 
Progress OE performance management
Progress OE performance managementProgress OE performance management
Progress OE performance management
 
Progress Openedge performance management
Progress Openedge performance managementProgress Openedge performance management
Progress Openedge performance management
 
Feb 7th CDM
Feb 7th CDMFeb 7th CDM
Feb 7th CDM
 
Client Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right WayClient Drivers and Cassandra, the Right Way
Client Drivers and Cassandra, the Right Way
 
High availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication SystemHigh availability and disaster recovery in IBM PureApplication System
High availability and disaster recovery in IBM PureApplication System
 
What to do when detect deadlock
What to do when detect deadlockWhat to do when detect deadlock
What to do when detect deadlock
 
FreeRTOS basics (Real time Operating System)
FreeRTOS basics (Real time Operating System)FreeRTOS basics (Real time Operating System)
FreeRTOS basics (Real time Operating System)
 
Round Robin Algorithm.pptx
Round Robin Algorithm.pptxRound Robin Algorithm.pptx
Round Robin Algorithm.pptx
 
Taking Full Advantage of Galera Multi Master Cluster
Taking Full Advantage of Galera Multi Master ClusterTaking Full Advantage of Galera Multi Master Cluster
Taking Full Advantage of Galera Multi Master Cluster
 
Cassandra and drivers
Cassandra and driversCassandra and drivers
Cassandra and drivers
 

Kürzlich hochgeladen

Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 

Kürzlich hochgeladen (20)

The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 

Sep 6 cdm

  • 2. ISSUES • Maintaining causal ordering • Guarantee replication correctness when some OSDs go down • Some other considerations
  • 3. System run time Time slice Time slice Time slice Time slice Time slice Time slice Time slice Time slice Time slice Time slice Sage Weil suggested(to preserve causal order): • Split the whole system run time into a series of time slice • All client Ops during the same time slice are in the same transaction • At time slice boundaries, clients has to pause for some time in order for the physical clock to go forward enough to prevent causal order to be violated across the time slice boundary
  • 4. Some more details • How to make all clients pause at exactly every time slice boundary? • How to make all client Ops in the same time slice either all replicated or all not-replicated?
  • 5. Making all clients pause at time slice boundaries • Monitor send an timestamp Tts_bound to all clients for every time slice boundary • All clients set their pause timers according to the following rule: • CONDITION: Local system clock has to be synchronized with the same NTP server as monitors, and time skew is small enough. • Tpause_timer_expire = Tts_bound + TimeSlice – Tlocal • When the pause timer expires, if the client’s local system clock still satifies the CONDITION, its worker threads has to pause for the same period of time, Ppause. • Ppause must be much larger than the time synchronization error bound.
  • 6. Making all clients pause at time slice boundaries Tts_bound Tts_bound + TimeSlice Client1’s system clock Client2’s system clock Client3’s system clock Time Sync Error Bound Client1’s Ppause Client2’s Ppause Client3’s Ppause Time Sync Error Bound
  • 7. Making all ops in the same time slice in the same transaction • All clients, after being paused for Ppause, report to the monitor of the pause and the last Tts_bound_last according to which they set this pause. • OSDs in the master cluster periodically report to the monitor of their latest replicated client op • When all clients has finished the pauses which they set according to Tts_bound_last, and all OSDs have started to replicate client Ops whose timestamp are later than Tts_bound_last, monitor send the Tts_bound_last + TimeSlice to the backup cluster.
  • 8. Making all ops in the same time slice in the same transaction • OSDs in the master cluster keep replicating client ops to the backup cluster despite the time slice constrains • OSDs in the backup cluster cache the ops in their journal, and write them back to the backing store only when there is a confirmed time slice boundary(Tts_bound + TimeSlice) sent from the monitor of the master cluster. • When the time slice boundary(Tts_bound + TimeSlice) is received, all ops with their time stamp earlier than that boundary are written back to the backing store.
  • 9. Put together client client client client OSD monitor Tts_bound Tts_bound_last OSD OSD Top_latest Master cluster OSD monitor OSD OSD Backup cluster Tts_bound_last+TimeSlice Tts_bound_last+TimeSlice Transfer node Transfer node Transfer node ops ops
  • 10. Key points • The expiration time of clients’ pause timer must NOT be influenced by time synchronization service like ntp or chrony • If one client fail to pause correctly, this time slice should be merged into later time slices instead of be replicated. • OSD journal space of the OSDs in backup cluster should be large enough to cache multiple time slices’ Ops since the sending of the confirmed time slice boundary can be delayed.
  • 11. ISSUES • Maintaining causal ordering • Guarantee replication correctness when some OSDs go down • Some other considerations
  • 12. Guarantee replication correctness when some OSDs go down • The following two conditions should be enough to guarantee the correctness: • For all OSDs in the acting set, “original op journal” gets removed only after its corresponding “original op” is replicated • in the recovery/backfill phase, recovery source replicate all journal related to the recovering object before pushing
  • 13. Guarantee replication correctness when some OSDs go down • Can we use the current OSD journal? • Op_replication_head – op_replication_tail <= some_threshold • If the condition above is hold, journal_head should be pointing to the same “op” as the op_replication_head • Otherwise, only journal_head move forward.
  • 14. Guarantee replication correctness when some OSDs go down • Can we use the current OSD journal? • Replication is controlled by acting primary • Replica OSDs should report there journal space usage info to acting primary, for example, add this info in the reply to CEPH_OSD_OP_REPOP msg.
  • 15. Guarantee replication correctness when some OSDs go down • Details about this issue is in the document of the last CDM: http://tracker.ceph.com/attachments/download/2903/ceph_rados- level_replication.pdf :-)
  • 16. ISSUES • Maintaining causal ordering • Guarantee replication correctness when some OSDs go down • Some other considerations
  • 17. Some other considerations • Making all clients to pause periodically seems not necessary, maybe we can let clients identify themselves as whether or not they need the point-in-time consistency. So, only those who need point-in-time consistency need to pause.

Hinweis der Redaktion

  1. When an OP is replicated, a response should be sent back to the replicating OSD, which, in turn, informs other OSDs in the acting set that the journal entry corresponding to the replicated OP can be removed.