Sep 6 cdm

ISSUES
• Maintaining causal ordering
• Guarantee replication correctness when some OSDs go down
• Some other considerations

System run time
Time
slice
Time
slice
Time
slice
Time
slice
Time
slice
Time
slice
Time
slice
Time
slice
Time
slice
Time
slice
Sage Weil suggested(to preserve causal order):
• Split the whole system run time into a series of time slice
• All client Ops during the same time slice are in the same transaction
• At time slice boundaries, clients has to pause for some time in order for the physical clock to go
forward enough to prevent causal order to be violated across the time slice boundary

Some more details
• How to make all clients pause at exactly every time slice boundary?
• How to make all client Ops in the same time slice either all replicated
or all not-replicated?

Making all clients pause at time slice
boundaries
• Monitor send an timestamp Tts_bound to all clients for every time slice
boundary
• All clients set their pause timers according to the following rule:
• CONDITION: Local system clock has to be synchronized with the same NTP
server as monitors, and time skew is small enough.
• Tpause_timer_expire = Tts_bound + TimeSlice – Tlocal
• When the pause timer expires, if the client’s local system clock still
satifies the CONDITION, its worker threads has to pause for the same
period of time, Ppause.
• Ppause must be much larger than the time synchronization error bound.

Making all clients pause at time slice
boundaries
Tts_bound
Tts_bound + TimeSlice
Client1’s system clock
Time Sync
Error Bound
Client1’s Ppause
Client2’s Ppause
Client3’s Ppause
Time Sync
Error Bound

Making all ops in the same time slice in the
same transaction
• All clients, after being paused for Ppause, report to the monitor of the
pause and the last Tts_bound_last according to which they set this pause.
• OSDs in the master cluster periodically report to the monitor of their
latest replicated client op
• When all clients has finished the pauses which they set according to
Tts_bound_last, and all OSDs have started to replicate client Ops whose
timestamp are later than Tts_bound_last, monitor send the Tts_bound_last +
TimeSlice to the backup cluster.

Making all ops in the same time slice in the
same transaction
• OSDs in the master cluster keep replicating client ops to the backup
cluster despite the time slice constrains
• OSDs in the backup cluster cache the ops in their journal, and write
them back to the backing store only when there is a confirmed time
slice boundary(Tts_bound + TimeSlice) sent from the monitor of the
master cluster.
• When the time slice boundary(Tts_bound + TimeSlice) is received, all ops
with their time stamp earlier than that boundary are written back to
the backing store.

Put together
client client client client
OSD
monitor
Tts_bound
Tts_bound_last
OSD OSD
Top_latest
Master cluster
OSD
monitor
OSD OSD
Backup cluster
Tts_bound_last+TimeSlice
Tts_bound_last+TimeSlice
Transfer
node
Transfer
node
Transfer
node
ops ops

Key points
• The expiration time of clients’ pause timer must NOT be influenced by
time synchronization service like ntp or chrony
• If one client fail to pause correctly, this time slice should be merged
into later time slices instead of be replicated.
• OSD journal space of the OSDs in backup cluster should be large
enough to cache multiple time slices’ Ops since the sending of the
confirmed time slice boundary can be delayed.

Guarantee replication correctness when some
OSDs go down
• The following two conditions should be enough to guarantee the
correctness:
• For all OSDs in the acting set, “original op journal” gets removed only after its
corresponding “original op” is replicated
• in the recovery/backfill phase, recovery source replicate all journal related to
the recovering object before pushing

OSDs go down
• Can we use the current OSD journal?
• Op_replication_head –
op_replication_tail <= some_threshold
• If the condition above is hold,
journal_head should be pointing to the
same “op” as the op_replication_head
• Otherwise, only journal_head move
forward.

OSDs go down
• Can we use the current OSD journal?
• Replication is controlled by acting
primary
• Replica OSDs should report there
journal space usage info to acting
primary, for example, add this info in
the reply to CEPH_OSD_OP_REPOP
msg.

OSDs go down
• Details about this issue is in the document of the last CDM:
http://tracker.ceph.com/attachments/download/2903/ceph_rados-
level_replication.pdf :-)

Some other considerations
• Making all clients to pause periodically seems not necessary, maybe
we can let clients identify themselves as whether or not they need
the point-in-time consistency. So, only those who need point-in-time
consistency need to pause.

Sep 6 cdm

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Sep 6 cdm

Ähnlich wie Sep 6 cdm (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Sep 6 cdm

Hinweis der Redaktion