2. Outline
● Spanner True Time (TT)
– Clock uncertainty
● Issues:
– Special hardware responsible for maintaining tightly
synchronized clock
– Transaction delays
● Proposed Augmented-Time (AT)
3. Spanner & True Time
(in Review)
● Spanner read and writes:
– T1, T2 two transactions
– If ph.T1<ph.T2, ts.T1 < ts.T2
● TT API:
– TT.now() → TT interval [earliest, latest]
– Error bound: ε=(latest – earliest)/2
● Error bound < 6ms (using special purpose time
master machines)
●
4. Implementation
A zone masterEach zone
(a unit of admin.
deployment)
100-1000 span servers
The zone master assigns
data to span servers
Span servers serve
data to clients
● Each span-server is responsible to 100-1000
tablets (a bag of mapping)
– (key:string, timestamp:int64) → string
5. Paxos state machines
● Implemented on top of each tablet to support
replication
– Leader replica → lock table (ranges of keys to lock
states)
– Transaction manager → to support distribued
transactions
● A Transaction involves
Only 1 Paxos group:
Bypass transaction manager
More than 1 Paxos group:
Perform 2 phase commit
7. Transaction Types
● Read-write:
– Use a 2-phase locking and 2-phase commit
● Client issues a read to the leader of the appropriate group
● Acquires read locks and reads the most recent data
● 2-phase commit
– Leader assign transaction timestamp when all the locks are
acquired (but not released yet) si=TT.now().latest
– Commit-wait for the uncertainty in TT:
● The leader blocks clients from seeing data committed by Ti until si <
TT.now().earliest
● Read-only:
– Assign a timestmp s_read (s_read = TT.now.latest)
– Execuate a snapshot read at s_read without locking
8. Causality in Distributed Systems
● Approaches for ordering events:
1. Ignore wall clocks (assuming infinite uncertainty), and track
events using logical clocks
vector clocks: e → f iff vc.e < vc.f
2. Use a dedicated time-synchronization machine, order
events by wall-clock time
● Pros and Cons:
– Vector clocks: Size of VC can be too large
– Spanner TrueTime:
● No communication required
● Commit-wait for uncertainty
Snapshot read:
Reading a consistent-cut in the past,
very difficult to achieve without TT
9. Augmented Time (AT)
● at.j is a vector:
● at.j[k] knowledge that j has about wallclock of k
● Same update rule as VC
● Update at.j[j] by the wallclock of j
– Synchronization assumption: at.j[k] cannot be less than at.j[j]-ε
– No need to keep track of at.j[k] if no communication between j & k
within the last ε time
– Limiting the size of at.j to only those that communicated with j in
the last ε time
ε small
TT
ε large
VC
ε
ε small
TT
ε small
TT
10. Augmented Time (AT)
● Integrating AT into Spanner's tablets:
– Backward compatible
– (key:string, timestamp:TTstamp) → 'string , at.j'
11. Transactions with AT
● Read/Write Transactions:
– Similar to TT execution, except that leader's
timestamp is AT (a vector) instead of a single TT
– No delay
● Snapshot Reads:
– Similar to TT
– Two cases for reading x with latest updated time tx
1. Latest update to x is older than ε → trivial case
2. x has another update within tx-ε
12. Snapshot Read:
handling overlapping uncertainty intervals
● Issue due to no commit-wait
● Another update at t'x:
– order events at tx and t'x according to AT
– Identify the latest version of x
– This ordering is not possible with single TT
13. Discussion
● AT is wait-free → higher throughout
● Hidden backchannel dependencies
– Can be resolved by client-notification-wait → not
reducing write throughput
● No need for dedicated GPS and atomic clocks
→ use NTP with moderate uncertainty
14. Take Home Message
● TT:
– Demands highly synchronized systems
(with GPS and atomic clocks)
– Provides snapshot reads in the past
– Has to wait for the uncertainty interval
– If keeping vector clocks, size of vectors will be too large
● AT:
– Size of vector is reasonably small
– No need to wait for the uncertainty
– Higher write throughout