The document discusses the tradeoff between throughput and latency in parallel systems. It provides examples of how different algorithms for shared counters can impact latency and throughput. Specifically, it shows that increasing throughput, such as through more parallelism or replication, often leads to worse latency due to the increased communication between cores. The document concludes that there is generally a tradeoff between throughput and latency based on the number of readers, writers, and contention level in a parallel system.
1. Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams, Mark Wegman
IBM T. J. Watson Research Center
Monday 7 January 2013
2. Example:
On-Line Transaction Processing
✦ Large “database” (100 GB) of information
✦ Constant stream of incoming updates & queries
✦ Need many cores to handle the work
✦ Cores need to communicate updates
✦ roll-ups sum over many variables
✦ Trick:
✦ Caching - updates must sync with invalidates
✦ Replication - updates must propagate
Monday 7 January 2013
3. Assumptions
✦ Too much computation for one core
✦ Not trivially scalable;
✦ needs communication
✦ Inputs constantly changing
✦ No sub-space radio:
✦ communication finite and limiting
Monday 7 January 2013
6. Latency
✦ Inter-core
✦ Data structure/algorithm level
✦ Time needed for cause (input, computation result) on one
core to affect another
Δt
What is best possible latency (on a given platform)?
Monday 7 January 2013
7. Measure w/ Ring Counter
while (1) A
= D
;
while (1) D
= C
+ 1;
while(1)B
=A;
w
hile (1) C
=
B;
Core 1
Ring
Counter
Latency Baseline ≣ Time / Count / Number-of-Cores
Core2
Core 3
Core4
Monday 7 January 2013
8. Ring Counter
Latency Baselines
0
20
40
60
80
100
1 2 3 4 5 6 7 8
Normal loads & stores
Latency(ns)
# threads (4 cores, 2-way SMT)
0
20
40
60
80
100
1 2 3 4 5 6 7 8
Normal loads & stores + memory barrier
Latency(ns)
# threads (4 cores, 2-way SMT)
Other platforms? Signals? Atomics?
min
max
min
max
Monday 7 January 2013
9. The Intution
✦ After you have optimized:
✦ Suppose relative latency is 10
✦ Relative throughput is 1/4
✦ If you then raise throughput to 1/2
✦ Latency will increase to 20
Space of best algorithms exhibits this trade-off
Monday 7 January 2013
12. Shared Counter
From McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-
Free
Per-
thread
Per-
thread +
cache
Race &
Repair
C += delta C tiny single-core
lock, C += delta, unlock C
small, unless
writers convoy
higher, but writers
have locking &
contention
overhead
C +=atomic delta C
if contention
writers can starve
higher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many cores
higher for writers,
lower for readers
per-thread-C += delta
another thread
maintains sum;
read sum
higher: summing
thread may be idle
high for both
readers and
writers
C += delta C
higher under
contention: lost
counts
high for both
readers and
writers
Monday 7 January 2013
13. Shared Counter
From McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-
Free
Per-
thread
Per-
thread +
cache
Race &
Repair
C += delta C tiny single-core
lock, C += delta, unlock C
small, unless
writers convoy
higher, but writers
have locking &
contention
overhead
C +=atomic delta C
if contention
writers can starve
higher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many cores
higher for writers,
lower for readers
per-thread-C += delta
another thread
maintains sum;
read sum
higher: summing
thread may be idle
high for both
readers and
writers
C += delta C
higher under
contention: lost
counts
high for both
readers and
writers
Monday 7 January 2013
14. Shared Counter
From McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-
Free
Per-
thread
Per-
thread +
cache
Race &
Repair
C += delta C tiny single-core
lock, C += delta, unlock C
small, unless
writers convoy
higher, but writers
have locking &
contention
overhead
C +=atomic delta C
if contention
writers can starve
higher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many cores
higher for writers,
lower for readers
per-thread-C += delta
another thread
maintains sum;
read sum
higher: summing
thread may be idle
high for both
readers and
writers
C += delta C
higher under
contention: lost
counts
high for both
readers and
writers
Monday 7 January 2013
15. Shared Counter
From McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-
Free
Per-
thread
Per-
thread +
cache
Race &
Repair
C += delta C tiny single-core
lock, C += delta, unlock C
small, unless
writers convoy
higher, but writers
have locking &
contention
overhead
C +=atomic delta C
if contention
writers can starve
higher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many cores
higher for writers,
lower for readers
per-thread-C += delta
another thread
maintains sum;
read sum
higher: summing
thread may be idle
high for both
readers and
writers
C += delta C
higher under
contention: lost
counts
high for both
readers and
writers
Monday 7 January 2013
16. Shared Counter
From McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-
Free
Per-
thread
Per-
thread +
cache
Race &
Repair
C += delta C tiny single-core
lock, C += delta, unlock C
small, unless
writers convoy
higher, but writers
have locking &
contention
overhead
C +=atomic delta C
if contention
writers can starve
higher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many cores
higher for writers,
lower for readers
per-thread-C += delta
another thread
maintains sum;
read sum
higher: summing
thread may be idle
high for both
readers and
writers
C += delta C
higher under
contention: lost
counts
high for both
readers and
writers
Monday 7 January 2013
17. Shared Counter
From McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-
Free
Per-
thread
Per-
thread +
cache
Race &
Repair
C += delta C tiny single-core
lock, C += delta, unlock C
small, unless
writers convoy
higher, but writers
have locking &
contention
overhead
C +=atomic delta C
if contention
writers can starve
higher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many cores
higher for writers,
lower for readers
per-thread-C += delta
another thread
maintains sum;
read sum
higher: summing
thread may be idle
high for both
readers and
writers
C += delta C
higher under
contention: lost
counts
high for both
readers and
writers
Monday 7 January 2013
18. Shared Counter
From McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-
Free
Per-
thread
Per-
thread +
cache
Race &
Repair
C += delta C tiny single-core
lock, C += delta, unlock C
small, unless
writers convoy
higher, but writers
have locking &
contention
overhead
C +=atomic delta C
if contention
writers can starve
higher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many cores
higher for writers,
lower for readers
per-thread-C += delta
another thread
maintains sum;
read sum
higher: summing
thread may be idle
high for both
readers and
writers
C += delta C
higher under
contention: lost
counts
high for both
readers and
writers
Monday 7 January 2013
19. Conclusions
✦ Throughput: how well parallelism gets work
done
✦ Latency: how fast one core responds to another
✦ Lots of dimensions: # readers, # writers,
contention
✦ Throughput vs Latency:
✦ throughput -> parallel -> distributed/
replicated -> more latency
Monday 7 January 2013