SlideShare a Scribd company logo
1 of 19
Download to read offline
Does Better Throughput
Require Worse Latency?
David Ungar, Doug Kimelman, Sam Adams, Mark Wegman
IBM T. J. Watson Research Center
Monday 7 January 2013
Example:
On-Line Transaction Processing
✦ Large “database” (100 GB) of information
✦ Constant stream of incoming updates & queries
✦ Need many cores to handle the work
✦ Cores need to communicate updates
✦ roll-ups sum over many variables
✦ Trick:
✦ Caching - updates must sync with invalidates
✦ Replication - updates must propagate
Monday 7 January 2013
Assumptions
✦ Too much computation for one core
✦ Not trivially scalable;
✦ needs communication
✦ Inputs constantly changing
✦ No sub-space radio:
✦ communication finite and limiting
Monday 7 January 2013
Throughput
vs
Monday 7 January 2013
Throughput ~ Scaling
0
25
50
75
100
1 core 25 cores 50 cores 75 cores 100 cores
throughput = 1.0
throughput = 0.25
Monday 7 January 2013
Latency
✦ Inter-core
✦ Data structure/algorithm level
✦ Time needed for cause (input, computation result) on one
core to affect another
Δt
What is best possible latency (on a given platform)?
Monday 7 January 2013
Measure w/ Ring Counter
while (1) A
= D
;
while (1) D
= C
+ 1;
while(1)B
=A;
w
hile (1) C
=
B;
Core 1
Ring
Counter
Latency Baseline ≣ Time / Count / Number-of-Cores
Core2
Core 3
Core4
Monday 7 January 2013
Ring Counter
Latency Baselines
0
20
40
60
80
100
1 2 3 4 5 6 7 8
Normal loads & stores
Latency(ns)
# threads (4 cores, 2-way SMT)
0
20
40
60
80
100
1 2 3 4 5 6 7 8
Normal loads & stores + memory barrier
Latency(ns)
# threads (4 cores, 2-way SMT)
Other platforms? Signals? Atomics?
min
max
min
max
Monday 7 January 2013
The Intution
✦ After you have optimized:
✦ Suppose relative latency is 10
✦ Relative throughput is 1/4
✦ If you then raise throughput to 1/2
✦ Latency will increase to 20
Space of best algorithms exhibits this trade-off
Monday 7 January 2013
Variables
#readers
# writers
contention
reading/writing
Which Instructions
Normal loads &
stores
Atomic loads & stores
Signals
Memory barriers
Monday 7 January 2013
Shared Counter
From McKenney’s PerfBook
Monday 7 January 2013
Shared Counter
From McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-
Free
Per-
thread
Per-
thread +
cache
Race &
Repair
C += delta C tiny single-core
lock, C += delta, unlock C
small, unless
writers convoy
higher, but writers
have locking &
contention
overhead
C +=atomic delta C
if contention
writers can starve
higher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many cores
higher for writers,
lower for readers
per-thread-C += delta
another thread
maintains sum;
read sum
higher: summing
thread may be idle
high for both
readers and
writers
C += delta C
higher under
contention: lost
counts
high for both
readers and
writers
Monday 7 January 2013
Shared Counter
From McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-
Free
Per-
thread
Per-
thread +
cache
Race &
Repair
C += delta C tiny single-core
lock, C += delta, unlock C
small, unless
writers convoy
higher, but writers
have locking &
contention
overhead
C +=atomic delta C
if contention
writers can starve
higher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many cores
higher for writers,
lower for readers
per-thread-C += delta
another thread
maintains sum;
read sum
higher: summing
thread may be idle
high for both
readers and
writers
C += delta C
higher under
contention: lost
counts
high for both
readers and
writers
Monday 7 January 2013
Shared Counter
From McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-
Free
Per-
thread
Per-
thread +
cache
Race &
Repair
C += delta C tiny single-core
lock, C += delta, unlock C
small, unless
writers convoy
higher, but writers
have locking &
contention
overhead
C +=atomic delta C
if contention
writers can starve
higher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many cores
higher for writers,
lower for readers
per-thread-C += delta
another thread
maintains sum;
read sum
higher: summing
thread may be idle
high for both
readers and
writers
C += delta C
higher under
contention: lost
counts
high for both
readers and
writers
Monday 7 January 2013
Shared Counter
From McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-
Free
Per-
thread
Per-
thread +
cache
Race &
Repair
C += delta C tiny single-core
lock, C += delta, unlock C
small, unless
writers convoy
higher, but writers
have locking &
contention
overhead
C +=atomic delta C
if contention
writers can starve
higher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many cores
higher for writers,
lower for readers
per-thread-C += delta
another thread
maintains sum;
read sum
higher: summing
thread may be idle
high for both
readers and
writers
C += delta C
higher under
contention: lost
counts
high for both
readers and
writers
Monday 7 January 2013
Shared Counter
From McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-
Free
Per-
thread
Per-
thread +
cache
Race &
Repair
C += delta C tiny single-core
lock, C += delta, unlock C
small, unless
writers convoy
higher, but writers
have locking &
contention
overhead
C +=atomic delta C
if contention
writers can starve
higher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many cores
higher for writers,
lower for readers
per-thread-C += delta
another thread
maintains sum;
read sum
higher: summing
thread may be idle
high for both
readers and
writers
C += delta C
higher under
contention: lost
counts
high for both
readers and
writers
Monday 7 January 2013
Shared Counter
From McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-
Free
Per-
thread
Per-
thread +
cache
Race &
Repair
C += delta C tiny single-core
lock, C += delta, unlock C
small, unless
writers convoy
higher, but writers
have locking &
contention
overhead
C +=atomic delta C
if contention
writers can starve
higher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many cores
higher for writers,
lower for readers
per-thread-C += delta
another thread
maintains sum;
read sum
higher: summing
thread may be idle
high for both
readers and
writers
C += delta C
higher under
contention: lost
counts
high for both
readers and
writers
Monday 7 January 2013
Shared Counter
From McKenney’s PerfBook
Write code Read Code Latency Throughput
Serial
Mutex
Lock-
Free
Per-
thread
Per-
thread +
cache
Race &
Repair
C += delta C tiny single-core
lock, C += delta, unlock C
small, unless
writers convoy
higher, but writers
have locking &
contention
overhead
C +=atomic delta C
if contention
writers can starve
higher for low-
contention writers
per-thread-C += delta sum(all C’s) high if many cores
higher for writers,
lower for readers
per-thread-C += delta
another thread
maintains sum;
read sum
higher: summing
thread may be idle
high for both
readers and
writers
C += delta C
higher under
contention: lost
counts
high for both
readers and
writers
Monday 7 January 2013
Conclusions
✦ Throughput: how well parallelism gets work
done
✦ Latency: how fast one core responds to another
✦ Lots of dimensions: # readers, # writers,
contention
✦ Throughput vs Latency:
✦ throughput -> parallel -> distributed/
replicated -> more latency
Monday 7 January 2013

More Related Content

Viewers also liked

Viewers also liked (7)

A Case for Relativistic Programming
A Case for Relativistic ProgrammingA Case for Relativistic Programming
A Case for Relativistic Programming
 
Welcome and Lightning Intros
Welcome and Lightning IntrosWelcome and Lightning Intros
Welcome and Lightning Intros
 
Dancing with Uncertainty
Dancing with UncertaintyDancing with Uncertainty
Dancing with Uncertainty
 
(Relative) Safety Properties for Relaxed Approximate Programs
(Relative) Safety Properties for Relaxed Approximate Programs(Relative) Safety Properties for Relaxed Approximate Programs
(Relative) Safety Properties for Relaxed Approximate Programs
 
Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models
Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory ModelsEdge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models
Edge Chasing Delayed Consistency: Pushing the Limits of Weak Memory Models
 
Keynote, LambdaConf 2015 - Ipecac for the Ouroboros
Keynote, LambdaConf 2015 - Ipecac for the OuroborosKeynote, LambdaConf 2015 - Ipecac for the Ouroboros
Keynote, LambdaConf 2015 - Ipecac for the Ouroboros
 
Beyond Expert-Only Parallel Programming
Beyond Expert-Only Parallel ProgrammingBeyond Expert-Only Parallel Programming
Beyond Expert-Only Parallel Programming
 

Similar to Does Better Throughput Require Worse Latency?

import rdma: zero-copy networking with RDMA and Python
import rdma: zero-copy networking with RDMA and Pythonimport rdma: zero-copy networking with RDMA and Python
import rdma: zero-copy networking with RDMA and Python
groveronline
 
CUDA by Example : Atomics : Notes
CUDA by Example : Atomics : NotesCUDA by Example : Atomics : Notes
CUDA by Example : Atomics : Notes
Subhajit Sahu
 

Similar to Does Better Throughput Require Worse Latency? (20)

import rdma: zero-copy networking with RDMA and Python
import rdma: zero-copy networking with RDMA and Pythonimport rdma: zero-copy networking with RDMA and Python
import rdma: zero-copy networking with RDMA and Python
 
Concurrent Programming with Ruby and Tuple Spaces
Concurrent Programming with Ruby and Tuple SpacesConcurrent Programming with Ruby and Tuple Spaces
Concurrent Programming with Ruby and Tuple Spaces
 
Introduction to Galera Cluster
Introduction to Galera ClusterIntroduction to Galera Cluster
Introduction to Galera Cluster
 
Making the Most Out of ScyllaDB's Awesome Concurrency at Optimizely
Making the Most Out of ScyllaDB's Awesome Concurrency at OptimizelyMaking the Most Out of ScyllaDB's Awesome Concurrency at Optimizely
Making the Most Out of ScyllaDB's Awesome Concurrency at Optimizely
 
Ruby thread safety first
Ruby thread safety firstRuby thread safety first
Ruby thread safety first
 
Solr on Docker - the Good, the Bad and the Ugly
Solr on Docker - the Good, the Bad and the UglySolr on Docker - the Good, the Bad and the Ugly
Solr on Docker - the Good, the Bad and the Ugly
 
Solr on Docker: the Good, the Bad, and the Ugly - Radu Gheorghe, Sematext Gro...
Solr on Docker: the Good, the Bad, and the Ugly - Radu Gheorghe, Sematext Gro...Solr on Docker: the Good, the Bad, and the Ugly - Radu Gheorghe, Sematext Gro...
Solr on Docker: the Good, the Bad, and the Ugly - Radu Gheorghe, Sematext Gro...
 
Parallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical SectionParallel Programming: Beyond the Critical Section
Parallel Programming: Beyond the Critical Section
 
Retaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate LimitingRetaining Goodput with Query Rate Limiting
Retaining Goodput with Query Rate Limiting
 
Experiences building a multi region cassandra operations orchestrator on aws
Experiences building a multi region cassandra operations orchestrator on awsExperiences building a multi region cassandra operations orchestrator on aws
Experiences building a multi region cassandra operations orchestrator on aws
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
Computer Networking Assignment Help
Computer Networking Assignment HelpComputer Networking Assignment Help
Computer Networking Assignment Help
 
Tuning Java Servers
Tuning Java Servers Tuning Java Servers
Tuning Java Servers
 
Atmosphere Conference 2015: Need for Async: In pursuit of scalable internet-s...
Atmosphere Conference 2015: Need for Async: In pursuit of scalable internet-s...Atmosphere Conference 2015: Need for Async: In pursuit of scalable internet-s...
Atmosphere Conference 2015: Need for Async: In pursuit of scalable internet-s...
 
Processing TeraBytes of data every day and sleeping at night
Processing TeraBytes of data every day and sleeping at nightProcessing TeraBytes of data every day and sleeping at night
Processing TeraBytes of data every day and sleeping at night
 
Architectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop DistributionArchitectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop Distribution
 
Processing TeraBytes of data every day and sleeping at night
Processing TeraBytes of data every day and sleeping at nightProcessing TeraBytes of data every day and sleeping at night
Processing TeraBytes of data every day and sleeping at night
 
Developing a Globally Distributed Purging System
Developing a Globally Distributed Purging SystemDeveloping a Globally Distributed Purging System
Developing a Globally Distributed Purging System
 
Processing Terabytes of data every day … and sleeping at night (infiniteConf ...
Processing Terabytes of data every day … and sleeping at night (infiniteConf ...Processing Terabytes of data every day … and sleeping at night (infiniteConf ...
Processing Terabytes of data every day … and sleeping at night (infiniteConf ...
 
CUDA by Example : Atomics : Notes
CUDA by Example : Atomics : NotesCUDA by Example : Atomics : Notes
CUDA by Example : Atomics : Notes
 

Recently uploaded

Recently uploaded (20)

Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Does Better Throughput Require Worse Latency?

  • 1. Does Better Throughput Require Worse Latency? David Ungar, Doug Kimelman, Sam Adams, Mark Wegman IBM T. J. Watson Research Center Monday 7 January 2013
  • 2. Example: On-Line Transaction Processing ✦ Large “database” (100 GB) of information ✦ Constant stream of incoming updates & queries ✦ Need many cores to handle the work ✦ Cores need to communicate updates ✦ roll-ups sum over many variables ✦ Trick: ✦ Caching - updates must sync with invalidates ✦ Replication - updates must propagate Monday 7 January 2013
  • 3. Assumptions ✦ Too much computation for one core ✦ Not trivially scalable; ✦ needs communication ✦ Inputs constantly changing ✦ No sub-space radio: ✦ communication finite and limiting Monday 7 January 2013
  • 5. Throughput ~ Scaling 0 25 50 75 100 1 core 25 cores 50 cores 75 cores 100 cores throughput = 1.0 throughput = 0.25 Monday 7 January 2013
  • 6. Latency ✦ Inter-core ✦ Data structure/algorithm level ✦ Time needed for cause (input, computation result) on one core to affect another Δt What is best possible latency (on a given platform)? Monday 7 January 2013
  • 7. Measure w/ Ring Counter while (1) A = D ; while (1) D = C + 1; while(1)B =A; w hile (1) C = B; Core 1 Ring Counter Latency Baseline ≣ Time / Count / Number-of-Cores Core2 Core 3 Core4 Monday 7 January 2013
  • 8. Ring Counter Latency Baselines 0 20 40 60 80 100 1 2 3 4 5 6 7 8 Normal loads & stores Latency(ns) # threads (4 cores, 2-way SMT) 0 20 40 60 80 100 1 2 3 4 5 6 7 8 Normal loads & stores + memory barrier Latency(ns) # threads (4 cores, 2-way SMT) Other platforms? Signals? Atomics? min max min max Monday 7 January 2013
  • 9. The Intution ✦ After you have optimized: ✦ Suppose relative latency is 10 ✦ Relative throughput is 1/4 ✦ If you then raise throughput to 1/2 ✦ Latency will increase to 20 Space of best algorithms exhibits this trade-off Monday 7 January 2013
  • 10. Variables #readers # writers contention reading/writing Which Instructions Normal loads & stores Atomic loads & stores Signals Memory barriers Monday 7 January 2013
  • 11. Shared Counter From McKenney’s PerfBook Monday 7 January 2013
  • 12. Shared Counter From McKenney’s PerfBook Write code Read Code Latency Throughput Serial Mutex Lock- Free Per- thread Per- thread + cache Race & Repair C += delta C tiny single-core lock, C += delta, unlock C small, unless writers convoy higher, but writers have locking & contention overhead C +=atomic delta C if contention writers can starve higher for low- contention writers per-thread-C += delta sum(all C’s) high if many cores higher for writers, lower for readers per-thread-C += delta another thread maintains sum; read sum higher: summing thread may be idle high for both readers and writers C += delta C higher under contention: lost counts high for both readers and writers Monday 7 January 2013
  • 13. Shared Counter From McKenney’s PerfBook Write code Read Code Latency Throughput Serial Mutex Lock- Free Per- thread Per- thread + cache Race & Repair C += delta C tiny single-core lock, C += delta, unlock C small, unless writers convoy higher, but writers have locking & contention overhead C +=atomic delta C if contention writers can starve higher for low- contention writers per-thread-C += delta sum(all C’s) high if many cores higher for writers, lower for readers per-thread-C += delta another thread maintains sum; read sum higher: summing thread may be idle high for both readers and writers C += delta C higher under contention: lost counts high for both readers and writers Monday 7 January 2013
  • 14. Shared Counter From McKenney’s PerfBook Write code Read Code Latency Throughput Serial Mutex Lock- Free Per- thread Per- thread + cache Race & Repair C += delta C tiny single-core lock, C += delta, unlock C small, unless writers convoy higher, but writers have locking & contention overhead C +=atomic delta C if contention writers can starve higher for low- contention writers per-thread-C += delta sum(all C’s) high if many cores higher for writers, lower for readers per-thread-C += delta another thread maintains sum; read sum higher: summing thread may be idle high for both readers and writers C += delta C higher under contention: lost counts high for both readers and writers Monday 7 January 2013
  • 15. Shared Counter From McKenney’s PerfBook Write code Read Code Latency Throughput Serial Mutex Lock- Free Per- thread Per- thread + cache Race & Repair C += delta C tiny single-core lock, C += delta, unlock C small, unless writers convoy higher, but writers have locking & contention overhead C +=atomic delta C if contention writers can starve higher for low- contention writers per-thread-C += delta sum(all C’s) high if many cores higher for writers, lower for readers per-thread-C += delta another thread maintains sum; read sum higher: summing thread may be idle high for both readers and writers C += delta C higher under contention: lost counts high for both readers and writers Monday 7 January 2013
  • 16. Shared Counter From McKenney’s PerfBook Write code Read Code Latency Throughput Serial Mutex Lock- Free Per- thread Per- thread + cache Race & Repair C += delta C tiny single-core lock, C += delta, unlock C small, unless writers convoy higher, but writers have locking & contention overhead C +=atomic delta C if contention writers can starve higher for low- contention writers per-thread-C += delta sum(all C’s) high if many cores higher for writers, lower for readers per-thread-C += delta another thread maintains sum; read sum higher: summing thread may be idle high for both readers and writers C += delta C higher under contention: lost counts high for both readers and writers Monday 7 January 2013
  • 17. Shared Counter From McKenney’s PerfBook Write code Read Code Latency Throughput Serial Mutex Lock- Free Per- thread Per- thread + cache Race & Repair C += delta C tiny single-core lock, C += delta, unlock C small, unless writers convoy higher, but writers have locking & contention overhead C +=atomic delta C if contention writers can starve higher for low- contention writers per-thread-C += delta sum(all C’s) high if many cores higher for writers, lower for readers per-thread-C += delta another thread maintains sum; read sum higher: summing thread may be idle high for both readers and writers C += delta C higher under contention: lost counts high for both readers and writers Monday 7 January 2013
  • 18. Shared Counter From McKenney’s PerfBook Write code Read Code Latency Throughput Serial Mutex Lock- Free Per- thread Per- thread + cache Race & Repair C += delta C tiny single-core lock, C += delta, unlock C small, unless writers convoy higher, but writers have locking & contention overhead C +=atomic delta C if contention writers can starve higher for low- contention writers per-thread-C += delta sum(all C’s) high if many cores higher for writers, lower for readers per-thread-C += delta another thread maintains sum; read sum higher: summing thread may be idle high for both readers and writers C += delta C higher under contention: lost counts high for both readers and writers Monday 7 January 2013
  • 19. Conclusions ✦ Throughput: how well parallelism gets work done ✦ Latency: how fast one core responds to another ✦ Lots of dimensions: # readers, # writers, contention ✦ Throughput vs Latency: ✦ throughput -> parallel -> distributed/ replicated -> more latency Monday 7 January 2013