SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Lock-Free Algorithms For
  Ultimate Performance

  Martin Thompson - @mjpt777
Watch the video with slide
                         synchronization on InfoQ.com!
                      http://www.infoq.com/presentations
                             /Lock-Free-Algorithms

       InfoQ.com: News & Community Site
• 750,000 unique visitors/month
• Published in 4 languages (English, Chinese, Japanese and Brazilian
  Portuguese)
• Post content from our QCon conferences
• News 15-20 / week
• Articles 3-4 / week
• Presentations (videos) 12-15 / week
• Interviews 2-3 / week
• Books 1 / month
Presented at QCon San Francisco
                          www.qconsf.com
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
 - practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Modern Hardware
   Overview
Modern Hardware (Intel Sandy Bridge)

C1     ...   Cn     Registers/Buffers
                          <1ns
                                         C1   ...    Cn

 L1    ...    L1    ~3-4 cycles ~1ns     L1   ...    L1
 L2    ...    L2   ~10-12 cycles ~3ns    L2   ...    L2


       L3          ~40-45 cycles ~15ns        L3


             MC                          MC
                       QPI ~40ns

      DRAM                                    DRAM
      DRAM                                    DRAM
                         ~65ns
      DRAM                                    DRAM
      DRAM                                    DRAM
Memory Ordering
           Core 1                  Core 2                         Core n
       Registers               Registers

    Execution Units         Execution Units

                                     Store Buffer   Load Buffer

           MOB                     MOB



 LF/WC                   LF/WC
                    L1                      L1
 Buffers                 Buffers


            L2                      L2




                                    L3
Cache Structure & Coherence
                    MOB                         L0(I) – 1.5k µops                64-byte
                                                                              “Cache-lines”
                          128 bits                       16 Bytes



                                                                              TLB
LF/WC                                                                      Pre-fetchers
Buffers          L1(D) - 32K                       L1(I) – 32K
                        256 bits                          128 bits
          SRAM
                                                                                     TLB
                                                                                    Pre-fetchers
                               L2 - 256K
                                     32 Bytes                                         Ring Bus


                                                                                       QPI Bus
                                                                     QPI
                       MESI+F
                     State Model                                  Memory
                                                                 Controller
                                                                                  Memory Channels
                 L3 – 8-20MB                                 System Agent
Memory Models
Hardware Memory Models

Memory consistency models describe how threads may
interact through shared memory consistently.
 • Program Order (PO) for a single thread
 • Sequential Consistency (SC) [Lamport 1979]
   > What you expect a program to do! (for race free)
 • Strict Consistency (Linearizability)
   > Some special instructions
 • Total Store Order (TSO)
   > Sparc model that is stronger than SC
 • x86/64 is TSO + (Total Lock Order & Causal Consistency)
   > http://www.youtube.com/watch?v=WUfvvFD5tAA
 • Other Processors can have weaker models
Intel x86/64 Memory Model
http://www.multicoreinfo.com/research/papers/2008/damp08-intel64.pdf
http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a-
part-1-manual.html


1.   Loads are not reordered with other loads.
2.   Stores are not reordered with other stores.
3.   Stores are not reordered with older loads.
4.   Loads may be reordered with older stores to different locations but
     not with older stores to the same location.
5.   In a multiprocessor system, memory ordering obeys causality
     (memory ordering respects transitive visibility).
6.   In a multiprocessor system, stores to the same location have a total
     order.
7.   In a multiprocessor system, locked instructions have a total order.
8.   Loads and stores are not reordered with locked instructions.
Language/Runtime Memory Models

Some languages/Runtimes have a well defined memory
model for portability:
 •   Java Memory Model (Java 5)
 •   C++ 11
 •   Erlang
 •   Go

For most other languages we are at the mercy of the compiler
 • Instruction reordering
 • C “volatile” is inadequate
 • Register allocation for caching values
 • No mapping to the hardware memory model
 • Fences/Barriers need to be applied
Measuring What Is
   Going On
Model Specific Registers (MSR)

Many and varied uses
 • Timestamp Invariant Counter
 • Memory Type Range Registers

Performance Counters!!!
 •   L2/L3 Cache Hits/Misses
 •   TLB Hits/Misses
 •   QPI Transfer Rates
 •   Instruction and Cycle Counts
 •   Lots of others....
Accessing MSRs



void rdmsr(uint32_t msr, uint32_t* lo, uint32_t* hi)
{
    asm volatile(“rdmsr” : “=a” lo, “=d” hi : “c” msr);
}

void wrmsr(uint32_t msr, uint32_t lo, uint32_t hi)
{
    asm volatile(“wrmsr” :: “c” msr, “a” lo, “d” hi);
}
Accessing MSRs On Linux




f = new RandomAccessFile(“/dev/cpu/0/msr”, “rw”);
ch = f.getChannel();
buffer.order(ByteOrder.nativeOrder());
ch.read(buffer, msrNumber);
long value = buffer.getLong(0);
Accessing MSRs Made Easy!

Intel VTune
  • http://software.intel.com/en-us/intel-vtune-amplifier-xe

Linux “perf stat”
  • http://linux.die.net/man/1/perf-stat

likwid - Lightweight Performance Counters
   • http://code.google.com/p/likwid/
Biggest Performance
      Enemy
   “Contention!”
Contention
• Managing Contention
 > Locks
 > CAS Techniques
• Little’s & Amdahl’s Laws
 > L = λW
 > Sequential Component
   Constraint




                             • Single Writer Principle
                             • Shared Nothing Designs
Locks
Software Locks

• Mutex, Semaphore, Critical Section, etc.
 >   What happens when un-contended?
 >   What happens when contention occurs?
 >   What if we need condition variables?
 >   What are the cost of software locks?
 >   Can they be optimised?
Hardware Locks

• Atomic Instructions
 > Compare And Swap/Set
 > Lock instructions on x86
   – LOCK XADD is a bit special


• Used to update sequences and pointers
• What are the costs of these operations?

• Guess how software locks are created?

• TSX (Transactional Synchronization Extensions)
Let’s Look At A Lock-
   Free Algorithm
OneToOneQueue – Take 1

public final class OneToOneConcurrentArrayQueue<E>
    implements Queue<E>
{
    private final E[] buffer;

    private volatile long tail = 0;
    private volatile long head = 0;

    public OneToOneConcurrentArrayQueue(int capacity)
    {
        buffer = (E[])new Object[capacity];
    }
OneToOneQueue – Take 1

public boolean offer(final E e)
{
    final long currentTail = tail;
    final long wrapPoint = currentTail - buffer.length;
    if (head <= wrapPoint)
    {
        return false;
    }

    buffer[(int)(currentTail % buffer.length)] = e;

    tail = currentTail + 1;

    return true;
}
OneToOneQueue – Take 1

public E poll()
{
    final long currentHead = head;
    if (currentHead >= tail)
    {
        return null;
    }

    final int index =
        (int)(currentHead % buffer.length);
    final E e = buffer[index];
    buffer[index] = null;

    head = currentHead + 1;

    return e;
}
Concurrent Queue Performance Results


                                 Ops/Sec (Millions)              Mean Latency (ns)
 LinkedBlockingQueue                         4.3                       ~32,000 / ~500

  ArrayBlockingQueue                         3.5                       ~32,000 / ~600

ConcurrentLinkedQueue                        13                          NA / ~180

ConcurrentArrayQueue                         13                          NA / ~150




Note: None of these tests are run with thread affinity set, Sandy Bridge 2.4 GHz
Latency: Blocking - put() & take() / Non-Blocking - offer() & poll()
Let’s Apply Some
“Mechanical Sympathy”
Mechanical Sympathy In Action

Knowing the cost of operations
 • Remainder Operation
 • Volatile writes and lock instructions




                              Why so many cache misses?
                                • False Sharing
                                • Algorithm Opportunities
                                    ˃ “Smart Batching”
                                • Memory layout
Operation Costs
Signalling

// Lock
pthread_mutex_lock(&lock);
sequence = i;
pthread_cond_signal(&condition);
pthread_mutex_unlock(&lock);

// Soft Barrier
asm volatile(“” ::: “memory”);
sequence = i;

// Fence
asm volatile(“” ::: “memory”);
sequence = i;
asm volatile(“lock addl $0x0,(%rsp)”);
Signalling Costs

                   Lock      Fence    Soft

Million Ops/Sec      9.4      45.7    108.1

L2 Hit Ratio        17.26    28.17    13.32

L3 Hit Ratio        0.78     29.60    27.99

Instructions       12846 M   906 M    801 M

CPU Cycles         28278 M   5808 M   1475 M

Ins/Cycle           0.45      0.16     0.54
OneToOneQueue – Take 2

public final class OneToOneConcurrentArrayQueue2<E>
    implements Queue<E>
{
    private final int mask;
    private final E[] buffer;

    private final AtomicLong tail = new AtomicLong(0);
    private final AtomicLong head = new AtomicLong(0);

    public OneToOneConcurrentArrayQueue2(int capacity)
    {
        capacity = findNextPositivePowerOfTwo(capacity);
        mask = capacity - 1;
        buffer = (E[])new Object[capacity];
    }
OneToOneQueue – Take 2

public boolean offer(final E e)
{
    final long currentTail = tail.get();
    final long wrapPoint = currentTail - buffer.length;
    if (head.get() <= wrapPoint)
    {
        return false;
    }

    buffer[(int)currentTail & mask] = e;
    tail.lazySet(currentTail + 1);

    return true;
}
OneToOneQueue – Take 2

public E poll()
{
    final long currentHead = head.get();
    if (currentHead >= tail.get())
    {
        return null;
    }

    final int index = (int)currentHead & mask;
    final E e = buffer[index];
    buffer[index] = null;
    head.lazySet(currentHead + 1);

    return e;
}
Concurrent Queue Performance Results


                                 Ops/Sec (Millions)              Mean Latency (ns)
 LinkedBlockingQueue                         4.3                       ~32,000 / ~500

  ArrayBlockingQueue                         3.5                       ~32,000 / ~600

ConcurrentLinkedQueue                        13                          NA / ~180

ConcurrentArrayQueue                         13                          NA / ~150

ConcurrentArrayQueue2                        45                          NA / ~120




Note: None of these tests are run with thread affinity set, Sandy Bridge 2.4 GHz
Latency: Blocking - put() & take() / Non-Blocking - offer() & poll()
Cache Misses
False Sharing and Cache Lines
                                                             64-byte
             Unpadded                     Padded          “Cache-lines”




   *address1     *address2      *address1      *address2
    (thread a)     (thread b)    (thread a)        (thread b)
False Sharing Testing



int64_t* address = seq->address

for (int i = 0; i < ITERATIONS; i++)
{
    int64_t value = *address;
    ++value;
    *address = value;
    asm volatile(“lock addl 0x0,(%rsp)”);
}
False Sharing Test Results


                    Unpadded    Padded

  Million Ops/sec      12.4      104.9

   L2 Hit Ratio        1.16%    23.05%

   L3 Hit Ratio        2.51%    39.18%

   Instructions       4559 M    4508 M

   CPU Cycles         63480 M   7551 M

  Ins/Cycle Ratio      0.07      0.60
OneToOneQueue – Take 3

public final class OneToOneConcurrentArrayQueue3<E>
    implements Queue<E>
{
    private final int capacity;
    private final int mask;
    private final E[] buffer;

    private final AtomicLong tail = new PaddedAtomicLong(0);
    private final AtomicLong head = new PaddedAtomicLong(0);

    public static class PaddedLong
    {
        public long value = 0, p1, p2, p3, p4, p5, p6;
    }

    private final PaddedLong tailCache = new PaddedLong();
    private final PaddedLong headCache = new PaddedLong();
OneToOneQueue – Take 3

public boolean offer(final E e)
{
    final long currentTail = tail.get();
    final long wrapPoint = currentTail - capacity;
    if (headCache.value <= wrapPoint)
    {
        headCache.value = head.get();
        if (headCache.value <= wrapPoint)
        {
            return false;
        }
    }

    buffer[(int)currentTail & mask] = e;
    tail.lazySet(currentTail + 1);

    return true;
}
OneToOneQueue – Take 3

public E poll()
{
    final long currentHead = head.get();
    if (currentHead >= tailCache.value)
    {
        tailCache.value = tail.get();
        if (currentHead >= tailCache.value)
        {
            return null;
        }
    }

    final int index = (int)currentHead & mask;
    final E e = buffer[index];
    buffer[index] = null;
    head.lazySet(currentHead + 1);

    return e;
}
Concurrent Queue Performance Results


                                 Ops/Sec (Millions)              Mean Latency (ns)
 LinkedBlockingQueue                         4.3                       ~32,000 / ~500

  ArrayBlockingQueue                         3.5                       ~32,000 / ~600

ConcurrentLinkedQueue                        13                          NA / ~180

ConcurrentArrayQueue                         13                          NA / ~150

ConcurrentArrayQueue2                        45                          NA / ~120

ConcurrentArrayQueue3                        150                         NA / ~100

Note: None of these tests are run with thread affinity set, Sandy Bridge 2.4 GHz
Latency: Blocking - put() & take() / Non-Blocking - offer() & poll()
How Far Can We Go
  With Lock Free
   Algorithms?
Further Adventures With Lock-Free Algorithms

•   State Machines
•   CAS operations
•   Wait-Free in addition to Lock-Free algorithms
•   Thread Affinity
•   x86 and busy spinning and back off
Questions?


Blog: http://mechanical-sympathy.blogspot.com/
Code: https://github.com/mjpt777/examples
Twitter: @mjpt777

  “The most amazing achievement of the computer
  software industry is its continuing cancellation of the
  steady and staggering gains made by the computer
  hardware industry.”
                                          - Henry Peteroski

Weitere ähnliche Inhalte

Mehr von C4Media

Mehr von C4Media (20)

Does Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to JavaDoes Java Need Inline Types? What Project Valhalla Can Bring to Java
Does Java Need Inline Types? What Project Valhalla Can Bring to Java
 
Service Meshes- The Ultimate Guide
Service Meshes- The Ultimate GuideService Meshes- The Ultimate Guide
Service Meshes- The Ultimate Guide
 
Shifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CDShifting Left with Cloud Native CI/CD
Shifting Left with Cloud Native CI/CD
 
CI/CD for Machine Learning
CI/CD for Machine LearningCI/CD for Machine Learning
CI/CD for Machine Learning
 
Fault Tolerance at Speed
Fault Tolerance at SpeedFault Tolerance at Speed
Fault Tolerance at Speed
 
Architectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep SystemsArchitectures That Scale Deep - Regaining Control in Deep Systems
Architectures That Scale Deep - Regaining Control in Deep Systems
 
ML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.jsML in the Browser: Interactive Experiences with Tensorflow.js
ML in the Browser: Interactive Experiences with Tensorflow.js
 
Build Your Own WebAssembly Compiler
Build Your Own WebAssembly CompilerBuild Your Own WebAssembly Compiler
Build Your Own WebAssembly Compiler
 
User & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix ScaleUser & Device Identity for Microservices @ Netflix Scale
User & Device Identity for Microservices @ Netflix Scale
 
Scaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's EdgeScaling Patterns for Netflix's Edge
Scaling Patterns for Netflix's Edge
 
Make Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home EverywhereMake Your Electron App Feel at Home Everywhere
Make Your Electron App Feel at Home Everywhere
 
The Talk You've Been Await-ing For
The Talk You've Been Await-ing ForThe Talk You've Been Await-ing For
The Talk You've Been Await-ing For
 
Future of Data Engineering
Future of Data EngineeringFuture of Data Engineering
Future of Data Engineering
 
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and MoreAutomated Testing for Terraform, Docker, Packer, Kubernetes, and More
Automated Testing for Terraform, Docker, Packer, Kubernetes, and More
 
Navigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery TeamsNavigating Complexity: High-performance Delivery and Discovery Teams
Navigating Complexity: High-performance Delivery and Discovery Teams
 
High Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in AdtechHigh Performance Cooperative Distributed Systems in Adtech
High Performance Cooperative Distributed Systems in Adtech
 
Rust's Journey to Async/await
Rust's Journey to Async/awaitRust's Journey to Async/await
Rust's Journey to Async/await
 
Opportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven UtopiaOpportunities and Pitfalls of Event-Driven Utopia
Opportunities and Pitfalls of Event-Driven Utopia
 
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/DayDatadog: a Real-Time Metrics Database for One Quadrillion Points/Day
Datadog: a Real-Time Metrics Database for One Quadrillion Points/Day
 
Are We Really Cloud-Native?
Are We Really Cloud-Native?Are We Really Cloud-Native?
Are We Really Cloud-Native?
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 

Lock-Free Algorithms For Ultimate Performance

  • 1. Lock-Free Algorithms For Ultimate Performance Martin Thompson - @mjpt777
  • 2. Watch the video with slide synchronization on InfoQ.com! http://www.infoq.com/presentations /Lock-Free-Algorithms InfoQ.com: News & Community Site • 750,000 unique visitors/month • Published in 4 languages (English, Chinese, Japanese and Brazilian Portuguese) • Post content from our QCon conferences • News 15-20 / week • Articles 3-4 / week • Presentations (videos) 12-15 / week • Interviews 2-3 / week • Books 1 / month
  • 3. Presented at QCon San Francisco www.qconsf.com Purpose of QCon - to empower software development by facilitating the spread of knowledge and innovation Strategy - practitioner-driven conference designed for YOU: influencers of change and innovation in your teams - speakers and topics driving the evolution and innovation - connecting and catalyzing the influencers and innovators Highlights - attended by more than 12,000 delegates since 2007 - held in 9 cities worldwide
  • 4. Modern Hardware Overview
  • 5. Modern Hardware (Intel Sandy Bridge) C1 ... Cn Registers/Buffers <1ns C1 ... Cn L1 ... L1 ~3-4 cycles ~1ns L1 ... L1 L2 ... L2 ~10-12 cycles ~3ns L2 ... L2 L3 ~40-45 cycles ~15ns L3 MC MC QPI ~40ns DRAM DRAM DRAM DRAM ~65ns DRAM DRAM DRAM DRAM
  • 6.
  • 7. Memory Ordering Core 1 Core 2 Core n Registers Registers Execution Units Execution Units Store Buffer Load Buffer MOB MOB LF/WC LF/WC L1 L1 Buffers Buffers L2 L2 L3
  • 8. Cache Structure & Coherence MOB L0(I) – 1.5k µops 64-byte “Cache-lines” 128 bits 16 Bytes TLB LF/WC Pre-fetchers Buffers L1(D) - 32K L1(I) – 32K 256 bits 128 bits SRAM TLB Pre-fetchers L2 - 256K 32 Bytes Ring Bus QPI Bus QPI MESI+F State Model Memory Controller Memory Channels L3 – 8-20MB System Agent
  • 10. Hardware Memory Models Memory consistency models describe how threads may interact through shared memory consistently. • Program Order (PO) for a single thread • Sequential Consistency (SC) [Lamport 1979] > What you expect a program to do! (for race free) • Strict Consistency (Linearizability) > Some special instructions • Total Store Order (TSO) > Sparc model that is stronger than SC • x86/64 is TSO + (Total Lock Order & Causal Consistency) > http://www.youtube.com/watch?v=WUfvvFD5tAA • Other Processors can have weaker models
  • 11. Intel x86/64 Memory Model http://www.multicoreinfo.com/research/papers/2008/damp08-intel64.pdf http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3a- part-1-manual.html 1. Loads are not reordered with other loads. 2. Stores are not reordered with other stores. 3. Stores are not reordered with older loads. 4. Loads may be reordered with older stores to different locations but not with older stores to the same location. 5. In a multiprocessor system, memory ordering obeys causality (memory ordering respects transitive visibility). 6. In a multiprocessor system, stores to the same location have a total order. 7. In a multiprocessor system, locked instructions have a total order. 8. Loads and stores are not reordered with locked instructions.
  • 12. Language/Runtime Memory Models Some languages/Runtimes have a well defined memory model for portability: • Java Memory Model (Java 5) • C++ 11 • Erlang • Go For most other languages we are at the mercy of the compiler • Instruction reordering • C “volatile” is inadequate • Register allocation for caching values • No mapping to the hardware memory model • Fences/Barriers need to be applied
  • 13. Measuring What Is Going On
  • 14. Model Specific Registers (MSR) Many and varied uses • Timestamp Invariant Counter • Memory Type Range Registers Performance Counters!!! • L2/L3 Cache Hits/Misses • TLB Hits/Misses • QPI Transfer Rates • Instruction and Cycle Counts • Lots of others....
  • 15. Accessing MSRs void rdmsr(uint32_t msr, uint32_t* lo, uint32_t* hi) { asm volatile(“rdmsr” : “=a” lo, “=d” hi : “c” msr); } void wrmsr(uint32_t msr, uint32_t lo, uint32_t hi) { asm volatile(“wrmsr” :: “c” msr, “a” lo, “d” hi); }
  • 16. Accessing MSRs On Linux f = new RandomAccessFile(“/dev/cpu/0/msr”, “rw”); ch = f.getChannel(); buffer.order(ByteOrder.nativeOrder()); ch.read(buffer, msrNumber); long value = buffer.getLong(0);
  • 17. Accessing MSRs Made Easy! Intel VTune • http://software.intel.com/en-us/intel-vtune-amplifier-xe Linux “perf stat” • http://linux.die.net/man/1/perf-stat likwid - Lightweight Performance Counters • http://code.google.com/p/likwid/
  • 18. Biggest Performance Enemy “Contention!”
  • 19. Contention • Managing Contention > Locks > CAS Techniques • Little’s & Amdahl’s Laws > L = λW > Sequential Component Constraint • Single Writer Principle • Shared Nothing Designs
  • 20. Locks
  • 21. Software Locks • Mutex, Semaphore, Critical Section, etc. > What happens when un-contended? > What happens when contention occurs? > What if we need condition variables? > What are the cost of software locks? > Can they be optimised?
  • 22. Hardware Locks • Atomic Instructions > Compare And Swap/Set > Lock instructions on x86 – LOCK XADD is a bit special • Used to update sequences and pointers • What are the costs of these operations? • Guess how software locks are created? • TSX (Transactional Synchronization Extensions)
  • 23. Let’s Look At A Lock- Free Algorithm
  • 24. OneToOneQueue – Take 1 public final class OneToOneConcurrentArrayQueue<E> implements Queue<E> { private final E[] buffer; private volatile long tail = 0; private volatile long head = 0; public OneToOneConcurrentArrayQueue(int capacity) { buffer = (E[])new Object[capacity]; }
  • 25. OneToOneQueue – Take 1 public boolean offer(final E e) { final long currentTail = tail; final long wrapPoint = currentTail - buffer.length; if (head <= wrapPoint) { return false; } buffer[(int)(currentTail % buffer.length)] = e; tail = currentTail + 1; return true; }
  • 26. OneToOneQueue – Take 1 public E poll() { final long currentHead = head; if (currentHead >= tail) { return null; } final int index = (int)(currentHead % buffer.length); final E e = buffer[index]; buffer[index] = null; head = currentHead + 1; return e; }
  • 27. Concurrent Queue Performance Results Ops/Sec (Millions) Mean Latency (ns) LinkedBlockingQueue 4.3 ~32,000 / ~500 ArrayBlockingQueue 3.5 ~32,000 / ~600 ConcurrentLinkedQueue 13 NA / ~180 ConcurrentArrayQueue 13 NA / ~150 Note: None of these tests are run with thread affinity set, Sandy Bridge 2.4 GHz Latency: Blocking - put() & take() / Non-Blocking - offer() & poll()
  • 29. Mechanical Sympathy In Action Knowing the cost of operations • Remainder Operation • Volatile writes and lock instructions Why so many cache misses? • False Sharing • Algorithm Opportunities ˃ “Smart Batching” • Memory layout
  • 31. Signalling // Lock pthread_mutex_lock(&lock); sequence = i; pthread_cond_signal(&condition); pthread_mutex_unlock(&lock); // Soft Barrier asm volatile(“” ::: “memory”); sequence = i; // Fence asm volatile(“” ::: “memory”); sequence = i; asm volatile(“lock addl $0x0,(%rsp)”);
  • 32. Signalling Costs Lock Fence Soft Million Ops/Sec 9.4 45.7 108.1 L2 Hit Ratio 17.26 28.17 13.32 L3 Hit Ratio 0.78 29.60 27.99 Instructions 12846 M 906 M 801 M CPU Cycles 28278 M 5808 M 1475 M Ins/Cycle 0.45 0.16 0.54
  • 33. OneToOneQueue – Take 2 public final class OneToOneConcurrentArrayQueue2<E> implements Queue<E> { private final int mask; private final E[] buffer; private final AtomicLong tail = new AtomicLong(0); private final AtomicLong head = new AtomicLong(0); public OneToOneConcurrentArrayQueue2(int capacity) { capacity = findNextPositivePowerOfTwo(capacity); mask = capacity - 1; buffer = (E[])new Object[capacity]; }
  • 34. OneToOneQueue – Take 2 public boolean offer(final E e) { final long currentTail = tail.get(); final long wrapPoint = currentTail - buffer.length; if (head.get() <= wrapPoint) { return false; } buffer[(int)currentTail & mask] = e; tail.lazySet(currentTail + 1); return true; }
  • 35. OneToOneQueue – Take 2 public E poll() { final long currentHead = head.get(); if (currentHead >= tail.get()) { return null; } final int index = (int)currentHead & mask; final E e = buffer[index]; buffer[index] = null; head.lazySet(currentHead + 1); return e; }
  • 36. Concurrent Queue Performance Results Ops/Sec (Millions) Mean Latency (ns) LinkedBlockingQueue 4.3 ~32,000 / ~500 ArrayBlockingQueue 3.5 ~32,000 / ~600 ConcurrentLinkedQueue 13 NA / ~180 ConcurrentArrayQueue 13 NA / ~150 ConcurrentArrayQueue2 45 NA / ~120 Note: None of these tests are run with thread affinity set, Sandy Bridge 2.4 GHz Latency: Blocking - put() & take() / Non-Blocking - offer() & poll()
  • 38. False Sharing and Cache Lines 64-byte Unpadded Padded “Cache-lines” *address1 *address2 *address1 *address2 (thread a) (thread b) (thread a) (thread b)
  • 39. False Sharing Testing int64_t* address = seq->address for (int i = 0; i < ITERATIONS; i++) { int64_t value = *address; ++value; *address = value; asm volatile(“lock addl 0x0,(%rsp)”); }
  • 40. False Sharing Test Results Unpadded Padded Million Ops/sec 12.4 104.9 L2 Hit Ratio 1.16% 23.05% L3 Hit Ratio 2.51% 39.18% Instructions 4559 M 4508 M CPU Cycles 63480 M 7551 M Ins/Cycle Ratio 0.07 0.60
  • 41. OneToOneQueue – Take 3 public final class OneToOneConcurrentArrayQueue3<E> implements Queue<E> { private final int capacity; private final int mask; private final E[] buffer; private final AtomicLong tail = new PaddedAtomicLong(0); private final AtomicLong head = new PaddedAtomicLong(0); public static class PaddedLong { public long value = 0, p1, p2, p3, p4, p5, p6; } private final PaddedLong tailCache = new PaddedLong(); private final PaddedLong headCache = new PaddedLong();
  • 42. OneToOneQueue – Take 3 public boolean offer(final E e) { final long currentTail = tail.get(); final long wrapPoint = currentTail - capacity; if (headCache.value <= wrapPoint) { headCache.value = head.get(); if (headCache.value <= wrapPoint) { return false; } } buffer[(int)currentTail & mask] = e; tail.lazySet(currentTail + 1); return true; }
  • 43. OneToOneQueue – Take 3 public E poll() { final long currentHead = head.get(); if (currentHead >= tailCache.value) { tailCache.value = tail.get(); if (currentHead >= tailCache.value) { return null; } } final int index = (int)currentHead & mask; final E e = buffer[index]; buffer[index] = null; head.lazySet(currentHead + 1); return e; }
  • 44. Concurrent Queue Performance Results Ops/Sec (Millions) Mean Latency (ns) LinkedBlockingQueue 4.3 ~32,000 / ~500 ArrayBlockingQueue 3.5 ~32,000 / ~600 ConcurrentLinkedQueue 13 NA / ~180 ConcurrentArrayQueue 13 NA / ~150 ConcurrentArrayQueue2 45 NA / ~120 ConcurrentArrayQueue3 150 NA / ~100 Note: None of these tests are run with thread affinity set, Sandy Bridge 2.4 GHz Latency: Blocking - put() & take() / Non-Blocking - offer() & poll()
  • 45. How Far Can We Go With Lock Free Algorithms?
  • 46. Further Adventures With Lock-Free Algorithms • State Machines • CAS operations • Wait-Free in addition to Lock-Free algorithms • Thread Affinity • x86 and busy spinning and back off
  • 47. Questions? Blog: http://mechanical-sympathy.blogspot.com/ Code: https://github.com/mjpt777/examples Twitter: @mjpt777 “The most amazing achievement of the computer software industry is its continuing cancellation of the steady and staggering gains made by the computer hardware industry.” - Henry Peteroski