SlideShare ist ein Scribd-Unternehmen logo
1 von 57
Downloaden Sie, um offline zu lesen
Writing High-Performance Software
arvid@bittorrent.com
Performance ⟺Longer Battery Life
(Not only for when things need to run fast)
Memory Cache
Typical memory cache hierarchy (Core i5 Sandy Bridge)

Main memory (16 GiB)
L3 (6 MiB)
L2 (256 kiB)

L2 (256 kiB)

L2 (256 kiB)

L2 (256 kiB)

L1 (32 kiB)

L1 (32 kiB)

L1 (32 kiB)

L1 (32 kiB)

Core 1

Core 2

Core 3

Core 4

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Latency
Memory latencies Core i5 Sandy Bridge

register

0ns

0.125ns

0.25ns

0.375ns

0.5ns

http://www.7-cpu.com/cpu/IvyBridge.html
BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Latency
Memory latencies Core i5 Sandy Bridge

register

L1 cache

0ns

0.325ns

0.65ns

0.975ns

1.3ns

http://www.7-cpu.com/cpu/IvyBridge.html
BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Latency
Memory latencies Core i5 Sandy Bridge

register

L1 cache

L2 cache

0ns

1ns

2ns

3ns

4ns

http://www.7-cpu.com/cpu/IvyBridge.html
BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Latency
Memory latencies Core i5 Sandy Bridge

register

L1 cache

L2 cache

L3 cache
0ns

3.75ns

7.5ns

11.25ns

15ns

http://www.7-cpu.com/cpu/IvyBridge.html
BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Latency
Memory latencies Core i5 Sandy Bridge

register
L1 cache

61.8 x

L2 cache
L3 cache
DRAM
0ns

22.5ns

45ns

67.5ns

90ns

http://www.7-cpu.com/cpu/IvyBridge.html
BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Latency

• When a CPU is waiting for memory, it is busy (i.e. you will see 100% CPU
usage, even if your bottleneck is waiting for memory)
• Memory access patterns is a significant factor in performance
• Constant cache misses makes your program run up to 2 orders of
magnitude slower than constant cache hits

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Cache

The memory you requested

cache line

The memory pulled into the cache

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Latency

• CPUs prefetch memory automatically if they can recognize your access
pattern (sequential is easy)
• CPUs predict branches in order to prefetch instruction memory
• Memory access pattern is not only determined by data access but also
control flow (indirect jumps stall execution on a memory lookup)

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Cache
For linear memory reads, the CPU will pre-fetch memory

64 bytes

64 bytes

BitTorrent, Inc. | Writing High-Performance Software

64 bytes

64 bytes

64 bytes

64 bytes

64 bytes

64 bytes

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Cache
For linear memory reads, the CPU will pre-fetch memory

64 bytes

64 bytes

BitTorrent, Inc. | Writing High-Performance Software

64 bytes

64 bytes

64 bytes

64 bytes

64 bytes

64 bytes

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Cache
For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss

64 bytes

64 bytes

BitTorrent, Inc. | Writing High-Performance Software

64 bytes

64 bytes

64 bytes

64 bytes

64 bytes

64 bytes

For Internal Presentation Purposes Only, Not For External Distribution .
Memory Cache
For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss

64 bytes

64 bytes

BitTorrent, Inc. | Writing High-Performance Software

64 bytes

64 bytes

64 bytes

64 bytes

64 bytes

64 bytes

For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures
Data Structures

• Array of pointers to objects and linked lists
more cache pressure / cache misses
• Array of objects
less cache pressure / cache hits

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures

• One optimization is to refactor your single list of heterogenous objects
into one list per type.
• Objects would lay out sequentially in memory
• Virtual function dispatch could become static

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures

std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures

std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();

std::vector<rectangle> rectangles;
std::vector<circle> circles;
for (auto& s : rectangles) s.draw();
for (auto& s : circles) s.draw();

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures

std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();

Pointers need
dereferencing +
vtable lookup

std::vector<rectangle> rectangles;
std::vector<circle> circles;
for (auto& s : rectangles) s.draw();
for (auto& s : circles) s.draw();

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures

std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();
std::vector<rectangle> rectangles;
std::vector<circle> circles;
for (auto& s : rectangles) s.draw();
for (auto& s : circles) s.draw();

BitTorrent, Inc. | Writing High-Performance Software

Pointers need
dereferencing +
vtable lookup
Objects packed back-toback, sequential memory
access, no vtable lookup
For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures

• For heap allocated objects, put the most commonly used (“hot”) fields in
the first cache line
• Avoid unnecessary padding

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Data Structures

Padding
struct A [24 Bytes]
0: [int : 4] a
--- 4 Bytes padding --8: [void* : 8] b
16: [int : 4] c
--- 4 Bytes padding ---

struct A {
! int a;
! void* b;
! int c;
};

struct
0:
8:
12:

struct B {
! void* b;
! int a;
! int c;
};

B [16 Bytes]
[void* : 8] b
[int : 4] a
[int : 4] c

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching
Context Switching

• One significant source of cache misses is switching context, and
switching the data set being worked on
• Context switch
• Thread / process switching
• User space -> kernel space

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching

• One significant source of cache misses is switching context, and
switching the data set being worked on
• Context switch
• Thread / process switching
• User space -> kernel space

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching

• Lower the cost of context switching by amortizing it over as much work
as possible
• Reduce the number of system calls by passing as much work as
possible per call
• Reduce thread wake-ups/sleeps by batching work

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching

When a thread wakes up, do as much work as possible
before going to sleep
Drain the socket of received bytes
Drain the job queue

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)
One car at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)
One car at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)
One car at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)
One car at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)
One car at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)
One car at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)
The whole queue at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)
The whole queue at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching (traffic analogy)
The whole queue at a time

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching

• Every time the socket becomes readable, read and handle one request

buf = socket.read_one_request()
req = parse_request(buf)
handle_req(socket, req)

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching

• Drain the socket each time it becomes readable
• Parse and handle each request that was receive
buf.append(socket.read_all())
req, buf = parse_request(buf)
while req != None:
handle_req(socket, req)
req, buf = parse_request(buf)

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching

• Write all responses in a single call at the end

buf.append(socket.read_all())
socket.cork()
req, buf = parse_request(buf)
while req != None:
handle_req(socket, req)
req, buf = parse_request(buf)
socket.uncork()

BitTorrent, Inc. | Writing High-Performance Software

Don’t flush buffer to
socket until all
messages are handled

For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming

• There are two ways to read from sockets
• Wait for readable event then read (POSIX)
• Read async. then wait for completion event (Win32)

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming

kevent ev[100];
int events = kevent(queue, nullptr
, 0, ev, 100, nullptr);
for (int i = 0; i < events; ++i) {
int size = read(ev[i].ident, buffer
, buffer_size);
/* ... */
}

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming

Wait for the socket to
become readable

kevent ev[100];
int events = kevent(queue, nullptr
, 0, ev, 100, nullptr);
for (int i = 0; i < events; ++i) {
int size = read(ev[i].ident, buffer
, buffer_size);
/* ... */
}

Copy data from kernel
space to user space
BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming

WSABUF b = { buffer_size, buffer };
DWORD transferred = 0, flags = 0;
WSAOVERLAPPED ov; // [ initialization ]
int ret = WSARecv(s, &b, 1, &transferred
, &flags, &ov, nullptr);
WSAOVERLAPPED* ol;
ULONG_PTR* key;
BOOL r = GetQueuedCompletionStatus(port
, &transferred, &key, &ol, INFINITE);
ret = WSAGetOverlappedResult(s, &ov
, &transferred, false, &flags);
BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming

Initiate async. read into
buffer

WSABUF b = { buffer_size, buffer };
DWORD transferred = 0, flags = 0;
WSAOVERLAPPED ov; // [ initialization ]
int ret = WSARecv(s, &b, 1, &transferred
, &flags, &ov, nullptr);

WSAOVERLAPPED* ol;
Wait
ULONG_PTR* key;
BOOL r = GetQueuedCompletionStatus(port
, &transferred, &key, &ol, INFINITE);
ret = WSAGetOverlappedResult(s, &ov
, &transferred, false, &flags);
BitTorrent, Inc. | Writing High-Performance Software

for operations to
complete
Query status
For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming

• Passing in a buffer up-front is preferable because:
• NIC driver can in theory receive data directly into your buffer and
save a copy
• If there is a memory copy, it can be done asynchronously, not
blocking your thread

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming

• Problem: What buffer size should be used?
• Too large will waste memory
• Too small will waste system calls
(since we need multiple calls to drain the socket)

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Socket Programming

• Problem: What buffer size should be used?
• Start with some reasonable buffer size
• If an async read fills the whole buffer, increase size
• If an async read returns significantly less than the buffer size,
decrease size
Size adjustments should be proportional to the buffer size

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Context Switching

Adapt batch size to the computer’s natural granularity
Higher load should lead to larger batches, fewer context switches and higher efficiency.
Use of magic numbers is a red flag

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Message Queues
Message Queues

• Events on message queues may come in batches
• Example: we receive one message per 16 kiB block read from disk.

void conn::on_disk_read(buffer const& buf) {
m_socket.write(&buf[0], buf.size());
}

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Message Queues

• Problem: We want to flush our sockets right before we go to sleep, i.e.
when we have drained the message queue, without starvation

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Message Queues

void conn::on_disk_read(buffer const& buf) {
m_buf.insert(m_buf.end(), buf);
if (m_has_flush_msg) return;
m_has_flush_msg = true;
m_queue.post(std::bind(&conn::flush
, this));
}
void conn::flush() {
m_socket.write(&m_buf[0], m_buf.size());
}
BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Message Queues
Instead of writing to the socket,
accumulate the buffers
void conn::on_disk_read(buffer const& buf) {
m_buf.insert(m_buf.end(), buf);
if (m_has_flush_msg) return;
m_has_flush_msg = true;
m_queue.post(std::bind(&conn::flush
, this));
}
If there is no outstanding flush
message, post one
void conn::flush() {
m_socket.write(&m_buf[0], m_buf.size());
}
Flush all buffers when all messages have been handled
BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Message Queues

Flush message
FIFO
message
queue
Message handler

BitTorrent, Inc. | Writing High-Performance Software

For Internal Presentation Purposes Only, Not For External Distribution .
Contact
arvid@bittorrent.com

Weitere ähnliche Inhalte

Was ist angesagt?

Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at SchibstedLars Marius Garshol
 
JSR 335 / java 8 - update reference
JSR 335 / java 8 - update referenceJSR 335 / java 8 - update reference
JSR 335 / java 8 - update referencesandeepji_choudhary
 
Visual Studio 2010 and .NET 4.0 Overview
Visual Studio 2010 and .NET 4.0 OverviewVisual Studio 2010 and .NET 4.0 Overview
Visual Studio 2010 and .NET 4.0 Overviewbwullems
 
GraphQL the holy contract between client and server
GraphQL the holy contract between client and serverGraphQL the holy contract between client and server
GraphQL the holy contract between client and serverPavel Chertorogov
 
Hibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance TechniquesHibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance TechniquesBrett Meyer
 
50 Shades of Fail KScope16
50 Shades of Fail KScope1650 Shades of Fail KScope16
50 Shades of Fail KScope16Christian Berg
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksJamie Grier
 

Was ist angesagt? (7)

Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
 
JSR 335 / java 8 - update reference
JSR 335 / java 8 - update referenceJSR 335 / java 8 - update reference
JSR 335 / java 8 - update reference
 
Visual Studio 2010 and .NET 4.0 Overview
Visual Studio 2010 and .NET 4.0 OverviewVisual Studio 2010 and .NET 4.0 Overview
Visual Studio 2010 and .NET 4.0 Overview
 
GraphQL the holy contract between client and server
GraphQL the holy contract between client and serverGraphQL the holy contract between client and server
GraphQL the holy contract between client and server
 
Hibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance TechniquesHibernate ORM: Tips, Tricks, and Performance Techniques
Hibernate ORM: Tips, Tricks, and Performance Techniques
 
50 Shades of Fail KScope16
50 Shades of Fail KScope1650 Shades of Fail KScope16
50 Shades of Fail KScope16
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
 

Ähnlich wie Writing High-Performance Software by Arvid Norberg

Deep Dive into AWS Fargate - CON333 - re:Invent 2017
Deep Dive into AWS Fargate - CON333 - re:Invent 2017Deep Dive into AWS Fargate - CON333 - re:Invent 2017
Deep Dive into AWS Fargate - CON333 - re:Invent 2017Amazon Web Services
 
Tuning Your SharePoint Environment
Tuning Your SharePoint EnvironmentTuning Your SharePoint Environment
Tuning Your SharePoint Environmentvmaximiuk
 
Playing with shodan
Playing with shodanPlaying with shodan
Playing with shodandecode _dev
 
Basic Application Performance Optimization Techniques (Backend)
Basic Application Performance Optimization Techniques (Backend)Basic Application Performance Optimization Techniques (Backend)
Basic Application Performance Optimization Techniques (Backend)Klas Berlič Fras
 
ShaREing Is Caring
ShaREing Is CaringShaREing Is Caring
ShaREing Is Caringsporst
 
NIKE Product Specification
NIKE Product SpecificationNIKE Product Specification
NIKE Product SpecificationGlen Alleman
 
Ai dev world utilizing apache pulsar, apache ni fi and minifi for edgeai io...
Ai dev world   utilizing apache pulsar, apache ni fi and minifi for edgeai io...Ai dev world   utilizing apache pulsar, apache ni fi and minifi for edgeai io...
Ai dev world utilizing apache pulsar, apache ni fi and minifi for edgeai io...Timothy Spann
 
Tech trends 2018 2019
Tech trends 2018 2019Tech trends 2018 2019
Tech trends 2018 2019Johan Norm
 
Cerebro general overiew eng
Cerebro general overiew engCerebro general overiew eng
Cerebro general overiew engCineSoft
 
Asynchronous web-development with Python
Asynchronous web-development with PythonAsynchronous web-development with Python
Asynchronous web-development with PythonAnton Caceres
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...Databricks
 
Using Event Streams in Serverless Applications
Using Event Streams in Serverless ApplicationsUsing Event Streams in Serverless Applications
Using Event Streams in Serverless ApplicationsJonathan Dee
 
Sitecore Knowledge Transfer 2018 day-1
Sitecore  Knowledge Transfer 2018 day-1Sitecore  Knowledge Transfer 2018 day-1
Sitecore Knowledge Transfer 2018 day-1Manish Puri
 
SharePoint Development Workshop
SharePoint Development WorkshopSharePoint Development Workshop
SharePoint Development WorkshopMJ Ferdous
 
Single Source of Truth for Network Automation
Single Source of Truth for Network AutomationSingle Source of Truth for Network Automation
Single Source of Truth for Network AutomationAndy Davidson
 
Programming on Windows 8.1: The New Stream and Storage Paradigm (Raffaele Ria...
Programming on Windows 8.1: The New Stream and Storage Paradigm (Raffaele Ria...Programming on Windows 8.1: The New Stream and Storage Paradigm (Raffaele Ria...
Programming on Windows 8.1: The New Stream and Storage Paradigm (Raffaele Ria...ITCamp
 
Welcome to pc hardware
Welcome to pc hardwareWelcome to pc hardware
Welcome to pc hardwarejaydeepdwivedi
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Intel® Software
 

Ähnlich wie Writing High-Performance Software by Arvid Norberg (20)

Deep Dive into AWS Fargate - CON333 - re:Invent 2017
Deep Dive into AWS Fargate - CON333 - re:Invent 2017Deep Dive into AWS Fargate - CON333 - re:Invent 2017
Deep Dive into AWS Fargate - CON333 - re:Invent 2017
 
Tuning Your SharePoint Environment
Tuning Your SharePoint EnvironmentTuning Your SharePoint Environment
Tuning Your SharePoint Environment
 
Playing with shodan
Playing with shodanPlaying with shodan
Playing with shodan
 
Basic Application Performance Optimization Techniques (Backend)
Basic Application Performance Optimization Techniques (Backend)Basic Application Performance Optimization Techniques (Backend)
Basic Application Performance Optimization Techniques (Backend)
 
ShaREing Is Caring
ShaREing Is CaringShaREing Is Caring
ShaREing Is Caring
 
NIKE Product Specification
NIKE Product SpecificationNIKE Product Specification
NIKE Product Specification
 
Pi Is For Python
Pi Is For PythonPi Is For Python
Pi Is For Python
 
Ai dev world utilizing apache pulsar, apache ni fi and minifi for edgeai io...
Ai dev world   utilizing apache pulsar, apache ni fi and minifi for edgeai io...Ai dev world   utilizing apache pulsar, apache ni fi and minifi for edgeai io...
Ai dev world utilizing apache pulsar, apache ni fi and minifi for edgeai io...
 
Sharepoint Presentation
Sharepoint PresentationSharepoint Presentation
Sharepoint Presentation
 
Tech trends 2018 2019
Tech trends 2018 2019Tech trends 2018 2019
Tech trends 2018 2019
 
Cerebro general overiew eng
Cerebro general overiew engCerebro general overiew eng
Cerebro general overiew eng
 
Asynchronous web-development with Python
Asynchronous web-development with PythonAsynchronous web-development with Python
Asynchronous web-development with Python
 
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
OAP: Optimized Analytics Package for Spark Platform with Daoyuan Wang and Yua...
 
Using Event Streams in Serverless Applications
Using Event Streams in Serverless ApplicationsUsing Event Streams in Serverless Applications
Using Event Streams in Serverless Applications
 
Sitecore Knowledge Transfer 2018 day-1
Sitecore  Knowledge Transfer 2018 day-1Sitecore  Knowledge Transfer 2018 day-1
Sitecore Knowledge Transfer 2018 day-1
 
SharePoint Development Workshop
SharePoint Development WorkshopSharePoint Development Workshop
SharePoint Development Workshop
 
Single Source of Truth for Network Automation
Single Source of Truth for Network AutomationSingle Source of Truth for Network Automation
Single Source of Truth for Network Automation
 
Programming on Windows 8.1: The New Stream and Storage Paradigm (Raffaele Ria...
Programming on Windows 8.1: The New Stream and Storage Paradigm (Raffaele Ria...Programming on Windows 8.1: The New Stream and Storage Paradigm (Raffaele Ria...
Programming on Windows 8.1: The New Stream and Storage Paradigm (Raffaele Ria...
 
Welcome to pc hardware
Welcome to pc hardwareWelcome to pc hardware
Welcome to pc hardware
 
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
Accelerate Your Python* Code through Profiling, Tuning, and Compilation Part ...
 

Kürzlich hochgeladen

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 

Kürzlich hochgeladen (20)

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 

Writing High-Performance Software by Arvid Norberg

  • 2. Performance ⟺Longer Battery Life (Not only for when things need to run fast)
  • 3. Memory Cache Typical memory cache hierarchy (Core i5 Sandy Bridge) Main memory (16 GiB) L3 (6 MiB) L2 (256 kiB) L2 (256 kiB) L2 (256 kiB) L2 (256 kiB) L1 (32 kiB) L1 (32 kiB) L1 (32 kiB) L1 (32 kiB) Core 1 Core 2 Core 3 Core 4 BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 4. Memory Latency Memory latencies Core i5 Sandy Bridge register 0ns 0.125ns 0.25ns 0.375ns 0.5ns http://www.7-cpu.com/cpu/IvyBridge.html BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 5. Memory Latency Memory latencies Core i5 Sandy Bridge register L1 cache 0ns 0.325ns 0.65ns 0.975ns 1.3ns http://www.7-cpu.com/cpu/IvyBridge.html BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 6. Memory Latency Memory latencies Core i5 Sandy Bridge register L1 cache L2 cache 0ns 1ns 2ns 3ns 4ns http://www.7-cpu.com/cpu/IvyBridge.html BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 7. Memory Latency Memory latencies Core i5 Sandy Bridge register L1 cache L2 cache L3 cache 0ns 3.75ns 7.5ns 11.25ns 15ns http://www.7-cpu.com/cpu/IvyBridge.html BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 8. Memory Latency Memory latencies Core i5 Sandy Bridge register L1 cache 61.8 x L2 cache L3 cache DRAM 0ns 22.5ns 45ns 67.5ns 90ns http://www.7-cpu.com/cpu/IvyBridge.html BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 9. Memory Latency • When a CPU is waiting for memory, it is busy (i.e. you will see 100% CPU usage, even if your bottleneck is waiting for memory) • Memory access patterns is a significant factor in performance • Constant cache misses makes your program run up to 2 orders of magnitude slower than constant cache hits BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 10. Memory Cache The memory you requested cache line The memory pulled into the cache BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 11. Memory Latency • CPUs prefetch memory automatically if they can recognize your access pattern (sequential is easy) • CPUs predict branches in order to prefetch instruction memory • Memory access pattern is not only determined by data access but also control flow (indirect jumps stall execution on a memory lookup) BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 12. Memory Cache For linear memory reads, the CPU will pre-fetch memory 64 bytes 64 bytes BitTorrent, Inc. | Writing High-Performance Software 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes For Internal Presentation Purposes Only, Not For External Distribution .
  • 13. Memory Cache For linear memory reads, the CPU will pre-fetch memory 64 bytes 64 bytes BitTorrent, Inc. | Writing High-Performance Software 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes For Internal Presentation Purposes Only, Not For External Distribution .
  • 14. Memory Cache For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss 64 bytes 64 bytes BitTorrent, Inc. | Writing High-Performance Software 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes For Internal Presentation Purposes Only, Not For External Distribution .
  • 15. Memory Cache For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss 64 bytes 64 bytes BitTorrent, Inc. | Writing High-Performance Software 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes 64 bytes For Internal Presentation Purposes Only, Not For External Distribution .
  • 17. Data Structures • Array of pointers to objects and linked lists more cache pressure / cache misses • Array of objects less cache pressure / cache hits BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 18. Data Structures • One optimization is to refactor your single list of heterogenous objects into one list per type. • Objects would lay out sequentially in memory • Virtual function dispatch could become static BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 19. Data Structures std::vector<std::unique_ptr<shape>> shapes; for (auto& s : shapes) s->draw(); BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 20. Data Structures std::vector<std::unique_ptr<shape>> shapes; for (auto& s : shapes) s->draw(); std::vector<rectangle> rectangles; std::vector<circle> circles; for (auto& s : rectangles) s.draw(); for (auto& s : circles) s.draw(); BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 21. Data Structures std::vector<std::unique_ptr<shape>> shapes; for (auto& s : shapes) s->draw(); Pointers need dereferencing + vtable lookup std::vector<rectangle> rectangles; std::vector<circle> circles; for (auto& s : rectangles) s.draw(); for (auto& s : circles) s.draw(); BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 22. Data Structures std::vector<std::unique_ptr<shape>> shapes; for (auto& s : shapes) s->draw(); std::vector<rectangle> rectangles; std::vector<circle> circles; for (auto& s : rectangles) s.draw(); for (auto& s : circles) s.draw(); BitTorrent, Inc. | Writing High-Performance Software Pointers need dereferencing + vtable lookup Objects packed back-toback, sequential memory access, no vtable lookup For Internal Presentation Purposes Only, Not For External Distribution .
  • 23. Data Structures • For heap allocated objects, put the most commonly used (“hot”) fields in the first cache line • Avoid unnecessary padding BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 24. Data Structures Padding struct A [24 Bytes] 0: [int : 4] a --- 4 Bytes padding --8: [void* : 8] b 16: [int : 4] c --- 4 Bytes padding --- struct A { ! int a; ! void* b; ! int c; }; struct 0: 8: 12: struct B { ! void* b; ! int a; ! int c; }; B [16 Bytes] [void* : 8] b [int : 4] a [int : 4] c BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 26. Context Switching • One significant source of cache misses is switching context, and switching the data set being worked on • Context switch • Thread / process switching • User space -> kernel space BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 27. Context Switching • One significant source of cache misses is switching context, and switching the data set being worked on • Context switch • Thread / process switching • User space -> kernel space BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 28. Context Switching • Lower the cost of context switching by amortizing it over as much work as possible • Reduce the number of system calls by passing as much work as possible per call • Reduce thread wake-ups/sleeps by batching work BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 29. Context Switching When a thread wakes up, do as much work as possible before going to sleep Drain the socket of received bytes Drain the job queue BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 30. Context Switching (traffic analogy) One car at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 31. Context Switching (traffic analogy) One car at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 32. Context Switching (traffic analogy) One car at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 33. Context Switching (traffic analogy) One car at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 34. Context Switching (traffic analogy) One car at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 35. Context Switching (traffic analogy) One car at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 36. Context Switching (traffic analogy) The whole queue at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 37. Context Switching (traffic analogy) The whole queue at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 38. Context Switching (traffic analogy) The whole queue at a time BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 39. Context Switching • Every time the socket becomes readable, read and handle one request buf = socket.read_one_request() req = parse_request(buf) handle_req(socket, req) BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 40. Context Switching • Drain the socket each time it becomes readable • Parse and handle each request that was receive buf.append(socket.read_all()) req, buf = parse_request(buf) while req != None: handle_req(socket, req) req, buf = parse_request(buf) BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 41. Context Switching • Write all responses in a single call at the end buf.append(socket.read_all()) socket.cork() req, buf = parse_request(buf) while req != None: handle_req(socket, req) req, buf = parse_request(buf) socket.uncork() BitTorrent, Inc. | Writing High-Performance Software Don’t flush buffer to socket until all messages are handled For Internal Presentation Purposes Only, Not For External Distribution .
  • 42. Socket Programming • There are two ways to read from sockets • Wait for readable event then read (POSIX) • Read async. then wait for completion event (Win32) BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 43. Socket Programming kevent ev[100]; int events = kevent(queue, nullptr , 0, ev, 100, nullptr); for (int i = 0; i < events; ++i) { int size = read(ev[i].ident, buffer , buffer_size); /* ... */ } BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 44. Socket Programming Wait for the socket to become readable kevent ev[100]; int events = kevent(queue, nullptr , 0, ev, 100, nullptr); for (int i = 0; i < events; ++i) { int size = read(ev[i].ident, buffer , buffer_size); /* ... */ } Copy data from kernel space to user space BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 45. Socket Programming WSABUF b = { buffer_size, buffer }; DWORD transferred = 0, flags = 0; WSAOVERLAPPED ov; // [ initialization ] int ret = WSARecv(s, &b, 1, &transferred , &flags, &ov, nullptr); WSAOVERLAPPED* ol; ULONG_PTR* key; BOOL r = GetQueuedCompletionStatus(port , &transferred, &key, &ol, INFINITE); ret = WSAGetOverlappedResult(s, &ov , &transferred, false, &flags); BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 46. Socket Programming Initiate async. read into buffer WSABUF b = { buffer_size, buffer }; DWORD transferred = 0, flags = 0; WSAOVERLAPPED ov; // [ initialization ] int ret = WSARecv(s, &b, 1, &transferred , &flags, &ov, nullptr); WSAOVERLAPPED* ol; Wait ULONG_PTR* key; BOOL r = GetQueuedCompletionStatus(port , &transferred, &key, &ol, INFINITE); ret = WSAGetOverlappedResult(s, &ov , &transferred, false, &flags); BitTorrent, Inc. | Writing High-Performance Software for operations to complete Query status For Internal Presentation Purposes Only, Not For External Distribution .
  • 47. Socket Programming • Passing in a buffer up-front is preferable because: • NIC driver can in theory receive data directly into your buffer and save a copy • If there is a memory copy, it can be done asynchronously, not blocking your thread BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 48. Socket Programming • Problem: What buffer size should be used? • Too large will waste memory • Too small will waste system calls (since we need multiple calls to drain the socket) BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 49. Socket Programming • Problem: What buffer size should be used? • Start with some reasonable buffer size • If an async read fills the whole buffer, increase size • If an async read returns significantly less than the buffer size, decrease size Size adjustments should be proportional to the buffer size BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 50. Context Switching Adapt batch size to the computer’s natural granularity Higher load should lead to larger batches, fewer context switches and higher efficiency. Use of magic numbers is a red flag BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 52. Message Queues • Events on message queues may come in batches • Example: we receive one message per 16 kiB block read from disk. void conn::on_disk_read(buffer const& buf) { m_socket.write(&buf[0], buf.size()); } BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 53. Message Queues • Problem: We want to flush our sockets right before we go to sleep, i.e. when we have drained the message queue, without starvation BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 54. Message Queues void conn::on_disk_read(buffer const& buf) { m_buf.insert(m_buf.end(), buf); if (m_has_flush_msg) return; m_has_flush_msg = true; m_queue.post(std::bind(&conn::flush , this)); } void conn::flush() { m_socket.write(&m_buf[0], m_buf.size()); } BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 55. Message Queues Instead of writing to the socket, accumulate the buffers void conn::on_disk_read(buffer const& buf) { m_buf.insert(m_buf.end(), buf); if (m_has_flush_msg) return; m_has_flush_msg = true; m_queue.post(std::bind(&conn::flush , this)); } If there is no outstanding flush message, post one void conn::flush() { m_socket.write(&m_buf[0], m_buf.size()); } Flush all buffers when all messages have been handled BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .
  • 56. Message Queues Flush message FIFO message queue Message handler BitTorrent, Inc. | Writing High-Performance Software For Internal Presentation Purposes Only, Not For External Distribution .