4. Memory Latency
Memory latencies Core i5 Sandy Bridge
register
0ns
0.125ns
0.25ns
0.375ns
0.5ns
http://www.7-cpu.com/cpu/IvyBridge.html
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
5. Memory Latency
Memory latencies Core i5 Sandy Bridge
register
L1 cache
0ns
0.325ns
0.65ns
0.975ns
1.3ns
http://www.7-cpu.com/cpu/IvyBridge.html
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
6. Memory Latency
Memory latencies Core i5 Sandy Bridge
register
L1 cache
L2 cache
0ns
1ns
2ns
3ns
4ns
http://www.7-cpu.com/cpu/IvyBridge.html
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
7. Memory Latency
Memory latencies Core i5 Sandy Bridge
register
L1 cache
L2 cache
L3 cache
0ns
3.75ns
7.5ns
11.25ns
15ns
http://www.7-cpu.com/cpu/IvyBridge.html
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
8. Memory Latency
Memory latencies Core i5 Sandy Bridge
register
L1 cache
61.8 x
L2 cache
L3 cache
DRAM
0ns
22.5ns
45ns
67.5ns
90ns
http://www.7-cpu.com/cpu/IvyBridge.html
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
9. Memory Latency
• When a CPU is waiting for memory, it is busy (i.e. you will see 100% CPU
usage, even if your bottleneck is waiting for memory)
• Memory access patterns is a significant factor in performance
• Constant cache misses makes your program run up to 2 orders of
magnitude slower than constant cache hits
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
10. Memory Cache
The memory you requested
cache line
The memory pulled into the cache
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
11. Memory Latency
• CPUs prefetch memory automatically if they can recognize your access
pattern (sequential is easy)
• CPUs predict branches in order to prefetch instruction memory
• Memory access pattern is not only determined by data access but also
control flow (indirect jumps stall execution on a memory lookup)
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
12. Memory Cache
For linear memory reads, the CPU will pre-fetch memory
64 bytes
64 bytes
BitTorrent, Inc. | Writing High-Performance Software
64 bytes
64 bytes
64 bytes
64 bytes
64 bytes
64 bytes
For Internal Presentation Purposes Only, Not For External Distribution .
13. Memory Cache
For linear memory reads, the CPU will pre-fetch memory
64 bytes
64 bytes
BitTorrent, Inc. | Writing High-Performance Software
64 bytes
64 bytes
64 bytes
64 bytes
64 bytes
64 bytes
For Internal Presentation Purposes Only, Not For External Distribution .
14. Memory Cache
For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss
64 bytes
64 bytes
BitTorrent, Inc. | Writing High-Performance Software
64 bytes
64 bytes
64 bytes
64 bytes
64 bytes
64 bytes
For Internal Presentation Purposes Only, Not For External Distribution .
15. Memory Cache
For random memory reads, there is no pre-fetch and most memory accesses will cause a cache miss
64 bytes
64 bytes
BitTorrent, Inc. | Writing High-Performance Software
64 bytes
64 bytes
64 bytes
64 bytes
64 bytes
64 bytes
For Internal Presentation Purposes Only, Not For External Distribution .
17. Data Structures
• Array of pointers to objects and linked lists
more cache pressure / cache misses
• Array of objects
less cache pressure / cache hits
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
18. Data Structures
• One optimization is to refactor your single list of heterogenous objects
into one list per type.
• Objects would lay out sequentially in memory
• Virtual function dispatch could become static
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
20. Data Structures
std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();
std::vector<rectangle> rectangles;
std::vector<circle> circles;
for (auto& s : rectangles) s.draw();
for (auto& s : circles) s.draw();
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
21. Data Structures
std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();
Pointers need
dereferencing +
vtable lookup
std::vector<rectangle> rectangles;
std::vector<circle> circles;
for (auto& s : rectangles) s.draw();
for (auto& s : circles) s.draw();
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
22. Data Structures
std::vector<std::unique_ptr<shape>> shapes;
for (auto& s : shapes) s->draw();
std::vector<rectangle> rectangles;
std::vector<circle> circles;
for (auto& s : rectangles) s.draw();
for (auto& s : circles) s.draw();
BitTorrent, Inc. | Writing High-Performance Software
Pointers need
dereferencing +
vtable lookup
Objects packed back-toback, sequential memory
access, no vtable lookup
For Internal Presentation Purposes Only, Not For External Distribution .
23. Data Structures
• For heap allocated objects, put the most commonly used (“hot”) fields in
the first cache line
• Avoid unnecessary padding
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
24. Data Structures
Padding
struct A [24 Bytes]
0: [int : 4] a
--- 4 Bytes padding --8: [void* : 8] b
16: [int : 4] c
--- 4 Bytes padding ---
struct A {
! int a;
! void* b;
! int c;
};
struct
0:
8:
12:
struct B {
! void* b;
! int a;
! int c;
};
B [16 Bytes]
[void* : 8] b
[int : 4] a
[int : 4] c
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
26. Context Switching
• One significant source of cache misses is switching context, and
switching the data set being worked on
• Context switch
• Thread / process switching
• User space -> kernel space
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
27. Context Switching
• One significant source of cache misses is switching context, and
switching the data set being worked on
• Context switch
• Thread / process switching
• User space -> kernel space
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
28. Context Switching
• Lower the cost of context switching by amortizing it over as much work
as possible
• Reduce the number of system calls by passing as much work as
possible per call
• Reduce thread wake-ups/sleeps by batching work
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
29. Context Switching
When a thread wakes up, do as much work as possible
before going to sleep
Drain the socket of received bytes
Drain the job queue
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
30. Context Switching (traffic analogy)
One car at a time
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
31. Context Switching (traffic analogy)
One car at a time
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
32. Context Switching (traffic analogy)
One car at a time
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
33. Context Switching (traffic analogy)
One car at a time
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
34. Context Switching (traffic analogy)
One car at a time
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
35. Context Switching (traffic analogy)
One car at a time
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
36. Context Switching (traffic analogy)
The whole queue at a time
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
37. Context Switching (traffic analogy)
The whole queue at a time
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
38. Context Switching (traffic analogy)
The whole queue at a time
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
39. Context Switching
• Every time the socket becomes readable, read and handle one request
buf = socket.read_one_request()
req = parse_request(buf)
handle_req(socket, req)
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
40. Context Switching
• Drain the socket each time it becomes readable
• Parse and handle each request that was receive
buf.append(socket.read_all())
req, buf = parse_request(buf)
while req != None:
handle_req(socket, req)
req, buf = parse_request(buf)
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
41. Context Switching
• Write all responses in a single call at the end
buf.append(socket.read_all())
socket.cork()
req, buf = parse_request(buf)
while req != None:
handle_req(socket, req)
req, buf = parse_request(buf)
socket.uncork()
BitTorrent, Inc. | Writing High-Performance Software
Don’t flush buffer to
socket until all
messages are handled
For Internal Presentation Purposes Only, Not For External Distribution .
42. Socket Programming
• There are two ways to read from sockets
• Wait for readable event then read (POSIX)
• Read async. then wait for completion event (Win32)
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
43. Socket Programming
kevent ev[100];
int events = kevent(queue, nullptr
, 0, ev, 100, nullptr);
for (int i = 0; i < events; ++i) {
int size = read(ev[i].ident, buffer
, buffer_size);
/* ... */
}
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
44. Socket Programming
Wait for the socket to
become readable
kevent ev[100];
int events = kevent(queue, nullptr
, 0, ev, 100, nullptr);
for (int i = 0; i < events; ++i) {
int size = read(ev[i].ident, buffer
, buffer_size);
/* ... */
}
Copy data from kernel
space to user space
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
45. Socket Programming
WSABUF b = { buffer_size, buffer };
DWORD transferred = 0, flags = 0;
WSAOVERLAPPED ov; // [ initialization ]
int ret = WSARecv(s, &b, 1, &transferred
, &flags, &ov, nullptr);
WSAOVERLAPPED* ol;
ULONG_PTR* key;
BOOL r = GetQueuedCompletionStatus(port
, &transferred, &key, &ol, INFINITE);
ret = WSAGetOverlappedResult(s, &ov
, &transferred, false, &flags);
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
46. Socket Programming
Initiate async. read into
buffer
WSABUF b = { buffer_size, buffer };
DWORD transferred = 0, flags = 0;
WSAOVERLAPPED ov; // [ initialization ]
int ret = WSARecv(s, &b, 1, &transferred
, &flags, &ov, nullptr);
WSAOVERLAPPED* ol;
Wait
ULONG_PTR* key;
BOOL r = GetQueuedCompletionStatus(port
, &transferred, &key, &ol, INFINITE);
ret = WSAGetOverlappedResult(s, &ov
, &transferred, false, &flags);
BitTorrent, Inc. | Writing High-Performance Software
for operations to
complete
Query status
For Internal Presentation Purposes Only, Not For External Distribution .
47. Socket Programming
• Passing in a buffer up-front is preferable because:
• NIC driver can in theory receive data directly into your buffer and
save a copy
• If there is a memory copy, it can be done asynchronously, not
blocking your thread
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
48. Socket Programming
• Problem: What buffer size should be used?
• Too large will waste memory
• Too small will waste system calls
(since we need multiple calls to drain the socket)
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
49. Socket Programming
• Problem: What buffer size should be used?
• Start with some reasonable buffer size
• If an async read fills the whole buffer, increase size
• If an async read returns significantly less than the buffer size,
decrease size
Size adjustments should be proportional to the buffer size
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
50. Context Switching
Adapt batch size to the computer’s natural granularity
Higher load should lead to larger batches, fewer context switches and higher efficiency.
Use of magic numbers is a red flag
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
52. Message Queues
• Events on message queues may come in batches
• Example: we receive one message per 16 kiB block read from disk.
void conn::on_disk_read(buffer const& buf) {
m_socket.write(&buf[0], buf.size());
}
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
53. Message Queues
• Problem: We want to flush our sockets right before we go to sleep, i.e.
when we have drained the message queue, without starvation
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
54. Message Queues
void conn::on_disk_read(buffer const& buf) {
m_buf.insert(m_buf.end(), buf);
if (m_has_flush_msg) return;
m_has_flush_msg = true;
m_queue.post(std::bind(&conn::flush
, this));
}
void conn::flush() {
m_socket.write(&m_buf[0], m_buf.size());
}
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .
55. Message Queues
Instead of writing to the socket,
accumulate the buffers
void conn::on_disk_read(buffer const& buf) {
m_buf.insert(m_buf.end(), buf);
if (m_has_flush_msg) return;
m_has_flush_msg = true;
m_queue.post(std::bind(&conn::flush
, this));
}
If there is no outstanding flush
message, post one
void conn::flush() {
m_socket.write(&m_buf[0], m_buf.size());
}
Flush all buffers when all messages have been handled
BitTorrent, Inc. | Writing High-Performance Software
For Internal Presentation Purposes Only, Not For External Distribution .