1. Code and memory optimization tricks
Evgeny Muralev
Software Engineer
Sperasoft Inc.
2. About me
• Software engineer at Sperasoft
• Worked on code for EA Sports games (FIFA, NFL, Madden); Now Ubisoft
AAA title
• Indie game developer in my free time
3. Our Clients
Electronic Arts
Riot Games
Wargaming
BioWare
Ubisoft
Disney
Sony
Our Projects
Dragon Age: Inquisition
FIFA 14
SIMS 4
Mass Effect 2
League of Legends
Grand Theft Auto V
About us
Our Office Locations
USA
Poland
Russia
The Facts
Founded in 2004
300+ employees
Sperasoft on-line
sperasoft.com
linkedin.com/company/sperasoft
twitter.com/sperasoft
facebook.com/sperasoft
5. Developing AAA title
• Fixed performance requirements
• Min 30fps (33.3ms per frame)
• Performance is a king
• a LOT of work to do in one frame!
6. Make code faster?…
• Improved hardware
• Wait for another generation
• Fixed on consoles
• Improved algorithm
• Very important
• Hardware-aware optimization
• Optimizing for (limited) range of hardware
• Microoptimizations for specific architecture!
10. Intel Skylake case study:
Level Capacity/
Associativity
Fastest Latency Peak Bandwidth
(B/cycle)
L1/D 32Kb/8 4 cycles 96 (2x32 Load + 1x32
Store)
L1/I 32Kb/8 N/A N/A
L2 256Kb/4 12 cycles 64B/cycle
L3 (shared) Up to 2Mb per
core/Up to 16
44 cycles 32B/cycle
http://www.intel.ru/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
Brief overview
11. • Out-of-order execution cannot hide big
latencies like access to main memory
• That’s why processor always tries to prefetch
ahead
• Both instructions and data
Brief overview
12. • Linear data access is the best you can do to
help hardware prefetching
• Processor recognizes pattern and preload data for next iterations
beforehand
Vec4D in[SIZE]; // Offset from origin
float ChebyshevDist[SIZE]; // Chebyshev distance from origin
for (auto i = 0; i < SIZE; ++i)
{
ChebyshevDist[i] = Max(in[i].x, in[i].y, in[i].z, in[i].w);
}
Optimizing for data cache
13. • Access patterns must be trivial
• Triggering prefetching after every cache miss will
pollute a cache
• Prefetching cannot happen across page
boundaries
• Might trigger invalid page table walk (on TLB miss)
Optimizing for data cache
14. • What about traversal of pointer-based data
structures?
• Spoiler: It sucks
Optimizing for data cache
15. • Prefetching is blocked
• next->next is not known
• Cache miss every iteration!
• Increases chance of TLB misses
• * Depending on your memory allocator
current current->next->nextcurrent->next
struct GameActor
{
// Data…
GameActor* next;
};
while (current != nullptr)
{
// Do some operations on current
// actor…
current = current->next;
}
LLC miss! LLC miss! LLC miss!
Optimizing for data cache
16. Array vs Linked List traversal
Linear Data Random access
Optimizing for data cache
Time
N of elements
17. • Load from memory:
• auto data = *pointerToData;
• Special instructions:
• use intrinsics: _mm_prefetch(void *p, enum _mmhint h)
Configurable!
Optimizing for data cache
18. • Usually retire after virtual to physical address translation is completed
• In case of exception such as page fault software prefetch retired without
prefetching any data
e.g. Intel guide on prefetch instructions:
• Load from memory != prefetch instructions
• Prefetch instructions may differ depending on H/W vendor
Optimizing for data cache
19. • Probably won’t help
• Computations don’t overlap memory access time enough
• Remember LLC miss ~ 200c vs trivial ALUs ~ 3-4c
while (current != nullptr)
{
Prefetch(current->next)
// Trivial ALU computations on current actor
current = current->next;
}
Optimizing for data cache
20. while (current != nullptr)
{
Prefetch(current->next)
//HighLatencyComputation…
current = current->next;
}
• May help around high latency
• Make sure data is not evicted from cache before use
Optimizing for data cache
21. • Prefetch far enough to overlap memory access
time
• Prefetch near enough so it’s not evicted from
data cache
• Do NOT overprefetch
• Prefetching is not free
• Polluting cache
• Always profile when using software prefetching
Optimizing for data cache
22. RAM:
… … … … … … a … … … … … … … … … … … … … … … … … … … … … … … … …
Cache:
• Cache operates with blocks called “cache
lines”
• When accessing “a” whole cache line is
loaded
• You can expect 64 bytes wide cache line
on x64
a … … … … … … … … … … … … … … …
Optimizing for data cache
23. struct FooBonus
{
float fooBonus;
float otherData[15];
};
// For every character…
// Assume we have array<FooBonus> structs;
float Sum{0.0f};
for (auto i = 0; i < SIZE; ++i)
{
Actor->Total += FooArray[i].fooBonus;
}
Example of poor data layout:
Optimizing for data cache
24. • 64 byte offset between loads
• Each is on separate cache line
• 60 from 64 bytes are wasted
addss xmm6,dword ptr [rax-40h]
addss xmm6,dword ptr [rax]
addss xmm6,dword ptr [rax+40h]
addss xmm6,dword ptr [rax+80h]
addss xmm6,dword ptr [rax+0C0h]
addss xmm6,dword ptr [rax+100h]
addss xmm6,dword ptr [rax+140h]
addss xmm6,dword ptr [rax+180h]
add rax,200h
cmp rax,rcx
jl main+0A0h
*MSVC loves x8 loop unrolling
Optimizing for data cache
25. • Look for patterns how your data is accessed
• Split the data based on access patterns
• Data used together should be located together
• Look for most common case
Optimizing for data cache
26. Cold fields
struct FooBonus
{
MiscData* otherData;
float fooBonus;
};
struct MiscData
{
float otherData[15];
};
Optimizing for data cache
+ 4 bytes for memory alignment on 64bit
27. • 12 byte offset
• Much less bandwidth is wasted
• Can do better?!
addss xmm6,dword ptr [rax-0Ch]
addss xmm6,dword ptr [rax]
addss xmm6,dword ptr [rax+0Ch]
addss xmm6,dword ptr [rax+18h]
addss xmm6,dword ptr [rax+24h]
addss xmm6,dword ptr [rax+30h]
addss xmm6,dword ptr [rax+3Ch]
addss xmm6,dword ptr [rax+48h]
add rax,60h
cmp rax,rcx
jl main+0A0h
Optimizing for data cache
28. • Maybe no need to make a pointer to the cold fields?
• Make use of Structure of Arrays
• Store and index different arrays
struct FooBonus
{
float fooBonus;
};
struct MiscData
{
float otherData[15];
};
Optimizing for data cache
31. • Poor data utilization:
• Wasted bandwidth
• Increasing probability of TLB misses
• More cache misses due to crossing page boundary
Optimizing for data cache
32. • Recognize data access patterns:
• Just analyze the data and how it’s used
• Include logging to getters/setters
• Collect any other useful data (time/counters)
float GameCharacter::GetStamina() const
{
// Active only in debug build
CollectData(“GameCharacter::Stamina”);
return Stamina;
}
Optimizing for data cache
33. • What to consider:
• What data is accessed together
• How often data is accessed?
• From where it’s accessed?
Optimizing for data cache
34. • Instruction fetch
• Decoding
• Execution
• Memory Access
• Retirement
*of course it is more complex on real hardware
Optimizing branches
Instruction lifetime:
37. IF ID EX MEM WB
I1
I2 I1
I3 I2 I1
Optimizing branches
38. IF ID EX MEM WB
I1
I2 I1
I3 I2 I1
I4 I3 I2 I1
Optimizing branches
39. IF ID EX MEM WB
I1
I2 I1
I3 I2 I1
I4 I3 I2 I1
I5 I4 I3 I2 I1
Optimizing branches
40. • What instructions to fetch after Inst A?
• Condition hasn’t been evaluated yet
• Processor speculatively chooses one of the
paths
• Wrong guess is called branch misprediction
// Instruction A
if (Condition == true)
{
// Instruction B
// Instruction C
}
else
{
// Instruction D
// Instruction E
}
Optimizing branches
41. IF ID EX MEM WB
A
B A
C B A
A
D A
• Pipeline Flush
• A lot of wasted cycles
Mispredicted branch!
Optimizing branches
42. • Try to remove branches at all
• Especially hard to predict branches
• Reduces chance of branch misprediction
• Doesn’t take resources of Branch Target Buffer
Optimizing branches
43. Know bit tricks!
Example: Negate number based on flag value
int In;
int Out;
bool bDontNegate;
r = (bDontNegate ^ (bDontNegate– 1)) * v;
int In;
int Out;
bool bDontNegate;
Out = In;
if (bDontNegate == false)
{
out *= -1;
}
Branchy version: Branchless version:
https://graphics.stanford.edu/~seander/bithacks.html
Optimizing branches
44. • Compute both branches
Example: X = (A < B) ? CONST1 : CONST2
Optimizing branches
45. • Conditional instructions (setCC and cmovCC)
cmp a, b ;Condition
jbe L30 ;Conditional branch
mov ebx const1 ;ebx holds X
jmp L31 ;Unconditional branch
L30:
mov ebx const2
L31:
X = (A < B) ? CONST1 : CONST2
xor ebx, ebx ;Clear ebx (X in the C code)
cmp A, B
setge bl ;When ebx = 0 or 1
;OR the complement condition
sub ebx, 1 ;ebx=11..11 or 00..00
and ebx, const3 ;const3 = const1-const2
add ebx, const2 ;ebx=const1 or const2
Branchy version: Branchless version:
http://www.intel.ru/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
Optimizing branches
46. • SIMD mask + blending example
X = (A < B) ? CONST1 : CONST2
// create selector
mask = __mm_cmplt_ps(a, b);
// blend values
res = __mm_blendv_ps(const2, const1, mask);
mask = 0xffffffff if (a < b); 0 otherwise
blend values using mask
Optimizing branches
47. • Do it only for hard to predict branches
• Obviously have to compute both results
• Introduces data-dependency blocking out-of-
order execution
• Profile!
Compute both summary:
Optimizing branches
48. • Blue nodes - archers
• Red nodes - swordsmen
Optimizing branches
Example: Need to updatea squad
49. struct CombatActor
{
// Data…
EUnitType Type; //ARCHER or SWORDSMAN
};
struct Squad
{
CombatActor Units[SIZE][SIZE];
};
void UpdateArmy(const Squad& squad)
{
for (auto i = 0; i < SIZE; ++i)
for (auto j = 0; j < SIZE; ++j)
{
const auto & Unit = squad.Units[i][j];
switch (Unit.Type)
{
case EElementType::ARCHER:
// Process archer
break;
case EElementType::SWORDSMAN:
// Process swordsman
break;
default:
// Handle default
break;
}
}
}
• Branching every iteration?
• Bad performance for hard-to-predict branches
Optimizing branches
50. struct CombatActor
{
// Data…
EUnitType Type; //ARCHER or SWORDSMAN
};
struct Squad
{
CombatActor Archers[A_SIZE];
CombatActor Swordsmen[S_SIZE];
};
void UpdateArchers(const Squad & squad)
{
// Just iterate and process, no branching here
// Update archers
}
• Split! And process separately
• No branching in processing methods
• + Better utilization of I-cache!
void UpdateSwordsmen(const Squad & squad)
{
// Just iterate and process, no branching here
// Update swordsmen
}
Optimizing branches
51. • For very predictable branches:
• Generally prefer predicted not taken conditional
branches
• Depending on architecture predicted taken branch
may take little more latency
Optimizing branches
52. ; function prologue
cmp dword ptr [data], 0
je END
; set of some ALU instructions…
;…
END:
; function epilogue
; function prologue
cmp dword ptr [data], 0
jne COMP
jmp END
COMP:
; set of some ALU instructions…
;…
END:
; function epilogue
• Imagine cmp dword ptr [data], 0 – likely to evaluate to “false”
• Prefer predicted not taken
Predicted not taken Predicted taken
Optimizing branches
53. • Study branch predictor on target architecture
• Consider whether you really need a branch
• Compute both results
• Bit/Math hacks
• Study the data and split it
• Based on access patterns
• Based on performed computation
Optimizing branches
54. Conclusion
• Know your hardware
• Architecture matters!
• Design code around data, not abstractions
• Hardware is a real thing
OOO hides small cache latencies
The drawbacks, as just described,
are that the access patterns must be trivial and
that prefetching cannot happen across page boundaries