SlideShare a Scribd company logo
1 of 57
Code and memory optimization tricks
Evgeny Muralev
Software Engineer
Sperasoft Inc.
About me
• Software engineer at Sperasoft
• Worked on code for EA Sports games (FIFA, NFL, Madden); Now Ubisoft
AAA title
• Indie game developer in my free time 
Our Clients
Electronic Arts
Riot Games
Wargaming
BioWare
Ubisoft
Disney
Sony
Our Projects
Dragon Age: Inquisition
FIFA 14
SIMS 4
Mass Effect 2
League of Legends
Grand Theft Auto V
About us
Our Office Locations
USA
Poland
Russia
The Facts
Founded in 2004
300+ employees
Sperasoft on-line
sperasoft.com
linkedin.com/company/sperasoft
twitter.com/sperasoft
facebook.com/sperasoft
Agenda
• Brief architecture overview
• Optimizing for data cache
• Optimizing branches (and I-cache)
Developing AAA title
• Fixed performance requirements
• Min 30fps (33.3ms per frame)
• Performance is a king
• a LOT of work to do in one frame!
Make code faster?…
• Improved hardware
• Wait for another generation
• Fixed on consoles
• Improved algorithm
• Very important
• Hardware-aware optimization
• Optimizing for (limited) range of hardware
• Microoptimizations for specific architecture!
Brief overview
CPU
REG
L1 I L1 D
L2 I/D
RAM
REG
~2 cycles
~20 cycles
~200 cycles
• Last level cache (LLC) miss cost ~ 200 cycles
• Intel Skylake instruction latencies
• ADDPS/ADDSS 4 cycles
• MULPS/MULSS 4 cycles
• DIVSS/DIVPS 11 cycles
• SQRTPS/SQRTSS 13 cycles
Brief overview
Brief overview
Intel Skylake case study:
Level Capacity/
Associativity
Fastest Latency Peak Bandwidth
(B/cycle)
L1/D 32Kb/8 4 cycles 96 (2x32 Load + 1x32
Store)
L1/I 32Kb/8 N/A N/A
L2 256Kb/4 12 cycles 64B/cycle
L3 (shared) Up to 2Mb per
core/Up to 16
44 cycles 32B/cycle
http://www.intel.ru/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
Brief overview
• Out-of-order execution cannot hide big
latencies like access to main memory
• That’s why processor always tries to prefetch
ahead
• Both instructions and data
Brief overview
• Linear data access is the best you can do to
help hardware prefetching
• Processor recognizes pattern and preload data for next iterations
beforehand
Vec4D in[SIZE]; // Offset from origin
float ChebyshevDist[SIZE]; // Chebyshev distance from origin
for (auto i = 0; i < SIZE; ++i)
{
ChebyshevDist[i] = Max(in[i].x, in[i].y, in[i].z, in[i].w);
}
Optimizing for data cache
• Access patterns must be trivial
• Triggering prefetching after every cache miss will
pollute a cache
• Prefetching cannot happen across page
boundaries
• Might trigger invalid page table walk (on TLB miss)
Optimizing for data cache
• What about traversal of pointer-based data
structures?
• Spoiler: It sucks
Optimizing for data cache
• Prefetching is blocked
• next->next is not known
• Cache miss every iteration!
• Increases chance of TLB misses
• * Depending on your memory allocator
current current->next->nextcurrent->next
struct GameActor
{
// Data…
GameActor* next;
};
while (current != nullptr)
{
// Do some operations on current
// actor…
current = current->next;
}
LLC miss! LLC miss! LLC miss!
Optimizing for data cache
Array vs Linked List traversal
Linear Data Random access
Optimizing for data cache
Time
N of elements
• Load from memory:
• auto data = *pointerToData;
• Special instructions:
• use intrinsics: _mm_prefetch(void *p, enum _mmhint h)
Configurable!
Optimizing for data cache
• Usually retire after virtual to physical address translation is completed
• In case of exception such as page fault software prefetch retired without
prefetching any data
e.g. Intel guide on prefetch instructions:
• Load from memory != prefetch instructions
• Prefetch instructions may differ depending on H/W vendor
Optimizing for data cache
• Probably won’t help
• Computations don’t overlap memory access time enough
• Remember LLC miss ~ 200c vs trivial ALUs ~ 3-4c
while (current != nullptr)
{
Prefetch(current->next)
// Trivial ALU computations on current actor
current = current->next;
}
Optimizing for data cache
while (current != nullptr)
{
Prefetch(current->next)
//HighLatencyComputation…
current = current->next;
}
• May help around high latency
• Make sure data is not evicted from cache before use
Optimizing for data cache
• Prefetch far enough to overlap memory access
time
• Prefetch near enough so it’s not evicted from
data cache
• Do NOT overprefetch
• Prefetching is not free
• Polluting cache
• Always profile when using software prefetching
Optimizing for data cache
RAM:
… … … … … … a … … … … … … … … … … … … … … … … … … … … … … … … …
Cache:
• Cache operates with blocks called “cache
lines”
• When accessing “a” whole cache line is
loaded
• You can expect 64 bytes wide cache line
on x64
a … … … … … … … … … … … … … … …
Optimizing for data cache
struct FooBonus
{
float fooBonus;
float otherData[15];
};
// For every character…
// Assume we have array<FooBonus> structs;
float Sum{0.0f};
for (auto i = 0; i < SIZE; ++i)
{
Actor->Total += FooArray[i].fooBonus;
}
Example of poor data layout:
Optimizing for data cache
• 64 byte offset between loads
• Each is on separate cache line
• 60 from 64 bytes are wasted
addss xmm6,dword ptr [rax-40h]
addss xmm6,dword ptr [rax]
addss xmm6,dword ptr [rax+40h]
addss xmm6,dword ptr [rax+80h]
addss xmm6,dword ptr [rax+0C0h]
addss xmm6,dword ptr [rax+100h]
addss xmm6,dword ptr [rax+140h]
addss xmm6,dword ptr [rax+180h]
add rax,200h
cmp rax,rcx
jl main+0A0h
*MSVC loves x8 loop unrolling
Optimizing for data cache
• Look for patterns how your data is accessed
• Split the data based on access patterns
• Data used together should be located together
• Look for most common case
Optimizing for data cache
Cold fields
struct FooBonus
{
MiscData* otherData;
float fooBonus;
};
struct MiscData
{
float otherData[15];
};
Optimizing for data cache
+ 4 bytes for memory alignment on 64bit
• 12 byte offset
• Much less bandwidth is wasted
• Can do better?!
addss xmm6,dword ptr [rax-0Ch]
addss xmm6,dword ptr [rax]
addss xmm6,dword ptr [rax+0Ch]
addss xmm6,dword ptr [rax+18h]
addss xmm6,dword ptr [rax+24h]
addss xmm6,dword ptr [rax+30h]
addss xmm6,dword ptr [rax+3Ch]
addss xmm6,dword ptr [rax+48h]
add rax,60h
cmp rax,rcx
jl main+0A0h
Optimizing for data cache
• Maybe no need to make a pointer to the cold fields?
• Make use of Structure of Arrays
• Store and index different arrays
struct FooBonus
{
float fooBonus;
};
struct MiscData
{
float otherData[15];
};
Optimizing for data cache
• 100% bandwidth utilization
• If everything is 64byte aligned
addss xmm6,dword ptr [rax-4]
addss xmm6,dword ptr [rax]
addss xmm6,dword ptr [rax+4]
addss xmm6,dword ptr [rax+8]
addss xmm6,dword ptr [rax+0Ch]
addss xmm6,dword ptr [rax+10h]
addss xmm6,dword ptr [rax+14h]
addss xmm6,dword ptr [rax+18h]
add rax,20h
cmp rax,rcx
jl main+0A0h
Optimizing for data cache
B/D Utilization
Attempt 1 Attempt 2 Attempt 3
Optimizing for data cache
Time
N of elements
• Poor data utilization:
• Wasted bandwidth
• Increasing probability of TLB misses
• More cache misses due to crossing page boundary
Optimizing for data cache
• Recognize data access patterns:
• Just analyze the data and how it’s used
• Include logging to getters/setters
• Collect any other useful data (time/counters)
float GameCharacter::GetStamina() const
{
// Active only in debug build
CollectData(“GameCharacter::Stamina”);
return Stamina;
}
Optimizing for data cache
• What to consider:
• What data is accessed together
• How often data is accessed?
• From where it’s accessed?
Optimizing for data cache
• Instruction fetch
• Decoding
• Execution
• Memory Access
• Retirement
*of course it is more complex on real hardware 
Optimizing branches
Instruction lifetime:
IF ID EX MEM WB
I1
Optimizing branches
IF ID EX MEM WB
I1
I2 I1
Optimizing branches
IF ID EX MEM WB
I1
I2 I1
I3 I2 I1
Optimizing branches
IF ID EX MEM WB
I1
I2 I1
I3 I2 I1
I4 I3 I2 I1
Optimizing branches
IF ID EX MEM WB
I1
I2 I1
I3 I2 I1
I4 I3 I2 I1
I5 I4 I3 I2 I1
Optimizing branches
• What instructions to fetch after Inst A?
• Condition hasn’t been evaluated yet
• Processor speculatively chooses one of the
paths
• Wrong guess is called branch misprediction
// Instruction A
if (Condition == true)
{
// Instruction B
// Instruction C
}
else
{
// Instruction D
// Instruction E
}
Optimizing branches
IF ID EX MEM WB
A
B A
C B A
A
D A
• Pipeline Flush
• A lot of wasted cycles 
Mispredicted branch!
Optimizing branches
• Try to remove branches at all
• Especially hard to predict branches
• Reduces chance of branch misprediction
• Doesn’t take resources of Branch Target Buffer
Optimizing branches
Know bit tricks!
Example: Negate number based on flag value
int In;
int Out;
bool bDontNegate;
r = (bDontNegate ^ (bDontNegate– 1)) * v;
int In;
int Out;
bool bDontNegate;
Out = In;
if (bDontNegate == false)
{
out *= -1;
}
Branchy version: Branchless version:
https://graphics.stanford.edu/~seander/bithacks.html
Optimizing branches
• Compute both branches
Example: X = (A < B) ? CONST1 : CONST2
Optimizing branches
• Conditional instructions (setCC and cmovCC)
cmp a, b ;Condition
jbe L30 ;Conditional branch
mov ebx const1 ;ebx holds X
jmp L31 ;Unconditional branch
L30:
mov ebx const2
L31:
X = (A < B) ? CONST1 : CONST2
xor ebx, ebx ;Clear ebx (X in the C code)
cmp A, B
setge bl ;When ebx = 0 or 1
;OR the complement condition
sub ebx, 1 ;ebx=11..11 or 00..00
and ebx, const3 ;const3 = const1-const2
add ebx, const2 ;ebx=const1 or const2
Branchy version: Branchless version:
http://www.intel.ru/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf
Optimizing branches
• SIMD mask + blending example
X = (A < B) ? CONST1 : CONST2
// create selector
mask = __mm_cmplt_ps(a, b);
// blend values
res = __mm_blendv_ps(const2, const1, mask);
mask = 0xffffffff if (a < b); 0 otherwise
blend values using mask
Optimizing branches
• Do it only for hard to predict branches
• Obviously have to compute both results
• Introduces data-dependency blocking out-of-
order execution
• Profile!
Compute both summary:
Optimizing branches
• Blue nodes - archers
• Red nodes - swordsmen
Optimizing branches
Example: Need to updatea squad
struct CombatActor
{
// Data…
EUnitType Type; //ARCHER or SWORDSMAN
};
struct Squad
{
CombatActor Units[SIZE][SIZE];
};
void UpdateArmy(const Squad& squad)
{
for (auto i = 0; i < SIZE; ++i)
for (auto j = 0; j < SIZE; ++j)
{
const auto & Unit = squad.Units[i][j];
switch (Unit.Type)
{
case EElementType::ARCHER:
// Process archer
break;
case EElementType::SWORDSMAN:
// Process swordsman
break;
default:
// Handle default
break;
}
}
}
• Branching every iteration?
• Bad performance for hard-to-predict branches
Optimizing branches
struct CombatActor
{
// Data…
EUnitType Type; //ARCHER or SWORDSMAN
};
struct Squad
{
CombatActor Archers[A_SIZE];
CombatActor Swordsmen[S_SIZE];
};
void UpdateArchers(const Squad & squad)
{
// Just iterate and process, no branching here
// Update archers
}
• Split! And process separately
• No branching in processing methods
• + Better utilization of I-cache!
void UpdateSwordsmen(const Squad & squad)
{
// Just iterate and process, no branching here
// Update swordsmen
}
Optimizing branches
• For very predictable branches:
• Generally prefer predicted not taken conditional
branches
• Depending on architecture predicted taken branch
may take little more latency
Optimizing branches
; function prologue
cmp dword ptr [data], 0
je END
; set of some ALU instructions…
;…
END:
; function epilogue
; function prologue
cmp dword ptr [data], 0
jne COMP
jmp END
COMP:
; set of some ALU instructions…
;…
END:
; function epilogue
• Imagine cmp dword ptr [data], 0 – likely to evaluate to “false”
• Prefer predicted not taken
Predicted not taken Predicted taken
Optimizing branches
• Study branch predictor on target architecture
• Consider whether you really need a branch
• Compute both results
• Bit/Math hacks
• Study the data and split it
• Based on access patterns
• Based on performed computation
Optimizing branches
Conclusion
• Know your hardware
• Architecture matters!
• Design code around data, not abstractions
• Hardware is a real thing
Resources
• http://www.agner.org/optimize/microarchitecture.pdf
• https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
• http://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf
• http://www.intel.ru/content/dam/www/public/us/en/documents/manual
s/64-ia-32-architectures-optimization-manual.pdf
• https://graphics.stanford.edu/~seander/bithacks.html
Questions?
• E-mail: evgeny.muralev@sperasoft.com
• Twitter: @EvgenyGD
• Web: evgenymuralev.com
• www.sperasoft.com
• Follow us on Twitter, LinkedIn and Facebook!
Sperasoft

More Related Content

What's hot

OSDC 2012 | Scaling with MongoDB by Ross Lawley
OSDC 2012 | Scaling with MongoDB by Ross LawleyOSDC 2012 | Scaling with MongoDB by Ross Lawley
OSDC 2012 | Scaling with MongoDB by Ross LawleyNETWAYS
 
Tales from the Field
Tales from the FieldTales from the Field
Tales from the FieldMongoDB
 
Programming Hive Reading #4
Programming Hive Reading #4Programming Hive Reading #4
Programming Hive Reading #4moai kids
 
Windows 10 Nt Heap Exploitation (English version)
Windows 10 Nt Heap Exploitation (English version)Windows 10 Nt Heap Exploitation (English version)
Windows 10 Nt Heap Exploitation (English version)Angel Boy
 
A compact bytecode format for JavaScriptCore
A compact bytecode format for JavaScriptCoreA compact bytecode format for JavaScriptCore
A compact bytecode format for JavaScriptCoreTadeu Zagallo
 
Scaling out SSIS with Parallelism, Diving Deep Into The Dataflow Engine
Scaling out SSIS with Parallelism, Diving Deep Into The Dataflow EngineScaling out SSIS with Parallelism, Diving Deep Into The Dataflow Engine
Scaling out SSIS with Parallelism, Diving Deep Into The Dataflow EngineChris Adkin
 
Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]Aleksei Voitylov
 
Use Redis in Odd and Unusual Ways
Use Redis in Odd and Unusual WaysUse Redis in Odd and Unusual Ways
Use Redis in Odd and Unusual WaysItamar Haber
 
Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresOzgun Erdogan
 

What's hot (9)

OSDC 2012 | Scaling with MongoDB by Ross Lawley
OSDC 2012 | Scaling with MongoDB by Ross LawleyOSDC 2012 | Scaling with MongoDB by Ross Lawley
OSDC 2012 | Scaling with MongoDB by Ross Lawley
 
Tales from the Field
Tales from the FieldTales from the Field
Tales from the Field
 
Programming Hive Reading #4
Programming Hive Reading #4Programming Hive Reading #4
Programming Hive Reading #4
 
Windows 10 Nt Heap Exploitation (English version)
Windows 10 Nt Heap Exploitation (English version)Windows 10 Nt Heap Exploitation (English version)
Windows 10 Nt Heap Exploitation (English version)
 
A compact bytecode format for JavaScriptCore
A compact bytecode format for JavaScriptCoreA compact bytecode format for JavaScriptCore
A compact bytecode format for JavaScriptCore
 
Scaling out SSIS with Parallelism, Diving Deep Into The Dataflow Engine
Scaling out SSIS with Parallelism, Diving Deep Into The Dataflow EngineScaling out SSIS with Parallelism, Diving Deep Into The Dataflow Engine
Scaling out SSIS with Parallelism, Diving Deep Into The Dataflow Engine
 
Java on arm theory, applications, and workloads [dev5048]
Java on arm  theory, applications, and workloads [dev5048]Java on arm  theory, applications, and workloads [dev5048]
Java on arm theory, applications, and workloads [dev5048]
 
Use Redis in Odd and Unusual Ways
Use Redis in Odd and Unusual WaysUse Redis in Odd and Unusual Ways
Use Redis in Odd and Unusual Ways
 
Designing your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with PostgresDesigning your SaaS Database for Scale with Postgres
Designing your SaaS Database for Scale with Postgres
 

Viewers also liked

Unity3D Scripting: State Machine
Unity3D Scripting: State MachineUnity3D Scripting: State Machine
Unity3D Scripting: State MachineSperasoft
 
Unity introduction for programmers
Unity introduction for programmersUnity introduction for programmers
Unity introduction for programmersNoam Gat
 
Game Development with Unity
Game Development with UnityGame Development with Unity
Game Development with Unitydavidluzgouveia
 
Unity Programming
Unity Programming Unity Programming
Unity Programming Sperasoft
 
Introduction to Unity3D and Building your First Game
Introduction to Unity3D and Building your First GameIntroduction to Unity3D and Building your First Game
Introduction to Unity3D and Building your First GameSarah Sexton
 

Viewers also liked (11)

NYPF14 Report - CDA
NYPF14 Report - CDANYPF14 Report - CDA
NYPF14 Report - CDA
 
Securing PHP Applications
Securing PHP ApplicationsSecuring PHP Applications
Securing PHP Applications
 
Unity3D Scripting: State Machine
Unity3D Scripting: State MachineUnity3D Scripting: State Machine
Unity3D Scripting: State Machine
 
Unity introduction for programmers
Unity introduction for programmersUnity introduction for programmers
Unity introduction for programmers
 
Game Development with Unity
Game Development with UnityGame Development with Unity
Game Development with Unity
 
Unity Programming
Unity Programming Unity Programming
Unity Programming
 
Unity3D Programming
Unity3D ProgrammingUnity3D Programming
Unity3D Programming
 
Unity is strength presentation slides
Unity is strength presentation slidesUnity is strength presentation slides
Unity is strength presentation slides
 
Unity presentation
Unity presentationUnity presentation
Unity presentation
 
Unity 3d Basics
Unity 3d BasicsUnity 3d Basics
Unity 3d Basics
 
Introduction to Unity3D and Building your First Game
Introduction to Unity3D and Building your First GameIntroduction to Unity3D and Building your First Game
Introduction to Unity3D and Building your First Game
 

Similar to Code and Memory Optimisation Tricks

Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++Mike Acton
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)RichardWarburton
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonJAXLondon2014
 
Your backend architecture is what matters slideshare
Your backend architecture is what matters slideshareYour backend architecture is what matters slideshare
Your backend architecture is what matters slideshareColin Charles
 
Dynamic Binary Analysis and Obfuscated Codes
Dynamic Binary Analysis and Obfuscated Codes Dynamic Binary Analysis and Obfuscated Codes
Dynamic Binary Analysis and Obfuscated Codes Jonathan Salwan
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMDWei-Ta Wang
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarSpark Summit
 
Java Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoJava Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoValeriia Maliarenko
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsServer Density
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_SummaryHiram Fleitas León
 
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...MongoDB
 
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChanger
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChangerZero to 1 Billion+ Records: A True Story of Learning & Scaling GameChanger
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChangerMongoDB
 
Finding Xori: Malware Analysis Triage with Automated Disassembly
Finding Xori: Malware Analysis Triage with Automated DisassemblyFinding Xori: Malware Analysis Triage with Automated Disassembly
Finding Xori: Malware Analysis Triage with Automated DisassemblyPriyanka Aash
 
Sista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performanceSista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performanceESUG
 
MOW2010: 1TB MySQL Database Migration and HA Infrastructure by Alex Gorbachev...
MOW2010: 1TB MySQL Database Migration and HA Infrastructure by Alex Gorbachev...MOW2010: 1TB MySQL Database Migration and HA Infrastructure by Alex Gorbachev...
MOW2010: 1TB MySQL Database Migration and HA Infrastructure by Alex Gorbachev...Alex Gorbachev
 

Similar to Code and Memory Optimisation Tricks (20)

Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Your backend architecture is what matters slideshare
Your backend architecture is what matters slideshareYour backend architecture is what matters slideshare
Your backend architecture is what matters slideshare
 
Dynamic Binary Analysis and Obfuscated Codes
Dynamic Binary Analysis and Obfuscated Codes Dynamic Binary Analysis and Obfuscated Codes
Dynamic Binary Analysis and Obfuscated Codes
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMD
 
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran LonikarExploiting GPU's for Columnar DataFrrames by Kiran Lonikar
Exploiting GPU's for Columnar DataFrrames by Kiran Lonikar
 
Java Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoJava Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey Kovalenko
 
MongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & AnalyticsMongoDB: Optimising for Performance, Scale & Analytics
MongoDB: Optimising for Performance, Scale & Analytics
 
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms Fedor Polyakov - Optimizing computer vision problems on mobile platforms
Fedor Polyakov - Optimizing computer vision problems on mobile platforms
 
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
[DBA]_HiramFleitas_SQL_PASS_Summit_2017_Summary
 
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
Ensuring High Availability for Real-time Analytics featuring Boxed Ice / Serv...
 
Top ten-list
Top ten-listTop ten-list
Top ten-list
 
Apache Spark v3.0.0
Apache Spark v3.0.0Apache Spark v3.0.0
Apache Spark v3.0.0
 
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChanger
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChangerZero to 1 Billion+ Records: A True Story of Learning & Scaling GameChanger
Zero to 1 Billion+ Records: A True Story of Learning & Scaling GameChanger
 
test
testtest
test
 
mtl_rubykaigi
mtl_rubykaigimtl_rubykaigi
mtl_rubykaigi
 
Finding Xori: Malware Analysis Triage with Automated Disassembly
Finding Xori: Malware Analysis Triage with Automated DisassemblyFinding Xori: Malware Analysis Triage with Automated Disassembly
Finding Xori: Malware Analysis Triage with Automated Disassembly
 
Sista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performanceSista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performance
 
MOW2010: 1TB MySQL Database Migration and HA Infrastructure by Alex Gorbachev...
MOW2010: 1TB MySQL Database Migration and HA Infrastructure by Alex Gorbachev...MOW2010: 1TB MySQL Database Migration and HA Infrastructure by Alex Gorbachev...
MOW2010: 1TB MySQL Database Migration and HA Infrastructure by Alex Gorbachev...
 

More from Sperasoft

особенности работы с Locomotion в Unreal Engine 4
особенности работы с Locomotion в Unreal Engine 4особенности работы с Locomotion в Unreal Engine 4
особенности работы с Locomotion в Unreal Engine 4Sperasoft
 
концепт и архитектура геймплея в Creach: The Depleted World
концепт и архитектура геймплея в Creach: The Depleted Worldконцепт и архитектура геймплея в Creach: The Depleted World
концепт и архитектура геймплея в Creach: The Depleted WorldSperasoft
 
Опыт разработки VR игры для UE4
Опыт разработки VR игры для UE4Опыт разработки VR игры для UE4
Опыт разработки VR игры для UE4Sperasoft
 
Организация работы с UE4 в команде до 20 человек
Организация работы с UE4 в команде до 20 человек Организация работы с UE4 в команде до 20 человек
Организация работы с UE4 в команде до 20 человек Sperasoft
 
Gameplay Tags
Gameplay TagsGameplay Tags
Gameplay TagsSperasoft
 
Data Driven Gameplay in UE4
Data Driven Gameplay in UE4Data Driven Gameplay in UE4
Data Driven Gameplay in UE4Sperasoft
 
The theory of relational databases
The theory of relational databasesThe theory of relational databases
The theory of relational databasesSperasoft
 
Automated layout testing using Galen Framework
Automated layout testing using Galen FrameworkAutomated layout testing using Galen Framework
Automated layout testing using Galen FrameworkSperasoft
 
Sperasoft talks: Android Security Threats
Sperasoft talks: Android Security ThreatsSperasoft talks: Android Security Threats
Sperasoft talks: Android Security ThreatsSperasoft
 
Sperasoft Talks: RxJava Functional Reactive Programming on Android
Sperasoft Talks: RxJava Functional Reactive Programming on AndroidSperasoft Talks: RxJava Functional Reactive Programming on Android
Sperasoft Talks: RxJava Functional Reactive Programming on AndroidSperasoft
 
Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015Sperasoft
 
Effective Мeetings
Effective МeetingsEffective Мeetings
Effective МeetingsSperasoft
 
Unreal Engine 4 Introduction
Unreal Engine 4 IntroductionUnreal Engine 4 Introduction
Unreal Engine 4 IntroductionSperasoft
 
JIRA Development
JIRA DevelopmentJIRA Development
JIRA DevelopmentSperasoft
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to ElasticsearchSperasoft
 
MOBILE DEVELOPMENT with HTML, CSS and JS
MOBILE DEVELOPMENT with HTML, CSS and JSMOBILE DEVELOPMENT with HTML, CSS and JS
MOBILE DEVELOPMENT with HTML, CSS and JSSperasoft
 
Quick Intro Into Kanban
Quick Intro Into KanbanQuick Intro Into Kanban
Quick Intro Into KanbanSperasoft
 
ECMAScript 6 Review
ECMAScript 6 ReviewECMAScript 6 Review
ECMAScript 6 ReviewSperasoft
 
Console Development in 15 minutes
Console Development in 15 minutesConsole Development in 15 minutes
Console Development in 15 minutesSperasoft
 
Database Indexes
Database IndexesDatabase Indexes
Database IndexesSperasoft
 

More from Sperasoft (20)

особенности работы с Locomotion в Unreal Engine 4
особенности работы с Locomotion в Unreal Engine 4особенности работы с Locomotion в Unreal Engine 4
особенности работы с Locomotion в Unreal Engine 4
 
концепт и архитектура геймплея в Creach: The Depleted World
концепт и архитектура геймплея в Creach: The Depleted Worldконцепт и архитектура геймплея в Creach: The Depleted World
концепт и архитектура геймплея в Creach: The Depleted World
 
Опыт разработки VR игры для UE4
Опыт разработки VR игры для UE4Опыт разработки VR игры для UE4
Опыт разработки VR игры для UE4
 
Организация работы с UE4 в команде до 20 человек
Организация работы с UE4 в команде до 20 человек Организация работы с UE4 в команде до 20 человек
Организация работы с UE4 в команде до 20 человек
 
Gameplay Tags
Gameplay TagsGameplay Tags
Gameplay Tags
 
Data Driven Gameplay in UE4
Data Driven Gameplay in UE4Data Driven Gameplay in UE4
Data Driven Gameplay in UE4
 
The theory of relational databases
The theory of relational databasesThe theory of relational databases
The theory of relational databases
 
Automated layout testing using Galen Framework
Automated layout testing using Galen FrameworkAutomated layout testing using Galen Framework
Automated layout testing using Galen Framework
 
Sperasoft talks: Android Security Threats
Sperasoft talks: Android Security ThreatsSperasoft talks: Android Security Threats
Sperasoft talks: Android Security Threats
 
Sperasoft Talks: RxJava Functional Reactive Programming on Android
Sperasoft Talks: RxJava Functional Reactive Programming on AndroidSperasoft Talks: RxJava Functional Reactive Programming on Android
Sperasoft Talks: RxJava Functional Reactive Programming on Android
 
Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015Sperasoft‬ talks j point 2015
Sperasoft‬ talks j point 2015
 
Effective Мeetings
Effective МeetingsEffective Мeetings
Effective Мeetings
 
Unreal Engine 4 Introduction
Unreal Engine 4 IntroductionUnreal Engine 4 Introduction
Unreal Engine 4 Introduction
 
JIRA Development
JIRA DevelopmentJIRA Development
JIRA Development
 
Introduction to Elasticsearch
Introduction to ElasticsearchIntroduction to Elasticsearch
Introduction to Elasticsearch
 
MOBILE DEVELOPMENT with HTML, CSS and JS
MOBILE DEVELOPMENT with HTML, CSS and JSMOBILE DEVELOPMENT with HTML, CSS and JS
MOBILE DEVELOPMENT with HTML, CSS and JS
 
Quick Intro Into Kanban
Quick Intro Into KanbanQuick Intro Into Kanban
Quick Intro Into Kanban
 
ECMAScript 6 Review
ECMAScript 6 ReviewECMAScript 6 Review
ECMAScript 6 Review
 
Console Development in 15 minutes
Console Development in 15 minutesConsole Development in 15 minutes
Console Development in 15 minutes
 
Database Indexes
Database IndexesDatabase Indexes
Database Indexes
 

Recently uploaded

A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 

Recently uploaded (20)

A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Code and Memory Optimisation Tricks

  • 1. Code and memory optimization tricks Evgeny Muralev Software Engineer Sperasoft Inc.
  • 2. About me • Software engineer at Sperasoft • Worked on code for EA Sports games (FIFA, NFL, Madden); Now Ubisoft AAA title • Indie game developer in my free time 
  • 3. Our Clients Electronic Arts Riot Games Wargaming BioWare Ubisoft Disney Sony Our Projects Dragon Age: Inquisition FIFA 14 SIMS 4 Mass Effect 2 League of Legends Grand Theft Auto V About us Our Office Locations USA Poland Russia The Facts Founded in 2004 300+ employees Sperasoft on-line sperasoft.com linkedin.com/company/sperasoft twitter.com/sperasoft facebook.com/sperasoft
  • 4. Agenda • Brief architecture overview • Optimizing for data cache • Optimizing branches (and I-cache)
  • 5. Developing AAA title • Fixed performance requirements • Min 30fps (33.3ms per frame) • Performance is a king • a LOT of work to do in one frame!
  • 6. Make code faster?… • Improved hardware • Wait for another generation • Fixed on consoles • Improved algorithm • Very important • Hardware-aware optimization • Optimizing for (limited) range of hardware • Microoptimizations for specific architecture!
  • 7. Brief overview CPU REG L1 I L1 D L2 I/D RAM REG ~2 cycles ~20 cycles ~200 cycles
  • 8. • Last level cache (LLC) miss cost ~ 200 cycles • Intel Skylake instruction latencies • ADDPS/ADDSS 4 cycles • MULPS/MULSS 4 cycles • DIVSS/DIVPS 11 cycles • SQRTPS/SQRTSS 13 cycles Brief overview
  • 10. Intel Skylake case study: Level Capacity/ Associativity Fastest Latency Peak Bandwidth (B/cycle) L1/D 32Kb/8 4 cycles 96 (2x32 Load + 1x32 Store) L1/I 32Kb/8 N/A N/A L2 256Kb/4 12 cycles 64B/cycle L3 (shared) Up to 2Mb per core/Up to 16 44 cycles 32B/cycle http://www.intel.ru/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf Brief overview
  • 11. • Out-of-order execution cannot hide big latencies like access to main memory • That’s why processor always tries to prefetch ahead • Both instructions and data Brief overview
  • 12. • Linear data access is the best you can do to help hardware prefetching • Processor recognizes pattern and preload data for next iterations beforehand Vec4D in[SIZE]; // Offset from origin float ChebyshevDist[SIZE]; // Chebyshev distance from origin for (auto i = 0; i < SIZE; ++i) { ChebyshevDist[i] = Max(in[i].x, in[i].y, in[i].z, in[i].w); } Optimizing for data cache
  • 13. • Access patterns must be trivial • Triggering prefetching after every cache miss will pollute a cache • Prefetching cannot happen across page boundaries • Might trigger invalid page table walk (on TLB miss) Optimizing for data cache
  • 14. • What about traversal of pointer-based data structures? • Spoiler: It sucks Optimizing for data cache
  • 15. • Prefetching is blocked • next->next is not known • Cache miss every iteration! • Increases chance of TLB misses • * Depending on your memory allocator current current->next->nextcurrent->next struct GameActor { // Data… GameActor* next; }; while (current != nullptr) { // Do some operations on current // actor… current = current->next; } LLC miss! LLC miss! LLC miss! Optimizing for data cache
  • 16. Array vs Linked List traversal Linear Data Random access Optimizing for data cache Time N of elements
  • 17. • Load from memory: • auto data = *pointerToData; • Special instructions: • use intrinsics: _mm_prefetch(void *p, enum _mmhint h) Configurable! Optimizing for data cache
  • 18. • Usually retire after virtual to physical address translation is completed • In case of exception such as page fault software prefetch retired without prefetching any data e.g. Intel guide on prefetch instructions: • Load from memory != prefetch instructions • Prefetch instructions may differ depending on H/W vendor Optimizing for data cache
  • 19. • Probably won’t help • Computations don’t overlap memory access time enough • Remember LLC miss ~ 200c vs trivial ALUs ~ 3-4c while (current != nullptr) { Prefetch(current->next) // Trivial ALU computations on current actor current = current->next; } Optimizing for data cache
  • 20. while (current != nullptr) { Prefetch(current->next) //HighLatencyComputation… current = current->next; } • May help around high latency • Make sure data is not evicted from cache before use Optimizing for data cache
  • 21. • Prefetch far enough to overlap memory access time • Prefetch near enough so it’s not evicted from data cache • Do NOT overprefetch • Prefetching is not free • Polluting cache • Always profile when using software prefetching Optimizing for data cache
  • 22. RAM: … … … … … … a … … … … … … … … … … … … … … … … … … … … … … … … … Cache: • Cache operates with blocks called “cache lines” • When accessing “a” whole cache line is loaded • You can expect 64 bytes wide cache line on x64 a … … … … … … … … … … … … … … … Optimizing for data cache
  • 23. struct FooBonus { float fooBonus; float otherData[15]; }; // For every character… // Assume we have array<FooBonus> structs; float Sum{0.0f}; for (auto i = 0; i < SIZE; ++i) { Actor->Total += FooArray[i].fooBonus; } Example of poor data layout: Optimizing for data cache
  • 24. • 64 byte offset between loads • Each is on separate cache line • 60 from 64 bytes are wasted addss xmm6,dword ptr [rax-40h] addss xmm6,dword ptr [rax] addss xmm6,dword ptr [rax+40h] addss xmm6,dword ptr [rax+80h] addss xmm6,dword ptr [rax+0C0h] addss xmm6,dword ptr [rax+100h] addss xmm6,dword ptr [rax+140h] addss xmm6,dword ptr [rax+180h] add rax,200h cmp rax,rcx jl main+0A0h *MSVC loves x8 loop unrolling Optimizing for data cache
  • 25. • Look for patterns how your data is accessed • Split the data based on access patterns • Data used together should be located together • Look for most common case Optimizing for data cache
  • 26. Cold fields struct FooBonus { MiscData* otherData; float fooBonus; }; struct MiscData { float otherData[15]; }; Optimizing for data cache + 4 bytes for memory alignment on 64bit
  • 27. • 12 byte offset • Much less bandwidth is wasted • Can do better?! addss xmm6,dword ptr [rax-0Ch] addss xmm6,dword ptr [rax] addss xmm6,dword ptr [rax+0Ch] addss xmm6,dword ptr [rax+18h] addss xmm6,dword ptr [rax+24h] addss xmm6,dword ptr [rax+30h] addss xmm6,dword ptr [rax+3Ch] addss xmm6,dword ptr [rax+48h] add rax,60h cmp rax,rcx jl main+0A0h Optimizing for data cache
  • 28. • Maybe no need to make a pointer to the cold fields? • Make use of Structure of Arrays • Store and index different arrays struct FooBonus { float fooBonus; }; struct MiscData { float otherData[15]; }; Optimizing for data cache
  • 29. • 100% bandwidth utilization • If everything is 64byte aligned addss xmm6,dword ptr [rax-4] addss xmm6,dword ptr [rax] addss xmm6,dword ptr [rax+4] addss xmm6,dword ptr [rax+8] addss xmm6,dword ptr [rax+0Ch] addss xmm6,dword ptr [rax+10h] addss xmm6,dword ptr [rax+14h] addss xmm6,dword ptr [rax+18h] add rax,20h cmp rax,rcx jl main+0A0h Optimizing for data cache
  • 30. B/D Utilization Attempt 1 Attempt 2 Attempt 3 Optimizing for data cache Time N of elements
  • 31. • Poor data utilization: • Wasted bandwidth • Increasing probability of TLB misses • More cache misses due to crossing page boundary Optimizing for data cache
  • 32. • Recognize data access patterns: • Just analyze the data and how it’s used • Include logging to getters/setters • Collect any other useful data (time/counters) float GameCharacter::GetStamina() const { // Active only in debug build CollectData(“GameCharacter::Stamina”); return Stamina; } Optimizing for data cache
  • 33. • What to consider: • What data is accessed together • How often data is accessed? • From where it’s accessed? Optimizing for data cache
  • 34. • Instruction fetch • Decoding • Execution • Memory Access • Retirement *of course it is more complex on real hardware  Optimizing branches Instruction lifetime:
  • 35. IF ID EX MEM WB I1 Optimizing branches
  • 36. IF ID EX MEM WB I1 I2 I1 Optimizing branches
  • 37. IF ID EX MEM WB I1 I2 I1 I3 I2 I1 Optimizing branches
  • 38. IF ID EX MEM WB I1 I2 I1 I3 I2 I1 I4 I3 I2 I1 Optimizing branches
  • 39. IF ID EX MEM WB I1 I2 I1 I3 I2 I1 I4 I3 I2 I1 I5 I4 I3 I2 I1 Optimizing branches
  • 40. • What instructions to fetch after Inst A? • Condition hasn’t been evaluated yet • Processor speculatively chooses one of the paths • Wrong guess is called branch misprediction // Instruction A if (Condition == true) { // Instruction B // Instruction C } else { // Instruction D // Instruction E } Optimizing branches
  • 41. IF ID EX MEM WB A B A C B A A D A • Pipeline Flush • A lot of wasted cycles  Mispredicted branch! Optimizing branches
  • 42. • Try to remove branches at all • Especially hard to predict branches • Reduces chance of branch misprediction • Doesn’t take resources of Branch Target Buffer Optimizing branches
  • 43. Know bit tricks! Example: Negate number based on flag value int In; int Out; bool bDontNegate; r = (bDontNegate ^ (bDontNegate– 1)) * v; int In; int Out; bool bDontNegate; Out = In; if (bDontNegate == false) { out *= -1; } Branchy version: Branchless version: https://graphics.stanford.edu/~seander/bithacks.html Optimizing branches
  • 44. • Compute both branches Example: X = (A < B) ? CONST1 : CONST2 Optimizing branches
  • 45. • Conditional instructions (setCC and cmovCC) cmp a, b ;Condition jbe L30 ;Conditional branch mov ebx const1 ;ebx holds X jmp L31 ;Unconditional branch L30: mov ebx const2 L31: X = (A < B) ? CONST1 : CONST2 xor ebx, ebx ;Clear ebx (X in the C code) cmp A, B setge bl ;When ebx = 0 or 1 ;OR the complement condition sub ebx, 1 ;ebx=11..11 or 00..00 and ebx, const3 ;const3 = const1-const2 add ebx, const2 ;ebx=const1 or const2 Branchy version: Branchless version: http://www.intel.ru/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf Optimizing branches
  • 46. • SIMD mask + blending example X = (A < B) ? CONST1 : CONST2 // create selector mask = __mm_cmplt_ps(a, b); // blend values res = __mm_blendv_ps(const2, const1, mask); mask = 0xffffffff if (a < b); 0 otherwise blend values using mask Optimizing branches
  • 47. • Do it only for hard to predict branches • Obviously have to compute both results • Introduces data-dependency blocking out-of- order execution • Profile! Compute both summary: Optimizing branches
  • 48. • Blue nodes - archers • Red nodes - swordsmen Optimizing branches Example: Need to updatea squad
  • 49. struct CombatActor { // Data… EUnitType Type; //ARCHER or SWORDSMAN }; struct Squad { CombatActor Units[SIZE][SIZE]; }; void UpdateArmy(const Squad& squad) { for (auto i = 0; i < SIZE; ++i) for (auto j = 0; j < SIZE; ++j) { const auto & Unit = squad.Units[i][j]; switch (Unit.Type) { case EElementType::ARCHER: // Process archer break; case EElementType::SWORDSMAN: // Process swordsman break; default: // Handle default break; } } } • Branching every iteration? • Bad performance for hard-to-predict branches Optimizing branches
  • 50. struct CombatActor { // Data… EUnitType Type; //ARCHER or SWORDSMAN }; struct Squad { CombatActor Archers[A_SIZE]; CombatActor Swordsmen[S_SIZE]; }; void UpdateArchers(const Squad & squad) { // Just iterate and process, no branching here // Update archers } • Split! And process separately • No branching in processing methods • + Better utilization of I-cache! void UpdateSwordsmen(const Squad & squad) { // Just iterate and process, no branching here // Update swordsmen } Optimizing branches
  • 51. • For very predictable branches: • Generally prefer predicted not taken conditional branches • Depending on architecture predicted taken branch may take little more latency Optimizing branches
  • 52. ; function prologue cmp dword ptr [data], 0 je END ; set of some ALU instructions… ;… END: ; function epilogue ; function prologue cmp dword ptr [data], 0 jne COMP jmp END COMP: ; set of some ALU instructions… ;… END: ; function epilogue • Imagine cmp dword ptr [data], 0 – likely to evaluate to “false” • Prefer predicted not taken Predicted not taken Predicted taken Optimizing branches
  • 53. • Study branch predictor on target architecture • Consider whether you really need a branch • Compute both results • Bit/Math hacks • Study the data and split it • Based on access patterns • Based on performed computation Optimizing branches
  • 54. Conclusion • Know your hardware • Architecture matters! • Design code around data, not abstractions • Hardware is a real thing
  • 55. Resources • http://www.agner.org/optimize/microarchitecture.pdf • https://people.freebsd.org/~lstewart/articles/cpumemory.pdf • http://support.amd.com/TechDocs/47414_15h_sw_opt_guide.pdf • http://www.intel.ru/content/dam/www/public/us/en/documents/manual s/64-ia-32-architectures-optimization-manual.pdf • https://graphics.stanford.edu/~seander/bithacks.html
  • 56. Questions? • E-mail: evgeny.muralev@sperasoft.com • Twitter: @EvgenyGD • Web: evgenymuralev.com
  • 57. • www.sperasoft.com • Follow us on Twitter, LinkedIn and Facebook! Sperasoft

Editor's Notes

  1. OOO hides small cache latencies The drawbacks, as just described, are that the access patterns must be trivial and that prefetching cannot happen across page boundaries