SlideShare ist ein Scribd-Unternehmen logo
1 von 60
MEMORY OPTIMIZATION Christer Ericson Sony Computer Entertainment, Santa Monica (christer_ericson@playstation.sony.com)
Talk contents 1/2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Talk contents 2/2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Problem statement ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Need more justification? 1/3 SIMD instructions consume data at 2-8 times the rate of normal instructions! Instruction parallelism:
Need more justification? 2/3 Improvements to compiler technology double program performance every  ~18 years ! Proebsting’s law: Corollary: Don’t expect the compiler to do it for you!
Need more justification? 3/3 ,[object Object],[object Object],[object Object],On Moore’s law:
Brief cache review ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The memory hierarchy Main memory L2 cache L1 cache CPU ~1-5 cycles ~5-20 cycles ~40-100 cycles 1 cycle Roughly:
Some cache specs ,[object Object],[object Object],L1 cache (I/D) L2 cache PS2 16K/8K †  2-way N/A GameCube 32K/32K ‡  8-way 256K 2-way unified XBOX 16K/16K 4-way 128K 8-way unified PC ~32-64K ~128-512K
Foes: 3 C’s of cache misses ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Friends: Introducing the 3 R’s ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Compulsory Capacity Conflict Rearrange X (x) X Reduce X X (x) Reuse (x) X
Measuring cache utilization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Code cache optimization 1/2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Code cache optimization 2/2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Data cache optimization ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Prefetching and preloading ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Software prefetching const int kLookAhead = 4;  // Some elements ahead for (int i = 0; i < 4 * n; i += 4) { Prefetch(elem[i + kLookAhead]); Process(elem[i + 0]); Process(elem[i + 1]); Process(elem[i + 2]); Process(elem[i + 3]); } // Loop through and process all 4n elements for (int i = 0; i < 4 * n; i++) Process(elem[i]);
Greedy prefetching void PreorderTraversal(Node *pNode) { // Greedily prefetch left traversal path Prefetch(pNode->left); // Process the current node Process(pNode); // Greedily prefetch right traversal path Prefetch(pNode->right); // Recursively visit left then right subtree PreorderTraversal(pNode->left); PreorderTraversal(pNode->right); }
Preloading (pseudo-prefetch) (NB: This code reads one element beyond the end of the elem array.)  Elem a = elem[0]; for (int i = 0; i < 4 * n; i += 4) { Elem e = elem[i + 4];  // Cache miss, non-blocking Elem b = elem[i + 1];  // Cache hit Elem c = elem[i + 2];  // Cache hit Elem d = elem[i + 3];  // Cache hit Process(a); Process(b); Process(c); Process(d); a = e; }
Structures ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Field reordering ,[object Object],void Foo(S *p, void *key, int k) { while (p) { if (p->key == key) { p->count[k]++; break; } p = p->pNext; } } struct S { void *key; int count[20]; S *pNext; }; struct S { void *key; S *pNext; int count[20]; };
Hot/cold splitting ,[object Object],[object Object],[object Object],[object Object],Hot fields: Cold fields: struct S { void *key; S *pNext; S2 *pCold; }; struct S2 { int count[10]; };
Hot/cold splitting
Beware compiler padding ,[object Object],Decreasing size! struct X { int8 a; int64 b; int8 c; int16 d; int64 e; float f; }; struct Z { int64 b; int64 e; float f; int16 d; int8 a; int8 c; }; struct Y { int8 a, pad_a[7]; int64 b; int8 c, pad_c[1]; int16 d, pad_d[2]; int64 e; float f, pad_f[1]; };
Cache performance analysis ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Tree data structures ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Breadth-first order ,[object Object],[object Object]
Depth-first order ,[object Object],[object Object]
van Emde Boas layout ,[object Object],[object Object]
A compact static k-d tree ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Linearization caching ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
 
Memory allocation policy ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The curse of aliasing ,[object Object],Aliasing is also missed opportunities for optimization What value is returned here? Who knows! Aliasing is multiple references to the same storage location int Foo(int *a, int *b) { *a = 1; *b = 2; return *a; } int n; int *p1 = &n; int *p2 = &n;
The curse of aliasing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
How do we do ‘anti-aliasing’? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],†  To be defined
Matrix multiplication 1/3 Consider optimizing a 2x2 matrix multiplication: How do we typically optimize it? Right, unrolling! Mat22mul(float a[2][2], float b[2][2], float c[2][2]){ for (int i = 0; i < 2; i++) { for (int j = 0; j < 2; j++) { a[i][j] = 0.0f; for (int k = 0; k < 2; k++) a[i][j] += b[i][k] * c[k][j]; } } }
Matrix multiplication 2/3 ,[object Object],[object Object],[object Object],[object Object],[object Object],Staightforward unrolling results in this: // 16 memory reads, 4 writes Mat22mul(float a[2][2], float b[2][2], float c[2][2]){ a[0][0] = b[0][0]*c[0][0] + b[0][1]*c[1][0]; a[0][1] = b[0][0]*c[0][1] + b[0][1]*c[1][1];  //(1) a[1][0] = b[1][0]*c[0][0] + b[1][1]*c[1][0];  //(2) a[1][1] = b[1][0]*c[0][1] + b[1][1]*c[1][1];  //(3) }
Matrix multiplication 3/3 A correct approach is instead writing it as: … before producing outputs Consume inputs… // 8 memory reads, 4 writes Mat22mul(float a[2][2], float b[2][2], float c[2][2]){ float b00 = b[0][0], b01 = b[0][1]; float b10 = b[1][0], b11 = b[1][1]; float c00 = c[0][0], c01 = c[0][1]; float c10 = c[1][0], c11 = c[1][1]; a[0][0] = b00*c00 + b01*c10; a[0][1] = b00*c01 + b01*c11; a[1][0] = b10*c00 + b11*c10; a[1][1] = b10*c01 + b11*c11; }
Abstraction penalty problem ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
C++ abstraction penalty ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
C++ abstraction penalty Pointer members in classes may alias other members: Code likely to refetch  numVals  each iteration! numVals  not a local variable! May be aliased by  pBuf ! class Buf { public: void Clear() { for (int i = 0; i < numVals; i++) pBuf[i] = 0; } private: int numVals, *pBuf; }
C++ abstraction penalty We know that aliasing won’t happen, and can manually solve the aliasing issue by writing code as: class Buf { public: void Clear() { for (int i = 0, n = numVals; i < n; i++) pBuf[i] = 0; } private: int numVals, *pBuf; }
C++ abstraction penalty Since  pBuf[i]  can only alias  numVals  in the first iteration, a quality compiler can fix this problem by peeling the loop once, turning it into: Q: Does  your  compiler do this optimization?! void Clear() { if (numVals >= 1) { pBuf[0] = 0; for (int i = 1, n = numVals; i < n; i++) pBuf[i] = 0; } }
Type-based alias analysis ,[object Object],[object Object],Use language types to disambiguate memory references!
Type-based alias analysis ,[object Object],[object Object],[object Object],[object Object],[object Object]
Compatibility of C/C++ types ,[object Object],[object Object],[object Object],[object Object],[object Object]
What TBAA can do for you into this: It can turn this: Possible aliasing between v[i]  and  *n No aliasing possible so fetch  *n  once! void Foo(float *v, int *n) { for (int i = 0; i < *n; i++) v[i] += 1.0f; } void Foo(float *v, int *n) { int t = *n; for (int i = 0; i < t; i++) v[i] += 1.0f; }
What TBAA can also do ,[object Object],[object Object],Required by standard Allowed By gcc Illegal C/C++ code! uint32 i; float f; i = *((uint32 *)&f); uint32 i; union { float f; uchar8 c[4]; } u; u.f = f; i = (u.c[3]<<24L)+ (u.c[2]<<16L)+ ...; uint32 i; union { float f; uint32 i; } u; u.f = f; i = u.i;
Restrict-qualified pointers ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Using the restrict keyword Given this code: You really want the compiler to treat it as if written: But because of possible aliasing it cannot! void Foo(float v[], float *c, int n) { for (int i = 0; i < n; i++) v[i] = *c + 1.0f; } void Foo(float v[], float *c, int n) { float tmp = *c + 1.0f; for (int i = 0; i < n; i++) v[i] = tmp; }
Using the restrict keyword giving for the first version: and for the second version: For example, the code might be called as: The compiler must be conservative, and cannot perform the optimization! v[] = 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 v[] = 1, 1, 1, 1, 1, 2, 2, 2, 2, 2 float a[10]; a[4] = 0.0f; Foo(a, &a[4], 10);
Solving the aliasing problem The fix? Declaring the output as  restrict : ,[object Object],[object Object],[object Object],[object Object],[object Object],void Foo(float * restrict v, float *c, int n) { for (int i = 0; i < n; i++)  v[i] = *c + 1.0f; }
‘ const’ doesn’t help Some might think this would work: ,[object Object],[object Object],[object Object],[object Object],Since  *c  is const,  v[i]  cannot write to it, right? void Foo(float v[], const float *c, int n) { for (int i = 0; i < n; i++)  v[i] = *c + 1.0f; }
SIMD + restrict = TRUE ,[object Object],Independent loads and stores. Operations can be performed in parallel! Stores may alias loads. Must perform operations sequentially. void VecAdd(int *a, int *b, int *c) { for (int i = 0; i < 4; i++)  a[i] = b[i] + c[i];  } void VecAdd(int * restrict a, int *b, int *c) { for (int i = 0; i < 4; i++)  a[i] = b[i] + c[i];  }
Restrict-qualified pointers ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Tips for avoiding aliasing ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
That’s it! – Resources 1/2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Resources 2/2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Weitere ähnliche Inhalte

Was ist angesagt?

Pitfalls of Object Oriented Programming by SONY
Pitfalls of Object Oriented Programming by SONYPitfalls of Object Oriented Programming by SONY
Pitfalls of Object Oriented Programming by SONYAnaya Medias Swiss
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineNarann29
 
System Booting Process overview
System Booting Process overviewSystem Booting Process overview
System Booting Process overviewRajKumar Rampelli
 
JVM JIT-compiler overview @ JavaOne Moscow 2013
JVM JIT-compiler overview @ JavaOne Moscow 2013JVM JIT-compiler overview @ JavaOne Moscow 2013
JVM JIT-compiler overview @ JavaOne Moscow 2013Vladimir Ivanov
 
M|18 How to use MyRocks with MariaDB Server
M|18 How to use MyRocks with MariaDB ServerM|18 How to use MyRocks with MariaDB Server
M|18 How to use MyRocks with MariaDB ServerMariaDB plc
 
Linux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver OverviewLinux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver OverviewRajKumar Rampelli
 
How we optimized our Game - Jake & Tess' Finding Monsters Adventure
How we optimized our Game - Jake & Tess' Finding Monsters AdventureHow we optimized our Game - Jake & Tess' Finding Monsters Adventure
How we optimized our Game - Jake & Tess' Finding Monsters AdventureFelipe Lira
 
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunFive Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunElectronic Arts / DICE
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14AMD Developer Central
 
Qemu device prototyping
Qemu device prototypingQemu device prototyping
Qemu device prototypingYan Vugenfirer
 
Malware Analysis - x86 Disassembly
Malware Analysis - x86 DisassemblyMalware Analysis - x86 Disassembly
Malware Analysis - x86 DisassemblyNatraj G
 
Mca ii os u-1 introduction to os
Mca  ii  os u-1 introduction to osMca  ii  os u-1 introduction to os
Mca ii os u-1 introduction to osRai University
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Adrian Huang
 
“Zen 3”: AMD 2nd Generation 7nm x86-64 Microprocessor Core
“Zen 3”: AMD 2nd Generation 7nm x86-64 Microprocessor Core“Zen 3”: AMD 2nd Generation 7nm x86-64 Microprocessor Core
“Zen 3”: AMD 2nd Generation 7nm x86-64 Microprocessor CoreAMD
 
C++ Restrictions for Game Programming.
C++ Restrictions for Game Programming.C++ Restrictions for Game Programming.
C++ Restrictions for Game Programming.Richard Taylor
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Johan Andersson
 

Was ist angesagt? (20)

Pitfalls of Object Oriented Programming by SONY
Pitfalls of Object Oriented Programming by SONYPitfalls of Object Oriented Programming by SONY
Pitfalls of Object Oriented Programming by SONY
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering Pipeline
 
System Booting Process overview
System Booting Process overviewSystem Booting Process overview
System Booting Process overview
 
JVM JIT-compiler overview @ JavaOne Moscow 2013
JVM JIT-compiler overview @ JavaOne Moscow 2013JVM JIT-compiler overview @ JavaOne Moscow 2013
JVM JIT-compiler overview @ JavaOne Moscow 2013
 
M|18 How to use MyRocks with MariaDB Server
M|18 How to use MyRocks with MariaDB ServerM|18 How to use MyRocks with MariaDB Server
M|18 How to use MyRocks with MariaDB Server
 
Linux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver OverviewLinux Kernel MMC Storage driver Overview
Linux Kernel MMC Storage driver Overview
 
How we optimized our Game - Jake & Tess' Finding Monsters Adventure
How we optimized our Game - Jake & Tess' Finding Monsters AdventureHow we optimized our Game - Jake & Tess' Finding Monsters Adventure
How we optimized our Game - Jake & Tess' Finding Monsters Adventure
 
Case study windows
Case study windowsCase study windows
Case study windows
 
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The RunFive Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
 
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
 
Qemu device prototyping
Qemu device prototypingQemu device prototyping
Qemu device prototyping
 
Malware Analysis - x86 Disassembly
Malware Analysis - x86 DisassemblyMalware Analysis - x86 Disassembly
Malware Analysis - x86 Disassembly
 
Multiprocessor
MultiprocessorMultiprocessor
Multiprocessor
 
Mca ii os u-1 introduction to os
Mca  ii  os u-1 introduction to osMca  ii  os u-1 introduction to os
Mca ii os u-1 introduction to os
 
Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...Process Address Space: The way to create virtual address (page table) of user...
Process Address Space: The way to create virtual address (page table) of user...
 
“Zen 3”: AMD 2nd Generation 7nm x86-64 Microprocessor Core
“Zen 3”: AMD 2nd Generation 7nm x86-64 Microprocessor Core“Zen 3”: AMD 2nd Generation 7nm x86-64 Microprocessor Core
“Zen 3”: AMD 2nd Generation 7nm x86-64 Microprocessor Core
 
Evolution of os
Evolution of osEvolution of os
Evolution of os
 
Hardware e Componenti
Hardware e ComponentiHardware e Componenti
Hardware e Componenti
 
C++ Restrictions for Game Programming.
C++ Restrictions for Game Programming.C++ Restrictions for Game Programming.
C++ Restrictions for Game Programming.
 
Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)Parallel Futures of a Game Engine (v2.0)
Parallel Futures of a Game Engine (v2.0)
 

Ähnlich wie CACHE-OPTIMIZED MATRIX MULTIPLICATION

Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)RichardWarburton
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonJAXLondon2014
 
Gdc03 ericson memory_optimization
Gdc03 ericson memory_optimizationGdc03 ericson memory_optimization
Gdc03 ericson memory_optimizationbrettlevin
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta PyData
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictabilityRichardWarburton
 
PPU Optimisation Lesson
PPU Optimisation LessonPPU Optimisation Lesson
PPU Optimisation Lessonslantsixgames
 
Caching in (DevoxxUK 2013)
Caching in (DevoxxUK 2013)Caching in (DevoxxUK 2013)
Caching in (DevoxxUK 2013)RichardWarburton
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMDWei-Ta Wang
 
DConf 2016: Keynote by Walter Bright
DConf 2016: Keynote by Walter Bright DConf 2016: Keynote by Walter Bright
DConf 2016: Keynote by Walter Bright Andrei Alexandrescu
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxtidwellveronique
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxtidwellveronique
 
Coding for multiple cores
Coding for multiple coresCoding for multiple cores
Coding for multiple coresLee Hanxue
 

Ähnlich wie CACHE-OPTIMIZED MATRIX MULTIPLICATION (20)

Performance and predictability (1)
Performance and predictability (1)Performance and predictability (1)
Performance and predictability (1)
 
Performance and Predictability - Richard Warburton
Performance and Predictability - Richard WarburtonPerformance and Predictability - Richard Warburton
Performance and Predictability - Richard Warburton
 
Gdc03 ericson memory_optimization
Gdc03 ericson memory_optimizationGdc03 ericson memory_optimization
Gdc03 ericson memory_optimization
 
Scope Stack Allocation
Scope Stack AllocationScope Stack Allocation
Scope Stack Allocation
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta Daniel Krasner - High Performance Text Processing with Rosetta
Daniel Krasner - High Performance Text Processing with Rosetta
 
Performance and predictability
Performance and predictabilityPerformance and predictability
Performance and predictability
 
PPU Optimisation Lesson
PPU Optimisation LessonPPU Optimisation Lesson
PPU Optimisation Lesson
 
Caching in (DevoxxUK 2013)
Caching in (DevoxxUK 2013)Caching in (DevoxxUK 2013)
Caching in (DevoxxUK 2013)
 
04 Cache Memory
04  Cache  Memory04  Cache  Memory
04 Cache Memory
 
Caching in
Caching inCaching in
Caching in
 
Happy To Use SIMD
Happy To Use SIMDHappy To Use SIMD
Happy To Use SIMD
 
C language
C languageC language
C language
 
DConf 2016: Keynote by Walter Bright
DConf 2016: Keynote by Walter Bright DConf 2016: Keynote by Walter Bright
DConf 2016: Keynote by Walter Bright
 
C language
C languageC language
C language
 
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docxECECS 472572 Final Exam ProjectRemember to check the errat.docx
ECECS 472572 Final Exam ProjectRemember to check the errat.docx
 
ECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docxECECS 472572 Final Exam ProjectRemember to check the err.docx
ECECS 472572 Final Exam ProjectRemember to check the err.docx
 
Coding for multiple cores
Coding for multiple coresCoding for multiple cores
Coding for multiple cores
 
Cache recap
Cache recapCache recap
Cache recap
 
Cache recap
Cache recapCache recap
Cache recap
 

Kürzlich hochgeladen

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

CACHE-OPTIMIZED MATRIX MULTIPLICATION

  • 1. MEMORY OPTIMIZATION Christer Ericson Sony Computer Entertainment, Santa Monica (christer_ericson@playstation.sony.com)
  • 2.
  • 3.
  • 4.
  • 5. Need more justification? 1/3 SIMD instructions consume data at 2-8 times the rate of normal instructions! Instruction parallelism:
  • 6. Need more justification? 2/3 Improvements to compiler technology double program performance every ~18 years ! Proebsting’s law: Corollary: Don’t expect the compiler to do it for you!
  • 7.
  • 8.
  • 9. The memory hierarchy Main memory L2 cache L1 cache CPU ~1-5 cycles ~5-20 cycles ~40-100 cycles 1 cycle Roughly:
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18. Software prefetching const int kLookAhead = 4; // Some elements ahead for (int i = 0; i < 4 * n; i += 4) { Prefetch(elem[i + kLookAhead]); Process(elem[i + 0]); Process(elem[i + 1]); Process(elem[i + 2]); Process(elem[i + 3]); } // Loop through and process all 4n elements for (int i = 0; i < 4 * n; i++) Process(elem[i]);
  • 19. Greedy prefetching void PreorderTraversal(Node *pNode) { // Greedily prefetch left traversal path Prefetch(pNode->left); // Process the current node Process(pNode); // Greedily prefetch right traversal path Prefetch(pNode->right); // Recursively visit left then right subtree PreorderTraversal(pNode->left); PreorderTraversal(pNode->right); }
  • 20. Preloading (pseudo-prefetch) (NB: This code reads one element beyond the end of the elem array.) Elem a = elem[0]; for (int i = 0; i < 4 * n; i += 4) { Elem e = elem[i + 4]; // Cache miss, non-blocking Elem b = elem[i + 1]; // Cache hit Elem c = elem[i + 2]; // Cache hit Elem d = elem[i + 3]; // Cache hit Process(a); Process(b); Process(c); Process(d); a = e; }
  • 21.
  • 22.
  • 23.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.  
  • 34.
  • 35.
  • 36.
  • 37.
  • 38. Matrix multiplication 1/3 Consider optimizing a 2x2 matrix multiplication: How do we typically optimize it? Right, unrolling! Mat22mul(float a[2][2], float b[2][2], float c[2][2]){ for (int i = 0; i < 2; i++) { for (int j = 0; j < 2; j++) { a[i][j] = 0.0f; for (int k = 0; k < 2; k++) a[i][j] += b[i][k] * c[k][j]; } } }
  • 39.
  • 40. Matrix multiplication 3/3 A correct approach is instead writing it as: … before producing outputs Consume inputs… // 8 memory reads, 4 writes Mat22mul(float a[2][2], float b[2][2], float c[2][2]){ float b00 = b[0][0], b01 = b[0][1]; float b10 = b[1][0], b11 = b[1][1]; float c00 = c[0][0], c01 = c[0][1]; float c10 = c[1][0], c11 = c[1][1]; a[0][0] = b00*c00 + b01*c10; a[0][1] = b00*c01 + b01*c11; a[1][0] = b10*c00 + b11*c10; a[1][1] = b10*c01 + b11*c11; }
  • 41.
  • 42.
  • 43. C++ abstraction penalty Pointer members in classes may alias other members: Code likely to refetch numVals each iteration! numVals not a local variable! May be aliased by pBuf ! class Buf { public: void Clear() { for (int i = 0; i < numVals; i++) pBuf[i] = 0; } private: int numVals, *pBuf; }
  • 44. C++ abstraction penalty We know that aliasing won’t happen, and can manually solve the aliasing issue by writing code as: class Buf { public: void Clear() { for (int i = 0, n = numVals; i < n; i++) pBuf[i] = 0; } private: int numVals, *pBuf; }
  • 45. C++ abstraction penalty Since pBuf[i] can only alias numVals in the first iteration, a quality compiler can fix this problem by peeling the loop once, turning it into: Q: Does your compiler do this optimization?! void Clear() { if (numVals >= 1) { pBuf[0] = 0; for (int i = 1, n = numVals; i < n; i++) pBuf[i] = 0; } }
  • 46.
  • 47.
  • 48.
  • 49. What TBAA can do for you into this: It can turn this: Possible aliasing between v[i] and *n No aliasing possible so fetch *n once! void Foo(float *v, int *n) { for (int i = 0; i < *n; i++) v[i] += 1.0f; } void Foo(float *v, int *n) { int t = *n; for (int i = 0; i < t; i++) v[i] += 1.0f; }
  • 50.
  • 51.
  • 52. Using the restrict keyword Given this code: You really want the compiler to treat it as if written: But because of possible aliasing it cannot! void Foo(float v[], float *c, int n) { for (int i = 0; i < n; i++) v[i] = *c + 1.0f; } void Foo(float v[], float *c, int n) { float tmp = *c + 1.0f; for (int i = 0; i < n; i++) v[i] = tmp; }
  • 53. Using the restrict keyword giving for the first version: and for the second version: For example, the code might be called as: The compiler must be conservative, and cannot perform the optimization! v[] = 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 v[] = 1, 1, 1, 1, 1, 2, 2, 2, 2, 2 float a[10]; a[4] = 0.0f; Foo(a, &a[4], 10);
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.