SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Downloaden Sie, um offline zu lesen
Intel Software Conference 2014 Brazil
May 2014
Leonardo Borges
Notes on NUMA architecture
2 2
Non-Uniform Memory Access (NUMA)
FSB architecture
- All memory in one location
Starting with Nehalem
- Memory located in multiple
places
Latency to memory
dependent on location
Local memory
- Highest BW
- Lowest latency
Remote Memory
- Higher latency
Socket 0 Socket 1
QPI
Ensure software is NUMA-optimized for best performance
Notes for Intel Software Conference – Brazil, May 2014
3 3
CPU1 DRAM
Node 1
Non-Uniform Memory Access (NUMA)
Locality matters
- Remote memory access latency ~1.7x than local memory
- Local memory bandwidth can be up to 2x greater than remote
Intel® QPI = Intel® QuickPath Interconnect
Remote Memory Access
Intel®
QPI
CPU0DRAM
Local Memory Access
Node 0
BIOS:
- NUMA mode (NUMA Enabled)
First Half of memory space on Node 0, second half on Node 1
Should be default on Nehalem (!)
- Non-NUMA (NUMA Disabled)
Even/Odd cache lines assigned to Nodes 0/1: Line interleaving
Notes for Intel Software Conference – Brazil, May 2014
4 4
Local Memory Access Example
CPU0 requests cache line X, not present in any CPU0 cache
- CPU0 requests data from its DRAM
- CPU0 snoops CPU1 to check if data is present
Step 2:
- DRAM returns data
- CPU1 returns snoop response
Local memory latency is the maximum latency of the two responses
Nehalem optimized to keep key latencies close to each other
CPU0 CPU1
QPI
DRAMDRAM
Notes for Intel Software Conference – Brazil, May 2014
5 5
Remote Memory Access Example
CPU0 requests cache line X, not present in any CPU0 cache
- CPU0 requests data from CPU1
- Request sent over QPI to CPU1
- CPU1’s IMC makes request to its DRAM
- CPU1 snoops internal caches
- Data returned to CPU0 over QPI
Remote memory latency a function of having a low latency
interconnect
CPU0 CPU1
QPI
DRAMDRAM
Notes for Intel Software Conference – Brazil, May 2014
6 6
Non Uniform Memory Access and
Parallel Execution
Process-parallel execution:
- NUMA friendly- data belongs only to the process
- E.g. MPI
- Affinity pinning maximizes local memory access
- Standard for HPC
Shared-memory threading:
- More problematic: same thread may require data from multiple
NUMA nodes
- E.g. OpenMP, TBB , explicit threading
- OS scheduled thread migration can aggravate situation
- NUMA and non-NUMA should be compared
Notes for Intel Software Conference – Brazil, May 2014
7 7
Operating System Differences
Operating systems allocate data differently
Linux*
- Malloc reserves the memory
- Assigns the physical page when data touched (first touch)
Many HPC code initialize memory by single ‘master’ thread !!
- A couple of extensions available via numactl and libnuma like
numactl --interleave=all /bin/program
numactl --cpunodebind=1 --membind=1 /bin/program
numactl --hardware
numa_run_on_node(3) // run thread on node 3
Microsoft Windows*
- Malloc assigns the physical page on allocation
- This default allocation policy is not NUMA friendly
- Microsoft Windows has NUMA Friendly API’s
VirtualAlloc reserves memory (like malloc on Linux*)
Physical pages assigned at first use
For more details:
http://kernel.org/pub/linux/kernel/people/christoph/pmig/numamemory.pdf
http://msdn.microsoft.com/en-us/library/aa363804.aspx
Notes for Intel Software Conference – Brazil, May 2014
8 8
Other Ways to Set Process Affinity
taskset: sets or retrieves the CPU affinity
Intel MPI: using I_MPI_PIN and
I_MPI_PIN_PROCESSOR_LIST environment
variables
KMP_AFFINITY on Intel Compilers OpenMP
- Compact: binds the OpenMP thread n+1 as close as
possible to OpenMP thread n
- Scatter: distributes threads evenly across the entire
system. Scatter is the opposite of compact
Notes for Intel Software Conference – Brazil, May 2014
9 9
NUMA Application Level Tuning:
Shared Memory Threading Example: TRIAD
Parallelized time consuming hotspot “TRIAD” (e.g.
of STREAM benchmark) using OpenMP
main() {
…
#pragma omp parallel
{
//Parallelized TRIAD loop…
#pragma omp parallel for private(j)
for (j=0; j<N; j++)
a[j] = b[j]+scalar*c[j];
} //end omp parallel
…
} //end main
Parallelizing hotspots may not be sufficient for NUMA
Notes for Intel Software Conference – Brazil, May 2014
10 10
NUMA Shared Memory Threading
Example ( Linux* )
KMP_AFFINITY=compact,0,verbose
main() {
…
#pragma omp parallel
{
#pragma omp for private(i)
for(i=0;i<N;i++)
{ a[i] = 10.0; b[i] = 10.0; c[i] = 10.0;}
…
//Parallelized TRIAD loop…
#pragma omp parallel for private(j)
for (j=0; j<N; j++)
a[j] = b[j]+scalar*c[j];
} //end omp parallel …
} //end main
Each thread initializes its data
pinning the pages to local memory
Environment variable
to pin affinity
Same thread that initialized
data uses data
Notes for Intel Software Conference – Brazil, May 2014
11 11
NUMA Optimization Summary
NUMA adds complexity to software parallelization
and optimization
Optimize for latency and for bandwidth
- In most cases goal to minimize latency
- Use local memory
- Keep memory near the thread it accesses
- Keep thread near memory it uses
Rely on quality middle-ware for CPU affinitization:
Example: Intel Compiler OpenMP or MPI environment
variables
Application level tuning may be required to
minimize NUMA first touch policy effects
Notes for Intel Software Conference – Brazil, May 2014
12 12Notes for Intel Software Conference – Brazil, May 2014

Weitere ähnliche Inhalte

Was ist angesagt?

Shared-Memory Multiprocessors
Shared-Memory MultiprocessorsShared-Memory Multiprocessors
Shared-Memory MultiprocessorsSalvatore La Bua
 
Shared Memory Centric Computing with CXL & OMI
Shared Memory Centric Computing with CXL & OMIShared Memory Centric Computing with CXL & OMI
Shared Memory Centric Computing with CXL & OMIAllan Cantle
 
Multi Processors And Multi Computers
 Multi Processors And Multi Computers Multi Processors And Multi Computers
Multi Processors And Multi ComputersNemwos
 
Linux memory-management-kamal
Linux memory-management-kamalLinux memory-management-kamal
Linux memory-management-kamalKamal Maiti
 
Computer architecture multi processor
Computer architecture multi processorComputer architecture multi processor
Computer architecture multi processorMazin Alwaaly
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel ComputingRoshan Karunarathna
 
Multiprocessor architecture
Multiprocessor architectureMultiprocessor architecture
Multiprocessor architectureArpan Baishya
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platformsSyed Zaid Irshad
 
RxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance ResultsRxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance ResultsBrendan Gregg
 
Linux Internals - Kernel/Core
Linux Internals - Kernel/CoreLinux Internals - Kernel/Core
Linux Internals - Kernel/CoreShay Cohen
 
Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency modelspalani kumar
 
Multi processor scheduling
Multi  processor schedulingMulti  processor scheduling
Multi processor schedulingShashank Kapoor
 
HKG15-107: ACPI Power Management on ARM64 Servers (v2)
HKG15-107: ACPI Power Management on ARM64 Servers (v2)HKG15-107: ACPI Power Management on ARM64 Servers (v2)
HKG15-107: ACPI Power Management on ARM64 Servers (v2)Linaro
 
Linux scheduling and input and output
Linux scheduling and input and outputLinux scheduling and input and output
Linux scheduling and input and outputSanidhya Chugh
 
LCA13: Power State Coordination Interface
LCA13: Power State Coordination InterfaceLCA13: Power State Coordination Interface
LCA13: Power State Coordination InterfaceLinaro
 

Was ist angesagt? (20)

virtual memory
virtual memoryvirtual memory
virtual memory
 
Shared-Memory Multiprocessors
Shared-Memory MultiprocessorsShared-Memory Multiprocessors
Shared-Memory Multiprocessors
 
Shared Memory Centric Computing with CXL & OMI
Shared Memory Centric Computing with CXL & OMIShared Memory Centric Computing with CXL & OMI
Shared Memory Centric Computing with CXL & OMI
 
Multi Processors And Multi Computers
 Multi Processors And Multi Computers Multi Processors And Multi Computers
Multi Processors And Multi Computers
 
Linux memory-management-kamal
Linux memory-management-kamalLinux memory-management-kamal
Linux memory-management-kamal
 
Computer architecture multi processor
Computer architecture multi processorComputer architecture multi processor
Computer architecture multi processor
 
Memory management in linux
Memory management in linuxMemory management in linux
Memory management in linux
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
 
Multiprocessor architecture
Multiprocessor architectureMultiprocessor architecture
Multiprocessor architecture
 
Dichotomy of parallel computing platforms
Dichotomy of parallel computing platformsDichotomy of parallel computing platforms
Dichotomy of parallel computing platforms
 
RxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance ResultsRxNetty vs Tomcat Performance Results
RxNetty vs Tomcat Performance Results
 
Emc isilon overview
Emc isilon overview Emc isilon overview
Emc isilon overview
 
Linux Internals - Kernel/Core
Linux Internals - Kernel/CoreLinux Internals - Kernel/Core
Linux Internals - Kernel/Core
 
Overview on NUMA
Overview on NUMAOverview on NUMA
Overview on NUMA
 
Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency models
 
Multi processor scheduling
Multi  processor schedulingMulti  processor scheduling
Multi processor scheduling
 
Memory Management
Memory ManagementMemory Management
Memory Management
 
HKG15-107: ACPI Power Management on ARM64 Servers (v2)
HKG15-107: ACPI Power Management on ARM64 Servers (v2)HKG15-107: ACPI Power Management on ARM64 Servers (v2)
HKG15-107: ACPI Power Management on ARM64 Servers (v2)
 
Linux scheduling and input and output
Linux scheduling and input and outputLinux scheduling and input and output
Linux scheduling and input and output
 
LCA13: Power State Coordination Interface
LCA13: Power State Coordination InterfaceLCA13: Power State Coordination Interface
LCA13: Power State Coordination Interface
 

Andere mochten auch

Linux numa evolution
Linux numa evolutionLinux numa evolution
Linux numa evolutionLukas Pirl
 
SLES Performance Enhancements for Large NUMA Systems
SLES Performance Enhancements for Large NUMA SystemsSLES Performance Enhancements for Large NUMA Systems
SLES Performance Enhancements for Large NUMA SystemsDavidlohr Bueso
 
Linux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and OpportunitiesLinux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and OpportunitiesRaghavendra Prabhu
 
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0sprdd
 
Lecture 6
Lecture  6Lecture  6
Lecture 6Mr SMAK
 
Virtualization Architecture & KVM
Virtualization Architecture & KVMVirtualization Architecture & KVM
Virtualization Architecture & KVMPradeep Kumar
 
NUMA Performance Considerations in VMware vSphere
NUMA Performance Considerations in VMware vSphereNUMA Performance Considerations in VMware vSphere
NUMA Performance Considerations in VMware vSphereAMD
 
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5Intel Software Brasil
 
Desafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataformaDesafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataformaIntel Software Brasil
 
Escreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKatEscreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKatIntel Software Brasil
 
Benchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenhoBenchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenhoIntel Software Brasil
 
Principais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralelaPrincipais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralelaIntel Software Brasil
 
Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™  Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™ Intel Software Brasil
 
Principais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorizaçãoPrincipais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorizaçãoIntel Software Brasil
 
Net App Cisco V Mware Integrated Presov6
Net App Cisco V Mware Integrated Presov6Net App Cisco V Mware Integrated Presov6
Net App Cisco V Mware Integrated Presov6jnava09
 
Getting the maximum performance in distributed clusters Intel Cluster Studio XE
Getting the maximum performance in distributed clusters Intel Cluster Studio XEGetting the maximum performance in distributed clusters Intel Cluster Studio XE
Getting the maximum performance in distributed clusters Intel Cluster Studio XEIntel Software Brasil
 

Andere mochten auch (20)

Linux numa evolution
Linux numa evolutionLinux numa evolution
Linux numa evolution
 
SLES Performance Enhancements for Large NUMA Systems
SLES Performance Enhancements for Large NUMA SystemsSLES Performance Enhancements for Large NUMA Systems
SLES Performance Enhancements for Large NUMA Systems
 
Linux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and OpportunitiesLinux NUMA & Databases: Perils and Opportunities
Linux NUMA & Databases: Perils and Opportunities
 
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
Summit2014 riel chegu_w_0340_automatic_numa_balancing_0
 
Lecture 6
Lecture  6Lecture  6
Lecture 6
 
Virtualization Architecture & KVM
Virtualization Architecture & KVMVirtualization Architecture & KVM
Virtualization Architecture & KVM
 
NUMA Performance Considerations in VMware vSphere
NUMA Performance Considerations in VMware vSphereNUMA Performance Considerations in VMware vSphere
NUMA Performance Considerations in VMware vSphere
 
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
Desenvolvimento e análise de performance de jogos Android com Coco2d-HTML5
 
Yocto - 7 masters
Yocto - 7 mastersYocto - 7 masters
Yocto - 7 masters
 
Desafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataformaDesafios do Desenvolvimento Multi-plataforma
Desafios do Desenvolvimento Multi-plataforma
 
IoT FISL15
IoT FISL15IoT FISL15
IoT FISL15
 
Escreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKatEscreva sua App sem gastar energia, agora no KitKat
Escreva sua App sem gastar energia, agora no KitKat
 
Html5 fisl15
Html5 fisl15Html5 fisl15
Html5 fisl15
 
Intel tools to optimize HPC systems
Intel tools to optimize HPC systemsIntel tools to optimize HPC systems
Intel tools to optimize HPC systems
 
Benchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenhoBenchmarking para sistemas de alto desempenho
Benchmarking para sistemas de alto desempenho
 
Principais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralelaPrincipais conceitos técnicas e modelos de programação paralela
Principais conceitos técnicas e modelos de programação paralela
 
Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™  Modernização de código em Xeon® e Xeon Phi™
Modernização de código em Xeon® e Xeon Phi™
 
Principais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorizaçãoPrincipais conceitos e técnicas em vetorização
Principais conceitos e técnicas em vetorização
 
Net App Cisco V Mware Integrated Presov6
Net App Cisco V Mware Integrated Presov6Net App Cisco V Mware Integrated Presov6
Net App Cisco V Mware Integrated Presov6
 
Getting the maximum performance in distributed clusters Intel Cluster Studio XE
Getting the maximum performance in distributed clusters Intel Cluster Studio XEGetting the maximum performance in distributed clusters Intel Cluster Studio XE
Getting the maximum performance in distributed clusters Intel Cluster Studio XE
 

Ähnlich wie Notes on NUMA architecture

Icg hpc-user
Icg hpc-userIcg hpc-user
Icg hpc-usergdburton
 
Towards Software Defined Persistent Memory
Towards Software Defined Persistent MemoryTowards Software Defined Persistent Memory
Towards Software Defined Persistent MemorySwaminathan Sundararaman
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14AMD Developer Central
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementGanesan Narayanasamy
 
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Anne Nicolas
 
AMP Kynetics - ELC 2018 Portland
AMP  Kynetics - ELC 2018 PortlandAMP  Kynetics - ELC 2018 Portland
AMP Kynetics - ELC 2018 PortlandKynetics
 
Asymmetric Multiprocessing - Kynetics ELC 2018 portland
Asymmetric Multiprocessing - Kynetics ELC 2018 portlandAsymmetric Multiprocessing - Kynetics ELC 2018 portland
Asymmetric Multiprocessing - Kynetics ELC 2018 portlandNicola La Gloria
 
Numa (non uniform memory access)
Numa (non uniform memory access)Numa (non uniform memory access)
Numa (non uniform memory access)Mamesh
 
Parallel computing
Parallel computingParallel computing
Parallel computingVinay Gupta
 
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalShak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalTommy Lee
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8AbdullahMunir32
 
Current and Future of Non-Volatile Memory on Linux
Current and Future of Non-Volatile Memory on LinuxCurrent and Future of Non-Volatile Memory on Linux
Current and Future of Non-Volatile Memory on Linuxmountpoint.io
 
Intro to parallel computing
Intro to parallel computingIntro to parallel computing
Intro to parallel computingPiyush Mittal
 
Parallel computing in india
Parallel computing in indiaParallel computing in india
Parallel computing in indiaPreeti Chauhan
 
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUSAVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUScscpconf
 
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUSAVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUScsandit
 

Ähnlich wie Notes on NUMA architecture (20)

Icg hpc-user
Icg hpc-userIcg hpc-user
Icg hpc-user
 
Towards Software Defined Persistent Memory
Towards Software Defined Persistent MemoryTowards Software Defined Persistent Memory
Towards Software Defined Persistent Memory
 
Rendering Battlefield 4 with Mantle
Rendering Battlefield 4 with MantleRendering Battlefield 4 with Mantle
Rendering Battlefield 4 with Mantle
 
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
 
module4.ppt
module4.pptmodule4.ppt
module4.ppt
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
Kernel Recipes 2018 - XDP: a new fast and programmable network layer - Jesper...
 
AMP Kynetics - ELC 2018 Portland
AMP  Kynetics - ELC 2018 PortlandAMP  Kynetics - ELC 2018 Portland
AMP Kynetics - ELC 2018 Portland
 
Asymmetric Multiprocessing - Kynetics ELC 2018 portland
Asymmetric Multiprocessing - Kynetics ELC 2018 portlandAsymmetric Multiprocessing - Kynetics ELC 2018 portland
Asymmetric Multiprocessing - Kynetics ELC 2018 portland
 
Numa (non uniform memory access)
Numa (non uniform memory access)Numa (non uniform memory access)
Numa (non uniform memory access)
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Multicore
MulticoreMulticore
Multicore
 
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalShak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-final
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
 
Current and Future of Non-Volatile Memory on Linux
Current and Future of Non-Volatile Memory on LinuxCurrent and Future of Non-Volatile Memory on Linux
Current and Future of Non-Volatile Memory on Linux
 
defense-linkedin
defense-linkedindefense-linkedin
defense-linkedin
 
Intro to parallel computing
Intro to parallel computingIntro to parallel computing
Intro to parallel computing
 
Parallel computing in india
Parallel computing in indiaParallel computing in india
Parallel computing in india
 
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUSAVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
 
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUSAVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
AVOIDING DUPLICATED COMPUTATION TO IMPROVE THE PERFORMANCE OF PFSP ON CUDA GPUS
 

Mehr von Intel Software Brasil

Desafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento MultiplataformaDesafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento MultiplataformaIntel Software Brasil
 
Methods and practices to analyze the performance of your application with Int...
Methods and practices to analyze the performance of your application with Int...Methods and practices to analyze the performance of your application with Int...
Methods and practices to analyze the performance of your application with Int...Intel Software Brasil
 
Intel Technologies for High Performance Computing
Intel Technologies for High Performance ComputingIntel Technologies for High Performance Computing
Intel Technologies for High Performance ComputingIntel Software Brasil
 
Yocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/VivoYocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/VivoIntel Software Brasil
 
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...Intel Software Brasil
 
O uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenhoO uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenhoIntel Software Brasil
 
Escreva sua App Android sem gastar energia - Intel Sw Day
Escreva sua App Android sem gastar energia - Intel Sw DayEscreva sua App Android sem gastar energia - Intel Sw Day
Escreva sua App Android sem gastar energia - Intel Sw DayIntel Software Brasil
 
Using multitouch and sensors in Java
Using multitouch and sensors in JavaUsing multitouch and sensors in Java
Using multitouch and sensors in JavaIntel Software Brasil
 
Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Intel Software Brasil
 
Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...
Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...
Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...Intel Software Brasil
 
Livros eletrônicos interativos com html5 e e pub3
Livros eletrônicos interativos com html5 e e pub3Livros eletrônicos interativos com html5 e e pub3
Livros eletrônicos interativos com html5 e e pub3Intel Software Brasil
 
Intel XDK New - Intel Software Day 2013
Intel XDK New - Intel Software Day 2013Intel XDK New - Intel Software Day 2013
Intel XDK New - Intel Software Day 2013Intel Software Brasil
 

Mehr von Intel Software Brasil (18)

Desafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento MultiplataformaDesafios do Desenvolvimento Multiplataforma
Desafios do Desenvolvimento Multiplataforma
 
Methods and practices to analyze the performance of your application with Int...
Methods and practices to analyze the performance of your application with Int...Methods and practices to analyze the performance of your application with Int...
Methods and practices to analyze the performance of your application with Int...
 
Intel Technologies for High Performance Computing
Intel Technologies for High Performance ComputingIntel Technologies for High Performance Computing
Intel Technologies for High Performance Computing
 
Yocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/VivoYocto no 1 IoT Day da Telefonica/Vivo
Yocto no 1 IoT Day da Telefonica/Vivo
 
IoT TDC Floripa 2014
IoT TDC Floripa 2014IoT TDC Floripa 2014
IoT TDC Floripa 2014
 
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...Otávio Salvador - Yocto project  reduzindo -time to market- do seu próximo pr...
Otávio Salvador - Yocto project reduzindo -time to market- do seu próximo pr...
 
Html5 tdc floripa_2014
Html5 tdc floripa_2014Html5 tdc floripa_2014
Html5 tdc floripa_2014
 
O uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenhoO uso de tecnologias Intel na implantação de sistemas de alto desempenho
O uso de tecnologias Intel na implantação de sistemas de alto desempenho
 
Escreva sua App Android sem gastar energia - Intel Sw Day
Escreva sua App Android sem gastar energia - Intel Sw DayEscreva sua App Android sem gastar energia - Intel Sw Day
Escreva sua App Android sem gastar energia - Intel Sw Day
 
Using multitouch and sensors in Java
Using multitouch and sensors in JavaUsing multitouch and sensors in Java
Using multitouch and sensors in Java
 
Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™
 
Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...
Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...
Across the Silicon Spectrum: Xeon Phi to Quark – Unleash the Performance in Y...
 
Livros eletrônicos interativos com html5 e e pub3
Livros eletrônicos interativos com html5 e e pub3Livros eletrônicos interativos com html5 e e pub3
Livros eletrônicos interativos com html5 e e pub3
 
Intel XDK New - Intel Software Day 2013
Intel XDK New - Intel Software Day 2013Intel XDK New - Intel Software Day 2013
Intel XDK New - Intel Software Day 2013
 
Hackeando a Sala de Aula
Hackeando a Sala de AulaHackeando a Sala de Aula
Hackeando a Sala de Aula
 
Android Native Apps Hands On
Android Native Apps Hands OnAndroid Native Apps Hands On
Android Native Apps Hands On
 
Android Fat Binaries
Android Fat BinariesAndroid Fat Binaries
Android Fat Binaries
 
Android Native Apps Development
Android Native Apps DevelopmentAndroid Native Apps Development
Android Native Apps Development
 

Notes on NUMA architecture

  • 1. Intel Software Conference 2014 Brazil May 2014 Leonardo Borges Notes on NUMA architecture
  • 2. 2 2 Non-Uniform Memory Access (NUMA) FSB architecture - All memory in one location Starting with Nehalem - Memory located in multiple places Latency to memory dependent on location Local memory - Highest BW - Lowest latency Remote Memory - Higher latency Socket 0 Socket 1 QPI Ensure software is NUMA-optimized for best performance Notes for Intel Software Conference – Brazil, May 2014
  • 3. 3 3 CPU1 DRAM Node 1 Non-Uniform Memory Access (NUMA) Locality matters - Remote memory access latency ~1.7x than local memory - Local memory bandwidth can be up to 2x greater than remote Intel® QPI = Intel® QuickPath Interconnect Remote Memory Access Intel® QPI CPU0DRAM Local Memory Access Node 0 BIOS: - NUMA mode (NUMA Enabled) First Half of memory space on Node 0, second half on Node 1 Should be default on Nehalem (!) - Non-NUMA (NUMA Disabled) Even/Odd cache lines assigned to Nodes 0/1: Line interleaving Notes for Intel Software Conference – Brazil, May 2014
  • 4. 4 4 Local Memory Access Example CPU0 requests cache line X, not present in any CPU0 cache - CPU0 requests data from its DRAM - CPU0 snoops CPU1 to check if data is present Step 2: - DRAM returns data - CPU1 returns snoop response Local memory latency is the maximum latency of the two responses Nehalem optimized to keep key latencies close to each other CPU0 CPU1 QPI DRAMDRAM Notes for Intel Software Conference – Brazil, May 2014
  • 5. 5 5 Remote Memory Access Example CPU0 requests cache line X, not present in any CPU0 cache - CPU0 requests data from CPU1 - Request sent over QPI to CPU1 - CPU1’s IMC makes request to its DRAM - CPU1 snoops internal caches - Data returned to CPU0 over QPI Remote memory latency a function of having a low latency interconnect CPU0 CPU1 QPI DRAMDRAM Notes for Intel Software Conference – Brazil, May 2014
  • 6. 6 6 Non Uniform Memory Access and Parallel Execution Process-parallel execution: - NUMA friendly- data belongs only to the process - E.g. MPI - Affinity pinning maximizes local memory access - Standard for HPC Shared-memory threading: - More problematic: same thread may require data from multiple NUMA nodes - E.g. OpenMP, TBB , explicit threading - OS scheduled thread migration can aggravate situation - NUMA and non-NUMA should be compared Notes for Intel Software Conference – Brazil, May 2014
  • 7. 7 7 Operating System Differences Operating systems allocate data differently Linux* - Malloc reserves the memory - Assigns the physical page when data touched (first touch) Many HPC code initialize memory by single ‘master’ thread !! - A couple of extensions available via numactl and libnuma like numactl --interleave=all /bin/program numactl --cpunodebind=1 --membind=1 /bin/program numactl --hardware numa_run_on_node(3) // run thread on node 3 Microsoft Windows* - Malloc assigns the physical page on allocation - This default allocation policy is not NUMA friendly - Microsoft Windows has NUMA Friendly API’s VirtualAlloc reserves memory (like malloc on Linux*) Physical pages assigned at first use For more details: http://kernel.org/pub/linux/kernel/people/christoph/pmig/numamemory.pdf http://msdn.microsoft.com/en-us/library/aa363804.aspx Notes for Intel Software Conference – Brazil, May 2014
  • 8. 8 8 Other Ways to Set Process Affinity taskset: sets or retrieves the CPU affinity Intel MPI: using I_MPI_PIN and I_MPI_PIN_PROCESSOR_LIST environment variables KMP_AFFINITY on Intel Compilers OpenMP - Compact: binds the OpenMP thread n+1 as close as possible to OpenMP thread n - Scatter: distributes threads evenly across the entire system. Scatter is the opposite of compact Notes for Intel Software Conference – Brazil, May 2014
  • 9. 9 9 NUMA Application Level Tuning: Shared Memory Threading Example: TRIAD Parallelized time consuming hotspot “TRIAD” (e.g. of STREAM benchmark) using OpenMP main() { … #pragma omp parallel { //Parallelized TRIAD loop… #pragma omp parallel for private(j) for (j=0; j<N; j++) a[j] = b[j]+scalar*c[j]; } //end omp parallel … } //end main Parallelizing hotspots may not be sufficient for NUMA Notes for Intel Software Conference – Brazil, May 2014
  • 10. 10 10 NUMA Shared Memory Threading Example ( Linux* ) KMP_AFFINITY=compact,0,verbose main() { … #pragma omp parallel { #pragma omp for private(i) for(i=0;i<N;i++) { a[i] = 10.0; b[i] = 10.0; c[i] = 10.0;} … //Parallelized TRIAD loop… #pragma omp parallel for private(j) for (j=0; j<N; j++) a[j] = b[j]+scalar*c[j]; } //end omp parallel … } //end main Each thread initializes its data pinning the pages to local memory Environment variable to pin affinity Same thread that initialized data uses data Notes for Intel Software Conference – Brazil, May 2014
  • 11. 11 11 NUMA Optimization Summary NUMA adds complexity to software parallelization and optimization Optimize for latency and for bandwidth - In most cases goal to minimize latency - Use local memory - Keep memory near the thread it accesses - Keep thread near memory it uses Rely on quality middle-ware for CPU affinitization: Example: Intel Compiler OpenMP or MPI environment variables Application level tuning may be required to minimize NUMA first touch policy effects Notes for Intel Software Conference – Brazil, May 2014
  • 12. 12 12Notes for Intel Software Conference – Brazil, May 2014