SlideShare ist ein Scribd-Unternehmen logo
1 von 49
Downloaden Sie, um offline zu lesen
ORNL is managed by UT-Battelle, LLC for the US Department of Energy
Performance Analysis with TAU, part II
George S. Markomanolis
8 August 2019
MPI+OpenMP
33 Open slide master to edit
MiniWeather MPI+OpenMP compilation
• module load pgi
• module load tau
• export TAU_MAKEFILE=/sw/summit/…/ibm64linux/lib/Makefile.tau-
pgi-papi-mpi-pdt-openmp-opari-pgi
• Export TAU_OPTIONS=‘-optLinking=-lpnetcdf –optPreProcess’
• Replace mpicxx with tau_cxx.sh in the Makefile
• make openmp
44 Open slide master to edit
MiniWeather MPI compilation and execute
• Compile with: make openmp
• Execution:
export TAU_METRICS=TIME:PAPI_TOT_INS:PAPI_TOT_CYC:PAPI_FP_OPS
export TAU_PROFILE=1
export TAU_TRACK_MESSAGE=1
export TAU_COMM_MATRIX=1
#TAU_CALLPATH=1
#TAU_CALLPATH_DEPTH=10
jsrun -n 64 -r 8 -a 1 -c 4 -b packed:4 ./miniWeather_mpi_openmp
55 Open slide master to edit
Paraprof - OpenMP
66 Open slide master to edit
Paraprof – OpenMP - TIME
77 Open slide master to edit
Paraprof – OpenMP - parallel loop
88 Open slide master to edit
Paraprof – OpenMP - IPC
GPU
1010 Open slide master to edit
MiniWeather MPI+OpenACC compilation
• module load tau
• Use mpicxx in the Makefile not tau_cxx.sh (for now)
• make openacc
1111 Open slide master to edit
MiniWeather MPI compilation and execute
• Compile with: make openacc
• Execution:
export TAU_METRICS=TIME
export TAU_PROFILE=1
export TAU_TRACK_MESSAGE=1
export TAU_COMM_MATRIX=1
jsrun -n 6 -r 6 --smpiargs="-gpu" -g 1 tau_exec -T mpi,pgi,pdt -openacc
./miniWeather_mpi_openacc
• CUPTI metrics for OpenACC to be available up to SC19
1212 Open slide master to edit
Paraprof – OpenACC
1313 Open slide master to edit
Paraprof – OpenACC
1414 Open slide master to edit
Paraprof – OpenACC
From the main window right click one label and select “Show User Event
Statistics Window”
1515 Open slide master to edit
CUPTI Metrics
• https://docs.nvidia.com/cupti/Cupti/r_main.html#metrics-reference
1616 Open slide master to edit
LSMS MPI+OpenMP+CUDA execution
module load gcc
export TAU_METRICS=TIME,achieved_occupancy
jsrun --smpiargs="-gpu" --nrs 2 --tasks_per_rs 1 --gpu_per_rs 1 --
rs_per_host 2 --cpu_per_rs 8 --bind rs tau_exec -T
mpi,pdt,papi,cupti,openmp -ompt -cupti $EXECUTABLE $INPUT
• Error: Only counters for a single GPU device model can be collected
at the same time.
• Achieved_occupancy= CUDA.Tesla_V100-SXM2-
16GB.domain_d.active_warps/CUDA.Tesla_V100-SXM2-
16GB.domain_d.active_cycles
1717 Open slide master to edit
LSMS MPI+OpenMP+CUDA execution
• module load gcc
• jsrun --smpiargs="-gpu" --nrs 2 --tasks_per_rs 1 --gpu_per_rs 1 --
rs_per_host 2 --cpu_per_rs 8 --bind rs tau_exec -T
mpi,pdt,papi,cupti,openmp -ompt -cupti $EXECUTABLE $INPUT
1818 Open slide master to edit
LSMS - Paraprof – MPI+OpenMP+CUDA
Options -> Uncheck Stack Bars Together
1919 Open slide master to edit
LSMS - Paraprof – MPI+OpenMP+CUDA
• Click on any color
2020 Open slide master to edit
LSMS - Paraprof – MPI+OpenMP+CUDA
• Statistics for thread on CPU
2121 Open slide master to edit
LSMS - Paraprof – MPI+OpenMP+CUDA
• Statistics for thread on GPU
2222 Open slide master to edit
LSMS - Paraprof – MPI+OpenMP+CUDA
• User event window
2323 Open slide master to edit
Benchmark for demonstration
export TAU_METRICS=TIME,achieved_occupancy
jsrun -n 2 -r 2 -g 1 tau_exec -T mpi,pdt,papi,cupti,openmp -ompt -cupti
./add
• Output folders:
MULTI__TAUGPU_TIME
MULTI__CUDA.Tesla_V100-SXM2-16GB.domain_d.active_warps
MULTI__CUDA.Tesla_V100-SXM2-16GB.domain_d.active_cycles
MULTI__achieved_occupancy
2424 Open slide master to edit
Bechmark - Paraprof - MPI+OpenMP+CUDA
2525 Open slide master to edit
LSMS - Paraprof – MPI+OpenMP+CUDA
• Select the metric achieved occupancy
2626 Open slide master to edit
LSMS - Paraprof – MPI+OpenMP+CUDA
• Click on the colored bar
• The achieved occupancy for this simple benchmark is 6.2%
2727 Open slide master to edit
TAU and GPU
• Similar approach for other metrics, not all of them can be used.
• TAU provides a tool called tau_cupti_avail where we can see the
list of available metrics, then we have to figured out which CUPTI
metrics use these ones.
2828 Open slide master to edit
TAU and tracing
• export TAU_TRACE=1
• export TAU_TRACE_FORMAT=otf2
• Currently, supported MPI and OpenSHMEM applications
• Use Vampir for visualization
2929 Open slide master to edit
Overhead
0
10
20
30
40
50
60
Inpercentage
TAU Overhead
Profiling (no extra information) Profiling (with PAPI and callpath) Tracing (no option) Tracing (with PAPI and callpath)
Should use PDT to exclude files/routines that cause overhead
3030 Open slide master to edit
TAU mechanisms
• TAU_THROTTLE
– TAU by default excludes from the instrumentation routines that could cause
overhead
– Rule: If a routine is called more than 100,000 times and it spends up to 10
usecs/call, then exclude it.
– Adjustable: TAU_THROTTLE_NUMCALLS, TAU_THROTTLE_PERCALL
3131 Open slide master to edit
Selective Instrumentation
• Do not instrument routine sort*(int *)
File select.tau:
BEGIN_EXCLUDE_LIST
void sort_#(int *)
END_EXCLUDE_LIST
TAU_OPTIONS=“-optTauSelectFile=select.tau”
• Dynamic phase
BEGIN_INSTRUMENT_SECTION
dynamic phase name=“phase1” file=“miniWeather_mpi.cpp” line=300 to line=327
END_INSTRUMENT_SECTION
3232 Open slide master to edit
Static Phase
File phases.tau:
BEGIN_INSTRUMENT_SECTION
static phase name="phase1" file="miniWeather_mpi.cpp" line=300 to
line=327
static phase name="phase2" file="miniWeather_mpi.cpp" line=333 to
line=346
END_INSTRUMENT_SECTION
TAU_OPTIONS=“-optTauSelectFile=phases.tau”
3333 Open slide master to edit
Creating static Phases
3434 Open slide master to edit
Creating static Phases
3535 Open slide master to edit
Creating static Phases
3636 Open slide master to edit
PerfExplorer
• PerfExplorer is a framework for parallel performance data mining and
knowledge discovery
• Unfortunately not working efficient on Summit for now
• Perfexplorer should be installed, if it is not configured, it will propose
the next steps
• In this example we execute MiniWeather with MPI for 8,16,32,64
processes, we load them to paraprof, select the name of the
experiment, right click and select “Upload Trial to DB”
3737 Open slide master to edit
From Paraprof to the DB
Right click on the highlighted name of the experiment with the path and
select “Upload Trial to DB”, repeat for all the experiments
3838 Open slide master to edit
PerfExplorer – Total Execution
Select the name of the experiment (arrow) and then select Charts -> Total
Execution Time -> Select the metric TIME-> Click OK
All the menus
start from the
main window of
the PerfExplorer
3939 Open slide master to edit
PerfExplorer – Total Execution
Click Charts -> Stacked Bar Chart -> Select the metric TIME -> Click OK
Although the duration
of computation
decreases as we
increase the number of
the processes, some
MPI calls remain similar
duration and the
MPI_File_write_at is
increasing.
4040 Open slide master to edit
PerfExplorer – Tip
If the data include more than one
metric, and the user wants to visualize
another metric, then click in any of the
experiments in the green arrow and
then back to the name of the
experiments (grey arrow). Then in the
next visualization, PerfExplorer will
require to choose which metric to
visualize.
4141 Open slide master to edit
PerfExplorer – PAPI_TOT_INS
Click Charts -> Stacked Bar Chart -> Select the metric TIME -> Click OK
MPI calls are
increasing
instructions as we
increase the
number of MPI
processes.
4242 Open slide master to edit
PerfExplorer – Relative efficiency
Click Charts -> Relative Efficiency -> “The problem size remains constant” -
> Click OK
4343 Open slide master to edit
PerfExplorer – Relative Efficiency by event
Click Charts -> Relative Efficiency by Event -> Metric TIME -> “The problem
size remains constant” -> Click OK
4444 Open slide master to edit
PerfExplorer – Relative Speedup
Click Charts -> Relative Speedup -> Metric TIME -> Click OK -> “The
problem size remains constant” -> Click OK
4545 Open slide master to edit
PerfExplorer – Relative Speedup
Click Charts -> Relative Speedup by Event -> Metric TIME -> Click OK ->
“The problem size remains constant” -> Click OK
4646 Open slide master to edit
PerfExplorer – Relative Speedup
Click Charts -> Correlate Events with Total Runtime -> Metric TIME -> Click
OK -> “The problem size remains constant” -> Click OK
4747 Open slide master to edit
PerfExplorer – Relative Speedup
Click Charts -> Runtime Breakdown -> Metric TIME -> Click OK -> “The
problem size remains constant” -> Click OK
4848 Open slide master to edit
PerfExplorer – Relative Speedup
Click Charts -> Runtime Breakdown -> Metric PAPI_TOT_INS -> Click OK ->
“The problem size remains constant” -> Click OK
4949 Open slide master to edit
Conclusions
• TAU is a promising tool, with some good features to be improved
related to GPU performance analysis
• Interesting functionalities to identify loop issues and create dynamic
phases of an iterative analysis
• We would like OTF2 traces with CUDA/OpenACC
• OpenACC with support of CUPTI metrics is coming around to SC19
• It supports Python instrumentation but not activated during
compilation

Weitere ähnliche Inhalte

Ähnlich wie Performance Analysis with TAU on Summit Supercomputer, part II

Tips on how to improve the performance of your custom modules for high volume...
Tips on how to improve the performance of your custom modules for high volume...Tips on how to improve the performance of your custom modules for high volume...
Tips on how to improve the performance of your custom modules for high volume...
Odoo
 
CHAPTER READING TASK OPERATING SYSTEM
CHAPTER READING TASK OPERATING SYSTEMCHAPTER READING TASK OPERATING SYSTEM
CHAPTER READING TASK OPERATING SYSTEM
Nur Atiqah Mohd Rosli
 
Sheet1Address60741024Offsetaddress0102450506074120484026230723002p.docx
Sheet1Address60741024Offsetaddress0102450506074120484026230723002p.docxSheet1Address60741024Offsetaddress0102450506074120484026230723002p.docx
Sheet1Address60741024Offsetaddress0102450506074120484026230723002p.docx
bagotjesusa
 

Ähnlich wie Performance Analysis with TAU on Summit Supercomputer, part II (20)

TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part I
 
HiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSHiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOS
 
Reproducible AI Using PyTorch and MLflow
Reproducible AI Using PyTorch and MLflowReproducible AI Using PyTorch and MLflow
Reproducible AI Using PyTorch and MLflow
 
Exploring OpenFaaS autoscalability on Kubernetes with the Chaos Toolkit
Exploring OpenFaaS autoscalability on Kubernetes with the Chaos ToolkitExploring OpenFaaS autoscalability on Kubernetes with the Chaos Toolkit
Exploring OpenFaaS autoscalability on Kubernetes with the Chaos Toolkit
 
Performance_Programming
Performance_ProgrammingPerformance_Programming
Performance_Programming
 
Bgoug 2019.11 test your pl sql - not your patience
Bgoug 2019.11   test your pl sql - not your patienceBgoug 2019.11   test your pl sql - not your patience
Bgoug 2019.11 test your pl sql - not your patience
 
BITS 1213 - OPERATING SYSTEM (PROCESS,THREAD,SYMMETRIC MULTIPROCESSOR,MICROKE...
BITS 1213 - OPERATING SYSTEM (PROCESS,THREAD,SYMMETRIC MULTIPROCESSOR,MICROKE...BITS 1213 - OPERATING SYSTEM (PROCESS,THREAD,SYMMETRIC MULTIPROCESSOR,MICROKE...
BITS 1213 - OPERATING SYSTEM (PROCESS,THREAD,SYMMETRIC MULTIPROCESSOR,MICROKE...
 
POUG2019 - Test your PL/SQL - your database will love you
POUG2019 - Test your PL/SQL - your database will love youPOUG2019 - Test your PL/SQL - your database will love you
POUG2019 - Test your PL/SQL - your database will love you
 
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
PuppetConf 2016: An Introduction to Measuring and Tuning PE Performance – Cha...
 
Tips on how to improve the performance of your custom modules for high volume...
Tips on how to improve the performance of your custom modules for high volume...Tips on how to improve the performance of your custom modules for high volume...
Tips on how to improve the performance of your custom modules for high volume...
 
Os task
Os taskOs task
Os task
 
CHAPTER READING TASK OPERATING SYSTEM
CHAPTER READING TASK OPERATING SYSTEMCHAPTER READING TASK OPERATING SYSTEM
CHAPTER READING TASK OPERATING SYSTEM
 
Postgres Vision 2018: Making Postgres Even Faster
Postgres Vision 2018: Making Postgres Even FasterPostgres Vision 2018: Making Postgres Even Faster
Postgres Vision 2018: Making Postgres Even Faster
 
pm1
pm1pm1
pm1
 
Sheet1Address60741024Offsetaddress0102450506074120484026230723002p.docx
Sheet1Address60741024Offsetaddress0102450506074120484026230723002p.docxSheet1Address60741024Offsetaddress0102450506074120484026230723002p.docx
Sheet1Address60741024Offsetaddress0102450506074120484026230723002p.docx
 
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
Horovod ubers distributed deep learning framework  by Alex Sergeev from UberHorovod ubers distributed deep learning framework  by Alex Sergeev from Uber
Horovod ubers distributed deep learning framework by Alex Sergeev from Uber
 
Analyze database system using a 3 d method
Analyze database system using a 3 d methodAnalyze database system using a 3 d method
Analyze database system using a 3 d method
 
The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to Prometheus
 
The hitchhiker’s guide to Prometheus
The hitchhiker’s guide to PrometheusThe hitchhiker’s guide to Prometheus
The hitchhiker’s guide to Prometheus
 

Mehr von George Markomanolis

Mehr von George Markomanolis (14)

Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIPerformance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part II
 
Introducing IO-500 benchmark
Introducing IO-500 benchmarkIntroducing IO-500 benchmark
Introducing IO-500 benchmark
 
Experience using the IO-500
Experience using the IO-500Experience using the IO-500
Experience using the IO-500
 
Harshad - Handle Darshan Data
Harshad - Handle Darshan DataHarshad - Handle Darshan Data
Harshad - Handle Darshan Data
 
Lustre Best Practices
Lustre Best Practices Lustre Best Practices
Lustre Best Practices
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to Omega
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
 
markomanolis_phd_defense
markomanolis_phd_defensemarkomanolis_phd_defense
markomanolis_phd_defense
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen II
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Performance Analysis with TAU on Summit Supercomputer, part II

  • 1. ORNL is managed by UT-Battelle, LLC for the US Department of Energy Performance Analysis with TAU, part II George S. Markomanolis 8 August 2019
  • 3. 33 Open slide master to edit MiniWeather MPI+OpenMP compilation • module load pgi • module load tau • export TAU_MAKEFILE=/sw/summit/…/ibm64linux/lib/Makefile.tau- pgi-papi-mpi-pdt-openmp-opari-pgi • Export TAU_OPTIONS=‘-optLinking=-lpnetcdf –optPreProcess’ • Replace mpicxx with tau_cxx.sh in the Makefile • make openmp
  • 4. 44 Open slide master to edit MiniWeather MPI compilation and execute • Compile with: make openmp • Execution: export TAU_METRICS=TIME:PAPI_TOT_INS:PAPI_TOT_CYC:PAPI_FP_OPS export TAU_PROFILE=1 export TAU_TRACK_MESSAGE=1 export TAU_COMM_MATRIX=1 #TAU_CALLPATH=1 #TAU_CALLPATH_DEPTH=10 jsrun -n 64 -r 8 -a 1 -c 4 -b packed:4 ./miniWeather_mpi_openmp
  • 5. 55 Open slide master to edit Paraprof - OpenMP
  • 6. 66 Open slide master to edit Paraprof – OpenMP - TIME
  • 7. 77 Open slide master to edit Paraprof – OpenMP - parallel loop
  • 8. 88 Open slide master to edit Paraprof – OpenMP - IPC
  • 9. GPU
  • 10. 1010 Open slide master to edit MiniWeather MPI+OpenACC compilation • module load tau • Use mpicxx in the Makefile not tau_cxx.sh (for now) • make openacc
  • 11. 1111 Open slide master to edit MiniWeather MPI compilation and execute • Compile with: make openacc • Execution: export TAU_METRICS=TIME export TAU_PROFILE=1 export TAU_TRACK_MESSAGE=1 export TAU_COMM_MATRIX=1 jsrun -n 6 -r 6 --smpiargs="-gpu" -g 1 tau_exec -T mpi,pgi,pdt -openacc ./miniWeather_mpi_openacc • CUPTI metrics for OpenACC to be available up to SC19
  • 12. 1212 Open slide master to edit Paraprof – OpenACC
  • 13. 1313 Open slide master to edit Paraprof – OpenACC
  • 14. 1414 Open slide master to edit Paraprof – OpenACC From the main window right click one label and select “Show User Event Statistics Window”
  • 15. 1515 Open slide master to edit CUPTI Metrics • https://docs.nvidia.com/cupti/Cupti/r_main.html#metrics-reference
  • 16. 1616 Open slide master to edit LSMS MPI+OpenMP+CUDA execution module load gcc export TAU_METRICS=TIME,achieved_occupancy jsrun --smpiargs="-gpu" --nrs 2 --tasks_per_rs 1 --gpu_per_rs 1 -- rs_per_host 2 --cpu_per_rs 8 --bind rs tau_exec -T mpi,pdt,papi,cupti,openmp -ompt -cupti $EXECUTABLE $INPUT • Error: Only counters for a single GPU device model can be collected at the same time. • Achieved_occupancy= CUDA.Tesla_V100-SXM2- 16GB.domain_d.active_warps/CUDA.Tesla_V100-SXM2- 16GB.domain_d.active_cycles
  • 17. 1717 Open slide master to edit LSMS MPI+OpenMP+CUDA execution • module load gcc • jsrun --smpiargs="-gpu" --nrs 2 --tasks_per_rs 1 --gpu_per_rs 1 -- rs_per_host 2 --cpu_per_rs 8 --bind rs tau_exec -T mpi,pdt,papi,cupti,openmp -ompt -cupti $EXECUTABLE $INPUT
  • 18. 1818 Open slide master to edit LSMS - Paraprof – MPI+OpenMP+CUDA Options -> Uncheck Stack Bars Together
  • 19. 1919 Open slide master to edit LSMS - Paraprof – MPI+OpenMP+CUDA • Click on any color
  • 20. 2020 Open slide master to edit LSMS - Paraprof – MPI+OpenMP+CUDA • Statistics for thread on CPU
  • 21. 2121 Open slide master to edit LSMS - Paraprof – MPI+OpenMP+CUDA • Statistics for thread on GPU
  • 22. 2222 Open slide master to edit LSMS - Paraprof – MPI+OpenMP+CUDA • User event window
  • 23. 2323 Open slide master to edit Benchmark for demonstration export TAU_METRICS=TIME,achieved_occupancy jsrun -n 2 -r 2 -g 1 tau_exec -T mpi,pdt,papi,cupti,openmp -ompt -cupti ./add • Output folders: MULTI__TAUGPU_TIME MULTI__CUDA.Tesla_V100-SXM2-16GB.domain_d.active_warps MULTI__CUDA.Tesla_V100-SXM2-16GB.domain_d.active_cycles MULTI__achieved_occupancy
  • 24. 2424 Open slide master to edit Bechmark - Paraprof - MPI+OpenMP+CUDA
  • 25. 2525 Open slide master to edit LSMS - Paraprof – MPI+OpenMP+CUDA • Select the metric achieved occupancy
  • 26. 2626 Open slide master to edit LSMS - Paraprof – MPI+OpenMP+CUDA • Click on the colored bar • The achieved occupancy for this simple benchmark is 6.2%
  • 27. 2727 Open slide master to edit TAU and GPU • Similar approach for other metrics, not all of them can be used. • TAU provides a tool called tau_cupti_avail where we can see the list of available metrics, then we have to figured out which CUPTI metrics use these ones.
  • 28. 2828 Open slide master to edit TAU and tracing • export TAU_TRACE=1 • export TAU_TRACE_FORMAT=otf2 • Currently, supported MPI and OpenSHMEM applications • Use Vampir for visualization
  • 29. 2929 Open slide master to edit Overhead 0 10 20 30 40 50 60 Inpercentage TAU Overhead Profiling (no extra information) Profiling (with PAPI and callpath) Tracing (no option) Tracing (with PAPI and callpath) Should use PDT to exclude files/routines that cause overhead
  • 30. 3030 Open slide master to edit TAU mechanisms • TAU_THROTTLE – TAU by default excludes from the instrumentation routines that could cause overhead – Rule: If a routine is called more than 100,000 times and it spends up to 10 usecs/call, then exclude it. – Adjustable: TAU_THROTTLE_NUMCALLS, TAU_THROTTLE_PERCALL
  • 31. 3131 Open slide master to edit Selective Instrumentation • Do not instrument routine sort*(int *) File select.tau: BEGIN_EXCLUDE_LIST void sort_#(int *) END_EXCLUDE_LIST TAU_OPTIONS=“-optTauSelectFile=select.tau” • Dynamic phase BEGIN_INSTRUMENT_SECTION dynamic phase name=“phase1” file=“miniWeather_mpi.cpp” line=300 to line=327 END_INSTRUMENT_SECTION
  • 32. 3232 Open slide master to edit Static Phase File phases.tau: BEGIN_INSTRUMENT_SECTION static phase name="phase1" file="miniWeather_mpi.cpp" line=300 to line=327 static phase name="phase2" file="miniWeather_mpi.cpp" line=333 to line=346 END_INSTRUMENT_SECTION TAU_OPTIONS=“-optTauSelectFile=phases.tau”
  • 33. 3333 Open slide master to edit Creating static Phases
  • 34. 3434 Open slide master to edit Creating static Phases
  • 35. 3535 Open slide master to edit Creating static Phases
  • 36. 3636 Open slide master to edit PerfExplorer • PerfExplorer is a framework for parallel performance data mining and knowledge discovery • Unfortunately not working efficient on Summit for now • Perfexplorer should be installed, if it is not configured, it will propose the next steps • In this example we execute MiniWeather with MPI for 8,16,32,64 processes, we load them to paraprof, select the name of the experiment, right click and select “Upload Trial to DB”
  • 37. 3737 Open slide master to edit From Paraprof to the DB Right click on the highlighted name of the experiment with the path and select “Upload Trial to DB”, repeat for all the experiments
  • 38. 3838 Open slide master to edit PerfExplorer – Total Execution Select the name of the experiment (arrow) and then select Charts -> Total Execution Time -> Select the metric TIME-> Click OK All the menus start from the main window of the PerfExplorer
  • 39. 3939 Open slide master to edit PerfExplorer – Total Execution Click Charts -> Stacked Bar Chart -> Select the metric TIME -> Click OK Although the duration of computation decreases as we increase the number of the processes, some MPI calls remain similar duration and the MPI_File_write_at is increasing.
  • 40. 4040 Open slide master to edit PerfExplorer – Tip If the data include more than one metric, and the user wants to visualize another metric, then click in any of the experiments in the green arrow and then back to the name of the experiments (grey arrow). Then in the next visualization, PerfExplorer will require to choose which metric to visualize.
  • 41. 4141 Open slide master to edit PerfExplorer – PAPI_TOT_INS Click Charts -> Stacked Bar Chart -> Select the metric TIME -> Click OK MPI calls are increasing instructions as we increase the number of MPI processes.
  • 42. 4242 Open slide master to edit PerfExplorer – Relative efficiency Click Charts -> Relative Efficiency -> “The problem size remains constant” - > Click OK
  • 43. 4343 Open slide master to edit PerfExplorer – Relative Efficiency by event Click Charts -> Relative Efficiency by Event -> Metric TIME -> “The problem size remains constant” -> Click OK
  • 44. 4444 Open slide master to edit PerfExplorer – Relative Speedup Click Charts -> Relative Speedup -> Metric TIME -> Click OK -> “The problem size remains constant” -> Click OK
  • 45. 4545 Open slide master to edit PerfExplorer – Relative Speedup Click Charts -> Relative Speedup by Event -> Metric TIME -> Click OK -> “The problem size remains constant” -> Click OK
  • 46. 4646 Open slide master to edit PerfExplorer – Relative Speedup Click Charts -> Correlate Events with Total Runtime -> Metric TIME -> Click OK -> “The problem size remains constant” -> Click OK
  • 47. 4747 Open slide master to edit PerfExplorer – Relative Speedup Click Charts -> Runtime Breakdown -> Metric TIME -> Click OK -> “The problem size remains constant” -> Click OK
  • 48. 4848 Open slide master to edit PerfExplorer – Relative Speedup Click Charts -> Runtime Breakdown -> Metric PAPI_TOT_INS -> Click OK -> “The problem size remains constant” -> Click OK
  • 49. 4949 Open slide master to edit Conclusions • TAU is a promising tool, with some good features to be improved related to GPU performance analysis • Interesting functionalities to identify loop issues and create dynamic phases of an iterative analysis • We would like OTF2 traces with CUDA/OpenACC • OpenACC with support of CUPTI metrics is coming around to SC19 • It supports Python instrumentation but not activated during compilation