SlideShare a Scribd company logo
1 of 39
Download to read offline
ORNL is managed by UT-Battelle, LLC for the US Department of Energy
Performance Analysis with Scalasca part II
George S. Markomanolis
8 August 2019
22 Open slide master to edit
MiniWeather MPI+OpenACC
• Compilations
• MPI+OpenACC: make PREP="scorep --mpp=mpi --openacc –
pdt" openacc
• Cube4 does not show OpenACC information
33 Open slide master to edit
LSMS compilation - Issues
• scorep --cuda nvcc…
/gpfs/alpine/gen110/scratch/gmarkoma/scorep/LSMS3/Source
/LSMS_3/src/Accelerator/buildKKRMatrix_kernels.cu(242): error:
identifier "omp_get_thread_num" is undefined
• Adding the option --keep-files, does not delete the temporary
files, so we can observe what is wrong
44 Open slide master to edit
LSMS compilation – Issues (cont.)
Original .cu file Temporary opari.cu file during
instrumentation
#ifdef _OPENMP
#include <omp.h>
#else
inline int omp_get_max_threads()
{return 1;}
inline int omp_get_num_threads()
{return 1;}
inline int omp_get_thread_num()
{return 0;}
#endif
#ifdef _OPENMP
#else
inline int omp_get_max_threads()
{return 1;}
inline int omp_get_num_threads()
{return 1;}
inline int omp_get_thread_num()
{return 0;}
#endif
Adding manually the #include <omp.h> to the temporary file, it solves the issue
55 Open slide master to edit
LSMS – Profiling one compute node
66 Open slide master to edit
LSMS on 128 compute nodes – Compute variation
77 Open slide master to edit
LSMS on 128 compute nodes – MPI_Allreduce variation
88 Open slide master to edit
LSMS on 128 compute nodes – OpenMP Barrier
99 Open slide master to edit
LSMS on 128 compute nodes – OpenMP Barrier
1010 Open slide master to edit
LSMS on 128 compute nodes – OpenMP Barrier - Sunburst
1111 Open slide master to edit
LSMS – Buffer for tracing
• scalasca -examine -s /gpfs/alpine/…/scorep_lsms_Ox8_sum/
INFO: Score report written to /gpfs/…/scorep_lsms_Ox8_sum/scorep.score
vim /gpfs/…/scorep_lsms_Ox8_sum/scorep.score
Estimated aggregate size of event trace: 24TB
Estimated requirements for largest trace buffer (max_buf): 36GB
Estimated memory requirements (SCOREP_TOTAL_MEMORY): 36GB
(warning: The memory requirements cannot be satisfied by Score-P to avoid intermediate flushes when tracing. Set SCOREP_TOTAL_MEMORY=4G to get the maximum
supported memory or reduce requirements using USR regions filters.)
flt type max_buf[B] visits time[s] time[%] time/visit[us] region
ALL 37,758,243,091 2,906,934,475 3738.13 100.0 1.29 ALL
USR 37,757,438,730 2,906,899,463 3714.45 99.4 1.28 USR
OMP 757,272 32,688 18.59 0.5 568.66 OMP
MPI 29,168 932 2.58 0.1 2772.73 MPI
COM 17,880 1,390 2.50 0.1 1798.55 COM
SCOREP 41 2 0.00 0.0 242.20 SCOREP
USR 22,486,328,384 1,729,717,541 260.77 7.0 0.15 dfv_m
USR 9,921,071,888 763,159,376 126.99 3.4 0.17 fit
USR 1,064,960,000 81,920,000 343.03 9.2 4.19 void bulirsch_stoer_integrator
USR 846,102,582 65,083,854 19.65 0.5 0.30 void plm_normalized_(int*, double*, double*)
1212 Open slide master to edit
Selective instrumentation
• File exclude.filt, we select the USR calls with high volume of max_buf and the ones with many calls:
SCOREP_REGION_NAMES_BEGIN EXCLUDE
dfv_m
fit
*bulirsch_stoer_integrator_*
*plm_normalized_*
...
SCOREP_REGION_NAMES_END
• Apply on the profiling data:
scalasca -examine -s -f exclude.filt scorep_lsms_Ox7_sum/
1313 Open slide master to edit
Selective instrumentation (cont.)
• vim scorep_lsms_Ox7_sum/scorep.score_exclude.filt
Estimated aggregate size of event trace: 3100MB
Estimated requirements for largest trace buffer (max_buf): 6MB
Estimated memory requirements (SCOREP_TOTAL_MEMORY): 20MB
(hint: When tracing set SCOREP_TOTAL_MEMORY=20MB to avoid intermediate flushes or reduce requirements using USR regions filters.)
flt type max_buf[B] visits time[s] time[%] time/visit[us] region
- ALL 37,768,557,978 997,234,022,247 1508955.54 100.0 1.51 ALL
- USR 37,767,383,966 997,216,546,439 1237070.04 82.0 1.24 USR
- MPI 1,160,452 5,948,480 139997.90 9.3 23535.07 MPI
- OMP 662,613 10,983,168 129188.73 8.6 11762.43 OMP
- COM 302,424 543,392 2698.66 0.2 4966.33 COM
- SCOREP 41 768 0.21 0.0 270.75 SCOREP
* ALL 5,851,181 108,920,629 1061962.47 70.4 9749.87 ALL-FLT
+ FLT 37,763,922,579 997,125,101,618 446993.07 29.6 0.45 FLT
* USR 3,725,651 91,444,821 790076.96 52.4 8639.93 USR-FLT
- MPI 1,160,452 5,948,480 139997.90 9.3 23535.07 MPI-FLT
- OMP 662,613 10,983,168 129188.73 8.6 11762.43 OMP-FLT
* COM 302,424 543,392 2698.66 0.2 4966.33 COM-FLT
- SCOREP 41 768 0.21 0.0 270.75 SCOREP-FLT
Execute: scalasca -analyze -q -t -f /full_path/exclude.filt jsrun …
1414 Open slide master to edit
LSMS on 128 compute nodes – Late Receivers (occ)
1515 Open slide master to edit
LSMS on 128 compute nodes – Late Sender (sec.)
1616 Open slide master to edit
LSMS on 128 compute nodes – Short-term Idleness delay
1717 Open slide master to edit
LSMS on 128 compute nodes – Short-term Idleness delay
1818 Open slide master to edit
Vampir with OTF2 trace
OTF2 trace is created inside the scorep folder after the scalasca –analyze,
example of 640 MPI processes
1919 Open slide master to edit
Vampir with OTF2 trace
2020 Open slide master to edit
Vampir with OTF2 trace
2121 Open slide master to edit
Vampir with OTF2 trace
2222 Open slide master to edit
Vampir with OTF2 trace
2323 Open slide master to edit
Vampir with OTF2 trace
2424 Open slide master to edit
Vampir with OTF2 trace
2525 Open slide master to edit
Vampir with OTF2 trace
2626 Open slide master to edit
Vampir with OTF2 trace
2727 Open slide master to edit
Vampir with OTF2 trace
2828 Open slide master to edit
Vampir with OTF2 trace
2929 Open slide master to edit
Vampir with OTF2 trace
3030 Open slide master to edit
Vampir with OTF2 trace
3131 Open slide master to edit
Vampir with OTF2 trace
3232 Open slide master to edit
Vampir with OTF2 trace
3333 Open slide master to edit
Vampir with OTF2 trace
3434 Open slide master to edit
Vampir with OTF2 trace
3535 Open slide master to edit
Vampir with OTF2 trace
3636 Open slide master to edit
Vampir with OTF2 trace II
Zoom in with the mouse
3737 Open slide master to edit
Vampir with OTF2 trace III
Zoom in with the mouse
3838 Open slide master to edit
Vampir with OTF2 trace IV
Add functionalities from the menu
3939 Open slide master to edit
Conclusions
• Scalasca can identify bottlenecks based on some patterns
• Useful for MPI and OpenMP
• It can not provide useful information about GPUs but the
information exists inside the traces
• It can create OTF2 trace files and use Vampir to visualize

More Related Content

Similar to ORNL is managed by UT-Battelle, LLC for the US Department of Energy

Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFBrendan Gregg
 
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Anne Nicolas
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFBrendan Gregg
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFBrendan Gregg
 
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PGeorge Markomanolis
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby SystemsEngine Yard
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging RubyAman Gupta
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing ToolsBrendan Gregg
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and moreBrendan Gregg
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFBrendan Gregg
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing ToolsSysdig
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018Brendan Gregg
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceBrendan Gregg
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance AnalysisBrendan Gregg
 
44CON London 2015 - Jtagsploitation: 5 wires, 5 ways to root
44CON London 2015 - Jtagsploitation: 5 wires, 5 ways to root44CON London 2015 - Jtagsploitation: 5 wires, 5 ways to root
44CON London 2015 - Jtagsploitation: 5 wires, 5 ways to root44CON
 
Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Ontico
 

Similar to ORNL is managed by UT-Battelle, LLC for the US Department of Energy (20)

Kernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPFKernel Recipes 2017: Performance Analysis with BPF
Kernel Recipes 2017: Performance Analysis with BPF
 
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
Kernel Recipes 2017 - Performance analysis Superpowers with Linux BPF - Brend...
 
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPFOSSNA 2017 Performance Analysis Superpowers with Linux BPF
OSSNA 2017 Performance Analysis Superpowers with Linux BPF
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
 
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-PAnalyzing ECP Proxy Apps with the Profiling Tool Score-P
Analyzing ECP Proxy Apps with the Profiling Tool Score-P
 
Debugging Ruby Systems
Debugging Ruby SystemsDebugging Ruby Systems
Debugging Ruby Systems
 
Debugging Ruby
Debugging RubyDebugging Ruby
Debugging Ruby
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPFUSENIX ATC 2017 Performance Superpowers with Enhanced BPF
USENIX ATC 2017 Performance Superpowers with Enhanced BPF
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
200.1,2-Capacity Planning
200.1,2-Capacity Planning200.1,2-Capacity Planning
200.1,2-Capacity Planning
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance Analysis
 
44CON London 2015 - Jtagsploitation: 5 wires, 5 ways to root
44CON London 2015 - Jtagsploitation: 5 wires, 5 ways to root44CON London 2015 - Jtagsploitation: 5 wires, 5 ways to root
44CON London 2015 - Jtagsploitation: 5 wires, 5 ways to root
 
Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)Performance tweaks and tools for Linux (Joe Damato)
Performance tweaks and tools for Linux (Joe Damato)
 
SOFA Tutorial
SOFA TutorialSOFA Tutorial
SOFA Tutorial
 

More from George Markomanolis

Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerGeorge Markomanolis
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapGeorge Markomanolis
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer George Markomanolis
 
Introduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IIntroduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IGeorge Markomanolis
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IGeorge Markomanolis
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIGeorge Markomanolis
 
How to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerHow to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerGeorge Markomanolis
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaGeorge Markomanolis
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...George Markomanolis
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIGeorge Markomanolis
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIGeorge Markomanolis
 

More from George Markomanolis (17)

Evaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI SupercomputerEvaluating GPU programming Models for the LUMI Supercomputer
Evaluating GPU programming Models for the LUMI Supercomputer
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
Introduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part IIntroduction to Extrae/Paraver, part I
Introduction to Extrae/Paraver, part I
 
Performance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part IPerformance Analysis with Scalasca on Summit Supercomputer part I
Performance Analysis with Scalasca on Summit Supercomputer part I
 
Performance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part IIPerformance Analysis with TAU on Summit Supercomputer, part II
Performance Analysis with TAU on Summit Supercomputer, part II
 
How to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit SupercomputerHow to use TAU for Performance Analysis on Summit Supercomputer
How to use TAU for Performance Analysis on Summit Supercomputer
 
Introducing IO-500 benchmark
Introducing IO-500 benchmarkIntroducing IO-500 benchmark
Introducing IO-500 benchmark
 
Experience using the IO-500
Experience using the IO-500Experience using the IO-500
Experience using the IO-500
 
Harshad - Handle Darshan Data
Harshad - Handle Darshan DataHarshad - Handle Darshan Data
Harshad - Handle Darshan Data
 
Lustre Best Practices
Lustre Best Practices Lustre Best Practices
Lustre Best Practices
 
Burst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to OmegaBurst Buffer: From Alpha to Omega
Burst Buffer: From Alpha to Omega
 
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
Optimizing an Earth Science Atmospheric Application with the OmpSs Programmin...
 
markomanolis_phd_defense
markomanolis_phd_defensemarkomanolis_phd_defense
markomanolis_phd_defense
 
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen IIPorting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
Porting an MPI application to hybrid MPI+OpenMP with Reveal tool on Shaheen II
 
Introduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen IIIntroduction to Performance Analysis tools on Shaheen II
Introduction to Performance Analysis tools on Shaheen II
 

Recently uploaded

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 

Recently uploaded (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 

ORNL is managed by UT-Battelle, LLC for the US Department of Energy

  • 1. ORNL is managed by UT-Battelle, LLC for the US Department of Energy Performance Analysis with Scalasca part II George S. Markomanolis 8 August 2019
  • 2. 22 Open slide master to edit MiniWeather MPI+OpenACC • Compilations • MPI+OpenACC: make PREP="scorep --mpp=mpi --openacc – pdt" openacc • Cube4 does not show OpenACC information
  • 3. 33 Open slide master to edit LSMS compilation - Issues • scorep --cuda nvcc… /gpfs/alpine/gen110/scratch/gmarkoma/scorep/LSMS3/Source /LSMS_3/src/Accelerator/buildKKRMatrix_kernels.cu(242): error: identifier "omp_get_thread_num" is undefined • Adding the option --keep-files, does not delete the temporary files, so we can observe what is wrong
  • 4. 44 Open slide master to edit LSMS compilation – Issues (cont.) Original .cu file Temporary opari.cu file during instrumentation #ifdef _OPENMP #include <omp.h> #else inline int omp_get_max_threads() {return 1;} inline int omp_get_num_threads() {return 1;} inline int omp_get_thread_num() {return 0;} #endif #ifdef _OPENMP #else inline int omp_get_max_threads() {return 1;} inline int omp_get_num_threads() {return 1;} inline int omp_get_thread_num() {return 0;} #endif Adding manually the #include <omp.h> to the temporary file, it solves the issue
  • 5. 55 Open slide master to edit LSMS – Profiling one compute node
  • 6. 66 Open slide master to edit LSMS on 128 compute nodes – Compute variation
  • 7. 77 Open slide master to edit LSMS on 128 compute nodes – MPI_Allreduce variation
  • 8. 88 Open slide master to edit LSMS on 128 compute nodes – OpenMP Barrier
  • 9. 99 Open slide master to edit LSMS on 128 compute nodes – OpenMP Barrier
  • 10. 1010 Open slide master to edit LSMS on 128 compute nodes – OpenMP Barrier - Sunburst
  • 11. 1111 Open slide master to edit LSMS – Buffer for tracing • scalasca -examine -s /gpfs/alpine/…/scorep_lsms_Ox8_sum/ INFO: Score report written to /gpfs/…/scorep_lsms_Ox8_sum/scorep.score vim /gpfs/…/scorep_lsms_Ox8_sum/scorep.score Estimated aggregate size of event trace: 24TB Estimated requirements for largest trace buffer (max_buf): 36GB Estimated memory requirements (SCOREP_TOTAL_MEMORY): 36GB (warning: The memory requirements cannot be satisfied by Score-P to avoid intermediate flushes when tracing. Set SCOREP_TOTAL_MEMORY=4G to get the maximum supported memory or reduce requirements using USR regions filters.) flt type max_buf[B] visits time[s] time[%] time/visit[us] region ALL 37,758,243,091 2,906,934,475 3738.13 100.0 1.29 ALL USR 37,757,438,730 2,906,899,463 3714.45 99.4 1.28 USR OMP 757,272 32,688 18.59 0.5 568.66 OMP MPI 29,168 932 2.58 0.1 2772.73 MPI COM 17,880 1,390 2.50 0.1 1798.55 COM SCOREP 41 2 0.00 0.0 242.20 SCOREP USR 22,486,328,384 1,729,717,541 260.77 7.0 0.15 dfv_m USR 9,921,071,888 763,159,376 126.99 3.4 0.17 fit USR 1,064,960,000 81,920,000 343.03 9.2 4.19 void bulirsch_stoer_integrator USR 846,102,582 65,083,854 19.65 0.5 0.30 void plm_normalized_(int*, double*, double*)
  • 12. 1212 Open slide master to edit Selective instrumentation • File exclude.filt, we select the USR calls with high volume of max_buf and the ones with many calls: SCOREP_REGION_NAMES_BEGIN EXCLUDE dfv_m fit *bulirsch_stoer_integrator_* *plm_normalized_* ... SCOREP_REGION_NAMES_END • Apply on the profiling data: scalasca -examine -s -f exclude.filt scorep_lsms_Ox7_sum/
  • 13. 1313 Open slide master to edit Selective instrumentation (cont.) • vim scorep_lsms_Ox7_sum/scorep.score_exclude.filt Estimated aggregate size of event trace: 3100MB Estimated requirements for largest trace buffer (max_buf): 6MB Estimated memory requirements (SCOREP_TOTAL_MEMORY): 20MB (hint: When tracing set SCOREP_TOTAL_MEMORY=20MB to avoid intermediate flushes or reduce requirements using USR regions filters.) flt type max_buf[B] visits time[s] time[%] time/visit[us] region - ALL 37,768,557,978 997,234,022,247 1508955.54 100.0 1.51 ALL - USR 37,767,383,966 997,216,546,439 1237070.04 82.0 1.24 USR - MPI 1,160,452 5,948,480 139997.90 9.3 23535.07 MPI - OMP 662,613 10,983,168 129188.73 8.6 11762.43 OMP - COM 302,424 543,392 2698.66 0.2 4966.33 COM - SCOREP 41 768 0.21 0.0 270.75 SCOREP * ALL 5,851,181 108,920,629 1061962.47 70.4 9749.87 ALL-FLT + FLT 37,763,922,579 997,125,101,618 446993.07 29.6 0.45 FLT * USR 3,725,651 91,444,821 790076.96 52.4 8639.93 USR-FLT - MPI 1,160,452 5,948,480 139997.90 9.3 23535.07 MPI-FLT - OMP 662,613 10,983,168 129188.73 8.6 11762.43 OMP-FLT * COM 302,424 543,392 2698.66 0.2 4966.33 COM-FLT - SCOREP 41 768 0.21 0.0 270.75 SCOREP-FLT Execute: scalasca -analyze -q -t -f /full_path/exclude.filt jsrun …
  • 14. 1414 Open slide master to edit LSMS on 128 compute nodes – Late Receivers (occ)
  • 15. 1515 Open slide master to edit LSMS on 128 compute nodes – Late Sender (sec.)
  • 16. 1616 Open slide master to edit LSMS on 128 compute nodes – Short-term Idleness delay
  • 17. 1717 Open slide master to edit LSMS on 128 compute nodes – Short-term Idleness delay
  • 18. 1818 Open slide master to edit Vampir with OTF2 trace OTF2 trace is created inside the scorep folder after the scalasca –analyze, example of 640 MPI processes
  • 19. 1919 Open slide master to edit Vampir with OTF2 trace
  • 20. 2020 Open slide master to edit Vampir with OTF2 trace
  • 21. 2121 Open slide master to edit Vampir with OTF2 trace
  • 22. 2222 Open slide master to edit Vampir with OTF2 trace
  • 23. 2323 Open slide master to edit Vampir with OTF2 trace
  • 24. 2424 Open slide master to edit Vampir with OTF2 trace
  • 25. 2525 Open slide master to edit Vampir with OTF2 trace
  • 26. 2626 Open slide master to edit Vampir with OTF2 trace
  • 27. 2727 Open slide master to edit Vampir with OTF2 trace
  • 28. 2828 Open slide master to edit Vampir with OTF2 trace
  • 29. 2929 Open slide master to edit Vampir with OTF2 trace
  • 30. 3030 Open slide master to edit Vampir with OTF2 trace
  • 31. 3131 Open slide master to edit Vampir with OTF2 trace
  • 32. 3232 Open slide master to edit Vampir with OTF2 trace
  • 33. 3333 Open slide master to edit Vampir with OTF2 trace
  • 34. 3434 Open slide master to edit Vampir with OTF2 trace
  • 35. 3535 Open slide master to edit Vampir with OTF2 trace
  • 36. 3636 Open slide master to edit Vampir with OTF2 trace II Zoom in with the mouse
  • 37. 3737 Open slide master to edit Vampir with OTF2 trace III Zoom in with the mouse
  • 38. 3838 Open slide master to edit Vampir with OTF2 trace IV Add functionalities from the menu
  • 39. 3939 Open slide master to edit Conclusions • Scalasca can identify bottlenecks based on some patterns • Useful for MPI and OpenMP • It can not provide useful information about GPUs but the information exists inside the traces • It can create OTF2 trace files and use Vampir to visualize