SlideShare ist ein Scribd-Unternehmen logo
1 von 15
Downloaden Sie, um offline zu lesen
State of Scalasca!
Itaru Kitayama
RIKEN AICS
What’s Scalasca!
•  Parallel application (MPI + OpenMP) performance study toolset
•  Open source, 3-clause BSD license
•  Portable implementation
•  IBM Blue Gene, Cray XT/XE/XK/XC, SGI Altix, Fujitsu FX10/100, K
computer, Linux (x86, Power, ARM), Intel Xeon Phi
•  Depends on Score-P instrumenter & measurement libraries
•  Supports common data formats
•  Reads event traces in OTF2 format
•  Write analysis reports in CUBE4 format 
2
Score-P!
Scalasca trace analysis
3
Scalasca workflow
Instr.
target
application
Measurement
library
HWC!
Parallel wait-
state search!
Wait-state
report!
Local event
traces!
Summary
report!
Optimized measurement configuration
Instrumenter
compiler / linker!
Instrumente
d executable!
Source
modules!
Report
manipulation!
Which problem?!
Where in the
program?!
Which
process?!
Scalasca Status!
•  Scalasca can be used for parallel application performance
studies on arm64
•  GNU Autotools are updated to recognize the arm64
architecture
•  Latest stable version is 2.3.1 (May 2016)
•  Cube v4.4 is upcoming
•  Sampling mode for arm64 is being worked upon 
•  Bug fixes and enhancements will be coming 
4
Sampling Mode!
•  Important to avoid excessive overhead due to
instrumentation
•  Requires libunwind package
•  POSIX timer, perf, PAPI are the sources of interrupt 
•  Works on the x86, the work is on-going on arm64
•  Issue: PLT-entry resolved address passed to libunwind
does not work as expected
5
libunwind test results on arm64!
============================================================================
Testsuite summary for libunwind 1.3-rc1
============================================================================
# TOTAL: 35
# PASS: 26
# SKIP: 0
# XFAIL: 0
# FAIL: 9
# XPASS: 0
# ERROR: 0
============================================================================
See tests/test-suite.log
Please report to libunwind-devel@nongnu.org
============================================================================
As of 1.3-rc1 AArch64 “Works well”
$ make check on arm64 produces:
•  kernel: 4.14
•  gcc: 4.8.5 20150623 (Red Hat 4.8.5-16)
•  glib: 2.17
•  hardware: Cavium ThunderX 
6
Cube Status!
•  Release v4.4 is upcoming
•  Major changes since stable v4.3:
•  Packaging
•  Many plugins for customized performance analysis
•  KNL vectorization adviser
•  OTF2 Trace visualizer
•  Sunburst
•  ScorePion
•  Memory footprint reduction (to appear in v4.4 or after)
•  http://www.scalasca.org/software/cube-4.x/download.html
7
Snapshot of Cube GUI on ThunderX!
8
NPB3.3-MZ-MPI/BT Exercise on ThunderX!
•  NAS Parallel Bench suite (sample MZ-MPI version)
•  Available from http://www.nas.nasa.gov/Software/NPB
•  3 benchmarks (all in Fortran77, using OpenMP+MPI)
•  Configurable for various sizes & classes
9
NPB-MZ-MPI/BT (Block Tridiagonal Solver)!
10
•  What does it do?
•  Solves a discretized version of unsteady, compressible Navier-
Stokes equations in three spatial dimensions
•  Performs 200 time-steps on a regular 3-dimensional grid using
ADI and verifies solution error within acceptable limit
•  Intra-zone computation with OpenMP, inter-zone with MPI
•  Implemented in 20 or so Fortran77 source modules
•  Runs with any number of MPI processes & OpenMP threads
•  On ThunderX, bt-mz_B.16 x6 should run in 30 seconds 
•  CLASS=B is recommended
NPB-MZ-MPI/BT profile execution!
11
•  Set OMP_NUM_THREDS and launch as an MPI application
-bash-4.2$ scan -s mpiexec -np 16 ./bt-mz.B.16
S=C=A=N: Scalasca 2.3.1 runtime summarization
S=C=A=N: ./scorep_bt-mz_16x6_sum experiment archive
S=C=A=N: Sat Dec 2 12:30:05 2017: Collect start
/home/itaru/opt/openmpi-2.1.1/bin/mpiexec -np 16 ./bt-mz.B.16

NAS Parallel Benchmarks (NPB3.3-MZ-MPI) - BT-MZ MPI+OpenMP Benchmark

Number of zones: 8 x 8
Iterations: 200 dt: 0.000300
Number of active processes: 16

Use the default load factors with threads
Total number of threads: 96 ( 6.0 threads/process)

Calculated speedup = 93.84

Time step 1
Time step 20
Time step 40
Time step 60
Time step 80
Time step 100
Time step 120
Time step 140
Time step 160
Time step 180
Time step 200
Verification being performed for class B
accuracy setting for epsilon = 0.1000000000000E-07
[…]
S=C=A=N: Sat Dec 2 12:30:44 2017: Collect done (status=0) 39s
S=C=A=N: ./scorep_bt-mz_16x6_sum complete.
NPB-MZ-MPI/BT build configuration definition!
12
# F77 - Fortran compiler
# FFLAGS - Fortran compilation arguments
# F_INC - any -I arguments required for compiling Fortran
# FLINK - Fortran linker
# FLINKFLAGS - Fortran linker arguments
# F_LIB - any -L and -l arguments required for linking Fortran
#
# compilations are done with $(F77) $(F_INC) $(FFLAGS) or
# $(F77) $(FFLAGS)
# linking is done with $(FLINK) $(F_LIB) $(FLINKFLAGS)
#------------------------------------------------------------------------
---
#------------------------------------------------------------------------
---
# This is the fortran compiler used for fortran programs
#------------------------------------------------------------------------
---
F77 = scorep mpif77
#F77 = mpif77
•  config/make.def
Score-P wrapper, just before the
compiler!
Cube Data Representation!
13
6 threads
16 Ranks
Summary!
•  Scalasca and Score-P have been ported to arm64 and
tools are working fine on real hardware
•  Missing feature is sampling
•  Data visualization and analysis framework will be updated
14
Thanks to!
•  Markus Geimer (JSC)
•  Pavel Saviankou (JSC)
•  Brian Wylie (JSC)
•  Michael Knobloch (JSC)
•  Scalasca and Score-P Communites
15

Weitere ähnliche Inhalte

Was ist angesagt?

P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC Offload
Open-NFP
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
ScyllaDB
 
Staring into the eBPF Abyss
Staring into the eBPF AbyssStaring into the eBPF Abyss
Staring into the eBPF Abyss
Sasha Goldshtein
 

Was ist angesagt? (20)

Bpf performance tools chapter 4 bcc
Bpf performance tools chapter 4   bccBpf performance tools chapter 4   bcc
Bpf performance tools chapter 4 bcc
 
BPF - All your packets belong to me
BPF - All your packets belong to meBPF - All your packets belong to me
BPF - All your packets belong to me
 
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDPDockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
DockerCon 2017 - Cilium - Network and Application Security with BPF and XDP
 
Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer Exploring the Programming Models for the LUMI Supercomputer
Exploring the Programming Models for the LUMI Supercomputer
 
Performance Tuning EC2 Instances
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 Instances
 
Onnc intro
Onnc introOnnc intro
Onnc intro
 
Get Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java ApplicationsGet Lower Latency and Higher Throughput for Java Applications
Get Lower Latency and Higher Throughput for Java Applications
 
Performance optimization 101 - Erlang Factory SF 2014
Performance optimization 101 - Erlang Factory SF 2014Performance optimization 101 - Erlang Factory SF 2014
Performance optimization 101 - Erlang Factory SF 2014
 
LCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC sessionLCA14: LCA14-412: GPGPU on ARM SoC session
LCA14: LCA14-412: GPGPU on ARM SoC session
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
eBPF Basics
eBPF BasicseBPF Basics
eBPF Basics
 
Debugging node in prod
Debugging node in prodDebugging node in prod
Debugging node in prod
 
Using eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster HealthUsing eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster Health
 
P4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC OffloadP4, EPBF, and Linux TC Offload
P4, EPBF, and Linux TC Offload
 
eBPF Workshop
eBPF WorkshopeBPF Workshop
eBPF Workshop
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
Staring into the eBPF Abyss
Staring into the eBPF AbyssStaring into the eBPF Abyss
Staring into the eBPF Abyss
 
Getting started with AMD GPUs
Getting started with AMD GPUsGetting started with AMD GPUs
Getting started with AMD GPUs
 
Linux System Monitoring with eBPF
Linux System Monitoring with eBPFLinux System Monitoring with eBPF
Linux System Monitoring with eBPF
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)
 

Ähnlich wie Porting and Optimization of Numerical Libraries for ARM SVE

Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Mingliang Liu
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
Measurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetMeasurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNet
Vasyl Senko
 

Ähnlich wie Porting and Optimization of Numerical Libraries for ARM SVE (20)

Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...Combining Phase Identification and Statistic Modeling for Automated Parallel ...
Combining Phase Identification and Statistic Modeling for Automated Parallel ...
 
Tech Days 2015: Embedded Product Update
Tech Days 2015: Embedded Product UpdateTech Days 2015: Embedded Product Update
Tech Days 2015: Embedded Product Update
 
Best Practices and Performance Studies for High-Performance Computing Clusters
Best Practices and Performance Studies for High-Performance Computing ClustersBest Practices and Performance Studies for High-Performance Computing Clusters
Best Practices and Performance Studies for High-Performance Computing Clusters
 
BKK16-308 The tool called Auto-Tuned Optimization System (ATOS)
BKK16-308 The tool called Auto-Tuned Optimization System (ATOS)BKK16-308 The tool called Auto-Tuned Optimization System (ATOS)
BKK16-308 The tool called Auto-Tuned Optimization System (ATOS)
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
 
RISC V in Spacer
RISC V in SpacerRISC V in Spacer
RISC V in Spacer
 
TAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platformTAU E4S ON OpenPOWER /POWER9 platform
TAU E4S ON OpenPOWER /POWER9 platform
 
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary DemosMM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
MM-4105, Realtime 4K HDR Decoding with GPU ACES, by Gary Demos
 
Codasip application class RISC-V processor solutions
Codasip application class RISC-V processor solutionsCodasip application class RISC-V processor solutions
Codasip application class RISC-V processor solutions
 
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
Intel Parallel Studio XE 2016 網路開發工具包新版本功能介紹(現已上市,歡迎詢價)
 
Dynamic Instrumentation- OpenEBS Golang Meetup July 2017
Dynamic Instrumentation- OpenEBS Golang Meetup July 2017Dynamic Instrumentation- OpenEBS Golang Meetup July 2017
Dynamic Instrumentation- OpenEBS Golang Meetup July 2017
 
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
BSides LV 2016 - Beyond the tip of the iceberg - fuzzing binary protocols for...
 
Modern Linux Tracing Landscape
Modern Linux Tracing LandscapeModern Linux Tracing Landscape
Modern Linux Tracing Landscape
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
 
Moving NEON to 64 bits
Moving NEON to 64 bitsMoving NEON to 64 bits
Moving NEON to 64 bits
 
Measurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNetMeasurement .Net Performance with BenchmarkDotNet
Measurement .Net Performance with BenchmarkDotNet
 
Java Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey KovalenkoJava Jit. Compilation and optimization by Andrey Kovalenko
Java Jit. Compilation and optimization by Andrey Kovalenko
 
SDAccel Design Contest: Vivado HLS
SDAccel Design Contest: Vivado HLSSDAccel Design Contest: Vivado HLS
SDAccel Design Contest: Vivado HLS
 
Project Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptxProject Slides for Website 2020-22.pptx
Project Slides for Website 2020-22.pptx
 

Mehr von Linaro

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Linaro
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
Linaro
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Linaro
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
Linaro
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
Linaro
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
Linaro
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
Linaro
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
Linaro
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
Linaro
 

Mehr von Linaro (20)

Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea GalloDeep Learning Neural Network Acceleration at the Edge - Andrea Gallo
Deep Learning Neural Network Acceleration at the Edge - Andrea Gallo
 
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta VekariaArm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
Arm Architecture HPC Workshop Santa Clara 2018 - Kanta Vekaria
 
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua MoraHuawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
Huawei’s requirements for the ARM based HPC solution readiness - Joshua Mora
 
Bud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qaBud17 113: distribution ci using qemu and open qa
Bud17 113: distribution ci using qemu and open qa
 
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
OpenHPC Automation with Ansible - Renato Golin - Linaro Arm HPC Workshop 2018
 
HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018HPC network stack on ARM - Linaro HPC Workshop 2018
HPC network stack on ARM - Linaro HPC Workshop 2018
 
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
It just keeps getting better - SUSE enablement for Arm - Linaro HPC Workshop ...
 
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
Intelligent Interconnect Architecture to Enable Next Generation HPC - Linaro ...
 
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
Yutaka Ishikawa - Post-K and Arm HPC Ecosystem - Linaro Arm HPC Workshop Sant...
 
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
Andrew J Younge - Vanguard Astra - Petascale Arm Platform for U.S. DOE/ASC Su...
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening KeynoteHKG18-100K1 - George Grey: Opening Keynote
HKG18-100K1 - George Grey: Opening Keynote
 
HKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP WorkshopHKG18-318 - OpenAMP Workshop
HKG18-318 - OpenAMP Workshop
 
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainlineHKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
HKG18-501 - EAS on Common Kernel 4.14 and getting (much) closer to mainline
 
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and allHKG18-315 - Why the ecosystem is a wonderful thing, warts and all
HKG18-315 - Why the ecosystem is a wonderful thing, warts and all
 
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse HypervisorHKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
HKG18- 115 - Partitioning ARM Systems with the Jailhouse Hypervisor
 
HKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMUHKG18-TR08 - Upstreaming SVE in QEMU
HKG18-TR08 - Upstreaming SVE in QEMU
 
HKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8MHKG18-113- Secure Data Path work with i.MX8M
HKG18-113- Secure Data Path work with i.MX8M
 
HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation HKG18-120 - Devicetree Schema Documentation and Validation
HKG18-120 - Devicetree Schema Documentation and Validation
 
HKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted bootHKG18-223 - Trusted FirmwareM: Trusted boot
HKG18-223 - Trusted FirmwareM: Trusted boot
 

Kürzlich hochgeladen

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Porting and Optimization of Numerical Libraries for ARM SVE

  • 1. State of Scalasca! Itaru Kitayama RIKEN AICS
  • 2. What’s Scalasca! •  Parallel application (MPI + OpenMP) performance study toolset •  Open source, 3-clause BSD license •  Portable implementation •  IBM Blue Gene, Cray XT/XE/XK/XC, SGI Altix, Fujitsu FX10/100, K computer, Linux (x86, Power, ARM), Intel Xeon Phi •  Depends on Score-P instrumenter & measurement libraries •  Supports common data formats •  Reads event traces in OTF2 format •  Write analysis reports in CUBE4 format 2
  • 3. Score-P! Scalasca trace analysis 3 Scalasca workflow Instr. target application Measurement library HWC! Parallel wait- state search! Wait-state report! Local event traces! Summary report! Optimized measurement configuration Instrumenter compiler / linker! Instrumente d executable! Source modules! Report manipulation! Which problem?! Where in the program?! Which process?!
  • 4. Scalasca Status! •  Scalasca can be used for parallel application performance studies on arm64 •  GNU Autotools are updated to recognize the arm64 architecture •  Latest stable version is 2.3.1 (May 2016) •  Cube v4.4 is upcoming •  Sampling mode for arm64 is being worked upon •  Bug fixes and enhancements will be coming 4
  • 5. Sampling Mode! •  Important to avoid excessive overhead due to instrumentation •  Requires libunwind package •  POSIX timer, perf, PAPI are the sources of interrupt •  Works on the x86, the work is on-going on arm64 •  Issue: PLT-entry resolved address passed to libunwind does not work as expected 5
  • 6. libunwind test results on arm64! ============================================================================ Testsuite summary for libunwind 1.3-rc1 ============================================================================ # TOTAL: 35 # PASS: 26 # SKIP: 0 # XFAIL: 0 # FAIL: 9 # XPASS: 0 # ERROR: 0 ============================================================================ See tests/test-suite.log Please report to libunwind-devel@nongnu.org ============================================================================ As of 1.3-rc1 AArch64 “Works well” $ make check on arm64 produces: •  kernel: 4.14 •  gcc: 4.8.5 20150623 (Red Hat 4.8.5-16) •  glib: 2.17 •  hardware: Cavium ThunderX 6
  • 7. Cube Status! •  Release v4.4 is upcoming •  Major changes since stable v4.3: •  Packaging •  Many plugins for customized performance analysis •  KNL vectorization adviser •  OTF2 Trace visualizer •  Sunburst •  ScorePion •  Memory footprint reduction (to appear in v4.4 or after) •  http://www.scalasca.org/software/cube-4.x/download.html 7
  • 8. Snapshot of Cube GUI on ThunderX! 8
  • 9. NPB3.3-MZ-MPI/BT Exercise on ThunderX! •  NAS Parallel Bench suite (sample MZ-MPI version) •  Available from http://www.nas.nasa.gov/Software/NPB •  3 benchmarks (all in Fortran77, using OpenMP+MPI) •  Configurable for various sizes & classes 9
  • 10. NPB-MZ-MPI/BT (Block Tridiagonal Solver)! 10 •  What does it do? •  Solves a discretized version of unsteady, compressible Navier- Stokes equations in three spatial dimensions •  Performs 200 time-steps on a regular 3-dimensional grid using ADI and verifies solution error within acceptable limit •  Intra-zone computation with OpenMP, inter-zone with MPI •  Implemented in 20 or so Fortran77 source modules •  Runs with any number of MPI processes & OpenMP threads •  On ThunderX, bt-mz_B.16 x6 should run in 30 seconds •  CLASS=B is recommended
  • 11. NPB-MZ-MPI/BT profile execution! 11 •  Set OMP_NUM_THREDS and launch as an MPI application -bash-4.2$ scan -s mpiexec -np 16 ./bt-mz.B.16 S=C=A=N: Scalasca 2.3.1 runtime summarization S=C=A=N: ./scorep_bt-mz_16x6_sum experiment archive S=C=A=N: Sat Dec 2 12:30:05 2017: Collect start /home/itaru/opt/openmpi-2.1.1/bin/mpiexec -np 16 ./bt-mz.B.16 NAS Parallel Benchmarks (NPB3.3-MZ-MPI) - BT-MZ MPI+OpenMP Benchmark Number of zones: 8 x 8 Iterations: 200 dt: 0.000300 Number of active processes: 16 Use the default load factors with threads Total number of threads: 96 ( 6.0 threads/process) Calculated speedup = 93.84 Time step 1 Time step 20 Time step 40 Time step 60 Time step 80 Time step 100 Time step 120 Time step 140 Time step 160 Time step 180 Time step 200 Verification being performed for class B accuracy setting for epsilon = 0.1000000000000E-07 […] S=C=A=N: Sat Dec 2 12:30:44 2017: Collect done (status=0) 39s S=C=A=N: ./scorep_bt-mz_16x6_sum complete.
  • 12. NPB-MZ-MPI/BT build configuration definition! 12 # F77 - Fortran compiler # FFLAGS - Fortran compilation arguments # F_INC - any -I arguments required for compiling Fortran # FLINK - Fortran linker # FLINKFLAGS - Fortran linker arguments # F_LIB - any -L and -l arguments required for linking Fortran # # compilations are done with $(F77) $(F_INC) $(FFLAGS) or # $(F77) $(FFLAGS) # linking is done with $(FLINK) $(F_LIB) $(FLINKFLAGS) #------------------------------------------------------------------------ --- #------------------------------------------------------------------------ --- # This is the fortran compiler used for fortran programs #------------------------------------------------------------------------ --- F77 = scorep mpif77 #F77 = mpif77 •  config/make.def Score-P wrapper, just before the compiler!
  • 13. Cube Data Representation! 13 6 threads 16 Ranks
  • 14. Summary! •  Scalasca and Score-P have been ported to arm64 and tools are working fine on real hardware •  Missing feature is sampling •  Data visualization and analysis framework will be updated 14
  • 15. Thanks to! •  Markus Geimer (JSC) •  Pavel Saviankou (JSC) •  Brian Wylie (JSC) •  Michael Knobloch (JSC) •  Scalasca and Score-P Communites 15