Assignment 1-mtat

•

0 gefällt mir•402 views

zafargilani

Technologie Bildung

Outline
● Introduction to benchmark app
● Testbeds
● Instrumentation
● Traces
● Measurement criterion
● Evaluation
● Anomalies
● Conclusions

1

Introduction to benchmark app
● NPB = NAS Parallel Benchmarks.
● A small set of programs designed to
evaluate performance of parallel
supercomputers.
● 5 kernels, 3 pseudo applications.
● 3 versions: Serial, OpenMP, MPI.
● 8 kind of classes of tests:
○ S - small, for quick tests
○ W - workstation size
○ A, B, C - standard tests, ~4x increase from A to C
○ D, E, F - large tests, ~16x increase from A to C

2

Testbeds
Local Remote
Machine type Laptop Server
Processor Intel Core i3-330M Intel Xeon E5645
2.13GHz 2.40GHz
Cores 2 6
Cache (MB) 3 12
Memory (GB) 3 24

3

Instrumentation
● Preload Extrae's MPI trace library
"libmpitrace.so".
● The library intercepts all the MPI calls and
traces all the MPI events.
● Instrumented and executed:
○ NPB version 3.3 stable
○ NPB3.3-MPI
○ IS (Integer Sort) kernel with 2, 4, 8, 16 and 32 procs
● Per experiment:
○ Size of problem: Class C, 135 million values approx.
○ Iterations: 10

6

Measurement criterion
Metric Relevance to NPB-MPI Integer Sort
Computation time General idea of speed-up.
Communication time Impact of increasing number of processes
on communication.
Load imbalance Which processes or threads do less as
compared to others.
Bottlenecks Performance bottlenecks.
L1 cache misses To see how many times the CPU had to
go to other memory to find data.

7

Computation time
● Measured: thread processing time.
● Local:
○ increase in time directly proportional to nprocs
○ upto 32 processes
○ poor scalability
● Remote:
○ decrease in time directly proportional to nprocs
○ upto 32 processes
○ good scalability

9

Communication time
● Overall communication time is determined
by the process taking maximum time.
● Local:
○ rapid increase in time as number of processes are
increased
● Remote:
○ nominal increase in time as number of processes
are increased

11

Load Imbalance
● On boada
○ For nprocs = 4, threads = {2, 3} are lazy.
○ For nprocs = 16, threads = {5, 6, 7, 8, 12} are lazy.
Exec

Wait

Comm

12

Bottlenecks
● For nprocs = {8, 16, 32}, one or more
processes takes more time.
○ Wait/Wait All signals.
○ Typical times for local machine is around 1000 ms.
○ Typical times for remote machine is around 250 ms.
■ 4x difference (threads in remote machine have
shorter wait time).

14

L1 cache misses
● Cache misses in local machine are more
expensive: typically costing 5x more time.
○ Cache size difference? Local has to "look"
elsewhere more often.
■ i3 has 3MB cache.
■ Xeon has 12MB cache.

16

Anomalies
● For 32 threads:
○ Time taken to spawn threads varies.
○ Remote takes less time to spawn 32 threads.
○ Possible reasons:
■ Acquiring locks and switching between resource
acquisition and release is costly.
● Time taken by "other" jobs also varies:
○ But these generally vary from system to system.

18

Conclusions
● Instrumentation is necessary to reveal
performance insights of parallel code.
● Extrae supports a handy procedure for
automatic instrumentation.
● Some interesting observations:
○ IS does not properly scale on low-end machines
beyond 16 procs.
○ Scales nicely on a server such as boada.
○ IS code becomes communication intensive when
nprocs is increased.
○ Some bottlenecks deteriorate performance.

Empfohlen

Avoiding Hardware AliasingPeter Breuer

Mbuf oflow - Finding vulnerabilities in iOS/macOS networking code - kevin ba...Semmle

Bsdtw17: george neville neil: realities of dtrace on free-bsdScott Tsai

ZFS Log Spacemap - Flushing AlgorithmSerapheim-Nikolaos Dimitropoulos

Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better ...Stefan Marr

memcached Binary Protocol in a NutshellToru Maesaka

Seminar on Parallel and Concurrent ProgrammingStefan Marr

Algorithm Complexity & Big-O AnalysisÖmer Faruk Öztürk

Empfohlen

Avoiding Hardware AliasingPeter Breuer

Mbuf oflow - Finding vulnerabilities in iOS/macOS networking code - kevin ba...Semmle

Bsdtw17: george neville neil: realities of dtrace on free-bsdScott Tsai

ZFS Log Spacemap - Flushing AlgorithmSerapheim-Nikolaos Dimitropoulos

Tracing versus Partial Evaluation: Which Meta-Compilation Approach is Better ...Stefan Marr

memcached Binary Protocol in a NutshellToru Maesaka

Seminar on Parallel and Concurrent ProgrammingStefan Marr

Algorithm Complexity & Big-O AnalysisÖmer Faruk Öztürk

On component interfaceLaurence Chen

A simple tool for debug (tap>)Laurence Chen

Efficient Bytecode Analysis: Linespeed Shellcode DetectionGeorg Wicherski

On the Necessity and Inapplicability of PythonTakeshi Akutsu

On the necessity and inapplicability of pythonYung-Yu Chen

Why Is Concurrent Programming Hard? And What Can We Do about It?Stefan Marr

BUD17-300: Journey of a packetLinaro

Greedy Enough for the Grid?Matteo Romanello

Using R in remote computer clustersBurak Himmetoglu

BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...Linaro

Parallel Programming on the ANDC clusterSudhang Shankar

Diagnosing HotSpot JVM Memory Leaks with JFR and JMCMushfekur Rahman

NS3 Tech TalkRodrigo Melo

Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaMushfekur Rahman

2 rest-elevator-pitchzafargilani

5 state-of-cloud-applications-and-platformszafargilani

6 intelligent-placement-of-datacenterszafargilani

Bigtablezafargilani

1 distributed-systems-template-modifiedzafargilani

1 logical data models for cc archzafargilani

Laporan lengakap percobaan pembiasan cahayafikar zul

Topik 1 Dunia Melalui Deria Kita (bahagian 1)Faizal Jay'z

Weitere ähnliche Inhalte

Was ist angesagt?

On component interfaceLaurence Chen

A simple tool for debug (tap>)Laurence Chen

Efficient Bytecode Analysis: Linespeed Shellcode DetectionGeorg Wicherski

On the Necessity and Inapplicability of PythonTakeshi Akutsu

On the necessity and inapplicability of pythonYung-Yu Chen

Why Is Concurrent Programming Hard? And What Can We Do about It?Stefan Marr

BUD17-300: Journey of a packetLinaro

Greedy Enough for the Grid?Matteo Romanello

Using R in remote computer clustersBurak Himmetoglu

BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...Linaro

Parallel Programming on the ANDC clusterSudhang Shankar

Diagnosing HotSpot JVM Memory Leaks with JFR and JMCMushfekur Rahman

NS3 Tech TalkRodrigo Melo

Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaMushfekur Rahman

Was ist angesagt? (14)

On component interface

A simple tool for debug (tap>)

Efficient Bytecode Analysis: Linespeed Shellcode Detection

On the Necessity and Inapplicability of Python

On the necessity and inapplicability of python

Why Is Concurrent Programming Hard? And What Can We Do about It?

BUD17-300: Journey of a packet

Greedy Enough for the Grid?

Using R in remote computer clusters

BKK16-503 Undefined Behavior and Compiler Optimizations – Why Your Program St...

Parallel Programming on the ANDC cluster

Diagnosing HotSpot JVM Memory Leaks with JFR and JMC

NS3 Tech Talk

Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana

Andere mochten auch

2 rest-elevator-pitchzafargilani

5 state-of-cloud-applications-and-platformszafargilani

6 intelligent-placement-of-datacenterszafargilani

Bigtablezafargilani

1 distributed-systems-template-modifiedzafargilani

1 logical data models for cc archzafargilani

Laporan lengakap percobaan pembiasan cahayafikar zul

Topik 1 Dunia Melalui Deria Kita (bahagian 1)Faizal Jay'z

Bab 1 Dunia Melalui Deria Kita Safwan Yusuf

3 apache-avrozafargilani

Hype vs. Reality: The AI ExplainerLuminary Labs

Study: The Future of VR, AR and Self-Driving CarsLinkedIn

Andere mochten auch (12)

2 rest-elevator-pitch

5 state-of-cloud-applications-and-platforms

6 intelligent-placement-of-datacenters

Bigtable

1 distributed-systems-template-modified

1 logical data models for cc arch

Laporan lengakap percobaan pembiasan cahaya

Topik 1 Dunia Melalui Deria Kita (bahagian 1)

Bab 1 Dunia Melalui Deria Kita

3 apache-avro

Hype vs. Reality: The AI Explainer

Study: The Future of VR, AR and Self-Driving Cars

Ähnlich wie Assignment 1-mtat

The Dark Side Of Go -- Go runtime related problems in TiDB in productionPingCAP

Understanding and Measuring I/O PerformanceGlenn K. Lockwood

Benchmarks, performance, scalability, and capacity what's behind the numbersJustin Dorfman

Benchmarks, performance, scalability, and capacity what s behind the numbers...james tong

Harnessing OpenCL in Modern CoprocessorsUnai Lopez-Novoa

PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing ProcessorAntonio Gomez

An End to Order (many cores with java, session two)Robert Burrell Donkin

OpenMPEric Cheng

Mirko Damiani - An Embedded soft real time distributed system in Golinuxlab_conf

The Search for Gravitational Wavesinside-BigData.com

Preparing Fusion codes for Perlmutter - CGYROIgor Sfiligoi

Parallel AlgorithmsDr Sandeep Kumar Poonia

An End to OrderRobert Burrell Donkin

Performance Optimization of CGYRO for Multiscale Turbulence SimulationsIgor Sfiligoi

GPU Computing for Data Science Domino Data Lab

SPE effiency on modern hardware paper presentationPanagiotisSavvaidis

Computer network (7)NYversity

Parallel AlgorithmsDr Sandeep Kumar Poonia

Building zero data loss pipelines with apache kafkaAvinash Ramineni

How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "SBrandon Liu

Ähnlich wie Assignment 1-mtat (20)

The Dark Side Of Go -- Go runtime related problems in TiDB in production

Understanding and Measuring I/O Performance

Benchmarks, performance, scalability, and capacity what's behind the numbers

Benchmarks, performance, scalability, and capacity what s behind the numbers...

Harnessing OpenCL in Modern Coprocessors

PEARC17: Evaluation of Intel Omni-Path on the Intel Knights Landing Processor

An End to Order (many cores with java, session two)

OpenMP

Mirko Damiani - An Embedded soft real time distributed system in Go

The Search for Gravitational Waves

Preparing Fusion codes for Perlmutter - CGYRO

Parallel Algorithms

An End to Order

Performance Optimization of CGYRO for Multiscale Turbulence Simulations

GPU Computing for Data Science

SPE effiency on modern hardware paper presentation

Computer network (7)

Parallel Algorithms

Building zero data loss pipelines with apache kafka

How I Sped up Complex Matrix-Vector Multiplication: Finding Intel MKL's "S

Kürzlich hochgeladen

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Histor y of HAM Radio presentation slidevu2urc

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Scaling API-first – The story of a global engineering organizationRadu Cotescu

Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Kürzlich hochgeladen (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

08448380779 Call Girls In Friends Colony Women Seeking Men

Axa Assurance Maroc - Insurer Innovation Award 2024

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Histor y of HAM Radio presentation slide

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

Automating Google Workspace (GWS) & more with Apps Script

[2024]Digital Global Overview Report 2024 Meltwater.pdf

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Scaling API-first – The story of a global engineering organization

Advantages of Hiring UIUX Design Service Providers for Your Business

08448380779 Call Girls In Civil Lines Women Seeking Men

Breaking the Kubernetes Kill Chain: Host Path Mount

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf

CNv6 Instructor Chapter 6 Quality of Service

What Are The Drone Anti-jamming Systems Technology?

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Assignment 1-mtat

1. Instrumentation and analysis of NPB Zafar Gilani EMDC 2012 Measurement Tools and Techniques UPC

2. Outline ● Introduction to benchmark app ● Testbeds ● Instrumentation ● Traces ● Measurement criterion ● Evaluation ● Anomalies ● Conclusions

3. 1 Introduction to benchmark app ● NPB = NAS Parallel Benchmarks. ● A small set of programs designed to evaluate performance of parallel supercomputers. ● 5 kernels, 3 pseudo applications. ● 3 versions: Serial, OpenMP, MPI. ● 8 kind of classes of tests: ○ S - small, for quick tests ○ W - workstation size ○ A, B, C - standard tests, ~4x increase from A to C ○ D, E, F - large tests, ~16x increase from A to C

4. 2 Testbeds Local Remote Machine type Laptop Server Processor Intel Core i3-330M Intel Xeon E5645 2.13GHz 2.40GHz Cores 2 6 Cache (MB) 3 12 Memory (GB) 3 24

5. 3 Instrumentation ● Preload Extrae's MPI trace library "libmpitrace.so". ● The library intercepts all the MPI calls and traces all the MPI events. ● Instrumented and executed: ○ NPB version 3.3 stable ○ NPB3.3-MPI ○ IS (Integer Sort) kernel with 2, 4, 8, 16 and 32 procs ● Per experiment: ○ Size of problem: Class C, 135 million values approx. ○ Iterations: 10

6. 4 Local traces Exec Comm

7. 5 Remote traces

8. Evaluation & Comparative Analysis

9. 6 Measurement criterion Metric Relevance to NPB-MPI Integer Sort Computation time General idea of speed-up. Communication time Impact of increasing number of processes on communication. Load imbalance Which processes or threads do less as compared to others. Bottlenecks Performance bottlenecks. L1 cache misses To see how many times the CPU had to go to other memory to find data.

10. 7 Computation time ● Measured: thread processing time. ● Local: ○ increase in time directly proportional to nprocs ○ upto 32 processes ○ poor scalability ● Remote: ○ decrease in time directly proportional to nprocs ○ upto 32 processes ○ good scalability

11. 8

12. 9 Communication time ● Overall communication time is determined by the process taking maximum time. ● Local: ○ rapid increase in time as number of processes are increased ● Remote: ○ nominal increase in time as number of processes are increased

13. 10

14. 11 Load Imbalance ● On boada ○ For nprocs = 4, threads = {2, 3} are lazy. ○ For nprocs = 16, threads = {5, 6, 7, 8, 12} are lazy. Exec Wait Comm

15. 12 Bottlenecks ● For nprocs = {8, 16, 32}, one or more processes takes more time. ○ Wait/Wait All signals. ○ Typical times for local machine is around 1000 ms. ○ Typical times for remote machine is around 250 ms. ■ 4x difference (threads in remote machine have shorter wait time).

16. 13 Wait I/O

17. 14 L1 cache misses ● Cache misses in local machine are more expensive: typically costing 5x more time. ○ Cache size difference? Local has to "look" elsewhere more often. ■ i3 has 3MB cache. ■ Xeon has 12MB cache.

18. 15

19. 16 Anomalies ● For 32 threads: ○ Time taken to spawn threads varies. ○ Remote takes less time to spawn 32 threads. ○ Possible reasons: ■ Acquiring locks and switching between resource acquisition and release is costly. ● Time taken by "other" jobs also varies: ○ But these generally vary from system to system.

20. 17 Spawning Others ??

21. 18 Conclusions ● Instrumentation is necessary to reveal performance insights of parallel code. ● Extrae supports a handy procedure for automatic instrumentation. ● Some interesting observations: ○ IS does not properly scale on low-end machines beyond 16 procs. ○ Scales nicely on a server such as boada. ○ IS code becomes communication intensive when nprocs is increased. ○ Some bottlenecks deteriorate performance.

22. Instrumentation and analysis of NPB Zafar Gilani EMDC 2012 Measurement Tools and Techniques UPC