SlideShare a Scribd company logo
1 of 19
Download to read offline
© 2002 IBM Corporation
IBM Software Group, Compilation Technologies
IBM Toronto Lab | September 19, 2003
Poly3D Case Study:
The Impact of Processor Cache Misses on
Performance of CPU-Intensive Applications
Zoran Kulina
Staff Software Engineer
(C/C++ Support)
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation2
Table of contents
Background
Poly3D Profiling Results
Benchmarks
Summary
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation
Background
3
It is often assumed that the performance of CPU-bound applications (e.g.,
computational science and engineering simulations) increases in more or
less a linear correlation with the CPU clock rate.
Users are often surprised when new hardware yields less than expected
performance improvement.
This study profiles Poly3D (a thermal dynamics simulator) and shows how
factors other than CPU speed influence the overall application
performance.
The study is the outcome of my recent service engagement. I was tasked
with benchmarking application performance to determine why a twice
faster computer failed to yield a requisite performance gain.
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation
Background
4
Client purchased a new p630 server for their CPU-intensive simulation app
2x clock rate (1000 vs 450 MHz)
4x the memory (8 vs 2 GB)
Expected at least a two-fold increase is application speed
Got something along the lines of 40-50%
The Story
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation5
Background
Machine Specifications
44P-170 p630-6C4
Processor POWER3-II (1-way) POWER4 (1-way)
Clock rate 450 Mhz 1000 Mhz
Memory 2 GB 8 GB
L1 96 KB 96 KB
L2 8 MB 1.44 MB
L3 none 32 MB
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation6
Background
Poly3D Overview
Fluid dynamics simulation program
Written in C (originally Fortran 77)
Single-threaded
Uses about 100 megabytes of memory
Performs dot product calculations and other matrix operations
Runs for ~150 seconds on 44p-170
Runs for ~96 seconds on p630
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation
Background
7
Run a representative workload on both old and new server
Ensure nothing else runs concurrently on the system
Collect hardware utilization metrics using pmcount utility
Summarize and compare the pmcount metrics on old vs new server
Gather the officially published benchmarks (SPEC2000 and Linpack) for
both systems. The challenge here is to find the matching server
configurations.
Determine how our pmcount metrics compare to the official benchmarks
Take corrective action as needed
My Method
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation
Background
8
Possible causes
CPU: more than 2x faster so any slowdown will have to come from
caching, memory or I/O
Disk I/O: Poly3D is CPU intensive as it mainly performs floating point
calculations, so disk I/O is not the likely bottleneck. SAN throughput is
nearly identical on both systems anyway.
Memory: p630 has 4x as much memory, so not a likely bottleneck.
Cache: p630 actually has less L2 cache than 44P-170. This is something
that we want to keep an eye on.
My Method (cont’d)
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation9
Table of contents
Background
Poly3D Profiling Results
Benchmarks
Summary
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation10
Poly3D Profiling Results
Memory Access Distribution
Event Description Hits
PM_DATA_FROM_L2 Data loaded from L2 cache 940,204,334
PM_DATA_FROM_L3 * Data loaded from L3 cache 63,310,703
PM_DATA_FROM_L3.5 * Data loaded from L3.5 cache 55,488,257
PM_DATA_FROM_MEM Data loaded from memory 77,330,243
Total 1,136,333,537
82%
6%
5%
7%
Data loaded from L2 cache
Data loaded from L3 cache
Data loaded from L3.5 cache
Data loaded from memory
* Total L3 cache access = L3 + L3.5
Obtained using the pmcount utility on p630-6C4
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation11
Poly3D Profiling Results
Processor Time Distribution
Activity Cycles Seconds
L2 cache access 59,930,474,670 59.93
L3 cache access 11,879,896,000 11.88
Memory access 23,199,072,900 23.20
Total 95,009,443,570 95.01
63%13%
24%
L2 cache access
L3 cache access
Memory access
Obtained by the pmcount utility on p630-6C4
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation12
Poly3D Profiling Results
Observations
Memory access constitutes a significant proportion of the execution time (24%)
Cost of one L3 cache access = ~100 cycles
Cost of one memory access = ~300 cycles
118,798,960 L3 accesses x 100 cycles = 11,879,896,000 cycles (11.9 seconds @ 1GHz)
77,330,243 memory accesses x 300 cycles = 23,199,072,900 cycles (23.2 seconds @ 1GHz)
Total of 35,078,968,900 cycles or 35.1 seconds spent on L3 cache and memory accesses
This portion of work will take less on 44P-170 due to a much larger L2 cache
The remaining work is expected to scale down with clock speed increase
Target of 70 seconds (or less) was achieved on p690 1Ghz, which due to a larger L3 cache
accessed memory eight times less than p630 (77 vs. 9 million)
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation13
Table of contents
Background
Poly3D Profiling Results
Benchmarks
Summary
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation14
Benchmarks
SPEC CPU2000 and LINPACK Results
SPEC CPU2000 LINPACK
int int_base fp fp_base DP TPP HPC
44p-170 346 333 434 426 503 1,440 ---
p630-6C4 639 624 886 843 842 2,172 ---
Improvement ratio 1.85 1.87 2.04 1.98 1.67 1.51 ---
Source: IBM eServer pSeries and IBM RS/6000 Performance Report
Greater improvement ratio shown for CPU-intensive benchmarks, i.e. SPEC CPU2000
Lower improvement ratio shown for memory-intensive benchmarks, i.e. LINPACK
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation15
Benchmarks
LINPACK Overview
LINear equations software PACKage
Developed by Dr. Jack Dongarra, University of Tennessee
Consists of algorithms that solve a dense system of linear equations / matrices using
Gaussian elimination
Uses matrix of order 100 for DP, and matrix of order 1000 for TPP benchmark
Used by TOP500 Supercomputer sites (www.top500.org)
Used to test overall performance rather than just CPU clock rate
Memory reference and CPU usage patterns similar to Poly3D
Problems being solved similar to those of Poly3D
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation16
Benchmarks
LINPACK Cont’d
Source: Performance of Various Computers Using Standard Linear Equations
Software, Dr. Jack Dongarra
Theoretical peak performance is determined by counting the number of floating point
operations (flops) that can be completed in one second
Theoretical peak performance does not take into account factors such as: data
movement between different levels of memory, cache misses, pipeline start-ups,
memory load, bus speed, and others
Actual performance reflects those factors and it also depends on application code
efficiency, compiler optimization, operating system, hardware characteristics, etc
DP
(Mflop/s)
TPP
(Mflop/s)
Theoretical Peak
(Mflop/s)
Poly3D
(seconds)
44p-170 503 1,440 1,800 150
p630-6C4 842 2,172 4,000 96
Improvement ratio 1.67 1.51 2.22 1.56
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation17
Table of contents
Background
Poly3D Profiling Results
Benchmarks
Summary
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation18
Summary
Poly3D memory reference pattern is causing a high cache miss rate and extensive
data movement between the main memory and CPU
Smaller L2 cache and high L3 cache miss rate is making Poly3D go to the main
memory on p630 more often than on 44P-170
Significant portion of execution is limited to the speed of the main memory
Total amount of memory used by Poly3D is greater than the system cache
Poly3D improvement ratio is consistent with LINPACK
Difference between the actual and peak performance for p630 LINPACK benchmark is
consistent with other systems
A single benchmark should not be used to judge the overall performance of a system.
Rather, a set of specialized benchmarks can measure overall performance more
accurately
IBM Software Group, Compilation Technologies
Performance Review of Poly3D on p630 © 2003 IBM Corporation19
Sources
IBM eServer pSeries and IBM RS/6000 Performance Report (June 2003)
http://www.ibm.com/servers/eserver/pseries/hardware/system_perf.pdf
Performance of Various Computers Using Standard Linear Equations Software, Jack Dongarra
http://www.netlib.org/benchmark/performance.ps
Frequently Asked Questions on the Linpack Benchmark
http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html
The LINPACK Benchmark: Past, Present, and Future, Dongarra, Luszczek, and Petitet
http://www.netlib.org/utk/people/JackDongarra/PAPERS/hpl.pdf

More Related Content

What's hot

Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPCExceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
inside-BigData.com
 
Resume_CQ_Edward
Resume_CQ_EdwardResume_CQ_Edward
Resume_CQ_Edward
caiqi wang
 
Run time dynamic partial reconfiguration using microblaze soft core processor...
Run time dynamic partial reconfiguration using microblaze soft core processor...Run time dynamic partial reconfiguration using microblaze soft core processor...
Run time dynamic partial reconfiguration using microblaze soft core processor...
eSAT Journals
 
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalShak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Tommy Lee
 

What's hot (12)

Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPCExceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
Exceeding the Limits of Air Cooling to Unlock Greater Potential in HPC
 
BKK16-TR08 How to generate power models for EAS and IPA
BKK16-TR08 How to generate power models for EAS and IPABKK16-TR08 How to generate power models for EAS and IPA
BKK16-TR08 How to generate power models for EAS and IPA
 
Resume_CQ_Edward
Resume_CQ_EdwardResume_CQ_Edward
Resume_CQ_Edward
 
Run time dynamic partial reconfiguration using microblaze soft core processor...
Run time dynamic partial reconfiguration using microblaze soft core processor...Run time dynamic partial reconfiguration using microblaze soft core processor...
Run time dynamic partial reconfiguration using microblaze soft core processor...
 
Run time dynamic partial reconfiguration using
Run time dynamic partial reconfiguration usingRun time dynamic partial reconfiguration using
Run time dynamic partial reconfiguration using
 
customization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLAcustomization of a deep learning accelerator, based on NVDLA
customization of a deep learning accelerator, based on NVDLA
 
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
 
2020 icldla-updated
2020 icldla-updated2020 icldla-updated
2020 icldla-updated
 
Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...
Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...
Performance Evaluation of FPGA Based Runtime Dynamic Partial Reconfiguration ...
 
Tuned
TunedTuned
Tuned
 
Shak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-finalShak larry-jeder-perf-and-tuning-summit14-part1-final
Shak larry-jeder-perf-and-tuning-summit14-part1-final
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 

Similar to Poly3D Case Study: The Impact of Cache Misses on Performance of CPU-Intensive Applications

Top 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & AnalysisTop 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & Analysis
NomanSiddiqui41
 

Similar to Poly3D Case Study: The Impact of Cache Misses on Performance of CPU-Intensive Applications (20)

Runtime Performance Optimizations for an OpenFOAM Simulation
Runtime Performance Optimizations for an OpenFOAM SimulationRuntime Performance Optimizations for an OpenFOAM Simulation
Runtime Performance Optimizations for an OpenFOAM Simulation
 
System Benchmarking
System BenchmarkingSystem Benchmarking
System Benchmarking
 
hetero_pim
hetero_pimhetero_pim
hetero_pim
 
Introduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSPIntroduction to Blackfin BF532 DSP
Introduction to Blackfin BF532 DSP
 
JVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark applicationJVM and OS Tuning for accelerating Spark application
JVM and OS Tuning for accelerating Spark application
 
Open power topics20191023
Open power topics20191023Open power topics20191023
Open power topics20191023
 
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
GPGPU Accelerates PostgreSQL ~Unlock the power of multi-thousand cores~
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Ibm pure data system for analytics n200x
Ibm pure data system for analytics n200xIbm pure data system for analytics n200x
Ibm pure data system for analytics n200x
 
P4 Introduction
P4 Introduction P4 Introduction
P4 Introduction
 
Deep Dive On Intel Optane SSDs And New Server Platforms
Deep Dive On Intel Optane SSDs And New Server PlatformsDeep Dive On Intel Optane SSDs And New Server Platforms
Deep Dive On Intel Optane SSDs And New Server Platforms
 
Performance_Programming
Performance_ProgrammingPerformance_Programming
Performance_Programming
 
SDC Server Sao Jose
SDC Server Sao JoseSDC Server Sao Jose
SDC Server Sao Jose
 
Introduction to Programmable Networks by Clarence Anslem, Intel
Introduction to Programmable Networks by Clarence Anslem, IntelIntroduction to Programmable Networks by Clarence Anslem, Intel
Introduction to Programmable Networks by Clarence Anslem, Intel
 
DPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet ProcessingDPDK: Multi Architecture High Performance Packet Processing
DPDK: Multi Architecture High Performance Packet Processing
 
Top 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & AnalysisTop 10 Supercomputers With Descriptive Information & Analysis
Top 10 Supercomputers With Descriptive Information & Analysis
 
How to Use GSM/3G/4G in Embedded Linux Systems
How to Use GSM/3G/4G in Embedded Linux SystemsHow to Use GSM/3G/4G in Embedded Linux Systems
How to Use GSM/3G/4G in Embedded Linux Systems
 
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
BKK16-302: Android Optimizing Compiler: New Member Assimilation GuideBKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
BKK16-302: Android Optimizing Compiler: New Member Assimilation Guide
 
FPGA MeetUp
FPGA MeetUpFPGA MeetUp
FPGA MeetUp
 
Accelerating SDN/NFV with transparent offloading architecture
Accelerating SDN/NFV with transparent offloading architectureAccelerating SDN/NFV with transparent offloading architecture
Accelerating SDN/NFV with transparent offloading architecture
 

Recently uploaded

+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
Health
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 

Recently uploaded (20)

OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
+971565801893>>SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHAB...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
WSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go PlatformlessWSO2CON2024 - It's time to go Platformless
WSO2CON2024 - It's time to go Platformless
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
WSO2CON 2024 - API Management Usage at La Poste and Its Impact on Business an...
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 

Poly3D Case Study: The Impact of Cache Misses on Performance of CPU-Intensive Applications

  • 1. © 2002 IBM Corporation IBM Software Group, Compilation Technologies IBM Toronto Lab | September 19, 2003 Poly3D Case Study: The Impact of Processor Cache Misses on Performance of CPU-Intensive Applications Zoran Kulina Staff Software Engineer (C/C++ Support)
  • 2. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation2 Table of contents Background Poly3D Profiling Results Benchmarks Summary
  • 3. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation Background 3 It is often assumed that the performance of CPU-bound applications (e.g., computational science and engineering simulations) increases in more or less a linear correlation with the CPU clock rate. Users are often surprised when new hardware yields less than expected performance improvement. This study profiles Poly3D (a thermal dynamics simulator) and shows how factors other than CPU speed influence the overall application performance. The study is the outcome of my recent service engagement. I was tasked with benchmarking application performance to determine why a twice faster computer failed to yield a requisite performance gain.
  • 4. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation Background 4 Client purchased a new p630 server for their CPU-intensive simulation app 2x clock rate (1000 vs 450 MHz) 4x the memory (8 vs 2 GB) Expected at least a two-fold increase is application speed Got something along the lines of 40-50% The Story
  • 5. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation5 Background Machine Specifications 44P-170 p630-6C4 Processor POWER3-II (1-way) POWER4 (1-way) Clock rate 450 Mhz 1000 Mhz Memory 2 GB 8 GB L1 96 KB 96 KB L2 8 MB 1.44 MB L3 none 32 MB
  • 6. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation6 Background Poly3D Overview Fluid dynamics simulation program Written in C (originally Fortran 77) Single-threaded Uses about 100 megabytes of memory Performs dot product calculations and other matrix operations Runs for ~150 seconds on 44p-170 Runs for ~96 seconds on p630
  • 7. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation Background 7 Run a representative workload on both old and new server Ensure nothing else runs concurrently on the system Collect hardware utilization metrics using pmcount utility Summarize and compare the pmcount metrics on old vs new server Gather the officially published benchmarks (SPEC2000 and Linpack) for both systems. The challenge here is to find the matching server configurations. Determine how our pmcount metrics compare to the official benchmarks Take corrective action as needed My Method
  • 8. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation Background 8 Possible causes CPU: more than 2x faster so any slowdown will have to come from caching, memory or I/O Disk I/O: Poly3D is CPU intensive as it mainly performs floating point calculations, so disk I/O is not the likely bottleneck. SAN throughput is nearly identical on both systems anyway. Memory: p630 has 4x as much memory, so not a likely bottleneck. Cache: p630 actually has less L2 cache than 44P-170. This is something that we want to keep an eye on. My Method (cont’d)
  • 9. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation9 Table of contents Background Poly3D Profiling Results Benchmarks Summary
  • 10. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation10 Poly3D Profiling Results Memory Access Distribution Event Description Hits PM_DATA_FROM_L2 Data loaded from L2 cache 940,204,334 PM_DATA_FROM_L3 * Data loaded from L3 cache 63,310,703 PM_DATA_FROM_L3.5 * Data loaded from L3.5 cache 55,488,257 PM_DATA_FROM_MEM Data loaded from memory 77,330,243 Total 1,136,333,537 82% 6% 5% 7% Data loaded from L2 cache Data loaded from L3 cache Data loaded from L3.5 cache Data loaded from memory * Total L3 cache access = L3 + L3.5 Obtained using the pmcount utility on p630-6C4
  • 11. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation11 Poly3D Profiling Results Processor Time Distribution Activity Cycles Seconds L2 cache access 59,930,474,670 59.93 L3 cache access 11,879,896,000 11.88 Memory access 23,199,072,900 23.20 Total 95,009,443,570 95.01 63%13% 24% L2 cache access L3 cache access Memory access Obtained by the pmcount utility on p630-6C4
  • 12. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation12 Poly3D Profiling Results Observations Memory access constitutes a significant proportion of the execution time (24%) Cost of one L3 cache access = ~100 cycles Cost of one memory access = ~300 cycles 118,798,960 L3 accesses x 100 cycles = 11,879,896,000 cycles (11.9 seconds @ 1GHz) 77,330,243 memory accesses x 300 cycles = 23,199,072,900 cycles (23.2 seconds @ 1GHz) Total of 35,078,968,900 cycles or 35.1 seconds spent on L3 cache and memory accesses This portion of work will take less on 44P-170 due to a much larger L2 cache The remaining work is expected to scale down with clock speed increase Target of 70 seconds (or less) was achieved on p690 1Ghz, which due to a larger L3 cache accessed memory eight times less than p630 (77 vs. 9 million)
  • 13. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation13 Table of contents Background Poly3D Profiling Results Benchmarks Summary
  • 14. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation14 Benchmarks SPEC CPU2000 and LINPACK Results SPEC CPU2000 LINPACK int int_base fp fp_base DP TPP HPC 44p-170 346 333 434 426 503 1,440 --- p630-6C4 639 624 886 843 842 2,172 --- Improvement ratio 1.85 1.87 2.04 1.98 1.67 1.51 --- Source: IBM eServer pSeries and IBM RS/6000 Performance Report Greater improvement ratio shown for CPU-intensive benchmarks, i.e. SPEC CPU2000 Lower improvement ratio shown for memory-intensive benchmarks, i.e. LINPACK
  • 15. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation15 Benchmarks LINPACK Overview LINear equations software PACKage Developed by Dr. Jack Dongarra, University of Tennessee Consists of algorithms that solve a dense system of linear equations / matrices using Gaussian elimination Uses matrix of order 100 for DP, and matrix of order 1000 for TPP benchmark Used by TOP500 Supercomputer sites (www.top500.org) Used to test overall performance rather than just CPU clock rate Memory reference and CPU usage patterns similar to Poly3D Problems being solved similar to those of Poly3D
  • 16. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation16 Benchmarks LINPACK Cont’d Source: Performance of Various Computers Using Standard Linear Equations Software, Dr. Jack Dongarra Theoretical peak performance is determined by counting the number of floating point operations (flops) that can be completed in one second Theoretical peak performance does not take into account factors such as: data movement between different levels of memory, cache misses, pipeline start-ups, memory load, bus speed, and others Actual performance reflects those factors and it also depends on application code efficiency, compiler optimization, operating system, hardware characteristics, etc DP (Mflop/s) TPP (Mflop/s) Theoretical Peak (Mflop/s) Poly3D (seconds) 44p-170 503 1,440 1,800 150 p630-6C4 842 2,172 4,000 96 Improvement ratio 1.67 1.51 2.22 1.56
  • 17. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation17 Table of contents Background Poly3D Profiling Results Benchmarks Summary
  • 18. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation18 Summary Poly3D memory reference pattern is causing a high cache miss rate and extensive data movement between the main memory and CPU Smaller L2 cache and high L3 cache miss rate is making Poly3D go to the main memory on p630 more often than on 44P-170 Significant portion of execution is limited to the speed of the main memory Total amount of memory used by Poly3D is greater than the system cache Poly3D improvement ratio is consistent with LINPACK Difference between the actual and peak performance for p630 LINPACK benchmark is consistent with other systems A single benchmark should not be used to judge the overall performance of a system. Rather, a set of specialized benchmarks can measure overall performance more accurately
  • 19. IBM Software Group, Compilation Technologies Performance Review of Poly3D on p630 © 2003 IBM Corporation19 Sources IBM eServer pSeries and IBM RS/6000 Performance Report (June 2003) http://www.ibm.com/servers/eserver/pseries/hardware/system_perf.pdf Performance of Various Computers Using Standard Linear Equations Software, Jack Dongarra http://www.netlib.org/benchmark/performance.ps Frequently Asked Questions on the Linpack Benchmark http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html The LINPACK Benchmark: Past, Present, and Future, Dongarra, Luszczek, and Petitet http://www.netlib.org/utk/people/JackDongarra/PAPERS/hpl.pdf