Poly3D Case Study: The Impact of Cache Misses on Performance of CPU-Intensive Applications

© 2002 IBM Corporation
IBM Software Group, Compilation Technologies
IBM Toronto Lab | September 19, 2003
Poly3D Case Study:
The Impact of Processor Cache Misses on
Performance of CPU-Intensive Applications
Zoran Kulina
Staff Software Engineer
(C/C++ Support)

Performance Review of Poly3D on p630 © 2003 IBM Corporation2
Table of contents
Background
Poly3D Profiling Results
Benchmarks
Summary

Performance Review of Poly3D on p630 © 2003 IBM Corporation
Background
3
It is often assumed that the performance of CPU-bound applications (e.g.,
computational science and engineering simulations) increases in more or
less a linear correlation with the CPU clock rate.
Users are often surprised when new hardware yields less than expected
performance improvement.
This study profiles Poly3D (a thermal dynamics simulator) and shows how
factors other than CPU speed influence the overall application
performance.
The study is the outcome of my recent service engagement. I was tasked
with benchmarking application performance to determine why a twice
faster computer failed to yield a requisite performance gain.

Background
4
Client purchased a new p630 server for their CPU-intensive simulation app
2x clock rate (1000 vs 450 MHz)
4x the memory (8 vs 2 GB)
Expected at least a two-fold increase is application speed
Got something along the lines of 40-50%
The Story

Background
Machine Specifications
44P-170 p630-6C4
Processor POWER3-II (1-way) POWER4 (1-way)
Clock rate 450 Mhz 1000 Mhz
Memory 2 GB 8 GB
L1 96 KB 96 KB
L2 8 MB 1.44 MB
L3 none 32 MB

Background
Poly3D Overview
Fluid dynamics simulation program
Written in C (originally Fortran 77)
Single-threaded
Uses about 100 megabytes of memory
Performs dot product calculations and other matrix operations
Runs for ~150 seconds on 44p-170
Runs for ~96 seconds on p630

Background
7
Run a representative workload on both old and new server
Ensure nothing else runs concurrently on the system
Collect hardware utilization metrics using pmcount utility
Summarize and compare the pmcount metrics on old vs new server
Gather the officially published benchmarks (SPEC2000 and Linpack) for
both systems. The challenge here is to find the matching server
configurations.
Determine how our pmcount metrics compare to the official benchmarks
Take corrective action as needed
My Method

Background
8
Possible causes
CPU: more than 2x faster so any slowdown will have to come from
caching, memory or I/O
Disk I/O: Poly3D is CPU intensive as it mainly performs floating point
calculations, so disk I/O is not the likely bottleneck. SAN throughput is
nearly identical on both systems anyway.
Memory: p630 has 4x as much memory, so not a likely bottleneck.
Cache: p630 actually has less L2 cache than 44P-170. This is something
that we want to keep an eye on.
My Method (cont’d)

Table of contents
Background
Benchmarks
Summary

Memory Access Distribution
Event Description Hits
PM_DATA_FROM_L2 Data loaded from L2 cache 940,204,334
PM_DATA_FROM_L3 * Data loaded from L3 cache 63,310,703
PM_DATA_FROM_L3.5 * Data loaded from L3.5 cache 55,488,257
PM_DATA_FROM_MEM Data loaded from memory 77,330,243
Total 1,136,333,537
82%
6%
5%
7%
Data loaded from L2 cache
Data loaded from L3 cache
Data loaded from L3.5 cache
Data loaded from memory
* Total L3 cache access = L3 + L3.5
Obtained using the pmcount utility on p630-6C4

Processor Time Distribution
Activity Cycles Seconds
L2 cache access 59,930,474,670 59.93
L3 cache access 11,879,896,000 11.88
Memory access 23,199,072,900 23.20
Total 95,009,443,570 95.01
63%13%
24%
L2 cache access
L3 cache access
Memory access
Obtained by the pmcount utility on p630-6C4

Observations
Memory access constitutes a significant proportion of the execution time (24%)
Cost of one L3 cache access = ~100 cycles
Cost of one memory access = ~300 cycles
118,798,960 L3 accesses x 100 cycles = 11,879,896,000 cycles (11.9 seconds @ 1GHz)
77,330,243 memory accesses x 300 cycles = 23,199,072,900 cycles (23.2 seconds @ 1GHz)
Total of 35,078,968,900 cycles or 35.1 seconds spent on L3 cache and memory accesses
This portion of work will take less on 44P-170 due to a much larger L2 cache
The remaining work is expected to scale down with clock speed increase
Target of 70 seconds (or less) was achieved on p690 1Ghz, which due to a larger L3 cache
accessed memory eight times less than p630 (77 vs. 9 million)

Table of contents
Background
Benchmarks
Summary

Benchmarks
SPEC CPU2000 and LINPACK Results
SPEC CPU2000 LINPACK
int int_base fp fp_base DP TPP HPC
44p-170 346 333 434 426 503 1,440 ---
p630-6C4 639 624 886 843 842 2,172 ---
Improvement ratio 1.85 1.87 2.04 1.98 1.67 1.51 ---
Source: IBM eServer pSeries and IBM RS/6000 Performance Report
Greater improvement ratio shown for CPU-intensive benchmarks, i.e. SPEC CPU2000
Lower improvement ratio shown for memory-intensive benchmarks, i.e. LINPACK

Benchmarks
LINPACK Overview
LINear equations software PACKage
Developed by Dr. Jack Dongarra, University of Tennessee
Consists of algorithms that solve a dense system of linear equations / matrices using
Gaussian elimination
Uses matrix of order 100 for DP, and matrix of order 1000 for TPP benchmark
Used by TOP500 Supercomputer sites (www.top500.org)
Used to test overall performance rather than just CPU clock rate
Memory reference and CPU usage patterns similar to Poly3D
Problems being solved similar to those of Poly3D

Benchmarks
LINPACK Cont’d
Source: Performance of Various Computers Using Standard Linear Equations
Software, Dr. Jack Dongarra
Theoretical peak performance is determined by counting the number of floating point
operations (flops) that can be completed in one second
Theoretical peak performance does not take into account factors such as: data
movement between different levels of memory, cache misses, pipeline start-ups,
memory load, bus speed, and others
Actual performance reflects those factors and it also depends on application code
efficiency, compiler optimization, operating system, hardware characteristics, etc
DP
(Mflop/s)
TPP
(Mflop/s)
Theoretical Peak
(Mflop/s)
Poly3D
(seconds)
44p-170 503 1,440 1,800 150
p630-6C4 842 2,172 4,000 96
Improvement ratio 1.67 1.51 2.22 1.56

Table of contents
Background
Benchmarks
Summary

Summary
Poly3D memory reference pattern is causing a high cache miss rate and extensive
data movement between the main memory and CPU
Smaller L2 cache and high L3 cache miss rate is making Poly3D go to the main
memory on p630 more often than on 44P-170
Significant portion of execution is limited to the speed of the main memory
Total amount of memory used by Poly3D is greater than the system cache
Poly3D improvement ratio is consistent with LINPACK
Difference between the actual and peak performance for p630 LINPACK benchmark is
consistent with other systems
A single benchmark should not be used to judge the overall performance of a system.
Rather, a set of specialized benchmarks can measure overall performance more
accurately

Sources
IBM eServer pSeries and IBM RS/6000 Performance Report (June 2003)
http://www.ibm.com/servers/eserver/pseries/hardware/system_perf.pdf
Performance of Various Computers Using Standard Linear Equations Software, Jack Dongarra
http://www.netlib.org/benchmark/performance.ps
Frequently Asked Questions on the Linpack Benchmark
http://www.netlib.org/utk/people/JackDongarra/faq-linpack.html
The LINPACK Benchmark: Past, Present, and Future, Dongarra, Luszczek, and Petitet
http://www.netlib.org/utk/people/JackDongarra/PAPERS/hpl.pdf

Poly3D Case Study: The Impact of Cache Misses on Performance of CPU-Intensive Applications

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Similar to Poly3D Case Study: The Impact of Cache Misses on Performance of CPU-Intensive Applications

Similar to Poly3D Case Study: The Impact of Cache Misses on Performance of CPU-Intensive Applications (20)

Recently uploaded

Recently uploaded (20)

Poly3D Case Study: The Impact of Cache Misses on Performance of CPU-Intensive Applications