Dimemas and Multi-Level Cache Simulations

`
Universitat Politecnica de
Catalunya

Measurement and Tools Project Report

Dimemas and Multi-level
Cache Simulations

Author: Supervisor:
M´rio Almeida
a Alejandro Ramirez Bellido

June 22, 2012

Contents
1 Introduction 2

2 Methodology 2
2.1 Dimemas Simulation . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Multi-Level Cache Simulation . . . . . . . . . . . . . . . . . . 3

3 Results 4
3.1 Dimemas Simulation . . . . . . . . . . . . . . . . . . . . . . . 4
3.2 Multi-Level Cache Simulation . . . . . . . . . . . . . . . . . . 8

4 Conclusions 11

A Used Scripts 13
A.1 Dimemas instrumentation . . . . . . . . . . . . . . . . . . . . 13
A.1.1 Generating Dimemas Conﬁguration . . . . . . . . . . . 13
A.1.2 Running experiments . . . . . . . . . . . . . . . . . . . 13
A.1.3 Graph generator . . . . . . . . . . . . . . . . . . . . . 17
A.1.4 Generating graphs . . . . . . . . . . . . . . . . . . . . 19
A.2 Pin tool instrumentation . . . . . . . . . . . . . . . . . . . . . 20
A.2.1 Generate and Compile Application and DCache tool . 20
A.2.2 Running the experiments . . . . . . . . . . . . . . . . . 28
A.2.3 Importing the results to a database . . . . . . . . . . . 31
A.2.4 Generating graphs . . . . . . . . . . . . . . . . . . . . 31

1

Abstract
This report describes the simulation and benchmarking steps taken
in order to predict the parallel performance of an application using
Dimemas and Cache-level simulations. Using Dimemas [3] the time
behaviour of NAS [1] integer sort was simulated for the architecture of
the Barcelona Super Computer, MareNostrum [4]. The performance
was evaluated as a function of the architecture latency, bandwidth,
connectivity and CPU speed. For Cache-Level Simulations, Intel’s pin
tool was used to benchmark a simple parallel application in function
of the cache and cluster sizes.

1 Introduction
This report describes the simulation and benchmarking steps taken in order
to predict the parallel performance of an application using Dimemas [3] and
Cache-level simulations.
Previous work was focused on benchmarking a PARSEC [2] ray-tracing
application on the multi-processor Boada server. For this purpose EXTRAE
and Paraver [5] were used to instrument and provide detailed quantitative
analysis of the application performance.
Following the study of measurement tools and techniques, this report
describes the usage of Dimemas to simulate the time behaviour of another
benchmarking application on the Barcelona Super Computer, MareNostrum.
This time the used traces were taken from a NAS benchmark application also
running on boada server. The performance of the application in this simu-
lation environment was evaluated as a function of the architecture latency,
bandwidth, connectivity and CPU speed.
To conclude this study on performance analysis, Cache-Level Simulations
were performed using Intel’s pin tool. The chosen application was a sim-
ple parallel application that performs distributed arithmetic operations. It
represents the typical Master-Slave paradigm with embarrassingly parallel
workload. For evaluating the cache architecture, the total cache miss rates
per cache level were calculated as a function of the cache sizes, associativity,
number of threads and the cluster size.

2 Methodology
This section presents the two different simulation configurations: Dimemas
and Multi-Level Cache simulations. Both sections describe the used tools,
configuration values and metrics used.

2

Boada Server
Bandwidth 1 Gb/s
Latency 6-10 us
Number of cores 12
Ram 24 GB

Table 1: Boada server configuration.

2.1 Dimemas Simulation
The application chosen for this experiment was the NAS Parallel Benchmark
application, integer sort. The NAS benchmark is a set of programs designed
to help evaluate the performance of parallel super computers. In this case,
the benchmark was done on the boada server which attributes are described
in table 1.

In order to perform an architecture simulation, it was decided to use the
MareNostrum Super Compute configuration which parameters are shown in
table 2. Note that a simplification was made, since it was considered that
each processor runs a single thread. Starting from MareNostrums original ar-
chitecture, multiple simulations were performed changing its attributes. For
this purpose, the script in section A.1.1 was created that generates Dimemas
configuration files and another to automate its variations. The changed at-
tributes in the simulated architecture consisted of latency, CPU speed, band-
width and the number of buses. All the measurements were stored in a sqlite3
database and then queried in order to automatically generate the graphs (sec-
tion A.1.3) presented on the section 3 using gnuplot.

To conclude, the changed attributes were recursively fixed on a chosen
optimal value to find a final architecture that needs lesser resources while
having similar execution times to the original MareNostrum configuration.

2.2 Multi-Level Cache Simulation
To conclude this study on performance analysis, Cache-Level Simulations
were performed using Intel’s pin tool. The chosen application was a sim-
ple parallel application that performs distributed arithmetic operations. It
represents the typical Master-Slave paradigm with embarrassingly parallel
workload.
For evaluating the cache architecture, the pin tool dcache application was
changed in order to support multiple levels of cache shared by parallel pro-

3

cessors. The implemented cache architecture is represented in figure 1. As
one might infer from the figure, the cache level two is cluster shared and the
cache level three is globally shared.

P0 L1

. .
. .
. . L2

P7 L1

L3

P8 L1

. . L2
. .
. .
Size of L2 Size of L1
P15 L1 = 1 MB = 4 MB

Size of L1 = 16 KB

Figure 1: Cache architecture for a cluster size of 8 and a total of 16 processors.

For this experiments, the total cache miss rates per cache level were calcu-
lated as a function of the multiple cache sizes, number of processors and the
cluster size. Some experiments were performed in terms of cache associativity
and the number of cache lines per cache set.

3 Results
In this section the results of both the experiments will be described alongside
with the resulting charts, descriptions and discussion.

3.1 Dimemas Simulation
Starting with the initial architecture of MareNostrum, the first experiment
consisted on varying the number of buses and observing its impact on the ex-
ecution time of our application. The results of this experiment are depicted
on the Figure 2.

As it can be observed from figure 2, the execution time decreases while
increasing the number of buses. This result was expected since this is a

4

Execution time with variable buses
1600
#Procs = 2
1400 #Procs = 4
#Procs = 8
1200
Execution time(s)

#Procs = 16
1000 #Procs = 32

800
600
400
200
0
0 5 10 15 20
buses

Figure 2: Execution time of IntegerSort depending on the number of buses.

multi-threaded application in which the data is transferred between threads
and adding more buses increases the amount of data that can be transferred
in parallel. Also it can be seen that from sixteen buses, the execution time
starts stabilizing. This is probably because most of the data to be sent, is
already sent in parallel and thus the increase of buses does not impact the
performance.

The second experiment consisted on varying the available bandwidth from
the initial MareNostrum conﬁguration. The results are shown in Figure 3.

Execution time with variable bandwidth
140
#Procs = 2
120 #Procs = 4
#Procs = 8
Execution time(s)

100 #Procs = 16
#Procs = 32
80

60

40

20

0
170 180 190 200 210 220 230 240 250
bandwidth

Figure 3: Execution time of IntegerSort depending on the bandwidth (MB/s).

5

Figure 3 shows that the bandwidth as a bigger impact on performance if
the application is run on a smaller set of threads. For example, a variation
of 40 MB/s can increase the execution time by 20 seconds for four threads,
but for 32 threads, the changes are almost unnoticeable. This is probably
due to the fact that the master thread has to send the initial data to all
slaves. This means that increasing the number of slaves, the data can be di-
vided in smaller chunks that can be sent in parallel and thus taking less time.

The third experiment consisted on varying the processing capacity of the
CPU. As one can observe in figure 4, increasing the processing power of each
processor decreases the execution time. This impact is more noticeable if we
consider processing capacity smaller than 100%. It is not very tunable in
terms of optimizing the usage of resources in terms of decreasing the CPU
power since a small decrease has a big impact on the execution time.

Execution time with variable cpu
500
#Procs = 2
450
#Procs = 4
400 #Procs = 8
Execution time(s)

350 #Procs = 16
#Procs = 32
300
250
200
150
100
50
0
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
cpu

Figure 4: Execution time of IntegerSort depending on the available CPU
(%).

To conclude the experiments on the variation of the architecture param-
eters, figure 5 shows the impact of latency on the execution time.
For figure 5 a logarithmic scale was chosen for the x axis since changes
in the same order of the initial MareNostrum configuration do not have a
significant impact on the execution time. The latency can be increased to
value significantly bigger without affecting much the performance since the
latency values in MareNostrum are very small. Only from values of latency
close to 0.01 seconds we start seeing bigger increases of the execution time.
This attribute should have a bigger impact for more communication intensive

6

Execution time with variable latency
600
#Procs = 2
500 #Procs = 4
#Procs = 8
Execution time(s)

#Procs = 16
400 #Procs = 32

300

200

100

0
1e-06 1e-05 0.0001 0.001 0.01 0.1 1
latency

Figure 5: Execution time of IntegerSort depending on the latency (s).

applications.

Figure 6: Execution time of IntegerSort depending on the number of threads.

To conclude, a comparison is shown in table 2 that presents the dif-
ferences between a less resource demanding conﬁguration and the original
MareNostrum conﬁguration, both achieving similar execution times. The
chosen number of threads was 32 due to its better performance as shown in

7

Parameters MareNostrum Config 1 Config 2
Cpu (%) 1.0 0.95 0.9
Latency (s) 0.000008 0.0001 0.001
Bandwidth (MB/s) 250.0 240 230
Number of buses 20+ * 16 16
Execution time (s) 12.506 13.150 13.779

Table 2: Comparison between the execution times of the initial MareNostrum
configuration and its less resource demanding configuration.

figure 6.
The table 2 confirms the predictions made in previous experiments. The
chosen values increase the execution time at most 1 second while reducing
most parameters by around 10% and increasing significantly the latency.

3.2 Multi-Level Cache Simulation
As previously mentioned, the chosen application was a simple parallel ap-
plication that performs distributed arithmetic operations. It represents the
typical Master-Slave paradigm with embarrassingly parallel workload.

MissRate of cache L2 per cluster size (Lsize=[16,1,4])
50
48 #procs = 2
#procs = 4
46 #procs = 8
Total Miss Rate (%)

44 #procs = 16
42
40
38
36
34
32
30
28
2 3 4 5 6 7 8
Cluster size

Figure 7: MissRate of Cache L2 for L1, L2 and L3 sized, respectively 16K,
1M, 4M

For evaluating the cache architecture, the cache architecture was changed
depending on multiple factors, such as the cluster size, caches sizes and cache
line sizes. To start with this experiments the cache architecture was set as
shown in figure 1. It has 16 processors with one L1 cache of 16 KB each.

8

The cache level two has 1 MB and is cluster shared with a cluster size of 8.
And finally, the cache level three is globally shared and has a size of 4 MB.
The first experiment consisted on varying the cluster size as shown in figure
7 and verifying its impact on the cache L2 miss rate. As it can be seen,
for the number of threads of this experiment the impact on the miss rates
of changing the cluster size was not very significant. For up to 4 threads it
has almost no impact at all, but when the system has more than 8 threads
it can reduce the miss rate by 2%. It is interesting to notice that in this
experiment, the more threads sharing the same L2 cache, the lesser the miss
rate becomes.
Since most cache size configurations produced similar variations for the clus-
ter size experiment, the next step consisted on verifying the the impact of
the cache sizes on the miss rates. The first step consisted on varying the size
of the non-shared cache L1 and its results are presented on figure 8.

MissRate of cache 1 per L1 size
15
#procs = 2
14 #procs = 4
#procs = 8
Total Miss Rate (%)

13 #procs = 16

12

11

10

9

8
15 20 25 30 35 40 45 50 55 60 65
Size of cache L1

Figure 8: MissRate of Cache L1 for a variable L1 cache size (KB).

Looking at figure 8 it might seem strange that a smaller number of threads
has such a lower miss rate. This is because of the master/slave paradigm that
for an increasing number of threads makes the accesses to data more sparse.
For bigger numbers of threads the miss rates can reach values close to 15%.
As expected, bigger sizes of L1 caches achieve smaller miss rates, although
the difference isn’t greater than 2%.
Although the experiments were performed for more sizes of L1 cache, in
order to study the impact of the L2 cache size, the L1 cache size was fixed
on 16 KB. The variation of L2 cache size is presented on figure 9. As one can
observe, the miss rate of L2 cache for 2 threads is high, being close to 50%.
This is probably because of the low miss rate of the L1 cache, the accesses

9

MissRate of cache 2 per L2 size (Lsize=[16,.,.])
50
#procs = 2
48
#procs = 4
46 #procs = 8
Total Miss Rate (%)

44 #procs = 16
42
40
38
36
34
32
30
1 1.5 2 2.5 3 3.5 4
Size of cache L2

Figure 9: MissRate of Cache L2 for a variable L2 cache size (MB) and a L1
cache size of 16KB.

that don’t produces hits on L1 should have lower predictability. For bigger
numbers of threads, the miss rates are still high although they don’t reach
values higher than 33%.

MissRate of cache L3 per L3 size (Lsize=[16,1,.])
100
#procs = 2
#procs = 4
80 #procs = 8
Total Miss Rate (%)

#procs = 16
60

40

20

0
4 6 8 10 12 14 16
Size of cache L3

Figure 10: MissRate of Cache L3 for a variable L3 cache size (MB) and a L1
cache size of 16KB.

Finally, for the L3 cache size, the impact on the miss rate of the L3 cache
size is shown in ﬁgure 10. It seems that accesses that don’t produce hits on
the previous two levels of cache, will hardly produce hits on the third level
of cache. The only exception are the 2 threads for which the set of accessed
data is bigger. This probably shows that either the application doesn’t justify
the use of three levels of cache, or the data accessed by each thread at each

10

moment is too short.

4 Conclusions
Dimemas allowed to experiment the theoretical performance of the applica-
tion in the MareNostrum architecture. Through the variation of each dif-
ferent parameter it was possible to create graphs depicting their impact on
the execution time. By the end of the experiment it was possible to suggest
an architecture with less resources that achieves similar results to the initial
MareNostrum architecture. This architecture is presented in table 2 and con-
firms the predictions made in the Dimemas experiments. The chosen values
increase the execution time at most 1 second while reducing most parameters
by around 10% and increasing significantly the latency.
For the second experiment, the impact of the cluster size and caches sizes
were presented for a simple parallel arithmetic calculations application. The
experiments showed that the cluster size impact on the miss rate was not
very significant. For more than 8 threads it can reduce the miss rate by
2%. Overall, the more threads sharing the same L2 cache, the lesser the
miss rate becomes. This is because of the master/slave paradigm that for
an increasing number of threads makes the accesses to data more sparse.
As expected, bigger sizes of L1 caches achieve smaller miss rates. For big
numbers of threads, the miss rates in L2 cache were high although they don’t
reach values higher than 33%. In general, accesses that didn’t produce hits
on the first two levels of cache, hardly produced hits on the third level of
cache. The experiments showed that either the application doesn’t justify
the use of three levels of cache, or the data accessed by each thread at each
moment is too short.
Scripting the experiments had a huge impact on the time needed to per-
form them. Some of the experiments produced thousands of results. The
technique that has proven to be more efficient was to script the generation
of results, output them to a sql database and perform queries to generate
graphs through gnuplot.

References
[1] http://www.nas.nasa.gov/publications/npb.html, NAS benchmark.

[2] http://parsec.cs.princeton.edu/, PARSEC benchmark.

11

[3] http://www.bsc.es/computer-sciences/performance-tools/dimemas,
Dimemas.

[4] http://en.wikipedia.org/wiki/MareNostrum, MareNostrum.

[5] http://www.bsc.es/computer-sciences/performance-tools/paraver, Par-
aver.

12

A Used Scripts
A.1 Dimemas instrumentation
A.1.1 Generating Dimemas Conﬁguration

1 #! / b i n / b a s h
2
3 i f [ $# −ne 6 ]
4 then
5 echo ” $0 : Wrong number o f arguments . ”
6 echo ” $0 : <i n p u t . t r f > <n t h r e a d s > <nbuses> <l a t e n c y > <
bandwidth> <%cpuspeed>”
7 exit 1
8 fi
9
10 c a t b e g i n o f c o n f i g
11
12 #Bandwidth d e f i n i t i o n
13 echo −e ”nn” environment i n f o r m a t i o n ” { ” ” , 0 , ” ” , 1 2 8 , $5
, $3 , 3 } ; ; n”
14
15 #Latency and %cpu s p e e d d e f i n i t i o n s
16 f o r ( ( i =0; i <=127; i++ ) )
17 do
18 echo ” ” node i n f o r m a t i o n ” { 0 , $ i , ” ” , 1 , 1 , 1 , 0 . 0 , $4 ,
$6 , 0 . 0 , 0 . 0 } ; ; ”
19 done
20
21 #F i l e name and number o f p r o c e s s o r s d e f i n i t i o n s
22 echo ” ”
23 echo −n ” ” mapping i n f o r m a t i o n ” { ” $1 ” , $2 , [ $2 ] ”
24 echo −n ” {0 ”
25
26 f o r ( ( i =1; i<=$2 −1; i++ ) )
27 do
28 echo −n ” , $ i ”
29 done
30
31 echo ” } } ; ; ”
32
33 c a t e n d o f c o n f i g

A.1.2 Running experiments

1 #! / b i n / b a s h
2#

13

3 # S c r i p t by aknahs ( Mario Almeida )
4 #
5
6 cat logo
7
8 echo ”Removing out f o l d e r ( f o r c e ) ”
9 rm − r f out
10
11 echo ” C r e a t i n g out f o l d e r ”
12 mkdir out
13 mkdir out / c f g
14 mkdir out / prv
15 mkdir out / d e t a i l s
16 mkdir out / r e s u l t s
17
18 echo ” C r e a t i n g s q l i t e 3 d a t a b a s e ”
19 s q l i t e 3 out / r e s u l t s / r e s . db ’CREATE TABLE dimemas ( p r o c s INTEGER,
b u s e s INTEGER, l a t e n c y REAL, bandwidth REAL, cpu REAL, runtime
REAL) ; ’
20
21 echo ” S e t t i n g d e f a u l t v a l u e s ”
22 LATENCY=” 0 . 0 0 0 0 0 8 ”
23 BANDWIDTH 2 5 0 . 0 ”
=”
24 BUSES=” 0 ”
25 CPU=” 1 . 0 ”
26
27 f o r i i n 02 04 08 16 32
28 do
29 #echo ”−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−”
30 i f [ $ { i : 0 : 1 } == 0 ]
31 then
32 #echo ” S e t t i n g n t h r e a d s t o $ { i : 1 } ”
33 n t h r e a d s=$ { i : 1 }
34 else
35 #echo ” S e t t i n g n t h r e a d s t o $ { i }”
36 n t h r e a d s=$ i
37 fi
38
39 echo −n ” G e n e r a t i n g r e s u l t s f o r $ n t h r e a d s ”
40
41 #BUSES−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
42 f o r j i n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
20
43 do
44 #echo ” G e n e r a t i n g c o n f i g u r a t i o n f i l e f o r BUSES = $ j ”
45 . / c o n f i g g e n i n / m p i p i n g $ i . t r f $ n t h r e a d s $ j $LATENCY
$BANDWIDTH $CPU > out / c f g / c o n f i g −$ n t h r e a d s −$j −$LATENCY −
$BANDWIDTH −$CPU . c f g
46 #echo ” C o n v e r t i n g t o p a r a v e r t r a c e . . . ”

14

47 . / Dimemas3 −S 32K −pa out / prv / paraver −$ n t h r e a d s −$j −$LATENCY −
$BANDWIDTH −$CPU . prv out / c f g / c o n f i g −$ n t h r e a d s −$j −$LATENCY −
$BANDWIDTH −$CPU . c f g > out / d e t a i l s / d e t a i l −$ n t h r e a d s −$j −
$LATENCY −$BANDWIDTH −$CPU
48 #echo ” O u t p u t i n g r e s u l t s . ”
49 echo −n ” $ n t h r e a d s , $j ,$LATENCY,$BANDWIDTH, $CPU, ” >> out /
r e s u l t s / r e s −$ n t h r e a d s . c s v
50 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −$ n t h r e a d s −$j −$LATENCY −
$BANDWIDTH −$CPU | awk ”{ p r i n t $3 } ” >> out / r e s u l t s / r e s −
$nthreads . csv
51 done
52
53 echo −n ” . ”
54
55 #LATENCY −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
56 for j in 0.000001 0.00001 0.0001 0.001 0.01 0 . 1 1 . 0
57 do
58 . / c o n f i g g e n i n / m p i p i n g $ i . t r f $ n t h r e a d s $BUSES $ j $BANDWIDTH
$CPU > out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−$j −$BANDWIDTH −$CPU .
cfg
59 . / Dimemas3 −S 32K −pa out / prv / paraver −$ n t h r e a d s −$BUSES−$j −
$BANDWIDTH −$CPU . prv out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−$j −
$BANDWIDTH −$CPU . c f g > out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−
$j −$BANDWIDTH −$CPU
60 echo −n ” $ n t h r e a d s , $BUSES , $j ,$BANDWIDTH, $CPU, ” >> out / r e s u l t s /
r e s −$ n t h r e a d s . c s v
61 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−$j −
$BANDWIDTH −$CPU | awk ”{ p r i n t $3 } ” >> out / r e s u l t s / r e s −
$nthreads . csv
62 done
63
64 echo −n ” . ”
65
66 # BANDWIDTH −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
67 for j in 250.0 245.0 240.0 235.0 230.0 225.0 220.0 215.0
210.0 205.0 200.0 195.0 190.0 185.0 180.0 175.0 170.0
68 do
69 . / c o n f i g g e n i n / m p i p i n g $ i . t r f $ n t h r e a d s $BUSES $LATENCY $ j
$CPU > out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−$LATENCY −$j −$CPU . c f g
70 . / Dimemas3 −S 32K −pa out / prv / paraver −$ n t h r e a d s −$BUSES−
$LATENCY −$j −$CPU . prv out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−
$LATENCY −$j −$CPU . c f g > out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−
$LATENCY −$j −$CPU
71 echo −n ” $ n t h r e a d s , $BUSES ,$LATENCY, $j , $CPU, ” >> out / r e s u l t s /
r e s −$ n t h r e a d s . c s v
72 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−$LATENCY −$j
−$CPU | awk ”{ p r i n t $3 }” >> out / r e s u l t s / r e s −$ n t h r e a d s . c s v
73 done
74

15

75 echo −n ” . ”
76
77 #CPU SPEED−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
78 for j in 5 . 0 4 . 0 3 . 0 2 . 0 1 . 0 0.95 0 . 9 0.85 0 . 8 0.75 0 . 7 0.65
0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.25 0.1 0.05
79 do
80 . / c o n f i g g e n i n / m p i p i n g $ i . t r f $ n t h r e a d s $BUSES $LATENCY
$BANDWIDTH $ j > out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−$LATENCY −
$BANDWIDTH j . c f g−$
81 . / Dimemas3 −S 32K −pa out / prv / paraver −$ n t h r e a d s −$BUSES−
$LATENCY −$BANDWIDTH j . prv out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−
−$
$LATENCY −$BANDWIDTH j . c f g > out / d e t a i l s / d e t a i l −$ n t h r e a d s −
−$
$BUSES−$LATENCY −$BANDWIDTH j −$
82 echo −n ” $ n t h r e a d s , $BUSES ,$LATENCY,$BANDWIDTH, $j , ” >> out /
r e s u l t s / r e s −$ n t h r e a d s . c s v
83 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−$LATENCY −
$BANDWIDTH j | awk ” { p r i n t $3 } ” >> out / r e s u l t s / r e s −
−$
$nthreads . csv
84
85
86 done
87 echo ” . ”
88 echo ” I m p o r t i n g t o d a t a b a s e ”
89 echo ” . s e p a r a t o r ” , ” ” > out / r e s u l t s /command
90 echo ” . import out / r e s u l t s / r e s −$ { n t h r e a d s } . c s v dimemas” >>
out / r e s u l t s /command
91 s q l i t e 3 out / r e s u l t s / r e s . db < out / r e s u l t s /command
92 rm out / r e s u l t s /command
93 done
94
95 echo ” G e n e r a t i n g b e s t c o n f i g u r a t i o n 1 ”
96 . / c o n f i g g e n i n / m p i p i n g 3 2 . t r f 32 16 0 . 0 0 0 1 2 4 0 . 0 0 . 9 5 > out / c f g
/ c o n f i g −32 −16 −0.0001 −240.0 −0.95. c f g
97 . / Dimemas3 −S 32K −pa out / prv / paraver −32 −16 −0.0001 −240.0 −0.95.
prv out / c f g / c o n f i g −32 −16 −0.0001 −240.0 −0.95. c f g > out / d e t a i l s /
d e t a i l −32 −16 −0.0001 −240.0 −0.95
98 echo −n ” 3 2 , 1 6 , 0 . 0 0 0 1 , 2 4 0 . 0 , 0 . 9 5 , ” > out / r e s u l t s / o p t i m a l . c s v
99 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −32 −16 −0.0001 −240.0 −0.95 | awk
”{ p r i n t $3 }” >> out / r e s u l t s / o p t i m a l . c s v
100
101 echo ” G e n e r a t i n g b e s t c o n f i g u r a t i o n ”
102 . / c o n f i g g e n i n / m p i p i n g 1 6 . t r f 16 16 0 . 0 0 0 1 2 3 0 . 0 0 . 9 > out / c f g /
c o n f i g −16 −16 −0.0001 −230.0 −0.9. c f g
103 . / Dimemas3 −S 32K −pa out / prv / paraver −16 −16 −0.0001 −230.0 −0.9. prv
out / c f g / c o n f i g −16 −16 −0.0001 −230.0 −0.9. c f g > out / d e t a i l s /
d e t a i l −16 −16 −0.0001 −230.0 −0.9
104 echo −n ” 1 6 , 1 6 , 0 . 0 0 0 1 , 2 3 0 . 0 , 0 . 9 , ” >> out / r e s u l t s / o p t i m a l . c s v
105 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −16 −16 −0.0001 −230.0 −0.9 | awk ”
{ p r i n t $3 } ” >> out / r e s u l t s / o p t i m a l . c s v

16

106
107 ./ graphall buses
108 ./ graphall cpu
109 ./ graphall bandwidth
110 ./ graphall latency
111
112 echo ” A l l done ! ”

A.1.3 Graph generator

1 #! / b i n / b a s h
2 #
4 #
5
6 l a t e n c y=” 0 . 0 0 0 0 0 8 ”
7 bandwidth=” 2 5 0 . 0 ”
8 b u s e s=” 0 ”
9 cpu=” 1 . 0 ”
10 aux=” ”
11 aux2=” ”
12
13 i f [ ” $1 ” == ” l a t e n c y ” ]
14 then
15 comp=$ l a t e n c y
16 aux=” s e t l o g x”
17 aux2=” s e t m x t i c s 10 ”
18 fi
19 i f [ ” $1 ” == ” bandwidth ” ]
20 then
21 comp=$bandwidth
22 fi
23 i f [ ” $1 ” == ” b u s e s ” ]
24 then
25 comp=$ b u s e s
26 fi
27 i f [ ” $1 ” == ” cpu ” ]
28 then
29 comp=$cpu
30 fi
31
32
33 echo ” G e n e r a t i n g Graph”
34 g n u p l o t << EOF
35 set d a t a f i l e s e p a r a t o r ” | ”
36
37 # Line s t y l e f o r a x e s
38 set s t y l e l i n e 80 l t rgb ”#808080”

17

39
40 # Line s t y l e f o r g r i d
41 set s t y l e l i n e 81 l t 0 # dashed
42 set s t y l e l i n e 81 l t rgb ”#808080” # grey
43
44 set grid back l i n e s t y l e 81
45 set b o r d e r 3 back l i n e s t y l e 80 # Remove b o r d e r on t o p and
r i g h t . These
46 # b o r d e r s a r e u s e l e s s and make i t h a r d e r
47 # t o s e e p l o t t e d l i n e s near t h e b o r d e r .
48 # Also , p u t i t i n g r e y ; no need f o r so much emphasis on a
border .
49 set x t i c s n o m i r r o r
50 set y t i c s n o m i r r o r
51
52 #s e t l o g x
53 #s e t m x t i c s 10 # Makes l o g s c a l e l o o k good .
54
55 # Line s t y l e s : t r y t o p i c k p l e a s i n g c o l o r s , r a t h e r
56 # than s t r i c t l y primary c o l o r s or hard−to−s e e c o l o r s
57 # l i k e g n u p l o t ’ s d e f a u l t y e l l o w . Make t h e l i n e s t h i c k
58 # so t h e y ’ r e e a s y t o s e e i n s m a l l p l o t s i n p a p e r s .
59 set s t y l e l i n e 1 l t rgb ”#A00000 ” lw 2 pt 1
60 set s t y l e l i n e 2 l t rgb ”#00A000” lw 2 pt 6
61 set s t y l e l i n e 3 l t rgb ”#5060D0” lw 2 pt 2
62 set s t y l e l i n e 4 l t rgb ”#F25900 ” lw 2 pt 9
63 set s t y l e l i n e 5 lw 2 pt 9
64
65 #s e t key t o p r i g h t
66
67 #s e t x r a n g e [ 0 : 1 ]
68 #s e t y r a n g e [ 0 : 1 ]
69
70 #p l o t ” t e m p l a t e . d a t ”
71 #i n d e x 0 t i t l e ” Example l i n e ” w l p l s 1 ,
72 #”” i n d e x 1 t i t l e ” Another example ” w l p l s 2
73
74 #s e t s t y l e d a t a l i n e s
75 set key o u t s i d e
76 #s e t x t i c s r o t a t e by −45
77 #s e t s i z e r a t i o 0 . 8
78 set t i t l e ” E x e c u t i o n time with v a r i a b l e $1 ”
79 set xlabel ” $1 ”
80 $aux
81 $aux2
82 set ylabel ” E x e c u t i o n time ( s ) ”
83
84 plot ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from
dimemas where $1 != $comp and p r o c s = 2 UNION s e l e c t $1 ,

18

runtime from dimemas where p r o c s = 2 and b u s e s = $ b u s e s and
l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu =
$cpu ’ ” u s i n g 1 : 2 w l p l s 1 t i t l e ’#Procs = 2 ’ ,
85 ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from
runtime from dimemas where p r o c s = 4 and b u s e s = $ b u s e s
and l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu
= $cpu ’ ” u s i n g 1 : 2 w l p l s 2 t i t l e ’#Procs = 4 ’ ,
= $cpu ’ ” u s i n g 1 : 2 w l p l s 3 t i t l e ’#Procs = 8 ’ ,
= $cpu ’ ” u s i n g 1 : 2 with l i n e s l s 4 t i t l e ’#Procs = 16 ’ ,
= $cpu ’ ” u s i n g 1 : 2 w l p l s 5 t i t l e ’#Procs = 32 ’
89
90 set terminal p d f c a i r o f o n t ” G i l l Sans , 7 ” l i n e w i d t h 4 rounded
91 #s e t t e r m i n a l p d f c a i r o s i z e 10cm, 2 0cm
92 set output ” out / r e s u l t s / $1 . pdf ”
93 replot
94 EOF
95
96 echo ”Done”

A.1.4 Generating graphs

1 ./ graphall buses
2 ./ graphall latency
3 ./ graphall cpu
4 ./ graphall bandwidth
5
6 echo ” G e n e r a t i n g Graph”
8 set d a t a f i l e s e p a r a t o r ” , ”
9 set nokey
10
11 set t i t l e ” E x e c u t i o n time depending on t h e number o f t h r e a d s ”
12 set xlabel ”Number o f t h r e a d s ”
13
14 set x t i c s ( 0 , 2 , 4 , 8 , 1 6 , 3 2 , 3 4 )

19

15
16 set ylabel ” E x e c u t i o n time ( s ) ”
17
18 set s t y l e l i n e 1 l t rgb ”#A00000 ” lw 50
19
20 plot ” out / r e s u l t s / comparisonThreads . c s v ” u s i n g 1 : 2 with imp l s
1
21
22 set term p o s t s c r i p t eps enhanced c o l o r
23 set output ” out / r e s u l t s / comparison . pdf ”
24 replot
25 EOF

A.2 Pin tool instrumentation
A.2.1 Generate and Compile Application and DCache tool

1 #! / b i n / b a s h
2
3 #c l u s t e r S i z e
4 # c o n s t UINT32 c a c h e S i z e = 256∗KILO ;
5 # c o n s t UINT32 l i n e S i z e = 1 ;
6 # c o n s t UINT32 a s s o c i a t i v i t y = 2 5 6 ;
7 # l u s t e r S i z e > <L 1 c a c h e s i z e > <L 1 l i n e S i z e > <L1assoc> <L 2 c a c h e s i z e
<c
> <L 2 l i n e S i z e > <L2assoc> <L 3 c a c h e s i z e > <L 3 l i n e S i z e > <L3assoc>
<nThreads>
8# $1 $2 $3 $4 $5
$6 $7 $8 $9
$10 $11
9
10 i f [ $# −ne 11 ]
11 then
12 echo ” $0 : Wrong number o f arguments . ”
13 echo ” $0 : <c l u s t e r S i z e > <L 1 c a c h e s i z e > <L 1 l i n e S i z e > <L1assoc
> <L 2 c a c h e s i z e > <L 2 l i n e S i z e > <L2assoc> <L 3 c a c h e s i z e > <
L 3 l i n e S i z e > <L3assoc> <nThreads>”
14 exit 1
15 f i
16
17 threadsAndMaster=$ ( ( $ {11} −1) )
18 #echo ” TreadsAndMaster = $threadsAndMaster ”
19
20 #echo −n ”INPUT=”
21 #echo ” $1 $2 $3 $4 $5 $6 $7 $8 $9 $ {10} $ {11}”
22
23 #echo ” S a v i n g backup o f dcache f i l e ”
24 mv −f dcache . cpp dcache backup . cpp

20

25
26 echo ”
27 #i n c l u d e <i o s t r e a m >
28 #i n c l u d e <f s t r e a m >
29 #i n c l u d e <c a s s e r t >
30
31 #i n c l u d e ” p i n .H”
32
33
34 t y p e d e f UINT32 CACHE STATS ; // type o f c a c h e h i t / m i s s c o u n t e r s
35
36 #i n c l u d e ” p i n c a c h e .H”
37
38 KNOB t r i n g > KnobOutputFile (KNOB MODE WRITEONCE,
<s ” p i n t o o l ” ,
39 ” o ” , ” a l l c a c h e . out ” , ” s p e c i f y dcache f i l e name” ) ;
40
41 PIN LOCK l o c k ;
42
43 INT32 numThreads = 0 ;
44 c o n s t INT32 MaxNumThreads = $11 ;
45 c o n s t INT32 c l u s t e r S i z e = $1 ;
46
47 s t r u c t THREAD DATA
48 {
49 UINT64 H i t s ;
50 UINT64 Miss ;
51 };
52
53 THREAD DATA l 1 c o u n t [ MaxNumThreads ] ;
54 THREAD DATA l 2 c o u n t [ c l u s t e r S i z e ] ;
55
56 VOID T h r e a d S t a r t (THREADID t h r e a d i d , CONTEXT ∗ c t x t , INT32 f l a g s ,
VOID ∗v )
57 {
58 GetLock(& l o c k , t h r e a d i d +1) ;
59 numThreads++;
60 R e l e a s e L o c k (& l o c k ) ;
61
62 ASSERT( numThreads <= MaxNumThreads , ”Maximum number o f
t h r e a d s e x c e e d e d n” ) ;
63 }
64
65 namespace DL1
66 {
67 // 1 s t l e v e l data c a c h e : 32 kB , 32 B l i n e s , 32−way
associative
68 c o n s t UINT32 c a c h e S i z e = $2 ∗KILO ;
69 c o n s t UINT32 l i n e S i z e = $3 ;
70 c o n s t UINT32 a s s o c i a t i v i t y = $4 ;

21

71 c o n s t CACHE ALLOC : : STORE ALLOCATION a l l o c a t i o n = CACHE ALLOC
: : STORE NO ALLOCATE;
72
73 c o n s t UINT32 m a x s e t s = c a c h e S i z e / ( l i n e S i z e ∗
associativity ) ;
74 c o n s t UINT32 m a x a s s o c i a t i v i t y = a s s o c i a t i v i t y ;
75
76 t y p e d e f CACHE ROUND ROBIN( max sets , m a x a s s o c i a t i v i t y ,
a l l o c a t i o n ) CACHE;
77 }
78 LOCALVAR DL1 : : CACHE d l 1 ( ”L1 Data Cache ” , DL1 : : c a c h e S i z e , DL1 : :
l i n e S i z e , DL1 : : a s s o c i a t i v i t y ) ;
79
80 namespace UL2
81 {
82 // 2nd l e v e l u n i f i e d c a c h e : 2 MB, 64 B l i n e s , d i r e c t mapped
83 c o n s t UINT32 c a c h e S i z e = $5 ∗MEGA;
85 c o n s t UINT32 a s s o c i a t i v i t y = $7 ;
: : STORE ALLOCATE;
87
associativity ) ;
89
90 t y p e d e f CACHE DIRECT MAPPED( max sets , a l l o c a t i o n ) CACHE;
91 }
92 LOCALVAR UL2 : : CACHE u l 2 ( ”L2 C l u s t e r −s h a r e d Cache ” , UL2 : :
c a c h e S i z e , UL2 : : l i n e S i z e , UL2 : : a s s o c i a t i v i t y ) ;
93
94 namespace UL3
95 {
96 // 3 rd l e v e l u n i f i e d c a c h e : 16 MB, 64 B l i n e s , d i r e c t mapped
97 c o n s t UINT32 c a c h e S i z e = $8 ∗MEGA;
99 c o n s t UINT32 a s s o c i a t i v i t y = $ { 1 0 } ;
: : STORE ALLOCATE;
101
associativity ) ;
103
104 t y p e d e f CACHE DIRECT MAPPED( max sets , a l l o c a t i o n ) CACHE;
105 }
106 LOCALVAR UL3 : : CACHE u l 3 ( ”L3 G l o b a l l y −s h a r e d Cache ” , UL3 : :
c a c h e S i z e , UL3 : : l i n e S i z e , UL3 : : a s s o c i a t i v i t y ) ;
107
108 LOCALFUN VOID F i n i ( i n t code , VOID ∗ v )
109 {

22

110 s t d : : o f s t r e a m out ( KnobOutputFile . Value ( ) . c s t r ( ) ) ;
111
112 out <<
113 ”#n”
114 ”# DCACHE s t a t s n ”
115 ”#n” ;
116
117 out << d l 1 ;
118 out << u l 2 ;
119 out << u l 3 ;
120
121 out . c l o s e ( ) ;
122
123 f o r ( i n t i =0; i <numThreads ; i ++)
124 {
125 p r i n t f ( ”%d L1 H i t s : %I64d n” , i , ( u n s i g n e d i n t ) l 1 c o u n t [ i ] . H i t s
);
126 p r i n t f ( ”%d L1 Miss : %I64d n” , i , ( u n s i g n e d i n t ) l 1 c o u n t [ i ] . Miss
);
127 p r i n t f ( ”%d L1 Hit r a t e : %f nn” , i , ( 1 0 0 . 0 ∗ l 1 c o u n t [ i ] . H i t s / (
l 1 c o u n t [ i ] . H i t s+l 1 c o u n t [ i ] . Miss ) ) ) ;
128 }
129
130 f o r ( i n t i =0; i <c l u s t e r S i z e ; i ++)
131 {
132 p r i n t f ( ”%d L2 H i t s : %I64d n” , i , ( u n s i g n e d i n t ) l 2 c o u n t [ i ] . H i t s
);
133 p r i n t f ( ”%d L2 Miss : %I64d n” , i , ( u n s i g n e d i n t ) l 2 c o u n t [ i ] . Miss
);
134 p r i n t f ( ”%d L2 Hit r a t e : %f nn” , i , ( 1 0 0 . 0 ∗ l 2 c o u n t [ i ] . H i t s / (
l 2 c o u n t [ i ] . H i t s+l 2 c o u n t [ i ] . Miss ) ) ) ;
135 }
136 }
137
138 LOCALFUN VOID U l 2 A c c e s s (ADDRINT addr , UINT32 size , CACHE BASE : :
ACCESS TYPE accessType , THREADID t i d )
139 {
140 // s e c o n d l e v e l u n i f i e d c a c h e
141 c o n s t BOOL d l 2 H i t = u l 2 . A c c e s s ( addr , size , a c c e s s T y p e ) ;
142
143 // t h i r d l e v e l u n i f i e d c a c h e
144 i n t c i d = t i d / ( MaxNumThreads/ c l u s t e r S i z e ) ;
145 i f ( ! dl2Hit )
146 {
147 GetLock(& l o c k , t i d +1) ;
148 l 2 c o u n t [ c i d ] . Miss++;
149 R e l e a s e L o c k (& l o c k ) ;
150 u l 3 . A c c e s s ( addr , size , a c c e s s T y p e ) ;
151 } else

23

152 l 2 c o u n t [ c i d ] . H i t s ++;
153 }
154
155 LOCALFUN VOID MemRefMulti (ADDRINT addr , UINT32 size , CACHE BASE
: : ACCESS TYPE accessType , THREADID t i d )
156 {
157 // f i r s t l e v e l D−c a c h e
158 c o n s t BOOL d l 1 H i t = d l 1 . A c c e s s ( addr , size , a c c e s s T y p e ) ;
159
160 i f ( ! dl1Hit ) {
161 l 1 c o u n t [ t i d ] . Miss++;
162 U l 2 A c c e s s ( addr , size , accessType , t i d ) ;
163 }
164 else
165 {
166 l 1 c o u n t [ t i d ] . H i t s ++;
167 }
168 }
169
170 LOCALFUN VOID MemRefSingle (ADDRINT addr , UINT32 size , CACHE BASE
: : ACCESS TYPE accessType , THREADID t i d )
171 {
172 // f i r s t l e v e l D−c a c h e
173 c o n s t BOOL d l 1 H i t = d l 1 . A c c e s s S i n g l e L i n e ( addr , a c c e s s T y p e ) ;
174
175 i f ( ! dl1Hit ) {
176 l 1 c o u n t [ t i d ] . Miss++;
177 U l 2 A c c e s s ( addr , size , accessType , t i d ) ;
178 }
179 else
180 {
181 l 1 c o u n t [ t i d ] . H i t s ++;
182 }
183 }
184
185 LOCALFUN VOID I n s t r u c t i o n ( INS i n s , VOID ∗v )
186 {
187 i f ( INS IsMemoryRead ( i n s ) )
188 {
189 c o n s t UINT32 s i z e = INS MemoryReadSize ( i n s ) ;
190 c o n s t AFUNPTR countFun = ( s i z e <= 4 ? (AFUNPTR)
MemRefSingle : (AFUNPTR) MemRefMulti ) ;
191
192 // o n l y p r e d i c a t e d −on memory i n s t r u c t i o n s a c c e s s D−c a c h e
193 INS InsertPredicatedCall (
194 i n s , IPOINT BEFORE , countFun ,
195 IARG MEMORYREAD EA,
196 IARG MEMORYREAD SIZE,
197 IARG UINT32 , CACHE BASE : : ACCESS TYPE LOAD,

24

198 IARG THREAD ID ,
199 IARG END) ;
200 }
201
202 i f ( INS IsMemoryWrite ( i n s ) )
203 {
204 c o n s t UINT32 s i z e = INS MemoryWriteSize ( i n s ) ;
205 c o n s t AFUNPTR countFun = ( s i z e <= 4 ? (AFUNPTR)
MemRefSingle : (AFUNPTR) MemRefMulti ) ;
206
207 // o n l y p r e d i c a t e d −on memory i n s t r u c t i o n s a c c e s s D−c a c h e
208 INS InsertPredicatedCall (
209 i n s , IPOINT BEFORE , countFun ,
210 IARG MEMORYWRITE EA,
211 IARG MEMORYWRITE SIZE,
212 IARG UINT32 , CACHE BASE : : ACCESS TYPE STORE,
213 IARG THREAD ID ,
214 IARG END) ;
215 }
216 }
217
218 GLOBALFUN i n t main ( i n t argc , c h a r ∗ argv [ ] )
219 {
220 P I N I n i t ( argc , argv ) ;
221
222 f o r ( INT32 t =0; t<MaxNumThreads ; t++)
223 {
224 l1count [ t ] . Hits = 0;
225 l 1 c o u n t [ t ] . Miss =0;
226 }
227
228 f o r ( i n t i =0; i <c l u s t e r S i z e ; i ++)
229 {
230 l 2 c o u n t [ i ] . H i t s =0;
231 l 2 c o u n t [ i ] . Miss =0;
232 }
233
234 PIN AddThreadStartFunction ( ThreadStart , 0 ) ;
235 INS AddInstrumentFunction ( I n s t r u c t i o n , 0 ) ;
236 PIN AddFiniFunction ( F i n i , 0 ) ;
237
238 // Never r e t u r n s
239 PIN StartProgram ( ) ;
240
241 return 0 ; // make c o m p i l e r happy
242 }” > dcache . cpp
243
244 make > makeres
245

25

246 echo ”
247 #i n c l u d e <p t h r e a d . h>
248 #i n c l u d e <s t d i o . h>
249 #i n c l u d e < s t d l i b . h>
250 #i n c l u d e <time . h>
251 typedef struct
252 {
253 double ∗a ;
254 double ∗b ;
255 double sum ;
256 int veclen ;
257 } DOTDATA;
258
259
260 #d e f i n e NUMTHRDS $threadsAndMaster
261 #d e f i n e VECLEN 1000000
262
263 DOTDATA d o t s t r ;
264 p t h r e a d t c a l l T h d [NUMTHRDS] ;
265 p t h r e a d m u t e x t mutexsum ;
266
267 v o i d ∗ dotprod ( v o i d ∗ a r g )
268 {
269 i n t i , s t a r t , end , l e n ;
270 long o f f s e t ;
271 // p r i n t f ( ”%dn ” , ( i n t ) a r g ) ;
272 d o u b l e mysum , ∗x , ∗y ;
273 o f f s e t = ( long ) arg ;
274
275 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 8 ) ;
276
277 len = dotstr . veclen ;
278 // p r i n t f ( ”%dn” , l e n ) ;
279 s t a r t = o f f s e t ∗ ( l e n /NUMTHRDS) ;
280 end = s t a r t + ( l e n /NUMTHRDS) ;
281 x = dotstr . a ;
282 y = dotstr . b ;
283
284 mysum = 0 ;
285 f o r ( i=s t a r t ; i <end ; i ++)
286 mysum += ( x [ i ] ∗ y [ i ] ) ;
287
289
290 p t h r e a d m u t e x l o c k (&mutexsum ) ;
291
292 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 1 0 ) ;
293 d o t s t r .sum += mysum ;
294 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 1 1 ) ;

26

295
296 p t h r e a d m u t e x u n l o c k (&mutexsum ) ;
297
299
300 // p t h r e a d e x i t ( ( v o i d ∗ ) 0 ) ;
301 }
302
303
304 i n t main ( i n t argc , c h a r ∗ argv [ ] )
305 {
306 long i ;
307 d o u b l e ∗a , ∗b ;
308 void ∗ s ta t us ;
309 pthread attr t attr ;
310
311 c l o c k t begin , end ;
312 double time spent ;
313
314 b e g i n = clock ( ) ;
315 // E x t r a e i n i t ( ) ;
316
318
319 a = ( d o u b l e ∗ ) m a l l o c (NUMTHRDS∗VECLEN∗ s i z e o f ( d o u b l e ) ) ;
320 b = ( d o u b l e ∗ ) m a l l o c (NUMTHRDS∗VECLEN∗ s i z e o f ( d o u b l e ) ) ;
321
323
324 f o r ( i =0; i <VECLEN∗NUMTHRDS; i ++)
325 {
326 a [ i ]=1;
327 b [ i ]=a [ i ] ;
328 }
329
330 d o t s t r . v e c l e n = VECLEN;
331 dotstr . a = a ;
332 dotstr . b = b ;
333 d o t s t r .sum=0;
334
336
337 p t h r e a d m u t e x i n i t (&mutexsum , NULL) ;
338
339 p t h r e a d a t t r i n i t (& a t t r ) ;
340 p t h r e a d a t t r s e t d e t a c h s t a t e (& a t t r , PTHREAD CREATE JOINABLE) ;
341
342 f o r ( i =0; i < NUMTHRDS; i ++)
343 {

27

345 p t h r e a d c r e a t e (& c a l l T h d [ i ] , &a t t r , dotprod , ( v o i d ∗ ) i ) ;
347 }
348
349 p t h r e a d a t t r d e s t r o y (& a t t r ) ;
351
352 f o r ( i =0; i < NUMTHRDS; i ++)
353 {
355 p t h r e a d j o i n ( c a l l T h d [ i ] , &s t a t u s ) ;
357 }
358
359 p r i n t f ( ”Sum = %f n” , d o t s t r .sum) ;
360 free (a) ;
361 free (b) ;
362
363 end=clock ( ) ;
364 t i m e s p e n t= ( d o u b l e ) ( end − b e g i n ) / CLOCKS PER SEC ;
365 p r i n t f ( ” E x e c u t i o n time : %f n ” , t i m e s p e n t ) ;
366
367 // E x t r a e f i n i ( ) ;
368
369 p t h r e a d m u t e x d e s t r o y (&mutexsum ) ;
370 p t h r e a d e x i t (NULL) ;
371 }
372 ” > dotprod . c
373
374 #echo ” Compiling dotprod ”
375 g c c −o dotprod dotprod . c −l p t h r e a d
376
377 #echo ” Running p i n t o o l ”
378 cd / s c r a t c h / boada −1/etm022 / p i n
379 . / p i n −t / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s /Memory/ obj−
i n t e l 6 4 / dcache . s o −− / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s
/Memory/ dotprod > / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s /
Memory/ r e s u l t s / r e s −$1−$2−$3−$4−$5−$6−$7−$8−$9−${10}−$ { 1 1 } . r e s
380
381 mv a l l c a c h e . out / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s /Memory/
r e s u l t s / r e s −$1−$2−$3−$4−$5−$6−$7−$8−$9−${10}−$ { 1 1 } . a l l c a c h e
382
383 cd / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s /Memory
384 echo ” done ! ”

A.2.2 Running the experiments

28

1 #! / b i n / b a s h
2 #
3 #S c r i p t by aknahs ( Mario Almeida )
4 #
6 # c o n s t UINT32 c a c h e S i z e = 256∗KILO ;
7 # c o n s t UINT32 l i n e S i z e = 1 ;
8 # c o n s t UINT32 a s s o c i a t i v i t y = 2 5 6 ;
9 # l u s t e r S i z e > <L 1 c a c h e s i z e > <L 1 l i n e S i z e > <L1assoc> <L 2 c a c h e s i z e
<c
> <L 2 l i n e S i z e > <L2assoc> <L 3 c a c h e s i z e > <L 3 l i n e S i z e > <L3assoc>
<nThreads>
10 # $1 $2 $3 $4 $5
$6 $7 $8 $9
$10 $11
11
12 rm − r f r e s u l t s
13 mkdir r e s u l t s
14
15 t o t a l=$ ( ( 3 ∗ 4 ∗ 3 ∗ 3 ∗ 2 ∗ 3 ∗ 2 ) )
16 n=0
17 r e s 1=$ ( date +%s .%N)
18
19
21 f o r c s i n 2 4 8
22 do
23 f o r mt i n 2 4 8 16
24 do
25 #L 1 c a c h e S i z e
26 f o r l 1 c i n 16 32 64
27 do
28 #L 1 l i n e S i z e
29 f o r l 1 l i n 32 #64 128
30 do
31 #L1assoc
32 f o r l 1 a i n 1 #2 4
33 do
35 for l 2 c in 1 2 4
36 do
37 #L 2 l i n e S i z e
38 f o r l 2 l i n 32 64 #128
39 do
40 #L2assoc
41 f o r l 2 a i n 1 #2 4
42 do
44 f o r l 3 c i n 4 8 16

29

45 do
46 #L 3 l i n e S i z e
47 f o r l 3 l i n 32 64 #128
48 do
49 #L3assoc
50 f o r l 3 a i n 1 #2 4
51 do
52 clear
53 cat logo
54 echo ”−−−−−−−−−−−−−−−−−−−−−−−−−−by aknahs ”
55 echo −n ” G e n e r a t i n g [ $n/ $ t o t a l ] . . . ”
56 r e s 2=$ ( date +%s .%N)
57 p r i n t f ” Elapsed : %.3Fn” $ ( echo ” $ r e s 2 − $ r e s 1 ” | bc
)
58
59 n=$ ( ( $n + 1 ) )
60 #echo
”−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

61 #echo ” G e n e r a t i n g CPP and Make”
62 #echo ” . / genMakeCPP $ c s $ l 1 c $ l 1 l $ l 1 a $ l 2 c $ l 2 l
$ l 2 a $ l 3 c $ l 3 l $ l 3 a $mt”
63 . / genMakeCPP $ c s $ l 1 c $ l 1 l $ l 1 a $ l 2 c $ l 2 l $ l 2 a
$ l 3 c $ l 3 l $ l 3 a $mt
64 echo ” . ”
65 done
66 done
67 done
68 done
69 done
70 done
71 done
72 done
73 done
74 done
75 done
76 echo ” a l l done . ”
77
78 g r e p ” T o t a l Miss Rate ” r e s u l t s / ∗ . a l l c a c h e | awk ’BEGIN{n=0;
p r i n t f ” C l u s t e r S i z e , L1 Cache S i z e , L1 L ine S i z e , L1
A s s o c i a t i o n , L2 Cache S i z e , L2 Li ne S i z e , L2 A s s o c i a t i o n , L3
Cache S i z e , L3 L in e S i z e , L3 A s s o c i a t i o n , Number o f t h r e a d s ,
T o t a l Miss Caches n”} { s p l i t ( $1 , a , ” . ” ) ; s p l i t ( a [ 1 ] , b ,” −”) ;
p r i n t f ”%d,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,% s n ” , n%3 +1,b
[ 2 ] , b [ 3 ] , b [ 4 ] , b [ 5 ] , b [ 6 ] , b [ 7 ] , b [ 8 ] , b [ 9 ] , b [ 1 0 ] , b [ 1 1 ] , b [ 1 2 ] , $5
;++n} ’ >> r e s u l t s / b r u t a l d b . c s v

30

A.2.3 Importing the results to a database

1 #! / b i n / b a s h
2 #
4 #
5
6 rm − r f power
7 mkdir power
8
9 s q l i t e 3 power / r e s . db ’CREATE TABLE r e s ( c a c h e l e v e l INTEGER,
c l u s t e r INTEGER, l 1 s i z e INTEGER, l 1 l i n e INTEGER, l 1 a s s o c INTEGER
, l 2 s i z e INTEGER, l 2 l i n e INTEGER, l 2 a s s o c INTEGER, l 3 s i z e INTEGER
, l 3 l i n e INTEGER, l 3 a s s o c INTEGER, t h r e a d s INTEGER, m i s s r a t e REAL
); ’
10
11 echo ” I m p o r t i n g t o d a t a b a s e ”
12 echo ” . s e p a r a t o r ” , ” ” > power /command
13 echo ” . import b r u t a l d b . c s v r e s ” >> power /command
14 s q l i t e 3 power / r e s . db < power /command
15 rm power /command
16
17 echo ” done ”
18
19 ./ graphall

A.2.4 Generating graphs

1 #! / b i n / b a s h
2 #
4 #
5
6 #s q l i t e 3 power / r e s . db ’CREATE TABLE r e s ( c a c h e l e v e l INTEGER,
c l u s t e r INTEGER, l 1 s i z e INTEGER, l 1 l i n e INTEGER, l 1 a s s o c INTEGER
, l 2 s i z e INTEGER, l 2 l i n e i n e INTEGER, l 2 a s s o c INTEGER, l 3 s i z e
INTEGER, l 3 l i n e i n e INTEGER, l 3 a s s o c INTEGER, t h r e a d s INTEGER,
m i s s r a t e REAL) ; ’
7
8 mkdir power / c l u s t e r 2
11
12 #f o r i n s t r u m e n t a t i o n l e v e l
13 f o r set i n 1 2 3
14 do
15 #f o r each c l u s t e r s i z e

31

16 for cs in 2 4 8
17 do
18 #f o r each l e v e l o f c a ch e
19 for l in 1 2 3
20 do
21
22 i f [ $ s e t == 1 ]
23 then
24 f i l e n a m e=” power / c l u s t e r $ { c s }/L${ l } MissRate−L1Size16 −c l u s t e r $ { c s }
”
25 s q l=” s e l e c t l $ { l } s i z e , m i s s r a t e from r e s where c l u s t e r = $ c s and
l 1 s i z e = 16 and c a c h e l e v e l = $ { l } and l 1 l i n e = 32 and l 2 l i n e
= 32 and l 3 l i n e = 32 ”
26 t i t l e=” MissRate o f c a c h e $ { l } p e r L${ l } s i z e ( L s i z e = [ 1 6 , . , . ] ) ”
27 xlabel=” S i z e o f c a c h e L${ l } ”
28 fi
29
30 i f [ $ s e t == 2 ]
31 then
32 f i l e n a m e=” power / c l u s t e r $ { c s }/L${ l } MissRate−c l u s t e r $ { c s } ”
c a c h e l e v e l = $ { l } and l 1 l i n e = 32 and l 2 l i n e = 32 and l 3 l i n e
= 32 ”
34 t i t l e=” MissRate o f c a c h e $ { l } p e r L${ l } s i z e ”
36 f i
37
38 i f [ $ s e t == 3 ]
39 then
40 f i l e n a m e=” power / c l u s t e r $ { c s }/L${ l } MissRate−L1Size16 −L 2 s i z e 1 −
c l u s t e r $ { c s }”
l 1 s i z e = 16 and l 2 s i z e = 1 and c a c h e l e v e l = ${ l } and l 1 l i n e =
32 and l 2 l i n e = 32 and l 3 l i n e = 32 ”
42 t i t l e=” MissRate o f c a c h e L${ l } p e r L${ l } s i z e ( L s i z e = [ 1 6 , 1 , . ] ) ”
44 f i
45
46 i f [ [ $ s e t = 1 && $ l = 1 ] ]
47 then
48 continue
49 f i
50
51 i f [ [ $ s e t == 3 && ( $ l == 1 | | $ l == 2 ) ] ]
52 then
53 continue
54 f i
55
56

32

57 echo ” G e n e r a t i n g Graph f o r s e t $ s e t on c a c h e l e v e l $ l ”
60
63
66 set s t y l e l i n e 81 l t rgb ”#808080” # g r e y
67
r i g h t . These
border .
75
76 #s e t l o g x
78
83 set s t y l e l i n e 1 l t rgb ”#A00000 ” lw 2 ps 1 pt 1
84 set s t y l e l i n e 2 l t rgb ”#00A000” lw 2 ps 1 pt 6
85 set s t y l e l i n e 3 l t rgb ”#5060D0” lw 2 ps 1 pt 2
86 set s t y l e l i n e 4 l t rgb ”#F25900 ” lw 2 ps 1 pt 9
87
89
90 #s e t x r a n g e [ 0 : 1 ]
91 set yrange [ 0 : 1 0 0 ]
92
96
98 #s e t key o u t s i d e
100 #s e t s i z e r a t i o 0 . 8
101 set t i t l e ” $ t i t l e ”
102 set xlabel ” $ x l a b e l ”
103 $aux

33

104 $aux2
105 set ylabel ” T o t a l Miss Rate (%)”
106 plot ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 2 ’ ” u s i n g
1 : 2 with p o i n t s l s 1 t i t l e ’#p r o c s = 2 ’ ,
107 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 4 ’ ” u s i n g 1 : 2
with p o i n t s l s 2 t i t l e ’#p r o c s = 4 ’ ,
with p o i n t s l s 3 t i t l e ’#p r o c s = 8 ’ ,
109 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 1 6 ’ ” u s i n g 1 : 2
with p o i n t s l s 4 t i t l e ’#p r o c s = 16 ’
110
113 set output ” ${ f i l e n a m e } . pdf ”
114 replot
115 EOF
116 done
117 done
118 done
119
120 echo ”Done”
121
122 f i l e n a m e=” power / L2MissRate−L1Size32 −L 2 s i z e 4 −l 3 s i z e 4 −v a r C l u s t e r ”
123 s q l=” s e l e c t c l u s t e r , m i s s r a t e from r e s where l 1 s i z e = 32 and
l 2 s i z e = 4 and l 3 s i z e = 4 and c a c h e l e v e l = 2 and l 1 l i n e = 32
and l 2 l i n e = 32 and l 3 l i n e = 32 ”
124 t i t l e=” MissRate o f c a c h e L2 p e r c l u s t e r s i z e ( L s i z e = [ 3 2 , 4 , 4 ] ) ”
125 xlabel=” C l u s t e r s i z e ”
126
127 echo ” G e n e r a t i n g Graph f o r s e t v a r i a b l e c l u s t e r s ”
130
133
137
r i g h t . These
border .

34

145
146 #s e t l o g x
148
153 set s t y l e l i n e 1 ps 1 pt 1
157
159
160 #s e t x r a n g e [ 0 : 1 ]
161 #s e t y r a n g e [ 0 : 1 ]
162
166
170 #s e t s i z e r a t i o 0 . 8
171 set t i t l e ” $ t i t l e ”
173 $aux
174 $aux2
1 : 2 with l p l s 1 t i t l e ’#p r o c s = 2 ’ ,
with l p l s 2 t i t l e ’#p r o c s = 4 ’ ,
with l p l s 4 t i t l e ’#p r o c s = 16 ’
180
184 replot
185 EOF
186
187 f i l e n a m e=” power / L2MissRate−L1Size16 −L 2 s i z e 1 −l 3 s i z e 4 −v a r C l u s t e r ”
188 s q l=” s e l e c t c l u s t e r , m i s s r a t e from r e s where l 1 s i z e = 16 and
l 2 s i z e = 1 and l 3 s i z e = 4 and c a c h e l e v e l = 2 and l 1 l i n e = 32

35

and l 2 l i n e = 32 and l 3 l i n e = 32 ”
189 t i t l e=” MissRate o f c a c h e L2 p e r c l u s t e r s i z e ( L s i z e = [ 1 6 , 1 , 4 ] ) ”
190 xlabel=” C l u s t e r s i z e ”
191
192 echo ” G e n e r a t i n g Graph f o r s e t v a r i a b l e c l u s t e r s ”
195
198
202
r i g h t . These
border .
210
211 #s e t l o g x
213
222
224
225 #s e t x r a n g e [ 0 : 1 ]
226 #s e t y r a n g e [ 0 : 1 ]
227
231

36

235 #s e t s i z e r a t i o 0 . 8
236 set t i t l e ” $ t i t l e ”
238 $aux
239 $aux2
1 : 2 with l p l s 1 t i t l e ’#p r o c s = 2 ’ ,
with l p l s 4 t i t l e ’#p r o c s = 16 ’
245
249 replot
250 EOF

37

Dimemas and Multi-Level Cache Simulations

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Dimemas and Multi-Level Cache Simulations

Ähnlich wie Dimemas and Multi-Level Cache Simulations (20)

Mehr von Mário Almeida

Mehr von Mário Almeida (14)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Dimemas and Multi-Level Cache Simulations