SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Downloaden Sie, um offline zu lesen
`
    Universitat Politecnica de
            Catalunya

Measurement and Tools Project Report


  Dimemas and Multi-level
     Cache Simulations


  Author:                                 Supervisor:
  M´rio Almeida
   a                         Alejandro Ramirez Bellido




                  June 22, 2012
Contents
1 Introduction                                                                 2

2 Methodology                                                                  2
  2.1 Dimemas Simulation . . . . . . . . . . . . . . . . . . . . . . .         3
  2.2 Multi-Level Cache Simulation . . . . . . . . . . . . . . . . . .         3

3 Results                                                                      4
  3.1 Dimemas Simulation . . . . . . . . . . . . . . . . . . . . . . .         4
  3.2 Multi-Level Cache Simulation . . . . . . . . . . . . . . . . . .         8

4 Conclusions                                                                  11

A Used Scripts                                                                 13
  A.1 Dimemas instrumentation . . . . . . . . . . .      . . . . . . . .   .   13
      A.1.1 Generating Dimemas Configuration . .          . . . . . . . .   .   13
      A.1.2 Running experiments . . . . . . . . . .      . . . . . . . .   .   13
      A.1.3 Graph generator . . . . . . . . . . . .      . . . . . . . .   .   17
      A.1.4 Generating graphs . . . . . . . . . . .      . . . . . . . .   .   19
  A.2 Pin tool instrumentation . . . . . . . . . . . .   . . . . . . . .   .   20
      A.2.1 Generate and Compile Application and         DCache tool       .   20
      A.2.2 Running the experiments . . . . . . . .      . . . . . . . .   .   28
      A.2.3 Importing the results to a database . .      . . . . . . . .   .   31
      A.2.4 Generating graphs . . . . . . . . . . .      . . . . . . . .   .   31




                                    1
Abstract
          This report describes the simulation and benchmarking steps taken
      in order to predict the parallel performance of an application using
      Dimemas and Cache-level simulations. Using Dimemas [3] the time
      behaviour of NAS [1] integer sort was simulated for the architecture of
      the Barcelona Super Computer, MareNostrum [4]. The performance
      was evaluated as a function of the architecture latency, bandwidth,
      connectivity and CPU speed. For Cache-Level Simulations, Intel’s pin
      tool was used to benchmark a simple parallel application in function
      of the cache and cluster sizes.


1    Introduction
This report describes the simulation and benchmarking steps taken in order
to predict the parallel performance of an application using Dimemas [3] and
Cache-level simulations.
    Previous work was focused on benchmarking a PARSEC [2] ray-tracing
application on the multi-processor Boada server. For this purpose EXTRAE
and Paraver [5] were used to instrument and provide detailed quantitative
analysis of the application performance.
    Following the study of measurement tools and techniques, this report
describes the usage of Dimemas to simulate the time behaviour of another
benchmarking application on the Barcelona Super Computer, MareNostrum.
This time the used traces were taken from a NAS benchmark application also
running on boada server. The performance of the application in this simu-
lation environment was evaluated as a function of the architecture latency,
bandwidth, connectivity and CPU speed.
    To conclude this study on performance analysis, Cache-Level Simulations
were performed using Intel’s pin tool. The chosen application was a sim-
ple parallel application that performs distributed arithmetic operations. It
represents the typical Master-Slave paradigm with embarrassingly parallel
workload. For evaluating the cache architecture, the total cache miss rates
per cache level were calculated as a function of the cache sizes, associativity,
number of threads and the cluster size.


2    Methodology
This section presents the two different simulation configurations: Dimemas
and Multi-Level Cache simulations. Both sections describe the used tools,
configuration values and metrics used.

                                        2
Boada Server
                           Bandwidth     1 Gb/s
                            Latency      6-10 us
                         Number of cores   12
                             Ram         24 GB

                    Table 1: Boada server configuration.

2.1    Dimemas Simulation
The application chosen for this experiment was the NAS Parallel Benchmark
application, integer sort. The NAS benchmark is a set of programs designed
to help evaluate the performance of parallel super computers. In this case,
the benchmark was done on the boada server which attributes are described
in table 1.

    In order to perform an architecture simulation, it was decided to use the
MareNostrum Super Compute configuration which parameters are shown in
table 2. Note that a simplification was made, since it was considered that
each processor runs a single thread. Starting from MareNostrums original ar-
chitecture, multiple simulations were performed changing its attributes. For
this purpose, the script in section A.1.1 was created that generates Dimemas
configuration files and another to automate its variations. The changed at-
tributes in the simulated architecture consisted of latency, CPU speed, band-
width and the number of buses. All the measurements were stored in a sqlite3
database and then queried in order to automatically generate the graphs (sec-
tion A.1.3) presented on the section 3 using gnuplot.

   To conclude, the changed attributes were recursively fixed on a chosen
optimal value to find a final architecture that needs lesser resources while
having similar execution times to the original MareNostrum configuration.

2.2    Multi-Level Cache Simulation
To conclude this study on performance analysis, Cache-Level Simulations
were performed using Intel’s pin tool. The chosen application was a sim-
ple parallel application that performs distributed arithmetic operations. It
represents the typical Master-Slave paradigm with embarrassingly parallel
workload.
    For evaluating the cache architecture, the pin tool dcache application was
changed in order to support multiple levels of cache shared by parallel pro-


                                      3
cessors. The implemented cache architecture is represented in figure 1. As
one might infer from the figure, the cache level two is cluster shared and the
cache level three is globally shared.


                P0                  L1

                 .                   .
                 .                   .
                 .                   .                L2


                P7                  L1

                                                                 L3

                P8                   L1

                 .                    .              L2
                 .                    .
                 .                    .
                                                 Size of L2   Size of L1
                P15                  L1          = 1 MB       = 4 MB

                            Size of L1 = 16 KB



Figure 1: Cache architecture for a cluster size of 8 and a total of 16 processors.

    For this experiments, the total cache miss rates per cache level were calcu-
lated as a function of the multiple cache sizes, number of processors and the
cluster size. Some experiments were performed in terms of cache associativity
and the number of cache lines per cache set.


3     Results
In this section the results of both the experiments will be described alongside
with the resulting charts, descriptions and discussion.

3.1    Dimemas Simulation
Starting with the initial architecture of MareNostrum, the first experiment
consisted on varying the number of buses and observing its impact on the ex-
ecution time of our application. The results of this experiment are depicted
on the Figure 2.

    As it can be observed from figure 2, the execution time decreases while
increasing the number of buses. This result was expected since this is a

                                             4
Execution time with variable buses
                              1600
                                                                                #Procs = 2
                              1400                                              #Procs = 4
                                                                                #Procs = 8
                              1200
          Execution time(s)


                                                                               #Procs = 16
                              1000                                             #Procs = 32

                               800
                               600
                               400
                               200
                                    0
                                        0             5               10            15             20
                                                                     buses


Figure 2: Execution time of IntegerSort depending on the number of buses.

multi-threaded application in which the data is transferred between threads
and adding more buses increases the amount of data that can be transferred
in parallel. Also it can be seen that from sixteen buses, the execution time
starts stabilizing. This is probably because most of the data to be sent, is
already sent in parallel and thus the increase of buses does not impact the
performance.

   The second experiment consisted on varying the available bandwidth from
the initial MareNostrum configuration. The results are shown in Figure 3.


                                                  Execution time with variable bandwidth
                              140
                                                                                #Procs = 2
                              120                                               #Procs = 4
                                                                                #Procs = 8
          Execution time(s)




                              100                                              #Procs = 16
                                                                               #Procs = 32
                               80

                               60

                               40

                               20

                               0
                                170         180     190    200       210     220   230       240   250
                                                                 bandwidth


Figure 3: Execution time of IntegerSort depending on the bandwidth (MB/s).



                                                                 5
Figure 3 shows that the bandwidth as a bigger impact on performance if
the application is run on a smaller set of threads. For example, a variation
of 40 MB/s can increase the execution time by 20 seconds for four threads,
but for 32 threads, the changes are almost unnoticeable. This is probably
due to the fact that the master thread has to send the initial data to all
slaves. This means that increasing the number of slaves, the data can be di-
vided in smaller chunks that can be sent in parallel and thus taking less time.

   The third experiment consisted on varying the processing capacity of the
CPU. As one can observe in figure 4, increasing the processing power of each
processor decreases the execution time. This impact is more noticeable if we
consider processing capacity smaller than 100%. It is not very tunable in
terms of optimizing the usage of resources in terms of decreasing the CPU
power since a small decrease has a big impact on the execution time.


                                                   Execution time with variable cpu
                               500
                                                                               #Procs = 2
                               450
                                                                               #Procs = 4
                               400                                             #Procs = 8
           Execution time(s)




                               350                                            #Procs = 16
                                                                              #Procs = 32
                               300
                               250
                               200
                               150
                               100
                                50
                                0
                                     0   0.5   1     1.5    2       2.5   3    3.5    4     4.5   5
                                                                    cpu


Figure 4: Execution time of IntegerSort depending on the available CPU
(%).

    To conclude the experiments on the variation of the architecture param-
eters, figure 5 shows the impact of latency on the execution time.
    For figure 5 a logarithmic scale was chosen for the x axis since changes
in the same order of the initial MareNostrum configuration do not have a
significant impact on the execution time. The latency can be increased to
value significantly bigger without affecting much the performance since the
latency values in MareNostrum are very small. Only from values of latency
close to 0.01 seconds we start seeing bigger increases of the execution time.
This attribute should have a bigger impact for more communication intensive

                                                                6
Execution time with variable latency
                               600
                                                                       #Procs = 2
                               500                                     #Procs = 4
                                                                       #Procs = 8
           Execution time(s)


                                                                      #Procs = 16
                               400                                    #Procs = 32

                               300

                               200

                               100

                                0
                                1e-06   1e-05   0.0001       0.001    0.01         0.1   1
                                                          latency


   Figure 5: Execution time of IntegerSort depending on the latency (s).

applications.




Figure 6: Execution time of IntegerSort depending on the number of threads.

    To conclude, a comparison is shown in table 2 that presents the dif-
ferences between a less resource demanding configuration and the original
MareNostrum configuration, both achieving similar execution times. The
chosen number of threads was 32 due to its better performance as shown in

                                                         7
Parameters                                  MareNostrum Config 1                   Config 2
             Cpu (%)                                         1.0      0.95                       0.9
            Latency (s)                                  0.000008    0.0001                     0.001
         Bandwidth (MB/s)                                   250.0     240                        230
          Number of buses                                  20+ *       16                        16
         Execution time (s)                                12.506    13.150                    13.779

Table 2: Comparison between the execution times of the initial MareNostrum
configuration and its less resource demanding configuration.

figure 6.
   The table 2 confirms the predictions made in previous experiments. The
chosen values increase the execution time at most 1 second while reducing
most parameters by around 10% and increasing significantly the latency.

3.2    Multi-Level Cache Simulation
As previously mentioned, the chosen application was a simple parallel ap-
plication that performs distributed arithmetic operations. It represents the
typical Master-Slave paradigm with embarrassingly parallel workload.

                                          MissRate of cache L2 per cluster size (Lsize=[16,1,4])
                                 50
                                 48                                            #procs = 2
                                                                               #procs = 4
                                 46                                            #procs = 8
           Total Miss Rate (%)




                                 44                                           #procs = 16
                                 42
                                 40
                                 38
                                 36
                                 34
                                 32
                                 30
                                 28
                                      2        3         4           5         6          7        8
                                                               Cluster size


Figure 7: MissRate of Cache L2 for L1, L2 and L3 sized, respectively 16K,
1M, 4M

    For evaluating the cache architecture, the cache architecture was changed
depending on multiple factors, such as the cluster size, caches sizes and cache
line sizes. To start with this experiments the cache architecture was set as
shown in figure 1. It has 16 processors with one L1 cache of 16 KB each.


                                                                 8
The cache level two has 1 MB and is cluster shared with a cluster size of 8.
And finally, the cache level three is globally shared and has a size of 4 MB.
The first experiment consisted on varying the cluster size as shown in figure
7 and verifying its impact on the cache L2 miss rate. As it can be seen,
for the number of threads of this experiment the impact on the miss rates
of changing the cluster size was not very significant. For up to 4 threads it
has almost no impact at all, but when the system has more than 8 threads
it can reduce the miss rate by 2%. It is interesting to notice that in this
experiment, the more threads sharing the same L2 cache, the lesser the miss
rate becomes.
Since most cache size configurations produced similar variations for the clus-
ter size experiment, the next step consisted on verifying the the impact of
the cache sizes on the miss rates. The first step consisted on varying the size
of the non-shared cache L1 and its results are presented on figure 8.

                                                     MissRate of cache 1 per L1 size
                                 15
                                                                                 #procs = 2
                                 14                                              #procs = 4
                                                                                 #procs = 8
           Total Miss Rate (%)




                                 13                                             #procs = 16

                                 12

                                 11

                                 10

                                  9

                                  8
                                      15   20   25    30     35       40   45    50    55     60   65
                                                            Size of cache L1


    Figure 8: MissRate of Cache L1 for a variable L1 cache size (KB).

    Looking at figure 8 it might seem strange that a smaller number of threads
has such a lower miss rate. This is because of the master/slave paradigm that
for an increasing number of threads makes the accesses to data more sparse.
For bigger numbers of threads the miss rates can reach values close to 15%.
As expected, bigger sizes of L1 caches achieve smaller miss rates, although
the difference isn’t greater than 2%.
    Although the experiments were performed for more sizes of L1 cache, in
order to study the impact of the L2 cache size, the L1 cache size was fixed
on 16 KB. The variation of L2 cache size is presented on figure 9. As one can
observe, the miss rate of L2 cache for 2 threads is high, being close to 50%.
This is probably because of the low miss rate of the L1 cache, the accesses

                                                                  9
MissRate of cache 2 per L2 size (Lsize=[16,.,.])
                                 50
                                                                                   #procs = 2
                                 48
                                                                                   #procs = 4
                                 46                                                #procs = 8
           Total Miss Rate (%)


                                 44                                               #procs = 16
                                 42
                                 40
                                 38
                                 36
                                 34
                                 32
                                 30
                                      1        1.5        2          2.5          3        3.5    4
                                                               Size of cache L2


Figure 9: MissRate of Cache L2 for a variable L2 cache size (MB) and a L1
cache size of 16KB.

that don’t produces hits on L1 should have lower predictability. For bigger
numbers of threads, the miss rates are still high although they don’t reach
values higher than 33%.

                                              MissRate of cache L3 per L3 size (Lsize=[16,1,.])
                                 100
                                                                                   #procs = 2
                                                                                   #procs = 4
                                  80                                               #procs = 8
           Total Miss Rate (%)




                                                                                  #procs = 16
                                  60


                                  40


                                  20


                                      0
                                          4      6         8           10         12       14     16
                                                               Size of cache L3


Figure 10: MissRate of Cache L3 for a variable L3 cache size (MB) and a L1
cache size of 16KB.

    Finally, for the L3 cache size, the impact on the miss rate of the L3 cache
size is shown in figure 10. It seems that accesses that don’t produce hits on
the previous two levels of cache, will hardly produce hits on the third level
of cache. The only exception are the 2 threads for which the set of accessed
data is bigger. This probably shows that either the application doesn’t justify
the use of three levels of cache, or the data accessed by each thread at each

                                                                  10
moment is too short.


4    Conclusions
Dimemas allowed to experiment the theoretical performance of the applica-
tion in the MareNostrum architecture. Through the variation of each dif-
ferent parameter it was possible to create graphs depicting their impact on
the execution time. By the end of the experiment it was possible to suggest
an architecture with less resources that achieves similar results to the initial
MareNostrum architecture. This architecture is presented in table 2 and con-
firms the predictions made in the Dimemas experiments. The chosen values
increase the execution time at most 1 second while reducing most parameters
by around 10% and increasing significantly the latency.
    For the second experiment, the impact of the cluster size and caches sizes
were presented for a simple parallel arithmetic calculations application. The
experiments showed that the cluster size impact on the miss rate was not
very significant. For more than 8 threads it can reduce the miss rate by
2%. Overall, the more threads sharing the same L2 cache, the lesser the
miss rate becomes. This is because of the master/slave paradigm that for
an increasing number of threads makes the accesses to data more sparse.
As expected, bigger sizes of L1 caches achieve smaller miss rates. For big
numbers of threads, the miss rates in L2 cache were high although they don’t
reach values higher than 33%. In general, accesses that didn’t produce hits
on the first two levels of cache, hardly produced hits on the third level of
cache. The experiments showed that either the application doesn’t justify
the use of three levels of cache, or the data accessed by each thread at each
moment is too short.
    Scripting the experiments had a huge impact on the time needed to per-
form them. Some of the experiments produced thousands of results. The
technique that has proven to be more efficient was to script the generation
of results, output them to a sql database and perform queries to generate
graphs through gnuplot.


References
[1] http://www.nas.nasa.gov/publications/npb.html, NAS benchmark.

[2] http://parsec.cs.princeton.edu/, PARSEC benchmark.




                                      11
[3] http://www.bsc.es/computer-sciences/performance-tools/dimemas,
    Dimemas.

[4] http://en.wikipedia.org/wiki/MareNostrum, MareNostrum.

[5] http://www.bsc.es/computer-sciences/performance-tools/paraver, Par-
    aver.




                                  12
A         Used Scripts
   A.1         Dimemas instrumentation
   A.1.1         Generating Dimemas Configuration

 1 #! / b i n / b a s h
 2
 3 i f [ $# −ne 6 ]
 4 then
 5        echo ” $0 : Wrong number o f arguments . ”
 6        echo ” $0 : <i n p u t . t r f > <n t h r e a d s > <nbuses> <l a t e n c y > <
                bandwidth> <%cpuspeed>”
 7        exit 1
 8 fi
 9
10 c a t b e g i n o f c o n f i g
11
12 #Bandwidth d e f i n i t i o n
13 echo −e ”nn” environment i n f o r m a t i o n ” { ” ” , 0 ,  ” ” , 1 2 8 , $5
         , $3 , 3 } ; ;  n”
14
15 #Latency and %cpu s p e e d d e f i n i t i o n s
16 f o r ( ( i =0; i <=127; i++ ) )
17 do
18        echo ” ” node i n f o r m a t i o n  ” { 0 , $ i ,  ” ” , 1 , 1 , 1 , 0 . 0 , $4 ,
                $6 , 0 . 0 , 0 . 0 } ; ; ”
19 done
20
21 #F i l e name and number o f p r o c e s s o r s d e f i n i t i o n s
22 echo ” ”
23 echo −n ”  ” mapping i n f o r m a t i o n ” { ” $1  ” , $2 , [ $2 ] ”
24 echo −n ” {0 ”
25
26 f o r ( ( i =1; i<=$2 −1; i++ ) )
27 do
28        echo −n ” , $ i ”
29 done
30
31 echo ” } } ; ; ”
32
33 c a t e n d o f c o n f i g



   A.1.2         Running experiments

 1 #! / b i n / b a s h
 2#


                                                 13
3   # S c r i p t by aknahs ( Mario Almeida )
 4   #
 5
 6   cat logo
 7
 8   echo ”Removing out f o l d e r ( f o r c e ) ”
 9   rm − r f out
10
11   echo ” C r e a t i n g out f o l d e r ”
12   mkdir out
13   mkdir out / c f g
14   mkdir out / prv
15   mkdir out / d e t a i l s
16   mkdir out / r e s u l t s
17
18   echo ” C r e a t i n g s q l i t e 3 d a t a b a s e ”
19   s q l i t e 3 out / r e s u l t s / r e s . db ’CREATE TABLE dimemas ( p r o c s INTEGER,
            b u s e s INTEGER, l a t e n c y REAL, bandwidth REAL, cpu REAL, runtime
           REAL) ; ’
20
21   echo ” S e t t i n g d e f a u l t v a l u e s ”
22   LATENCY=” 0 . 0 0 0 0 0 8 ”
23   BANDWIDTH 2 5 0 . 0 ”
                  =”
24   BUSES=” 0 ”
25   CPU=” 1 . 0 ”
26
27   f o r i i n 02 04 08 16 32
28   do
29          #echo ”−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−”
30           i f [ $ { i : 0 : 1 } == 0 ]
31           then
32      #echo ” S e t t i n g n t h r e a d s t o $ { i : 1 } ”
33       n t h r e a d s=$ { i : 1 }
34           else
35      #echo ” S e t t i n g n t h r e a d s t o $ { i }”
36       n t h r e a d s=$ i
37           fi
38
39          echo −n ” G e n e r a t i n g r e s u l t s f o r $ n t h r e a d s ”
40
41         #BUSES−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
42          f o r j i n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
                  20
43         do
44      #echo ” G e n e r a t i n g c o n f i g u r a t i o n f i l e f o r BUSES = $ j ”
45      . / c o n f i g g e n i n / m p i p i n g $ i . t r f $ n t h r e a d s $ j $LATENCY
              $BANDWIDTH $CPU > out / c f g / c o n f i g −$ n t h r e a d s −$j −$LATENCY   −
              $BANDWIDTH        −$CPU . c f g
46      #echo ” C o n v e r t i n g t o p a r a v e r t r a c e . . . ”


                                                         14
47   . / Dimemas3 −S 32K −pa out / prv / paraver −$ n t h r e a d s −$j −$LATENCY                         −
           $BANDWIDTH          −$CPU . prv out / c f g / c o n f i g −$ n t h r e a d s −$j −$LATENCY     −
           $BANDWIDTH          −$CPU . c f g > out / d e t a i l s / d e t a i l −$ n t h r e a d s −$j −
           $LATENCY        −$BANDWIDTH          −$CPU
48   #echo ” O u t p u t i n g r e s u l t s . ”
49   echo −n ” $ n t h r e a d s , $j ,$LATENCY,$BANDWIDTH, $CPU, ” >> out /
            r e s u l t s / r e s −$ n t h r e a d s . c s v
50   g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −$ n t h r e a d s −$j −$LATENCY    −
           $BANDWIDTH          −$CPU | awk ”{ p r i n t  $3 } ” >> out / r e s u l t s / r e s −
            $nthreads . csv
51       done
52
53      echo −n ” . ”
54
55      #LATENCY         −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
56       for j in 0.000001 0.00001 0.0001 0.001 0.01 0 . 1 1 . 0
57       do
58   . / c o n f i g g e n i n / m p i p i n g $ i . t r f $ n t h r e a d s $BUSES $ j $BANDWIDTH
           $CPU > out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−$j −$BANDWIDTH                −$CPU .
            cfg
59   . / Dimemas3 −S 32K −pa out / prv / paraver −$ n t h r e a d s −$BUSES−$j −
           $BANDWIDTH         −$CPU . prv out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−$j −
           $BANDWIDTH         −$CPU . c f g > out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−
            $j −$BANDWIDTH          −$CPU
60   echo −n ” $ n t h r e a d s , $BUSES , $j ,$BANDWIDTH, $CPU, ” >> out / r e s u l t s /
            r e s −$ n t h r e a d s . c s v
61   g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−$j −
           $BANDWIDTH         −$CPU | awk ”{ p r i n t  $3 } ” >> out / r e s u l t s / r e s −
            $nthreads . csv
62       done
63
64      echo −n ” . ”
65
66      #  BANDWIDTH       −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
67       for j in 250.0 245.0 240.0 235.0 230.0 225.0 220.0 215.0
                210.0 205.0 200.0 195.0 190.0 185.0 180.0 175.0 170.0
68       do
69   . / c o n f i g g e n i n / m p i p i n g $ i . t r f $ n t h r e a d s $BUSES $LATENCY $ j
           $CPU > out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−$LATENCY            −$j −$CPU . c f g
70   . / Dimemas3 −S 32K −pa out / prv / paraver −$ n t h r e a d s −$BUSES−
           $LATENCY       −$j −$CPU . prv out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−
           $LATENCY       −$j −$CPU . c f g > out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−
           $LATENCY       −$j −$CPU
71   echo −n ” $ n t h r e a d s , $BUSES ,$LATENCY, $j , $CPU, ” >> out / r e s u l t s /
            r e s −$ n t h r e a d s . c s v
72   g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−$LATENCY   −$j
           −$CPU | awk ”{ p r i n t  $3 }” >> out / r e s u l t s / r e s −$ n t h r e a d s . c s v
73       done
74


                                                      15
75        echo −n ” . ”
 76
 77        #CPU SPEED−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 78         for j in 5 . 0 4 . 0 3 . 0 2 . 0 1 . 0 0.95 0 . 9 0.85 0 . 8 0.75 0 . 7 0.65
                     0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.25 0.1 0.05
 79         do
 80     . / c o n f i g g e n i n / m p i p i n g $ i . t r f $ n t h r e a d s $BUSES $LATENCY
              $BANDWIDTH $ j > out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−$LATENCY            −
              $BANDWIDTH j . c f g−$
 81     . / Dimemas3 −S 32K −pa out / prv / paraver −$ n t h r e a d s −$BUSES−
              $LATENCY        −$BANDWIDTH j . prv out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−
                                                   −$
              $LATENCY        −$BANDWIDTH j . c f g > out / d e t a i l s / d e t a i l −$ n t h r e a d s −
                                                   −$
              $BUSES−$LATENCY              −$BANDWIDTH j        −$
 82     echo −n ” $ n t h r e a d s , $BUSES ,$LATENCY,$BANDWIDTH, $j , ” >> out /
               r e s u l t s / r e s −$ n t h r e a d s . c s v
 83     g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−$LATENCY −
              $BANDWIDTH j | awk ” { p r i n t  $3 } ” >> out / r e s u l t s / r e s −
                                  −$
               $nthreads . csv
 84
 85
 86        done
 87        echo ” . ”
 88        echo ” I m p o r t i n g t o d a t a b a s e ”
 89        echo ” . s e p a r a t o r  ” ,  ” ” > out / r e s u l t s /command
 90        echo ” . import out / r e s u l t s / r e s −$ { n t h r e a d s } . c s v dimemas” >>
                  out / r e s u l t s /command
 91        s q l i t e 3 out / r e s u l t s / r e s . db < out / r e s u l t s /command
 92        rm out / r e s u l t s /command
 93 done
 94
 95 echo ” G e n e r a t i n g b e s t c o n f i g u r a t i o n 1 ”
 96 . / c o n f i g g e n i n / m p i p i n g 3 2 . t r f 32 16 0 . 0 0 0 1 2 4 0 . 0 0 . 9 5 > out / c f g
           / c o n f i g −32 −16 −0.0001 −240.0 −0.95. c f g
 97 . / Dimemas3 −S 32K −pa out / prv / paraver −32 −16 −0.0001 −240.0 −0.95.
           prv out / c f g / c o n f i g −32 −16 −0.0001 −240.0 −0.95. c f g > out / d e t a i l s /
           d e t a i l −32 −16 −0.0001 −240.0 −0.95
 98 echo −n ” 3 2 , 1 6 , 0 . 0 0 0 1 , 2 4 0 . 0 , 0 . 9 5 , ” > out / r e s u l t s / o p t i m a l . c s v
 99 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −32 −16 −0.0001 −240.0 −0.95 | awk
           ”{ p r i n t  $3 }” >> out / r e s u l t s / o p t i m a l . c s v
100
101 echo ” G e n e r a t i n g b e s t c o n f i g u r a t i o n ”
102 . / c o n f i g g e n i n / m p i p i n g 1 6 . t r f 16 16 0 . 0 0 0 1 2 3 0 . 0 0 . 9 > out / c f g /
           c o n f i g −16 −16 −0.0001 −230.0 −0.9. c f g
103 . / Dimemas3 −S 32K −pa out / prv / paraver −16 −16 −0.0001 −230.0 −0.9. prv
             out / c f g / c o n f i g −16 −16 −0.0001 −230.0 −0.9. c f g > out / d e t a i l s /
           d e t a i l −16 −16 −0.0001 −230.0 −0.9
104 echo −n ” 1 6 , 1 6 , 0 . 0 0 0 1 , 2 3 0 . 0 , 0 . 9 , ” >> out / r e s u l t s / o p t i m a l . c s v
105 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −16 −16 −0.0001 −230.0 −0.9 | awk ”
           { p r i n t  $3 } ” >> out / r e s u l t s / o p t i m a l . c s v


                                                        16
106
107   ./ graphall         buses
108   ./ graphall         cpu
109   ./ graphall         bandwidth
110   ./ graphall         latency
111
112   echo ” A l l done ! ”


      A.1.3        Graph generator

 1    #! / b i n / b a s h
 2    #
 3    # S c r i p t by aknahs ( Mario Almeida )
 4    #
 5
 6    l a t e n c y=” 0 . 0 0 0 0 0 8 ”
 7    bandwidth=” 2 5 0 . 0 ”
 8    b u s e s=” 0 ”
 9    cpu=” 1 . 0 ”
10    aux=” ”
11    aux2=” ”
12
13    i f [ ” $1 ” == ” l a t e n c y ” ]
14    then
15       comp=$ l a t e n c y
16       aux=” s e t l o g x”
17       aux2=” s e t m x t i c s 10 ”
18    fi
19    i f [ ” $1 ” == ” bandwidth ” ]
20    then
21       comp=$bandwidth
22    fi
23    i f [ ” $1 ” == ” b u s e s ” ]
24    then
25       comp=$ b u s e s
26    fi
27    i f [ ” $1 ” == ” cpu ” ]
28    then
29       comp=$cpu
30    fi
31
32
33    echo ” G e n e r a t i n g Graph”
34       g n u p l o t << EOF
35      set d a t a f i l e s e p a r a t o r ” | ”
36
37       # Line s t y l e f o r a x e s
38       set s t y l e l i n e 80 l t rgb ”#808080”

                                                      17
39
40   # Line s t y l e f o r g r i d
41   set s t y l e l i n e 81 l t 0 # dashed
42   set s t y l e l i n e 81 l t rgb ”#808080”               # grey
43
44   set grid back l i n e s t y l e 81
45   set b o r d e r 3 back l i n e s t y l e 80 # Remove b o r d e r on t o p and
         r i g h t . These
46            # b o r d e r s a r e u s e l e s s and make i t h a r d e r
47            # t o s e e p l o t t e d l i n e s near t h e b o r d e r .
48       # Also , p u t i t i n g r e y ; no need f o r so much emphasis on a
                 border .
49   set x t i c s n o m i r r o r
50   set y t i c s n o m i r r o r
51
52   #s e t l o g x
53   #s e t m x t i c s 10          # Makes l o g s c a l e l o o k good .
54
55   # Line s t y l e s : t r y t o p i c k p l e a s i n g c o l o r s , r a t h e r
56   # than s t r i c t l y primary c o l o r s or hard−to−s e e c o l o r s
57   # l i k e g n u p l o t ’ s d e f a u l t y e l l o w . Make t h e l i n e s t h i c k
58   # so t h e y ’ r e e a s y t o s e e i n s m a l l p l o t s i n p a p e r s .
59   set s t y l e l i n e 1 l t rgb ”#A00000 ” lw 2 pt 1
60   set s t y l e l i n e 2 l t rgb ”#00A000” lw 2 pt 6
61   set s t y l e l i n e 3 l t rgb ”#5060D0” lw 2 pt 2
62   set s t y l e l i n e 4 l t rgb ”#F25900 ” lw 2 pt 9
63   set s t y l e l i n e 5 lw 2 pt 9
64
65   #s e t key t o p r i g h t
66
67   #s e t x r a n g e [ 0 : 1 ]
68   #s e t y r a n g e [ 0 : 1 ]
69
70   #p l o t ” t e m p l a t e . d a t ” 
71     #i n d e x 0 t i t l e ” Example l i n e ” w l p l s 1 , 
72   #”” i n d e x 1 t i t l e ” Another example ” w l p l s 2
73
74   #s e t s t y l e d a t a l i n e s
75   set key o u t s i d e
76   #s e t x t i c s r o t a t e by −45
77   #s e t s i z e r a t i o 0 . 8
78   set t i t l e ” E x e c u t i o n time with v a r i a b l e $1 ”
79   set xlabel ” $1 ”
80   $aux
81   $aux2
82   set ylabel ” E x e c u t i o n time ( s ) ”
83
84   plot ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from
        dimemas where $1 != $comp and p r o c s = 2 UNION s e l e c t $1 ,


                                                    18
runtime from dimemas where p r o c s = 2 and b u s e s = $ b u s e s and
              l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu =
            $cpu ’ ” u s i n g 1 : 2 w l p l s 1 t i t l e ’#Procs = 2 ’ , 
85         ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from
                dimemas where $1 != $comp and p r o c s = 4 UNION s e l e c t $1 ,
                runtime from dimemas where p r o c s = 4 and b u s e s = $ b u s e s
                and l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu
                  = $cpu ’ ” u s i n g 1 : 2 w l p l s 2 t i t l e ’#Procs = 4 ’ , 
86         ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from
                dimemas where $1 != $comp and p r o c s = 8 UNION s e l e c t $1 ,
                runtime from dimemas where p r o c s = 8 and b u s e s = $ b u s e s
                and l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu
                  = $cpu ’ ” u s i n g 1 : 2 w l p l s 3 t i t l e ’#Procs = 8 ’ , 
87         ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from
                dimemas where $1 != $comp and p r o c s = 16 UNION s e l e c t $1 ,
                runtime from dimemas where p r o c s = 16 and b u s e s = $ b u s e s
                and l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu
                  = $cpu ’ ” u s i n g 1 : 2 with l i n e s l s 4 t i t l e ’#Procs = 16 ’ , 
88         ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from
                dimemas where $1 != $comp and p r o c s = 32 UNION s e l e c t $1 ,
                runtime from dimemas where p r o c s = 32 and b u s e s = $ b u s e s
                and l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu
                  = $cpu ’ ” u s i n g 1 : 2 w l p l s 5 t i t l e ’#Procs = 32 ’
89
90   set terminal p d f c a i r o f o n t ” G i l l Sans , 7 ” l i n e w i d t h 4 rounded
91   #s e t t e r m i n a l p d f c a i r o s i z e 10cm, 2 0cm
92   set output ” out / r e s u l t s / $1 . pdf ”
93   replot
94 EOF
95
96 echo ”Done”


     A.1.4       Generating graphs

 1   ./ graphall       buses
 2   ./ graphall       latency
 3   ./ graphall       cpu
 4   ./ graphall       bandwidth
 5
 6   echo ” G e n e r a t i n g Graph”
 7      g n u p l o t << EOF
 8     set d a t a f i l e s e p a r a t o r ” , ”
 9     set nokey
10
11      set t i t l e ” E x e c u t i o n time depending on t h e number o f t h r e a d s ”
12      set xlabel ”Number o f t h r e a d s ”
13
14      set x t i c s ( 0 , 2 , 4 , 8 , 1 6 , 3 2 , 3 4 )

                                                            19
15
16       set ylabel ” E x e c u t i o n time ( s ) ”
17
18       set s t y l e l i n e 1 l t rgb ”#A00000 ” lw 50
19
20       plot ” out / r e s u l t s / comparisonThreads . c s v ” u s i n g 1 : 2 with imp l s
             1
21
22   set term p o s t s c r i p t eps enhanced c o l o r
23   set output ” out / r e s u l t s / comparison . pdf ”
24   replot
25 EOF



     A.2         Pin tool instrumentation
     A.2.1        Generate and Compile Application and DCache tool

 1   #! / b i n / b a s h
 2
 3 #c l u s t e r S i z e
 4 #        c o n s t UINT32 c a c h e S i z e = 256∗KILO ;
 5 #        c o n s t UINT32 l i n e S i z e = 1 ;
 6 #        c o n s t UINT32 a s s o c i a t i v i t y = 2 5 6 ;
 7 # l u s t e r S i z e > <L 1 c a c h e s i z e > <L 1 l i n e S i z e > <L1assoc> <L 2 c a c h e s i z e
    <c
        > <L 2 l i n e S i z e > <L2assoc> <L 3 c a c h e s i z e > <L 3 l i n e S i z e > <L3assoc>
          <nThreads>
 8#             $1                      $2                      $3            $4              $5
                                 $6              $7                   $8               $9
        $10         $11
 9
10 i f [ $# −ne 11 ]
11 then
12       echo ” $0 : Wrong number o f arguments . ”
13       echo ” $0 : <c l u s t e r S i z e > <L 1 c a c h e s i z e > <L 1 l i n e S i z e > <L1assoc
                > <L 2 c a c h e s i z e > <L 2 l i n e S i z e > <L2assoc> <L 3 c a c h e s i z e > <
                 L 3 l i n e S i z e > <L3assoc> <nThreads>”
14        exit 1
15 f i
16
17 threadsAndMaster=$ ( ( $ {11} −1) )
18 #echo ” TreadsAndMaster = $threadsAndMaster ”
19
20 #echo −n ”INPUT=”
21 #echo ” $1 $2 $3 $4 $5 $6 $7 $8 $9 $ {10} $ {11}”
22
23 #echo ” S a v i n g backup o f dcache f i l e ”
24 mv −f dcache . cpp dcache backup . cpp


                                                      20
25
26   echo ”
27   #i n c l u d e <i o s t r e a m >
28   #i n c l u d e <f s t r e a m >
29   #i n c l u d e <c a s s e r t >
30
31   #i n c l u d e  ” p i n .H”
32
33
34   t y p e d e f UINT32 CACHE STATS ; // type o f c a c h e h i t / m i s s c o u n t e r s
35
36   #i n c l u d e  ” p i n c a c h e .H”
37
38   KNOB t r i n g > KnobOutputFile (KNOB MODE WRITEONCE,
         <s                                                                            ” p i n t o o l ” ,
39        ” o ” ,  ” a l l c a c h e . out  ” ,  ” s p e c i f y dcache f i l e name” ) ;
40
41   PIN LOCK l o c k ;
42
43   INT32 numThreads = 0 ;
44   c o n s t INT32 MaxNumThreads = $11 ;
45   c o n s t INT32 c l u s t e r S i z e = $1 ;
46
47   s t r u c t THREAD DATA
48   {
49           UINT64 H i t s ;
50           UINT64 Miss ;
51   };
52
53   THREAD DATA l 1 c o u n t [ MaxNumThreads ] ;
54   THREAD DATA l 2 c o u n t [ c l u s t e r S i z e ] ;
55
56   VOID T h r e a d S t a r t (THREADID t h r e a d i d , CONTEXT ∗ c t x t , INT32 f l a g s ,
        VOID ∗v )
57   {
58       GetLock(& l o c k , t h r e a d i d +1) ;
59       numThreads++;
60       R e l e a s e L o c k (& l o c k ) ;
61
62         ASSERT( numThreads <= MaxNumThreads ,  ”Maximum number o f
               t h r e a d s e x c e e d e d n” ) ;
63 }
64
65 namespace DL1
66 {
67     // 1 s t l e v e l data c a c h e : 32 kB , 32 B l i n e s , 32−way
             associative
68     c o n s t UINT32 c a c h e S i z e = $2 ∗KILO ;
69     c o n s t UINT32 l i n e S i z e = $3 ;
70     c o n s t UINT32 a s s o c i a t i v i t y = $4 ;


                                                        21
71       c o n s t CACHE ALLOC : : STORE ALLOCATION a l l o c a t i o n = CACHE ALLOC
                 : : STORE NO ALLOCATE;
 72
 73       c o n s t UINT32 m a x s e t s = c a c h e S i z e / ( l i n e S i z e ∗
                associativity ) ;
 74       c o n s t UINT32 m a x a s s o c i a t i v i t y = a s s o c i a t i v i t y ;
 75
 76       t y p e d e f CACHE ROUND ROBIN( max sets , m a x a s s o c i a t i v i t y ,
                a l l o c a t i o n ) CACHE;
 77 }
 78 LOCALVAR DL1 : : CACHE d l 1 (  ”L1 Data Cache  ” , DL1 : : c a c h e S i z e , DL1 : :
        l i n e S i z e , DL1 : : a s s o c i a t i v i t y ) ;
 79
 80 namespace UL2
 81 {
 82      // 2nd l e v e l u n i f i e d c a c h e : 2 MB, 64 B l i n e s , d i r e c t mapped
 83       c o n s t UINT32 c a c h e S i z e = $5 ∗MEGA;
 84       c o n s t UINT32 l i n e S i z e = $6 ;
 85       c o n s t UINT32 a s s o c i a t i v i t y = $7 ;
 86       c o n s t CACHE ALLOC : : STORE ALLOCATION a l l o c a t i o n = CACHE ALLOC
                 : : STORE ALLOCATE;
 87
 88       c o n s t UINT32 m a x s e t s = c a c h e S i z e / ( l i n e S i z e ∗
                associativity ) ;
 89
 90       t y p e d e f CACHE DIRECT MAPPED( max sets , a l l o c a t i o n ) CACHE;
 91 }
 92 LOCALVAR UL2 : : CACHE u l 2 (  ”L2 C l u s t e r −s h a r e d Cache  ” , UL2 : :
        c a c h e S i z e , UL2 : : l i n e S i z e , UL2 : : a s s o c i a t i v i t y ) ;
 93
 94 namespace UL3
 95 {
 96      // 3 rd l e v e l u n i f i e d c a c h e : 16 MB, 64 B l i n e s , d i r e c t mapped
 97       c o n s t UINT32 c a c h e S i z e = $8 ∗MEGA;
 98       c o n s t UINT32 l i n e S i z e = $9 ;
 99       c o n s t UINT32 a s s o c i a t i v i t y = $ { 1 0 } ;
100       c o n s t CACHE ALLOC : : STORE ALLOCATION a l l o c a t i o n = CACHE ALLOC
                 : : STORE ALLOCATE;
101
102       c o n s t UINT32 m a x s e t s = c a c h e S i z e / ( l i n e S i z e ∗
                associativity ) ;
103
104       t y p e d e f CACHE DIRECT MAPPED( max sets , a l l o c a t i o n ) CACHE;
105 }
106 LOCALVAR UL3 : : CACHE u l 3 (  ”L3 G l o b a l l y −s h a r e d Cache ” , UL3 : :
        c a c h e S i z e , UL3 : : l i n e S i z e , UL3 : : a s s o c i a t i v i t y ) ;
107
108 LOCALFUN VOID F i n i ( i n t code , VOID ∗ v )
109 {


                                                       22
110         s t d : : o f s t r e a m out ( KnobOutputFile . Value ( ) . c s t r ( ) ) ;
111
112           out <<
113                   ”#n”
114                   ”# DCACHE s t a t s n ”
115                   ”#n” ;
116
117         out << d l 1 ;
118         out << u l 2 ;
119         out << u l 3 ;
120
121         out . c l o s e ( ) ;
122
123 f o r ( i n t i =0; i <numThreads ; i ++)
124     {
125     p r i n t f (  ”%d L1 H i t s : %I64d n” , i , ( u n s i g n e d i n t ) l 1 c o u n t [ i ] . H i t s
               );
126     p r i n t f (  ”%d L1 Miss : %I64d n” , i , ( u n s i g n e d i n t ) l 1 c o u n t [ i ] . Miss
               );
127     p r i n t f (  ”%d L1 Hit r a t e : %f nn” , i , ( 1 0 0 . 0 ∗ l 1 c o u n t [ i ] . H i t s / (
               l 1 c o u n t [ i ] . H i t s+l 1 c o u n t [ i ] . Miss ) ) ) ;
128 }
129
130 f o r ( i n t i =0; i <c l u s t e r S i z e ; i ++)
131     {
132     p r i n t f (  ”%d L2 H i t s : %I64d n” , i , ( u n s i g n e d i n t ) l 2 c o u n t [ i ] . H i t s
               );
133     p r i n t f (  ”%d L2 Miss : %I64d n” , i , ( u n s i g n e d i n t ) l 2 c o u n t [ i ] . Miss
               );
134     p r i n t f (  ”%d L2 Hit r a t e : %f nn” , i , ( 1 0 0 . 0 ∗ l 2 c o u n t [ i ] . H i t s / (
               l 2 c o u n t [ i ] . H i t s+l 2 c o u n t [ i ] . Miss ) ) ) ;
135 }
136 }
137
138 LOCALFUN VOID U l 2 A c c e s s (ADDRINT addr , UINT32 size , CACHE BASE : :
          ACCESS TYPE accessType , THREADID t i d )
139 {
140         // s e c o n d l e v e l u n i f i e d c a c h e
141         c o n s t BOOL d l 2 H i t = u l 2 . A c c e s s ( addr , size , a c c e s s T y p e ) ;
142
143         // t h i r d l e v e l u n i f i e d c a c h e
144     i n t c i d = t i d / ( MaxNumThreads/ c l u s t e r S i z e ) ;
145         i f ( ! dl2Hit )
146     {
147         GetLock(& l o c k , t i d +1) ;
148         l 2 c o u n t [ c i d ] . Miss++;
149         R e l e a s e L o c k (& l o c k ) ;
150         u l 3 . A c c e s s ( addr , size , a c c e s s T y p e ) ;
151     } else


                                                          23
152      l 2 c o u n t [ c i d ] . H i t s ++;
153 }
154
155 LOCALFUN VOID MemRefMulti (ADDRINT addr , UINT32 size , CACHE BASE
        : : ACCESS TYPE accessType , THREADID t i d )
156 {
157      // f i r s t l e v e l D−c a c h e
158      c o n s t BOOL d l 1 H i t = d l 1 . A c c e s s ( addr , size , a c c e s s T y p e ) ;
159
160              i f ( ! dl1Hit ) {
161      l 1 c o u n t [ t i d ] . Miss++;
162      U l 2 A c c e s s ( addr , size , accessType , t i d ) ;
163   }
164   else
165   {
166      l 1 c o u n t [ t i d ] . H i t s ++;
167   }
168 }
169
170 LOCALFUN VOID MemRefSingle (ADDRINT addr , UINT32 size , CACHE BASE
        : : ACCESS TYPE accessType , THREADID t i d )
171 {
172      // f i r s t l e v e l D−c a c h e
173      c o n s t BOOL d l 1 H i t = d l 1 . A c c e s s S i n g l e L i n e ( addr , a c c e s s T y p e ) ;
174
175              i f ( ! dl1Hit ) {
176      l 1 c o u n t [ t i d ] . Miss++;
177      U l 2 A c c e s s ( addr , size , accessType , t i d ) ;
178   }
179   else
180   {
181      l 1 c o u n t [ t i d ] . H i t s ++;
182   }
183 }
184
185 LOCALFUN VOID I n s t r u c t i o n ( INS i n s , VOID ∗v )
186 {
187      i f ( INS IsMemoryRead ( i n s ) )
188      {
189              c o n s t UINT32 s i z e = INS MemoryReadSize ( i n s ) ;
190              c o n s t AFUNPTR countFun = ( s i z e <= 4 ? (AFUNPTR)
                       MemRefSingle : (AFUNPTR) MemRefMulti ) ;
191
192              // o n l y p r e d i c a t e d −on memory i n s t r u c t i o n s a c c e s s D−c a c h e
193              INS InsertPredicatedCall (
194                      i n s , IPOINT BEFORE , countFun ,
195                     IARG MEMORYREAD EA,
196                     IARG MEMORYREAD SIZE,
197                      IARG UINT32 , CACHE BASE : : ACCESS TYPE LOAD,


                                                        24
198                     IARG THREAD ID ,
199               IARG END) ;
200           }
201
202           i f ( INS IsMemoryWrite ( i n s ) )
203           {
204                 c o n s t UINT32 s i z e = INS MemoryWriteSize ( i n s ) ;
205                 c o n s t AFUNPTR countFun = ( s i z e <= 4 ? (AFUNPTR)
                          MemRefSingle : (AFUNPTR) MemRefMulti ) ;
206
207                 // o n l y p r e d i c a t e d −on memory i n s t r u c t i o n s a c c e s s D−c a c h e
208                 INS InsertPredicatedCall (
209                      i n s , IPOINT BEFORE , countFun ,
210                     IARG MEMORYWRITE EA,
211                     IARG MEMORYWRITE SIZE,
212                     IARG UINT32 , CACHE BASE : : ACCESS TYPE STORE,
213               IARG THREAD ID ,
214                     IARG END) ;
215           }
216   }
217
218   GLOBALFUN i n t main ( i n t argc , c h a r ∗ argv [ ] )
219   {
220       P I N I n i t ( argc , argv ) ;
221
222           f o r ( INT32 t =0; t<MaxNumThreads ; t++)
223                   {
224           l1count [ t ] . Hits = 0;
225           l 1 c o u n t [ t ] . Miss =0;
226       }
227
228           f o r ( i n t i =0; i <c l u s t e r S i z e ; i ++)
229           {
230       l 2 c o u n t [ i ] . H i t s =0;
231       l 2 c o u n t [ i ] . Miss =0;
232       }
233
234           PIN AddThreadStartFunction ( ThreadStart , 0 ) ;
235           INS AddInstrumentFunction ( I n s t r u c t i o n , 0 ) ;
236           PIN AddFiniFunction ( F i n i , 0 ) ;
237
238           // Never r e t u r n s
239           PIN StartProgram ( ) ;
240
241       return 0 ; // make c o m p i l e r happy
242   }” > dcache . cpp
243
244   make > makeres
245


                                                           25
246   echo ”
247   #i n c l u d e <p t h r e a d . h>
248   #i n c l u d e <s t d i o . h>
249   #i n c l u d e < s t d l i b . h>
250   #i n c l u d e <time . h>
251   typedef struct
252    {
253        double                ∗a ;
254        double                ∗b ;
255        double              sum ;
256        int           veclen ;
257    } DOTDATA;
258
259
260   #d e f i n e NUMTHRDS $threadsAndMaster
261   #d e f i n e VECLEN 1000000
262
263       DOTDATA d o t s t r ;
264       p t h r e a d t c a l l T h d [NUMTHRDS] ;
265       p t h r e a d m u t e x t mutexsum ;
266
267   v o i d ∗ dotprod ( v o i d ∗ a r g )
268   {
269       i n t i , s t a r t , end , l e n ;
270       long o f f s e t ;
271      // p r i n t f (  ”%dn  ” , ( i n t ) a r g ) ;
272       d o u b l e mysum , ∗x , ∗y ;
273           o f f s e t = ( long ) arg ;
274
275      // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 8 ) ;
276
277       len = dotstr . veclen ;
278           // p r i n t f (  ”%dn” , l e n ) ;
279       s t a r t = o f f s e t ∗ ( l e n /NUMTHRDS) ;
280           end       = s t a r t + ( l e n /NUMTHRDS) ;
281           x = dotstr . a ;
282           y = dotstr . b ;
283
284         mysum = 0 ;
285         f o r ( i=s t a r t ; i <end ; i ++)
286              mysum += ( x [ i ] ∗ y [ i ] ) ;
287
288      // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 9 ) ;
289
290          p t h r e a d m u t e x l o c k (&mutexsum ) ;
291
292      // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 1 0 ) ;
293         d o t s t r .sum += mysum ;
294      // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 1 1 ) ;


                                                                26
295
296         p t h r e a d m u t e x u n l o c k (&mutexsum ) ;
297
298     // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 0 ) ;
299
300         // p t h r e a d e x i t ( ( v o i d ∗ ) 0 ) ;
301 }
302
303
304 i n t main ( i n t argc , c h a r ∗ argv [ ] )
305 {
306     long i ;
307     d o u b l e ∗a , ∗b ;
308     void ∗ s ta t us ;
309     pthread attr t attr ;
310
311     c l o c k t begin , end ;
312     double time spent ;
313
314     b e g i n = clock ( ) ;
315     // E x t r a e i n i t ( ) ;
316
317     // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 1 ) ;
318
319     a = ( d o u b l e ∗ ) m a l l o c (NUMTHRDS∗VECLEN∗ s i z e o f ( d o u b l e ) ) ;
320     b = ( d o u b l e ∗ ) m a l l o c (NUMTHRDS∗VECLEN∗ s i z e o f ( d o u b l e ) ) ;
321
322     // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 2 ) ;
323
324     f o r ( i =0; i <VECLEN∗NUMTHRDS; i ++)
325     {
326             a [ i ]=1;
327             b [ i ]=a [ i ] ;
328         }
329
330     d o t s t r . v e c l e n = VECLEN;
331     dotstr . a = a ;
332     dotstr . b = b ;
333     d o t s t r .sum=0;
334
335     // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 3 ) ;
336
337     p t h r e a d m u t e x i n i t (&mutexsum , NULL) ;
338
339     p t h r e a d a t t r i n i t (& a t t r ) ;
340     p t h r e a d a t t r s e t d e t a c h s t a t e (& a t t r , PTHREAD CREATE JOINABLE) ;
341
342     f o r ( i =0; i <  NUMTHRDS; i ++)
343         {


                                                   27
344              // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 4 ) ;
345              p t h r e a d c r e a t e (& c a l l T h d [ i ] , &a t t r , dotprod , ( v o i d ∗ ) i ) ;
346              // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 3 ) ;
347          }
348
349      p t h r e a d a t t r d e s t r o y (& a t t r ) ;
350      // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 5 ) ;
351
352      f o r ( i =0; i <  NUMTHRDS; i ++)
353      {
354          // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 6 ) ;
355              p t h r e a d j o i n ( c a l l T h d [ i ] , &s t a t u s ) ;
356          // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 7 ) ;
357          }
358
359      p r i n t f (  ”Sum = %f n” , d o t s t r .sum) ;
360      free (a) ;
361      free (b) ;
362
363      end=clock ( ) ;
364            t i m e s p e n t= ( d o u b l e ) ( end − b e g i n ) / CLOCKS PER SEC ;
365            p r i n t f (  ” E x e c u t i o n time : %f n ” , t i m e s p e n t ) ;
366
367      // E x t r a e f i n i ( ) ;
368
369      p t h r e a d m u t e x d e s t r o y (&mutexsum ) ;
370      p t h r e a d e x i t (NULL) ;
371   }
372   ” > dotprod . c
373
374   #echo ” Compiling dotprod ”
375   g c c −o dotprod dotprod . c −l p t h r e a d
376
377   #echo ” Running p i n t o o l ”
378   cd / s c r a t c h / boada −1/etm022 / p i n
379   . / p i n −t / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s /Memory/ obj−
             i n t e l 6 4 / dcache . s o −− / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s
            /Memory/ dotprod > / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s /
            Memory/ r e s u l t s / r e s −$1−$2−$3−$4−$5−$6−$7−$8−$9−${10}−$ { 1 1 } . r e s
380
381 mv a l l c a c h e . out / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s /Memory/
        r e s u l t s / r e s −$1−$2−$3−$4−$5−$6−$7−$8−$9−${10}−$ { 1 1 } . a l l c a c h e
382
383 cd / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s /Memory
384 echo ” done ! ”



      A.2.2        Running the experiments


                                                                28
1 #! / b i n / b a s h
 2 #
 3 #S c r i p t by aknahs ( Mario Almeida )
 4 #
 5 #c l u s t e r S i z e
 6 #         c o n s t UINT32 c a c h e S i z e = 256∗KILO ;
 7 #         c o n s t UINT32 l i n e S i z e = 1 ;
 8 #         c o n s t UINT32 a s s o c i a t i v i t y = 2 5 6 ;
 9 # l u s t e r S i z e > <L 1 c a c h e s i z e > <L 1 l i n e S i z e > <L1assoc> <L 2 c a c h e s i z e
    <c
         > <L 2 l i n e S i z e > <L2assoc> <L 3 c a c h e s i z e > <L 3 l i n e S i z e > <L3assoc>
           <nThreads>
10 #             $1                      $2                  $3               $4             $5
                                $6               $7                   $8              $9
          $10         $11
11
12 rm − r f r e s u l t s
13 mkdir r e s u l t s
14
15 t o t a l=$ ( ( 3 ∗ 4 ∗ 3 ∗ 3 ∗ 2 ∗ 3 ∗ 2 ) )
16 n=0
17 r e s 1=$ ( date +%s .%N)
18
19
20 #c l u s t e r S i z e
21 f o r c s i n 2 4 8
22 do
23         f o r mt i n 2 4 8 16
24         do
25        #L 1 c a c h e S i z e
26     f o r l 1 c i n 16 32 64
27     do
28    #L 1 l i n e S i z e
29             f o r l 1 l i n 32 #64 128
30             do
31            #L1assoc
32         f o r l 1 a i n 1 #2 4
33         do
34                    #L 2 c a c h e S i z e
35                 for l 2 c in 1 2 4
36                 do
37                #L 2 l i n e S i z e
38             f o r l 2 l i n 32 64 #128
39             do
40            #L2assoc
41                     f o r l 2 a i n 1 #2 4
42                     do
43                    #L 3 c a c h e S i z e
44                 f o r l 3 c i n 4 8 16


                                                      29
45                  do
46                                         #L 3 l i n e S i z e
47                          f o r l 3 l i n 32 64 #128
48                          do
49                                     #L3assoc
50                      f o r l 3 a i n 1 #2 4
51                      do
52                              clear
53                              cat logo
54                              echo ”−−−−−−−−−−−−−−−−−−−−−−−−−−by aknahs ”
55                              echo −n ” G e n e r a t i n g [ $n/ $ t o t a l ] . . . ”
56                              r e s 2=$ ( date +%s .%N)
57                              p r i n t f ” Elapsed : %.3Fn” $ ( echo ” $ r e s 2 − $ r e s 1 ” | bc
                                        )
58
59                             n=$ ( ( $n + 1 ) )
60                             #echo
                                   ”−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−

61                     #echo ” G e n e r a t i n g CPP and Make”
62                     #echo ” . / genMakeCPP $ c s $ l 1 c $ l 1 l $ l 1 a $ l 2 c $ l 2 l
                             $ l 2 a $ l 3 c $ l 3 l $ l 3 a $mt”
63                     . / genMakeCPP $ c s $ l 1 c $ l 1 l $ l 1 a $ l 2 c $ l 2 l $ l 2 a
                             $ l 3 c $ l 3 l $ l 3 a $mt
64                     echo ” . ”
65                done
66                   done
67              done
68                done
69          done
70              done
71        done
72          done
73     done
74        done
75   done
76   echo ” a l l done . ”
77
78 g r e p ” T o t a l Miss Rate ” r e s u l t s / ∗ . a l l c a c h e | awk ’BEGIN{n=0;
          p r i n t f ” C l u s t e r S i z e , L1 Cache S i z e , L1 L ine S i z e , L1
          A s s o c i a t i o n , L2 Cache S i z e , L2 Li ne S i z e , L2 A s s o c i a t i o n , L3
          Cache S i z e , L3 L in e S i z e , L3 A s s o c i a t i o n , Number o f t h r e a d s ,
          T o t a l Miss Caches n”} { s p l i t ( $1 , a , ” . ” ) ; s p l i t ( a [ 1 ] , b ,” −”) ;
          p r i n t f ”%d,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,% s n ” , n%3 +1,b
          [ 2 ] , b [ 3 ] , b [ 4 ] , b [ 5 ] , b [ 6 ] , b [ 7 ] , b [ 8 ] , b [ 9 ] , b [ 1 0 ] , b [ 1 1 ] , b [ 1 2 ] , $5
         ;++n} ’ >> r e s u l t s / b r u t a l d b . c s v




                                                                  30
A.2.3        Importing the results to a database

 1   #! / b i n / b a s h
 2   #
 3   # S c r i p t by aknahs ( Mario Almeida )
 4   #
 5
 6   rm − r f power
 7   mkdir power
 8
 9   s q l i t e 3 power / r e s . db ’CREATE TABLE r e s ( c a c h e l e v e l INTEGER,
            c l u s t e r INTEGER, l 1 s i z e INTEGER, l 1 l i n e INTEGER, l 1 a s s o c INTEGER
            , l 2 s i z e INTEGER, l 2 l i n e INTEGER, l 2 a s s o c INTEGER, l 3 s i z e INTEGER
            , l 3 l i n e INTEGER, l 3 a s s o c INTEGER, t h r e a d s INTEGER, m i s s r a t e REAL
            ); ’
10
11   echo ” I m p o r t i n g t o d a t a b a s e ”
12   echo ” . s e p a r a t o r  ” ,  ” ” > power /command
13   echo ” . import b r u t a l d b . c s v r e s ” >> power /command
14   s q l i t e 3 power / r e s . db < power /command
15   rm power /command
16
17   echo ” done ”
18
19   ./ graphall



     A.2.4        Generating graphs

 1   #! / b i n / b a s h
 2   #
 3   # S c r i p t by aknahs ( Mario Almeida )
 4   #
 5
 6   #s q l i t e 3 power / r e s . db ’CREATE TABLE r e s ( c a c h e l e v e l INTEGER,
           c l u s t e r INTEGER, l 1 s i z e INTEGER, l 1 l i n e INTEGER, l 1 a s s o c INTEGER
           , l 2 s i z e INTEGER, l 2 l i n e i n e INTEGER, l 2 a s s o c INTEGER, l 3 s i z e
          INTEGER, l 3 l i n e i n e INTEGER, l 3 a s s o c INTEGER, t h r e a d s INTEGER,
           m i s s r a t e REAL) ; ’
 7
 8   mkdir power / c l u s t e r 2
 9   mkdir power / c l u s t e r 4
10   mkdir power / c l u s t e r 8
11
12   #f o r i n s t r u m e n t a t i o n l e v e l
13   f o r set i n 1 2 3
14   do
15   #f o r each c l u s t e r s i z e


                                                      31
16   for cs in 2 4 8
17   do
18   #f o r each l e v e l o f c a ch e
19   for l in 1 2 3
20   do
21
22   i f [ $ s e t == 1 ]
23   then
24   f i l e n a m e=” power / c l u s t e r $ { c s }/L${ l } MissRate−L1Size16 −c l u s t e r $ { c s }
             ”
25   s q l=” s e l e c t l $ { l } s i z e , m i s s r a t e from r e s where c l u s t e r = $ c s and
             l 1 s i z e = 16 and c a c h e l e v e l = $ { l } and l 1 l i n e = 32 and l 2 l i n e
             = 32 and l 3 l i n e = 32 ”
26   t i t l e=” MissRate o f c a c h e $ { l } p e r L${ l } s i z e ( L s i z e = [ 1 6 , . , . ] ) ”
27   xlabel=” S i z e o f c a c h e L${ l } ”
28   fi
29
30 i f [ $ s e t == 2 ]
31 then
32 f i l e n a m e=” power / c l u s t e r $ { c s }/L${ l } MissRate−c l u s t e r $ { c s } ”
33 s q l=” s e l e c t l $ { l } s i z e , m i s s r a t e from r e s where c l u s t e r = $ c s and
           c a c h e l e v e l = $ { l } and l 1 l i n e = 32 and l 2 l i n e = 32 and l 3 l i n e
           = 32 ”
34 t i t l e=” MissRate o f c a c h e $ { l } p e r L${ l } s i z e ”
35 xlabel=” S i z e o f c a c h e L${ l } ”
36 f i
37
38 i f [ $ s e t == 3 ]
39 then
40 f i l e n a m e=” power / c l u s t e r $ { c s }/L${ l } MissRate−L1Size16 −L 2 s i z e 1 −
           c l u s t e r $ { c s }”
41 s q l=” s e l e c t l $ { l } s i z e , m i s s r a t e from r e s where c l u s t e r = $ c s and
           l 1 s i z e = 16 and l 2 s i z e = 1 and c a c h e l e v e l = ${ l } and l 1 l i n e =
             32 and l 2 l i n e = 32 and l 3 l i n e = 32 ”
42 t i t l e=” MissRate o f c a c h e L${ l } p e r L${ l } s i z e ( L s i z e = [ 1 6 , 1 , . ] ) ”
43 xlabel=” S i z e o f c a c h e L${ l } ”
44 f i
45
46 i f [ [ $ s e t = 1 && $ l = 1 ] ]
47 then
48       continue
49 f i
50
51 i f [ [ $ s e t == 3 && ( $ l == 1 | | $ l == 2 ) ] ]
52 then
53       continue
54 f i
55
56


                                                      32
57 echo ” G e n e r a t i n g Graph f o r s e t $ s e t on c a c h e l e v e l $ l ”
 58    g n u p l o t << EOF
 59   set d a t a f i l e s e p a r a t o r ” | ”
 60
 61   # Line s t y l e f o r a x e s
 62   set s t y l e l i n e 80 l t rgb ”#808080”
 63
 64   # Line s t y l e f o r g r i d
 65   set s t y l e l i n e 81 l t 0 # dashed
 66   set s t y l e l i n e 81 l t rgb ”#808080” # g r e y
 67
 68   set grid back l i n e s t y l e 81
 69   set b o r d e r 3 back l i n e s t y l e 80 # Remove b o r d e r on t o p and
            r i g h t . These
 70              # b o r d e r s a r e u s e l e s s and make i t h a r d e r
 71              # t o s e e p l o t t e d l i n e s near t h e b o r d e r .
 72         # Also , p u t i t i n g r e y ; no need f o r so much emphasis on a
                    border .
 73   set x t i c s n o m i r r o r
 74   set y t i c s n o m i r r o r
 75
 76   #s e t l o g x
 77   #s e t m x t i c s 10           # Makes l o g s c a l e l o o k good .
 78
 79   # Line s t y l e s : t r y t o p i c k p l e a s i n g c o l o r s , r a t h e r
 80   # than s t r i c t l y primary c o l o r s or hard−to−s e e c o l o r s
 81   # l i k e g n u p l o t ’ s d e f a u l t y e l l o w . Make t h e l i n e s t h i c k
 82   # so t h e y ’ r e e a s y t o s e e i n s m a l l p l o t s i n p a p e r s .
 83   set s t y l e l i n e 1 l t rgb ”#A00000 ” lw 2 ps 1 pt 1
 84   set s t y l e l i n e 2 l t rgb ”#00A000” lw 2 ps 1 pt 6
 85   set s t y l e l i n e 3 l t rgb ”#5060D0” lw 2 ps 1 pt 2
 86   set s t y l e l i n e 4 l t rgb ”#F25900 ” lw 2 ps 1 pt 9
 87
 88   #s e t key t o p r i g h t
 89
 90   #s e t x r a n g e [ 0 : 1 ]
 91   set yrange [ 0 : 1 0 0 ]
 92
 93   #p l o t ” t e m p l a t e . d a t ” 
 94     #i n d e x 0 t i t l e ” Example l i n e ” w l p l s 1 , 
 95   #”” i n d e x 1 t i t l e ” Another example ” w l p l s 2
 96
 97   #s e t s t y l e d a t a l i n e s
 98   #s e t key o u t s i d e
 99   #s e t x t i c s r o t a t e by −45
100   #s e t s i z e r a t i o 0 . 8
101   set t i t l e ” $ t i t l e ”
102   set xlabel ” $ x l a b e l ”
103   $aux


                                                33
104      $aux2
105      set ylabel ” T o t a l Miss Rate (%)”
106      plot ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 2 ’ ” u s i n g
             1 : 2 with p o i n t s l s 1 t i t l e ’#p r o c s = 2 ’ , 
107        ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 4 ’ ” u s i n g 1 : 2
                 with p o i n t s l s 2 t i t l e ’#p r o c s = 4 ’ , 
108        ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 8 ’ ” u s i n g 1 : 2
                 with p o i n t s l s 3 t i t l e ’#p r o c s = 8 ’ , 
109        ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 1 6 ’ ” u s i n g 1 : 2
                 with p o i n t s l s 4 t i t l e ’#p r o c s = 16 ’
110
111     set terminal p d f c a i r o f o n t ” G i l l Sans , 7 ” l i n e w i d t h 4 rounded
112     #s e t t e r m i n a l p d f c a i r o s i z e 30cm, 1 5cm
113     set output ” ${ f i l e n a m e } . pdf ”
114     replot
115   EOF
116   done
117   done
118   done
119
120   echo ”Done”
121
122 f i l e n a m e=” power / L2MissRate−L1Size32 −L 2 s i z e 4 −l 3 s i z e 4 −v a r C l u s t e r ”
123 s q l=” s e l e c t c l u s t e r , m i s s r a t e from r e s where l 1 s i z e = 32 and
             l 2 s i z e = 4 and l 3 s i z e = 4 and c a c h e l e v e l = 2 and l 1 l i n e = 32
               and l 2 l i n e = 32 and l 3 l i n e = 32 ”
124 t i t l e=” MissRate o f c a c h e L2 p e r c l u s t e r s i z e ( L s i z e = [ 3 2 , 4 , 4 ] ) ”
125 xlabel=” C l u s t e r s i z e ”
126
127 echo ” G e n e r a t i n g Graph f o r s e t v a r i a b l e c l u s t e r s ”
128         g n u p l o t << EOF
129      set d a t a f i l e s e p a r a t o r ” | ”
130
131     # Line s t y l e f o r a x e s
132      set s t y l e l i n e 80 l t rgb ”#808080”
133
134     # Line s t y l e f o r g r i d
135      set s t y l e l i n e 81 l t 0 # dashed
136      set s t y l e l i n e 81 l t rgb ”#808080” # g r e y
137
138      set grid back l i n e s t y l e 81
139      set b o r d e r 3 back l i n e s t y l e 80 # Remove b o r d e r on t o p and
                 r i g h t . These
140                   # b o r d e r s a r e u s e l e s s and make i t h a r d e r
141                   # t o s e e p l o t t e d l i n e s near t h e b o r d e r .
142              # Also , p u t i t i n g r e y ; no need f o r so much emphasis on a
                         border .
143      set x t i c s n o m i r r o r
144      set y t i c s n o m i r r o r


                                                     34
145
146    #s e t l o g x
147    #s e t m x t i c s 10          # Makes l o g s c a l e l o o k good .
148
149    # Line s t y l e s : t r y t o p i c k p l e a s i n g c o l o r s , r a t h e r
150    # than s t r i c t l y primary c o l o r s or hard−to−s e e c o l o r s
151    # l i k e g n u p l o t ’ s d e f a u l t y e l l o w . Make t h e l i n e s t h i c k
152    # so t h e y ’ r e e a s y t o s e e i n s m a l l p l o t s i n p a p e r s .
153    set s t y l e l i n e 1 ps 1 pt 1
154    set s t y l e l i n e 2 ps 1 pt 6
155    set s t y l e l i n e 3 ps 1 pt 2
156    set s t y l e l i n e 4 ps 1 pt 9
157
158    #s e t key t o p r i g h t
159
160    #s e t x r a n g e [ 0 : 1 ]
161    #s e t y r a n g e [ 0 : 1 ]
162
163    #p l o t ” t e m p l a t e . d a t ” 
164      #i n d e x 0 t i t l e ” Example l i n e ” w l p l s 1 , 
165    #”” i n d e x 1 t i t l e ” Another example ” w l p l s 2
166
167    #s e t s t y l e d a t a l i n e s
168    #s e t key o u t s i d e
169    #s e t x t i c s r o t a t e by −45
170    #s e t s i z e r a t i o 0 . 8
171    set t i t l e ” $ t i t l e ”
172    set xlabel ” $ x l a b e l ”
173    $aux
174    $aux2
175    set ylabel ” T o t a l Miss Rate (%)”
176    plot ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 2 ’ ” u s i n g
             1 : 2 with l p l s 1 t i t l e ’#p r o c s = 2 ’ , 
177       ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 4 ’ ” u s i n g 1 : 2
                 with l p l s 2 t i t l e ’#p r o c s = 4 ’ , 
178       ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 8 ’ ” u s i n g 1 : 2
                 with l p l s 3 t i t l e ’#p r o c s = 8 ’ , 
179       ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 1 6 ’ ” u s i n g 1 : 2
                 with l p l s 4 t i t l e ’#p r o c s = 16 ’
180
181      set terminal p d f c a i r o f o n t ” G i l l Sans , 7 ” l i n e w i d t h 4 rounded
182     #s e t t e r m i n a l p d f c a i r o s i z e 30cm, 1 5cm
183      set output ” ${ f i l e n a m e } . pdf ”
184       replot
185 EOF
186
187 f i l e n a m e=” power / L2MissRate−L1Size16 −L 2 s i z e 1 −l 3 s i z e 4 −v a r C l u s t e r ”
188 s q l=” s e l e c t c l u s t e r , m i s s r a t e from r e s where l 1 s i z e = 16 and
            l 2 s i z e = 1 and l 3 s i z e = 4 and c a c h e l e v e l = 2 and l 1 l i n e = 32


                                                      35
and l 2 l i n e = 32 and l 3 l i n e = 32 ”
189 t i t l e=” MissRate o f c a c h e L2 p e r c l u s t e r s i z e ( L s i z e = [ 1 6 , 1 , 4 ] ) ”
190 xlabel=” C l u s t e r s i z e ”
191
192 echo ” G e n e r a t i n g Graph f o r s e t v a r i a b l e c l u s t e r s ”
193       g n u p l o t << EOF
194     set d a t a f i l e s e p a r a t o r ” | ”
195
196    # Line s t y l e f o r a x e s
197     set s t y l e l i n e 80 l t rgb ”#808080”
198
199    # Line s t y l e f o r g r i d
200     set s t y l e l i n e 81 l t 0 # dashed
201     set s t y l e l i n e 81 l t rgb ”#808080” # g r e y
202
203     set grid back l i n e s t y l e 81
204     set b o r d e r 3 back l i n e s t y l e 80 # Remove b o r d e r on t o p and
              r i g h t . These
205                # b o r d e r s a r e u s e l e s s and make i t h a r d e r
206                # t o s e e p l o t t e d l i n e s near t h e b o r d e r .
207            # Also , p u t i t i n g r e y ; no need f o r so much emphasis on a
                       border .
208     set x t i c s n o m i r r o r
209     set y t i c s n o m i r r o r
210
211    #s e t l o g x
212    #s e t m x t i c s 10           # Makes l o g s c a l e l o o k good .
213
214    # Line s t y l e s : t r y t o p i c k p l e a s i n g c o l o r s , r a t h e r
215    # than s t r i c t l y primary c o l o r s or hard−to−s e e c o l o r s
216    # l i k e g n u p l o t ’ s d e f a u l t y e l l o w . Make t h e l i n e s t h i c k
217    # so t h e y ’ r e e a s y t o s e e i n s m a l l p l o t s i n p a p e r s .
218     set s t y l e l i n e 1 ps 1 pt 1
219     set s t y l e l i n e 2 ps 1 pt 6
220     set s t y l e l i n e 3 ps 1 pt 2
221     set s t y l e l i n e 4 ps 1 pt 9
222
223    #s e t key t o p r i g h t
224
225    #s e t x r a n g e [ 0 : 1 ]
226    #s e t y r a n g e [ 0 : 1 ]
227
228    #p l o t ” t e m p l a t e . d a t ” 
229         #i n d e x 0 t i t l e ” Example l i n e ” w l p l s 1 , 
230    #”” i n d e x 1 t i t l e ” Another example ” w l p l s 2
231
232    #s e t s t y l e d a t a l i n e s
233    #s e t key o u t s i d e
234    #s e t x t i c s r o t a t e by −45


                                                      36
235    #s e t s i z e r a t i o 0 . 8
236    set t i t l e ” $ t i t l e ”
237    set xlabel ” $ x l a b e l ”
238    $aux
239    $aux2
240    set ylabel ” T o t a l Miss Rate (%)”
241    plot ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 2 ’ ” u s i n g
             1 : 2 with l p l s 1 t i t l e ’#p r o c s = 2 ’ , 
242       ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 4 ’ ” u s i n g 1 : 2
                 with l p l s 2 t i t l e ’#p r o c s = 4 ’ , 
243       ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 8 ’ ” u s i n g 1 : 2
                 with l p l s 3 t i t l e ’#p r o c s = 8 ’ , 
244       ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 1 6 ’ ” u s i n g 1 : 2
                 with l p l s 4 t i t l e ’#p r o c s = 16 ’
245
246   set terminal p d f c a i r o f o n t ” G i l l Sans , 7 ” l i n e w i d t h 4 rounded
247   #s e t t e r m i n a l p d f c a i r o s i z e 30cm, 1 5cm
248   set output ” ${ f i l e n a m e } . pdf ”
249   replot
250 EOF




                                                    37

Weitere ähnliche Inhalte

Andere mochten auch (6)

Cache memory
Cache memoryCache memory
Cache memory
 
Cache memory
Cache memoryCache memory
Cache memory
 
Cache memory
Cache memoryCache memory
Cache memory
 
cache memory
cache memorycache memory
cache memory
 
Cache memory
Cache memoryCache memory
Cache memory
 
Cache memory presentation
Cache memory presentationCache memory presentation
Cache memory presentation
 

Ähnlich wie Dimemas and Multi-Level Cache Simulations

Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIComprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIijtsrd
 
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Databricks
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...Maria Stylianou
 
Lecture 3
Lecture 3Lecture 3
Lecture 3Mr SMAK
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Qualcomm Developer Network
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Vincenzo Gulisano
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009James McGalliard
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterSudhang Shankar
 
work load characterization
work load characterizationwork load characterization
work load characterizationRaghu Golla
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPAnil Bohare
 

Ähnlich wie Dimemas and Multi-Level Cache Simulations (20)

Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIComprehensive Performance Evaluation on Multiplication of Matrices using MPI
Comprehensive Performance Evaluation on Multiplication of Matrices using MPI
 
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
Using Spark Mllib Models in a Production Training and Serving Platform: Exper...
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
FrackingPaper
FrackingPaperFrackingPaper
FrackingPaper
 
Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
 
Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)Crash course on data streaming (with examples using Apache Flink)
Crash course on data streaming (with examples using Apache Flink)
 
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009Benchmark Analysis of Multi-core Processor Memory Contention April 2009
Benchmark Analysis of Multi-core Processor Memory Contention April 2009
 
Parallel Programming on the ANDC cluster
Parallel Programming on the ANDC clusterParallel Programming on the ANDC cluster
Parallel Programming on the ANDC cluster
 
work load characterization
work load characterizationwork load characterization
work load characterization
 
Dsp lab manual 15 11-2016
Dsp lab manual 15 11-2016Dsp lab manual 15 11-2016
Dsp lab manual 15 11-2016
 
cloud15micros
cloud15microscloud15micros
cloud15micros
 
Fulltext
FulltextFulltext
Fulltext
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
Performance_Programming
Performance_ProgrammingPerformance_Programming
Performance_Programming
 
Introduction to Microcontrollers
Introduction to MicrocontrollersIntroduction to Microcontrollers
Introduction to Microcontrollers
 
UDP Report
UDP ReportUDP Report
UDP Report
 
D031201021027
D031201021027D031201021027
D031201021027
 
Matopt
MatoptMatopt
Matopt
 

Mehr von Mário Almeida

Empirical Study of Android Alarm Usage for Application Scheduling
Empirical Study of Android Alarm Usage for Application SchedulingEmpirical Study of Android Alarm Usage for Application Scheduling
Empirical Study of Android Alarm Usage for Application SchedulingMário Almeida
 
Android reverse engineering - Analyzing skype
Android reverse engineering - Analyzing skypeAndroid reverse engineering - Analyzing skype
Android reverse engineering - Analyzing skypeMário Almeida
 
High-Availability of YARN (MRv2)
High-Availability of YARN (MRv2)High-Availability of YARN (MRv2)
High-Availability of YARN (MRv2)Mário Almeida
 
Flume impact of reliability on scalability
Flume impact of reliability on scalabilityFlume impact of reliability on scalability
Flume impact of reliability on scalabilityMário Almeida
 
Self-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsSelf-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsMário Almeida
 
Smith waterman algorithm parallelization
Smith waterman algorithm parallelizationSmith waterman algorithm parallelization
Smith waterman algorithm parallelizationMário Almeida
 
Man-In-The-Browser attacks
Man-In-The-Browser attacksMan-In-The-Browser attacks
Man-In-The-Browser attacksMário Almeida
 
Flume-based Independent News Aggregator
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News AggregatorMário Almeida
 
Exploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed SystemsExploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed SystemsMário Almeida
 
High Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksHigh Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksMário Almeida
 
Instrumenting parsecs raytrace
Instrumenting parsecs raytraceInstrumenting parsecs raytrace
Instrumenting parsecs raytraceMário Almeida
 
Architecting a cloud scale identity fabric
Architecting a cloud scale identity fabricArchitecting a cloud scale identity fabric
Architecting a cloud scale identity fabricMário Almeida
 

Mehr von Mário Almeida (14)

Empirical Study of Android Alarm Usage for Application Scheduling
Empirical Study of Android Alarm Usage for Application SchedulingEmpirical Study of Android Alarm Usage for Application Scheduling
Empirical Study of Android Alarm Usage for Application Scheduling
 
Android reverse engineering - Analyzing skype
Android reverse engineering - Analyzing skypeAndroid reverse engineering - Analyzing skype
Android reverse engineering - Analyzing skype
 
Spark
SparkSpark
Spark
 
High-Availability of YARN (MRv2)
High-Availability of YARN (MRv2)High-Availability of YARN (MRv2)
High-Availability of YARN (MRv2)
 
Flume impact of reliability on scalability
Flume impact of reliability on scalabilityFlume impact of reliability on scalability
Flume impact of reliability on scalability
 
Self-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File SystemsSelf-Adapting, Energy-Conserving Distributed File Systems
Self-Adapting, Energy-Conserving Distributed File Systems
 
Smith waterman algorithm parallelization
Smith waterman algorithm parallelizationSmith waterman algorithm parallelization
Smith waterman algorithm parallelization
 
Man-In-The-Browser attacks
Man-In-The-Browser attacksMan-In-The-Browser attacks
Man-In-The-Browser attacks
 
Flume-based Independent News Aggregator
Flume-based Independent News AggregatorFlume-based Independent News Aggregator
Flume-based Independent News Aggregator
 
Exploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed SystemsExploiting Availability Prediction in Distributed Systems
Exploiting Availability Prediction in Distributed Systems
 
High Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing NetworksHigh Availability of Services in Wide-Area Shared Computing Networks
High Availability of Services in Wide-Area Shared Computing Networks
 
Instrumenting parsecs raytrace
Instrumenting parsecs raytraceInstrumenting parsecs raytrace
Instrumenting parsecs raytrace
 
Architecting a cloud scale identity fabric
Architecting a cloud scale identity fabricArchitecting a cloud scale identity fabric
Architecting a cloud scale identity fabric
 
SOAP vs REST
SOAP vs RESTSOAP vs REST
SOAP vs REST
 

Kürzlich hochgeladen

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 

Kürzlich hochgeladen (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Dimemas and Multi-Level Cache Simulations

  • 1. ` Universitat Politecnica de Catalunya Measurement and Tools Project Report Dimemas and Multi-level Cache Simulations Author: Supervisor: M´rio Almeida a Alejandro Ramirez Bellido June 22, 2012
  • 2. Contents 1 Introduction 2 2 Methodology 2 2.1 Dimemas Simulation . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Multi-Level Cache Simulation . . . . . . . . . . . . . . . . . . 3 3 Results 4 3.1 Dimemas Simulation . . . . . . . . . . . . . . . . . . . . . . . 4 3.2 Multi-Level Cache Simulation . . . . . . . . . . . . . . . . . . 8 4 Conclusions 11 A Used Scripts 13 A.1 Dimemas instrumentation . . . . . . . . . . . . . . . . . . . . 13 A.1.1 Generating Dimemas Configuration . . . . . . . . . . . 13 A.1.2 Running experiments . . . . . . . . . . . . . . . . . . . 13 A.1.3 Graph generator . . . . . . . . . . . . . . . . . . . . . 17 A.1.4 Generating graphs . . . . . . . . . . . . . . . . . . . . 19 A.2 Pin tool instrumentation . . . . . . . . . . . . . . . . . . . . . 20 A.2.1 Generate and Compile Application and DCache tool . 20 A.2.2 Running the experiments . . . . . . . . . . . . . . . . . 28 A.2.3 Importing the results to a database . . . . . . . . . . . 31 A.2.4 Generating graphs . . . . . . . . . . . . . . . . . . . . 31 1
  • 3. Abstract This report describes the simulation and benchmarking steps taken in order to predict the parallel performance of an application using Dimemas and Cache-level simulations. Using Dimemas [3] the time behaviour of NAS [1] integer sort was simulated for the architecture of the Barcelona Super Computer, MareNostrum [4]. The performance was evaluated as a function of the architecture latency, bandwidth, connectivity and CPU speed. For Cache-Level Simulations, Intel’s pin tool was used to benchmark a simple parallel application in function of the cache and cluster sizes. 1 Introduction This report describes the simulation and benchmarking steps taken in order to predict the parallel performance of an application using Dimemas [3] and Cache-level simulations. Previous work was focused on benchmarking a PARSEC [2] ray-tracing application on the multi-processor Boada server. For this purpose EXTRAE and Paraver [5] were used to instrument and provide detailed quantitative analysis of the application performance. Following the study of measurement tools and techniques, this report describes the usage of Dimemas to simulate the time behaviour of another benchmarking application on the Barcelona Super Computer, MareNostrum. This time the used traces were taken from a NAS benchmark application also running on boada server. The performance of the application in this simu- lation environment was evaluated as a function of the architecture latency, bandwidth, connectivity and CPU speed. To conclude this study on performance analysis, Cache-Level Simulations were performed using Intel’s pin tool. The chosen application was a sim- ple parallel application that performs distributed arithmetic operations. It represents the typical Master-Slave paradigm with embarrassingly parallel workload. For evaluating the cache architecture, the total cache miss rates per cache level were calculated as a function of the cache sizes, associativity, number of threads and the cluster size. 2 Methodology This section presents the two different simulation configurations: Dimemas and Multi-Level Cache simulations. Both sections describe the used tools, configuration values and metrics used. 2
  • 4. Boada Server Bandwidth 1 Gb/s Latency 6-10 us Number of cores 12 Ram 24 GB Table 1: Boada server configuration. 2.1 Dimemas Simulation The application chosen for this experiment was the NAS Parallel Benchmark application, integer sort. The NAS benchmark is a set of programs designed to help evaluate the performance of parallel super computers. In this case, the benchmark was done on the boada server which attributes are described in table 1. In order to perform an architecture simulation, it was decided to use the MareNostrum Super Compute configuration which parameters are shown in table 2. Note that a simplification was made, since it was considered that each processor runs a single thread. Starting from MareNostrums original ar- chitecture, multiple simulations were performed changing its attributes. For this purpose, the script in section A.1.1 was created that generates Dimemas configuration files and another to automate its variations. The changed at- tributes in the simulated architecture consisted of latency, CPU speed, band- width and the number of buses. All the measurements were stored in a sqlite3 database and then queried in order to automatically generate the graphs (sec- tion A.1.3) presented on the section 3 using gnuplot. To conclude, the changed attributes were recursively fixed on a chosen optimal value to find a final architecture that needs lesser resources while having similar execution times to the original MareNostrum configuration. 2.2 Multi-Level Cache Simulation To conclude this study on performance analysis, Cache-Level Simulations were performed using Intel’s pin tool. The chosen application was a sim- ple parallel application that performs distributed arithmetic operations. It represents the typical Master-Slave paradigm with embarrassingly parallel workload. For evaluating the cache architecture, the pin tool dcache application was changed in order to support multiple levels of cache shared by parallel pro- 3
  • 5. cessors. The implemented cache architecture is represented in figure 1. As one might infer from the figure, the cache level two is cluster shared and the cache level three is globally shared. P0 L1 . . . . . . L2 P7 L1 L3 P8 L1 . . L2 . . . . Size of L2 Size of L1 P15 L1 = 1 MB = 4 MB Size of L1 = 16 KB Figure 1: Cache architecture for a cluster size of 8 and a total of 16 processors. For this experiments, the total cache miss rates per cache level were calcu- lated as a function of the multiple cache sizes, number of processors and the cluster size. Some experiments were performed in terms of cache associativity and the number of cache lines per cache set. 3 Results In this section the results of both the experiments will be described alongside with the resulting charts, descriptions and discussion. 3.1 Dimemas Simulation Starting with the initial architecture of MareNostrum, the first experiment consisted on varying the number of buses and observing its impact on the ex- ecution time of our application. The results of this experiment are depicted on the Figure 2. As it can be observed from figure 2, the execution time decreases while increasing the number of buses. This result was expected since this is a 4
  • 6. Execution time with variable buses 1600 #Procs = 2 1400 #Procs = 4 #Procs = 8 1200 Execution time(s) #Procs = 16 1000 #Procs = 32 800 600 400 200 0 0 5 10 15 20 buses Figure 2: Execution time of IntegerSort depending on the number of buses. multi-threaded application in which the data is transferred between threads and adding more buses increases the amount of data that can be transferred in parallel. Also it can be seen that from sixteen buses, the execution time starts stabilizing. This is probably because most of the data to be sent, is already sent in parallel and thus the increase of buses does not impact the performance. The second experiment consisted on varying the available bandwidth from the initial MareNostrum configuration. The results are shown in Figure 3. Execution time with variable bandwidth 140 #Procs = 2 120 #Procs = 4 #Procs = 8 Execution time(s) 100 #Procs = 16 #Procs = 32 80 60 40 20 0 170 180 190 200 210 220 230 240 250 bandwidth Figure 3: Execution time of IntegerSort depending on the bandwidth (MB/s). 5
  • 7. Figure 3 shows that the bandwidth as a bigger impact on performance if the application is run on a smaller set of threads. For example, a variation of 40 MB/s can increase the execution time by 20 seconds for four threads, but for 32 threads, the changes are almost unnoticeable. This is probably due to the fact that the master thread has to send the initial data to all slaves. This means that increasing the number of slaves, the data can be di- vided in smaller chunks that can be sent in parallel and thus taking less time. The third experiment consisted on varying the processing capacity of the CPU. As one can observe in figure 4, increasing the processing power of each processor decreases the execution time. This impact is more noticeable if we consider processing capacity smaller than 100%. It is not very tunable in terms of optimizing the usage of resources in terms of decreasing the CPU power since a small decrease has a big impact on the execution time. Execution time with variable cpu 500 #Procs = 2 450 #Procs = 4 400 #Procs = 8 Execution time(s) 350 #Procs = 16 #Procs = 32 300 250 200 150 100 50 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 cpu Figure 4: Execution time of IntegerSort depending on the available CPU (%). To conclude the experiments on the variation of the architecture param- eters, figure 5 shows the impact of latency on the execution time. For figure 5 a logarithmic scale was chosen for the x axis since changes in the same order of the initial MareNostrum configuration do not have a significant impact on the execution time. The latency can be increased to value significantly bigger without affecting much the performance since the latency values in MareNostrum are very small. Only from values of latency close to 0.01 seconds we start seeing bigger increases of the execution time. This attribute should have a bigger impact for more communication intensive 6
  • 8. Execution time with variable latency 600 #Procs = 2 500 #Procs = 4 #Procs = 8 Execution time(s) #Procs = 16 400 #Procs = 32 300 200 100 0 1e-06 1e-05 0.0001 0.001 0.01 0.1 1 latency Figure 5: Execution time of IntegerSort depending on the latency (s). applications. Figure 6: Execution time of IntegerSort depending on the number of threads. To conclude, a comparison is shown in table 2 that presents the dif- ferences between a less resource demanding configuration and the original MareNostrum configuration, both achieving similar execution times. The chosen number of threads was 32 due to its better performance as shown in 7
  • 9. Parameters MareNostrum Config 1 Config 2 Cpu (%) 1.0 0.95 0.9 Latency (s) 0.000008 0.0001 0.001 Bandwidth (MB/s) 250.0 240 230 Number of buses 20+ * 16 16 Execution time (s) 12.506 13.150 13.779 Table 2: Comparison between the execution times of the initial MareNostrum configuration and its less resource demanding configuration. figure 6. The table 2 confirms the predictions made in previous experiments. The chosen values increase the execution time at most 1 second while reducing most parameters by around 10% and increasing significantly the latency. 3.2 Multi-Level Cache Simulation As previously mentioned, the chosen application was a simple parallel ap- plication that performs distributed arithmetic operations. It represents the typical Master-Slave paradigm with embarrassingly parallel workload. MissRate of cache L2 per cluster size (Lsize=[16,1,4]) 50 48 #procs = 2 #procs = 4 46 #procs = 8 Total Miss Rate (%) 44 #procs = 16 42 40 38 36 34 32 30 28 2 3 4 5 6 7 8 Cluster size Figure 7: MissRate of Cache L2 for L1, L2 and L3 sized, respectively 16K, 1M, 4M For evaluating the cache architecture, the cache architecture was changed depending on multiple factors, such as the cluster size, caches sizes and cache line sizes. To start with this experiments the cache architecture was set as shown in figure 1. It has 16 processors with one L1 cache of 16 KB each. 8
  • 10. The cache level two has 1 MB and is cluster shared with a cluster size of 8. And finally, the cache level three is globally shared and has a size of 4 MB. The first experiment consisted on varying the cluster size as shown in figure 7 and verifying its impact on the cache L2 miss rate. As it can be seen, for the number of threads of this experiment the impact on the miss rates of changing the cluster size was not very significant. For up to 4 threads it has almost no impact at all, but when the system has more than 8 threads it can reduce the miss rate by 2%. It is interesting to notice that in this experiment, the more threads sharing the same L2 cache, the lesser the miss rate becomes. Since most cache size configurations produced similar variations for the clus- ter size experiment, the next step consisted on verifying the the impact of the cache sizes on the miss rates. The first step consisted on varying the size of the non-shared cache L1 and its results are presented on figure 8. MissRate of cache 1 per L1 size 15 #procs = 2 14 #procs = 4 #procs = 8 Total Miss Rate (%) 13 #procs = 16 12 11 10 9 8 15 20 25 30 35 40 45 50 55 60 65 Size of cache L1 Figure 8: MissRate of Cache L1 for a variable L1 cache size (KB). Looking at figure 8 it might seem strange that a smaller number of threads has such a lower miss rate. This is because of the master/slave paradigm that for an increasing number of threads makes the accesses to data more sparse. For bigger numbers of threads the miss rates can reach values close to 15%. As expected, bigger sizes of L1 caches achieve smaller miss rates, although the difference isn’t greater than 2%. Although the experiments were performed for more sizes of L1 cache, in order to study the impact of the L2 cache size, the L1 cache size was fixed on 16 KB. The variation of L2 cache size is presented on figure 9. As one can observe, the miss rate of L2 cache for 2 threads is high, being close to 50%. This is probably because of the low miss rate of the L1 cache, the accesses 9
  • 11. MissRate of cache 2 per L2 size (Lsize=[16,.,.]) 50 #procs = 2 48 #procs = 4 46 #procs = 8 Total Miss Rate (%) 44 #procs = 16 42 40 38 36 34 32 30 1 1.5 2 2.5 3 3.5 4 Size of cache L2 Figure 9: MissRate of Cache L2 for a variable L2 cache size (MB) and a L1 cache size of 16KB. that don’t produces hits on L1 should have lower predictability. For bigger numbers of threads, the miss rates are still high although they don’t reach values higher than 33%. MissRate of cache L3 per L3 size (Lsize=[16,1,.]) 100 #procs = 2 #procs = 4 80 #procs = 8 Total Miss Rate (%) #procs = 16 60 40 20 0 4 6 8 10 12 14 16 Size of cache L3 Figure 10: MissRate of Cache L3 for a variable L3 cache size (MB) and a L1 cache size of 16KB. Finally, for the L3 cache size, the impact on the miss rate of the L3 cache size is shown in figure 10. It seems that accesses that don’t produce hits on the previous two levels of cache, will hardly produce hits on the third level of cache. The only exception are the 2 threads for which the set of accessed data is bigger. This probably shows that either the application doesn’t justify the use of three levels of cache, or the data accessed by each thread at each 10
  • 12. moment is too short. 4 Conclusions Dimemas allowed to experiment the theoretical performance of the applica- tion in the MareNostrum architecture. Through the variation of each dif- ferent parameter it was possible to create graphs depicting their impact on the execution time. By the end of the experiment it was possible to suggest an architecture with less resources that achieves similar results to the initial MareNostrum architecture. This architecture is presented in table 2 and con- firms the predictions made in the Dimemas experiments. The chosen values increase the execution time at most 1 second while reducing most parameters by around 10% and increasing significantly the latency. For the second experiment, the impact of the cluster size and caches sizes were presented for a simple parallel arithmetic calculations application. The experiments showed that the cluster size impact on the miss rate was not very significant. For more than 8 threads it can reduce the miss rate by 2%. Overall, the more threads sharing the same L2 cache, the lesser the miss rate becomes. This is because of the master/slave paradigm that for an increasing number of threads makes the accesses to data more sparse. As expected, bigger sizes of L1 caches achieve smaller miss rates. For big numbers of threads, the miss rates in L2 cache were high although they don’t reach values higher than 33%. In general, accesses that didn’t produce hits on the first two levels of cache, hardly produced hits on the third level of cache. The experiments showed that either the application doesn’t justify the use of three levels of cache, or the data accessed by each thread at each moment is too short. Scripting the experiments had a huge impact on the time needed to per- form them. Some of the experiments produced thousands of results. The technique that has proven to be more efficient was to script the generation of results, output them to a sql database and perform queries to generate graphs through gnuplot. References [1] http://www.nas.nasa.gov/publications/npb.html, NAS benchmark. [2] http://parsec.cs.princeton.edu/, PARSEC benchmark. 11
  • 13. [3] http://www.bsc.es/computer-sciences/performance-tools/dimemas, Dimemas. [4] http://en.wikipedia.org/wiki/MareNostrum, MareNostrum. [5] http://www.bsc.es/computer-sciences/performance-tools/paraver, Par- aver. 12
  • 14. A Used Scripts A.1 Dimemas instrumentation A.1.1 Generating Dimemas Configuration 1 #! / b i n / b a s h 2 3 i f [ $# −ne 6 ] 4 then 5 echo ” $0 : Wrong number o f arguments . ” 6 echo ” $0 : <i n p u t . t r f > <n t h r e a d s > <nbuses> <l a t e n c y > < bandwidth> <%cpuspeed>” 7 exit 1 8 fi 9 10 c a t b e g i n o f c o n f i g 11 12 #Bandwidth d e f i n i t i o n 13 echo −e ”nn” environment i n f o r m a t i o n ” { ” ” , 0 , ” ” , 1 2 8 , $5 , $3 , 3 } ; ; n” 14 15 #Latency and %cpu s p e e d d e f i n i t i o n s 16 f o r ( ( i =0; i <=127; i++ ) ) 17 do 18 echo ” ” node i n f o r m a t i o n ” { 0 , $ i , ” ” , 1 , 1 , 1 , 0 . 0 , $4 , $6 , 0 . 0 , 0 . 0 } ; ; ” 19 done 20 21 #F i l e name and number o f p r o c e s s o r s d e f i n i t i o n s 22 echo ” ” 23 echo −n ” ” mapping i n f o r m a t i o n ” { ” $1 ” , $2 , [ $2 ] ” 24 echo −n ” {0 ” 25 26 f o r ( ( i =1; i<=$2 −1; i++ ) ) 27 do 28 echo −n ” , $ i ” 29 done 30 31 echo ” } } ; ; ” 32 33 c a t e n d o f c o n f i g A.1.2 Running experiments 1 #! / b i n / b a s h 2# 13
  • 15. 3 # S c r i p t by aknahs ( Mario Almeida ) 4 # 5 6 cat logo 7 8 echo ”Removing out f o l d e r ( f o r c e ) ” 9 rm − r f out 10 11 echo ” C r e a t i n g out f o l d e r ” 12 mkdir out 13 mkdir out / c f g 14 mkdir out / prv 15 mkdir out / d e t a i l s 16 mkdir out / r e s u l t s 17 18 echo ” C r e a t i n g s q l i t e 3 d a t a b a s e ” 19 s q l i t e 3 out / r e s u l t s / r e s . db ’CREATE TABLE dimemas ( p r o c s INTEGER, b u s e s INTEGER, l a t e n c y REAL, bandwidth REAL, cpu REAL, runtime REAL) ; ’ 20 21 echo ” S e t t i n g d e f a u l t v a l u e s ” 22 LATENCY=” 0 . 0 0 0 0 0 8 ” 23 BANDWIDTH 2 5 0 . 0 ” =” 24 BUSES=” 0 ” 25 CPU=” 1 . 0 ” 26 27 f o r i i n 02 04 08 16 32 28 do 29 #echo ”−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−” 30 i f [ $ { i : 0 : 1 } == 0 ] 31 then 32 #echo ” S e t t i n g n t h r e a d s t o $ { i : 1 } ” 33 n t h r e a d s=$ { i : 1 } 34 else 35 #echo ” S e t t i n g n t h r e a d s t o $ { i }” 36 n t h r e a d s=$ i 37 fi 38 39 echo −n ” G e n e r a t i n g r e s u l t s f o r $ n t h r e a d s ” 40 41 #BUSES−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 42 f o r j i n 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 43 do 44 #echo ” G e n e r a t i n g c o n f i g u r a t i o n f i l e f o r BUSES = $ j ” 45 . / c o n f i g g e n i n / m p i p i n g $ i . t r f $ n t h r e a d s $ j $LATENCY $BANDWIDTH $CPU > out / c f g / c o n f i g −$ n t h r e a d s −$j −$LATENCY − $BANDWIDTH −$CPU . c f g 46 #echo ” C o n v e r t i n g t o p a r a v e r t r a c e . . . ” 14
  • 16. 47 . / Dimemas3 −S 32K −pa out / prv / paraver −$ n t h r e a d s −$j −$LATENCY − $BANDWIDTH −$CPU . prv out / c f g / c o n f i g −$ n t h r e a d s −$j −$LATENCY − $BANDWIDTH −$CPU . c f g > out / d e t a i l s / d e t a i l −$ n t h r e a d s −$j − $LATENCY −$BANDWIDTH −$CPU 48 #echo ” O u t p u t i n g r e s u l t s . ” 49 echo −n ” $ n t h r e a d s , $j ,$LATENCY,$BANDWIDTH, $CPU, ” >> out / r e s u l t s / r e s −$ n t h r e a d s . c s v 50 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −$ n t h r e a d s −$j −$LATENCY − $BANDWIDTH −$CPU | awk ”{ p r i n t $3 } ” >> out / r e s u l t s / r e s − $nthreads . csv 51 done 52 53 echo −n ” . ” 54 55 #LATENCY −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 56 for j in 0.000001 0.00001 0.0001 0.001 0.01 0 . 1 1 . 0 57 do 58 . / c o n f i g g e n i n / m p i p i n g $ i . t r f $ n t h r e a d s $BUSES $ j $BANDWIDTH $CPU > out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−$j −$BANDWIDTH −$CPU . cfg 59 . / Dimemas3 −S 32K −pa out / prv / paraver −$ n t h r e a d s −$BUSES−$j − $BANDWIDTH −$CPU . prv out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−$j − $BANDWIDTH −$CPU . c f g > out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES− $j −$BANDWIDTH −$CPU 60 echo −n ” $ n t h r e a d s , $BUSES , $j ,$BANDWIDTH, $CPU, ” >> out / r e s u l t s / r e s −$ n t h r e a d s . c s v 61 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−$j − $BANDWIDTH −$CPU | awk ”{ p r i n t $3 } ” >> out / r e s u l t s / r e s − $nthreads . csv 62 done 63 64 echo −n ” . ” 65 66 # BANDWIDTH −−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 67 for j in 250.0 245.0 240.0 235.0 230.0 225.0 220.0 215.0 210.0 205.0 200.0 195.0 190.0 185.0 180.0 175.0 170.0 68 do 69 . / c o n f i g g e n i n / m p i p i n g $ i . t r f $ n t h r e a d s $BUSES $LATENCY $ j $CPU > out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−$LATENCY −$j −$CPU . c f g 70 . / Dimemas3 −S 32K −pa out / prv / paraver −$ n t h r e a d s −$BUSES− $LATENCY −$j −$CPU . prv out / c f g / c o n f i g −$ n t h r e a d s −$BUSES− $LATENCY −$j −$CPU . c f g > out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES− $LATENCY −$j −$CPU 71 echo −n ” $ n t h r e a d s , $BUSES ,$LATENCY, $j , $CPU, ” >> out / r e s u l t s / r e s −$ n t h r e a d s . c s v 72 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−$LATENCY −$j −$CPU | awk ”{ p r i n t $3 }” >> out / r e s u l t s / r e s −$ n t h r e a d s . c s v 73 done 74 15
  • 17. 75 echo −n ” . ” 76 77 #CPU SPEED−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 78 for j in 5 . 0 4 . 0 3 . 0 2 . 0 1 . 0 0.95 0 . 9 0.85 0 . 8 0.75 0 . 7 0.65 0.6 0.55 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.25 0.1 0.05 79 do 80 . / c o n f i g g e n i n / m p i p i n g $ i . t r f $ n t h r e a d s $BUSES $LATENCY $BANDWIDTH $ j > out / c f g / c o n f i g −$ n t h r e a d s −$BUSES−$LATENCY − $BANDWIDTH j . c f g−$ 81 . / Dimemas3 −S 32K −pa out / prv / paraver −$ n t h r e a d s −$BUSES− $LATENCY −$BANDWIDTH j . prv out / c f g / c o n f i g −$ n t h r e a d s −$BUSES− −$ $LATENCY −$BANDWIDTH j . c f g > out / d e t a i l s / d e t a i l −$ n t h r e a d s − −$ $BUSES−$LATENCY −$BANDWIDTH j −$ 82 echo −n ” $ n t h r e a d s , $BUSES ,$LATENCY,$BANDWIDTH, $j , ” >> out / r e s u l t s / r e s −$ n t h r e a d s . c s v 83 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −$ n t h r e a d s −$BUSES−$LATENCY − $BANDWIDTH j | awk ” { p r i n t $3 } ” >> out / r e s u l t s / r e s − −$ $nthreads . csv 84 85 86 done 87 echo ” . ” 88 echo ” I m p o r t i n g t o d a t a b a s e ” 89 echo ” . s e p a r a t o r ” , ” ” > out / r e s u l t s /command 90 echo ” . import out / r e s u l t s / r e s −$ { n t h r e a d s } . c s v dimemas” >> out / r e s u l t s /command 91 s q l i t e 3 out / r e s u l t s / r e s . db < out / r e s u l t s /command 92 rm out / r e s u l t s /command 93 done 94 95 echo ” G e n e r a t i n g b e s t c o n f i g u r a t i o n 1 ” 96 . / c o n f i g g e n i n / m p i p i n g 3 2 . t r f 32 16 0 . 0 0 0 1 2 4 0 . 0 0 . 9 5 > out / c f g / c o n f i g −32 −16 −0.0001 −240.0 −0.95. c f g 97 . / Dimemas3 −S 32K −pa out / prv / paraver −32 −16 −0.0001 −240.0 −0.95. prv out / c f g / c o n f i g −32 −16 −0.0001 −240.0 −0.95. c f g > out / d e t a i l s / d e t a i l −32 −16 −0.0001 −240.0 −0.95 98 echo −n ” 3 2 , 1 6 , 0 . 0 0 0 1 , 2 4 0 . 0 , 0 . 9 5 , ” > out / r e s u l t s / o p t i m a l . c s v 99 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −32 −16 −0.0001 −240.0 −0.95 | awk ”{ p r i n t $3 }” >> out / r e s u l t s / o p t i m a l . c s v 100 101 echo ” G e n e r a t i n g b e s t c o n f i g u r a t i o n ” 102 . / c o n f i g g e n i n / m p i p i n g 1 6 . t r f 16 16 0 . 0 0 0 1 2 3 0 . 0 0 . 9 > out / c f g / c o n f i g −16 −16 −0.0001 −230.0 −0.9. c f g 103 . / Dimemas3 −S 32K −pa out / prv / paraver −16 −16 −0.0001 −230.0 −0.9. prv out / c f g / c o n f i g −16 −16 −0.0001 −230.0 −0.9. c f g > out / d e t a i l s / d e t a i l −16 −16 −0.0001 −230.0 −0.9 104 echo −n ” 1 6 , 1 6 , 0 . 0 0 0 1 , 2 3 0 . 0 , 0 . 9 , ” >> out / r e s u l t s / o p t i m a l . c s v 105 g r e p E x e c u t i o n out / d e t a i l s / d e t a i l −16 −16 −0.0001 −230.0 −0.9 | awk ” { p r i n t $3 } ” >> out / r e s u l t s / o p t i m a l . c s v 16
  • 18. 106 107 ./ graphall buses 108 ./ graphall cpu 109 ./ graphall bandwidth 110 ./ graphall latency 111 112 echo ” A l l done ! ” A.1.3 Graph generator 1 #! / b i n / b a s h 2 # 3 # S c r i p t by aknahs ( Mario Almeida ) 4 # 5 6 l a t e n c y=” 0 . 0 0 0 0 0 8 ” 7 bandwidth=” 2 5 0 . 0 ” 8 b u s e s=” 0 ” 9 cpu=” 1 . 0 ” 10 aux=” ” 11 aux2=” ” 12 13 i f [ ” $1 ” == ” l a t e n c y ” ] 14 then 15 comp=$ l a t e n c y 16 aux=” s e t l o g x” 17 aux2=” s e t m x t i c s 10 ” 18 fi 19 i f [ ” $1 ” == ” bandwidth ” ] 20 then 21 comp=$bandwidth 22 fi 23 i f [ ” $1 ” == ” b u s e s ” ] 24 then 25 comp=$ b u s e s 26 fi 27 i f [ ” $1 ” == ” cpu ” ] 28 then 29 comp=$cpu 30 fi 31 32 33 echo ” G e n e r a t i n g Graph” 34 g n u p l o t << EOF 35 set d a t a f i l e s e p a r a t o r ” | ” 36 37 # Line s t y l e f o r a x e s 38 set s t y l e l i n e 80 l t rgb ”#808080” 17
  • 19. 39 40 # Line s t y l e f o r g r i d 41 set s t y l e l i n e 81 l t 0 # dashed 42 set s t y l e l i n e 81 l t rgb ”#808080” # grey 43 44 set grid back l i n e s t y l e 81 45 set b o r d e r 3 back l i n e s t y l e 80 # Remove b o r d e r on t o p and r i g h t . These 46 # b o r d e r s a r e u s e l e s s and make i t h a r d e r 47 # t o s e e p l o t t e d l i n e s near t h e b o r d e r . 48 # Also , p u t i t i n g r e y ; no need f o r so much emphasis on a border . 49 set x t i c s n o m i r r o r 50 set y t i c s n o m i r r o r 51 52 #s e t l o g x 53 #s e t m x t i c s 10 # Makes l o g s c a l e l o o k good . 54 55 # Line s t y l e s : t r y t o p i c k p l e a s i n g c o l o r s , r a t h e r 56 # than s t r i c t l y primary c o l o r s or hard−to−s e e c o l o r s 57 # l i k e g n u p l o t ’ s d e f a u l t y e l l o w . Make t h e l i n e s t h i c k 58 # so t h e y ’ r e e a s y t o s e e i n s m a l l p l o t s i n p a p e r s . 59 set s t y l e l i n e 1 l t rgb ”#A00000 ” lw 2 pt 1 60 set s t y l e l i n e 2 l t rgb ”#00A000” lw 2 pt 6 61 set s t y l e l i n e 3 l t rgb ”#5060D0” lw 2 pt 2 62 set s t y l e l i n e 4 l t rgb ”#F25900 ” lw 2 pt 9 63 set s t y l e l i n e 5 lw 2 pt 9 64 65 #s e t key t o p r i g h t 66 67 #s e t x r a n g e [ 0 : 1 ] 68 #s e t y r a n g e [ 0 : 1 ] 69 70 #p l o t ” t e m p l a t e . d a t ” 71 #i n d e x 0 t i t l e ” Example l i n e ” w l p l s 1 , 72 #”” i n d e x 1 t i t l e ” Another example ” w l p l s 2 73 74 #s e t s t y l e d a t a l i n e s 75 set key o u t s i d e 76 #s e t x t i c s r o t a t e by −45 77 #s e t s i z e r a t i o 0 . 8 78 set t i t l e ” E x e c u t i o n time with v a r i a b l e $1 ” 79 set xlabel ” $1 ” 80 $aux 81 $aux2 82 set ylabel ” E x e c u t i o n time ( s ) ” 83 84 plot ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from dimemas where $1 != $comp and p r o c s = 2 UNION s e l e c t $1 , 18
  • 20. runtime from dimemas where p r o c s = 2 and b u s e s = $ b u s e s and l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu = $cpu ’ ” u s i n g 1 : 2 w l p l s 1 t i t l e ’#Procs = 2 ’ , 85 ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from dimemas where $1 != $comp and p r o c s = 4 UNION s e l e c t $1 , runtime from dimemas where p r o c s = 4 and b u s e s = $ b u s e s and l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu = $cpu ’ ” u s i n g 1 : 2 w l p l s 2 t i t l e ’#Procs = 4 ’ , 86 ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from dimemas where $1 != $comp and p r o c s = 8 UNION s e l e c t $1 , runtime from dimemas where p r o c s = 8 and b u s e s = $ b u s e s and l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu = $cpu ’ ” u s i n g 1 : 2 w l p l s 3 t i t l e ’#Procs = 8 ’ , 87 ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from dimemas where $1 != $comp and p r o c s = 16 UNION s e l e c t $1 , runtime from dimemas where p r o c s = 16 and b u s e s = $ b u s e s and l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu = $cpu ’ ” u s i n g 1 : 2 with l i n e s l s 4 t i t l e ’#Procs = 16 ’ , 88 ”< s q l i t e 3 out / r e s u l t s / r e s . db ’ s e l e c t $1 , runtime from dimemas where $1 != $comp and p r o c s = 32 UNION s e l e c t $1 , runtime from dimemas where p r o c s = 32 and b u s e s = $ b u s e s and l a t e n c y = $ l a t e n c y and bandwidth = $bandwidth and cpu = $cpu ’ ” u s i n g 1 : 2 w l p l s 5 t i t l e ’#Procs = 32 ’ 89 90 set terminal p d f c a i r o f o n t ” G i l l Sans , 7 ” l i n e w i d t h 4 rounded 91 #s e t t e r m i n a l p d f c a i r o s i z e 10cm, 2 0cm 92 set output ” out / r e s u l t s / $1 . pdf ” 93 replot 94 EOF 95 96 echo ”Done” A.1.4 Generating graphs 1 ./ graphall buses 2 ./ graphall latency 3 ./ graphall cpu 4 ./ graphall bandwidth 5 6 echo ” G e n e r a t i n g Graph” 7 g n u p l o t << EOF 8 set d a t a f i l e s e p a r a t o r ” , ” 9 set nokey 10 11 set t i t l e ” E x e c u t i o n time depending on t h e number o f t h r e a d s ” 12 set xlabel ”Number o f t h r e a d s ” 13 14 set x t i c s ( 0 , 2 , 4 , 8 , 1 6 , 3 2 , 3 4 ) 19
  • 21. 15 16 set ylabel ” E x e c u t i o n time ( s ) ” 17 18 set s t y l e l i n e 1 l t rgb ”#A00000 ” lw 50 19 20 plot ” out / r e s u l t s / comparisonThreads . c s v ” u s i n g 1 : 2 with imp l s 1 21 22 set term p o s t s c r i p t eps enhanced c o l o r 23 set output ” out / r e s u l t s / comparison . pdf ” 24 replot 25 EOF A.2 Pin tool instrumentation A.2.1 Generate and Compile Application and DCache tool 1 #! / b i n / b a s h 2 3 #c l u s t e r S i z e 4 # c o n s t UINT32 c a c h e S i z e = 256∗KILO ; 5 # c o n s t UINT32 l i n e S i z e = 1 ; 6 # c o n s t UINT32 a s s o c i a t i v i t y = 2 5 6 ; 7 # l u s t e r S i z e > <L 1 c a c h e s i z e > <L 1 l i n e S i z e > <L1assoc> <L 2 c a c h e s i z e <c > <L 2 l i n e S i z e > <L2assoc> <L 3 c a c h e s i z e > <L 3 l i n e S i z e > <L3assoc> <nThreads> 8# $1 $2 $3 $4 $5 $6 $7 $8 $9 $10 $11 9 10 i f [ $# −ne 11 ] 11 then 12 echo ” $0 : Wrong number o f arguments . ” 13 echo ” $0 : <c l u s t e r S i z e > <L 1 c a c h e s i z e > <L 1 l i n e S i z e > <L1assoc > <L 2 c a c h e s i z e > <L 2 l i n e S i z e > <L2assoc> <L 3 c a c h e s i z e > < L 3 l i n e S i z e > <L3assoc> <nThreads>” 14 exit 1 15 f i 16 17 threadsAndMaster=$ ( ( $ {11} −1) ) 18 #echo ” TreadsAndMaster = $threadsAndMaster ” 19 20 #echo −n ”INPUT=” 21 #echo ” $1 $2 $3 $4 $5 $6 $7 $8 $9 $ {10} $ {11}” 22 23 #echo ” S a v i n g backup o f dcache f i l e ” 24 mv −f dcache . cpp dcache backup . cpp 20
  • 22. 25 26 echo ” 27 #i n c l u d e <i o s t r e a m > 28 #i n c l u d e <f s t r e a m > 29 #i n c l u d e <c a s s e r t > 30 31 #i n c l u d e ” p i n .H” 32 33 34 t y p e d e f UINT32 CACHE STATS ; // type o f c a c h e h i t / m i s s c o u n t e r s 35 36 #i n c l u d e ” p i n c a c h e .H” 37 38 KNOB t r i n g > KnobOutputFile (KNOB MODE WRITEONCE, <s ” p i n t o o l ” , 39 ” o ” , ” a l l c a c h e . out ” , ” s p e c i f y dcache f i l e name” ) ; 40 41 PIN LOCK l o c k ; 42 43 INT32 numThreads = 0 ; 44 c o n s t INT32 MaxNumThreads = $11 ; 45 c o n s t INT32 c l u s t e r S i z e = $1 ; 46 47 s t r u c t THREAD DATA 48 { 49 UINT64 H i t s ; 50 UINT64 Miss ; 51 }; 52 53 THREAD DATA l 1 c o u n t [ MaxNumThreads ] ; 54 THREAD DATA l 2 c o u n t [ c l u s t e r S i z e ] ; 55 56 VOID T h r e a d S t a r t (THREADID t h r e a d i d , CONTEXT ∗ c t x t , INT32 f l a g s , VOID ∗v ) 57 { 58 GetLock(& l o c k , t h r e a d i d +1) ; 59 numThreads++; 60 R e l e a s e L o c k (& l o c k ) ; 61 62 ASSERT( numThreads <= MaxNumThreads , ”Maximum number o f t h r e a d s e x c e e d e d n” ) ; 63 } 64 65 namespace DL1 66 { 67 // 1 s t l e v e l data c a c h e : 32 kB , 32 B l i n e s , 32−way associative 68 c o n s t UINT32 c a c h e S i z e = $2 ∗KILO ; 69 c o n s t UINT32 l i n e S i z e = $3 ; 70 c o n s t UINT32 a s s o c i a t i v i t y = $4 ; 21
  • 23. 71 c o n s t CACHE ALLOC : : STORE ALLOCATION a l l o c a t i o n = CACHE ALLOC : : STORE NO ALLOCATE; 72 73 c o n s t UINT32 m a x s e t s = c a c h e S i z e / ( l i n e S i z e ∗ associativity ) ; 74 c o n s t UINT32 m a x a s s o c i a t i v i t y = a s s o c i a t i v i t y ; 75 76 t y p e d e f CACHE ROUND ROBIN( max sets , m a x a s s o c i a t i v i t y , a l l o c a t i o n ) CACHE; 77 } 78 LOCALVAR DL1 : : CACHE d l 1 ( ”L1 Data Cache ” , DL1 : : c a c h e S i z e , DL1 : : l i n e S i z e , DL1 : : a s s o c i a t i v i t y ) ; 79 80 namespace UL2 81 { 82 // 2nd l e v e l u n i f i e d c a c h e : 2 MB, 64 B l i n e s , d i r e c t mapped 83 c o n s t UINT32 c a c h e S i z e = $5 ∗MEGA; 84 c o n s t UINT32 l i n e S i z e = $6 ; 85 c o n s t UINT32 a s s o c i a t i v i t y = $7 ; 86 c o n s t CACHE ALLOC : : STORE ALLOCATION a l l o c a t i o n = CACHE ALLOC : : STORE ALLOCATE; 87 88 c o n s t UINT32 m a x s e t s = c a c h e S i z e / ( l i n e S i z e ∗ associativity ) ; 89 90 t y p e d e f CACHE DIRECT MAPPED( max sets , a l l o c a t i o n ) CACHE; 91 } 92 LOCALVAR UL2 : : CACHE u l 2 ( ”L2 C l u s t e r −s h a r e d Cache ” , UL2 : : c a c h e S i z e , UL2 : : l i n e S i z e , UL2 : : a s s o c i a t i v i t y ) ; 93 94 namespace UL3 95 { 96 // 3 rd l e v e l u n i f i e d c a c h e : 16 MB, 64 B l i n e s , d i r e c t mapped 97 c o n s t UINT32 c a c h e S i z e = $8 ∗MEGA; 98 c o n s t UINT32 l i n e S i z e = $9 ; 99 c o n s t UINT32 a s s o c i a t i v i t y = $ { 1 0 } ; 100 c o n s t CACHE ALLOC : : STORE ALLOCATION a l l o c a t i o n = CACHE ALLOC : : STORE ALLOCATE; 101 102 c o n s t UINT32 m a x s e t s = c a c h e S i z e / ( l i n e S i z e ∗ associativity ) ; 103 104 t y p e d e f CACHE DIRECT MAPPED( max sets , a l l o c a t i o n ) CACHE; 105 } 106 LOCALVAR UL3 : : CACHE u l 3 ( ”L3 G l o b a l l y −s h a r e d Cache ” , UL3 : : c a c h e S i z e , UL3 : : l i n e S i z e , UL3 : : a s s o c i a t i v i t y ) ; 107 108 LOCALFUN VOID F i n i ( i n t code , VOID ∗ v ) 109 { 22
  • 24. 110 s t d : : o f s t r e a m out ( KnobOutputFile . Value ( ) . c s t r ( ) ) ; 111 112 out << 113 ”#n” 114 ”# DCACHE s t a t s n ” 115 ”#n” ; 116 117 out << d l 1 ; 118 out << u l 2 ; 119 out << u l 3 ; 120 121 out . c l o s e ( ) ; 122 123 f o r ( i n t i =0; i <numThreads ; i ++) 124 { 125 p r i n t f ( ”%d L1 H i t s : %I64d n” , i , ( u n s i g n e d i n t ) l 1 c o u n t [ i ] . H i t s ); 126 p r i n t f ( ”%d L1 Miss : %I64d n” , i , ( u n s i g n e d i n t ) l 1 c o u n t [ i ] . Miss ); 127 p r i n t f ( ”%d L1 Hit r a t e : %f nn” , i , ( 1 0 0 . 0 ∗ l 1 c o u n t [ i ] . H i t s / ( l 1 c o u n t [ i ] . H i t s+l 1 c o u n t [ i ] . Miss ) ) ) ; 128 } 129 130 f o r ( i n t i =0; i <c l u s t e r S i z e ; i ++) 131 { 132 p r i n t f ( ”%d L2 H i t s : %I64d n” , i , ( u n s i g n e d i n t ) l 2 c o u n t [ i ] . H i t s ); 133 p r i n t f ( ”%d L2 Miss : %I64d n” , i , ( u n s i g n e d i n t ) l 2 c o u n t [ i ] . Miss ); 134 p r i n t f ( ”%d L2 Hit r a t e : %f nn” , i , ( 1 0 0 . 0 ∗ l 2 c o u n t [ i ] . H i t s / ( l 2 c o u n t [ i ] . H i t s+l 2 c o u n t [ i ] . Miss ) ) ) ; 135 } 136 } 137 138 LOCALFUN VOID U l 2 A c c e s s (ADDRINT addr , UINT32 size , CACHE BASE : : ACCESS TYPE accessType , THREADID t i d ) 139 { 140 // s e c o n d l e v e l u n i f i e d c a c h e 141 c o n s t BOOL d l 2 H i t = u l 2 . A c c e s s ( addr , size , a c c e s s T y p e ) ; 142 143 // t h i r d l e v e l u n i f i e d c a c h e 144 i n t c i d = t i d / ( MaxNumThreads/ c l u s t e r S i z e ) ; 145 i f ( ! dl2Hit ) 146 { 147 GetLock(& l o c k , t i d +1) ; 148 l 2 c o u n t [ c i d ] . Miss++; 149 R e l e a s e L o c k (& l o c k ) ; 150 u l 3 . A c c e s s ( addr , size , a c c e s s T y p e ) ; 151 } else 23
  • 25. 152 l 2 c o u n t [ c i d ] . H i t s ++; 153 } 154 155 LOCALFUN VOID MemRefMulti (ADDRINT addr , UINT32 size , CACHE BASE : : ACCESS TYPE accessType , THREADID t i d ) 156 { 157 // f i r s t l e v e l D−c a c h e 158 c o n s t BOOL d l 1 H i t = d l 1 . A c c e s s ( addr , size , a c c e s s T y p e ) ; 159 160 i f ( ! dl1Hit ) { 161 l 1 c o u n t [ t i d ] . Miss++; 162 U l 2 A c c e s s ( addr , size , accessType , t i d ) ; 163 } 164 else 165 { 166 l 1 c o u n t [ t i d ] . H i t s ++; 167 } 168 } 169 170 LOCALFUN VOID MemRefSingle (ADDRINT addr , UINT32 size , CACHE BASE : : ACCESS TYPE accessType , THREADID t i d ) 171 { 172 // f i r s t l e v e l D−c a c h e 173 c o n s t BOOL d l 1 H i t = d l 1 . A c c e s s S i n g l e L i n e ( addr , a c c e s s T y p e ) ; 174 175 i f ( ! dl1Hit ) { 176 l 1 c o u n t [ t i d ] . Miss++; 177 U l 2 A c c e s s ( addr , size , accessType , t i d ) ; 178 } 179 else 180 { 181 l 1 c o u n t [ t i d ] . H i t s ++; 182 } 183 } 184 185 LOCALFUN VOID I n s t r u c t i o n ( INS i n s , VOID ∗v ) 186 { 187 i f ( INS IsMemoryRead ( i n s ) ) 188 { 189 c o n s t UINT32 s i z e = INS MemoryReadSize ( i n s ) ; 190 c o n s t AFUNPTR countFun = ( s i z e <= 4 ? (AFUNPTR) MemRefSingle : (AFUNPTR) MemRefMulti ) ; 191 192 // o n l y p r e d i c a t e d −on memory i n s t r u c t i o n s a c c e s s D−c a c h e 193 INS InsertPredicatedCall ( 194 i n s , IPOINT BEFORE , countFun , 195 IARG MEMORYREAD EA, 196 IARG MEMORYREAD SIZE, 197 IARG UINT32 , CACHE BASE : : ACCESS TYPE LOAD, 24
  • 26. 198 IARG THREAD ID , 199 IARG END) ; 200 } 201 202 i f ( INS IsMemoryWrite ( i n s ) ) 203 { 204 c o n s t UINT32 s i z e = INS MemoryWriteSize ( i n s ) ; 205 c o n s t AFUNPTR countFun = ( s i z e <= 4 ? (AFUNPTR) MemRefSingle : (AFUNPTR) MemRefMulti ) ; 206 207 // o n l y p r e d i c a t e d −on memory i n s t r u c t i o n s a c c e s s D−c a c h e 208 INS InsertPredicatedCall ( 209 i n s , IPOINT BEFORE , countFun , 210 IARG MEMORYWRITE EA, 211 IARG MEMORYWRITE SIZE, 212 IARG UINT32 , CACHE BASE : : ACCESS TYPE STORE, 213 IARG THREAD ID , 214 IARG END) ; 215 } 216 } 217 218 GLOBALFUN i n t main ( i n t argc , c h a r ∗ argv [ ] ) 219 { 220 P I N I n i t ( argc , argv ) ; 221 222 f o r ( INT32 t =0; t<MaxNumThreads ; t++) 223 { 224 l1count [ t ] . Hits = 0; 225 l 1 c o u n t [ t ] . Miss =0; 226 } 227 228 f o r ( i n t i =0; i <c l u s t e r S i z e ; i ++) 229 { 230 l 2 c o u n t [ i ] . H i t s =0; 231 l 2 c o u n t [ i ] . Miss =0; 232 } 233 234 PIN AddThreadStartFunction ( ThreadStart , 0 ) ; 235 INS AddInstrumentFunction ( I n s t r u c t i o n , 0 ) ; 236 PIN AddFiniFunction ( F i n i , 0 ) ; 237 238 // Never r e t u r n s 239 PIN StartProgram ( ) ; 240 241 return 0 ; // make c o m p i l e r happy 242 }” > dcache . cpp 243 244 make > makeres 245 25
  • 27. 246 echo ” 247 #i n c l u d e <p t h r e a d . h> 248 #i n c l u d e <s t d i o . h> 249 #i n c l u d e < s t d l i b . h> 250 #i n c l u d e <time . h> 251 typedef struct 252 { 253 double ∗a ; 254 double ∗b ; 255 double sum ; 256 int veclen ; 257 } DOTDATA; 258 259 260 #d e f i n e NUMTHRDS $threadsAndMaster 261 #d e f i n e VECLEN 1000000 262 263 DOTDATA d o t s t r ; 264 p t h r e a d t c a l l T h d [NUMTHRDS] ; 265 p t h r e a d m u t e x t mutexsum ; 266 267 v o i d ∗ dotprod ( v o i d ∗ a r g ) 268 { 269 i n t i , s t a r t , end , l e n ; 270 long o f f s e t ; 271 // p r i n t f ( ”%dn ” , ( i n t ) a r g ) ; 272 d o u b l e mysum , ∗x , ∗y ; 273 o f f s e t = ( long ) arg ; 274 275 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 8 ) ; 276 277 len = dotstr . veclen ; 278 // p r i n t f ( ”%dn” , l e n ) ; 279 s t a r t = o f f s e t ∗ ( l e n /NUMTHRDS) ; 280 end = s t a r t + ( l e n /NUMTHRDS) ; 281 x = dotstr . a ; 282 y = dotstr . b ; 283 284 mysum = 0 ; 285 f o r ( i=s t a r t ; i <end ; i ++) 286 mysum += ( x [ i ] ∗ y [ i ] ) ; 287 288 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 9 ) ; 289 290 p t h r e a d m u t e x l o c k (&mutexsum ) ; 291 292 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 1 0 ) ; 293 d o t s t r .sum += mysum ; 294 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 1 1 ) ; 26
  • 28. 295 296 p t h r e a d m u t e x u n l o c k (&mutexsum ) ; 297 298 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 0 ) ; 299 300 // p t h r e a d e x i t ( ( v o i d ∗ ) 0 ) ; 301 } 302 303 304 i n t main ( i n t argc , c h a r ∗ argv [ ] ) 305 { 306 long i ; 307 d o u b l e ∗a , ∗b ; 308 void ∗ s ta t us ; 309 pthread attr t attr ; 310 311 c l o c k t begin , end ; 312 double time spent ; 313 314 b e g i n = clock ( ) ; 315 // E x t r a e i n i t ( ) ; 316 317 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 1 ) ; 318 319 a = ( d o u b l e ∗ ) m a l l o c (NUMTHRDS∗VECLEN∗ s i z e o f ( d o u b l e ) ) ; 320 b = ( d o u b l e ∗ ) m a l l o c (NUMTHRDS∗VECLEN∗ s i z e o f ( d o u b l e ) ) ; 321 322 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 2 ) ; 323 324 f o r ( i =0; i <VECLEN∗NUMTHRDS; i ++) 325 { 326 a [ i ]=1; 327 b [ i ]=a [ i ] ; 328 } 329 330 d o t s t r . v e c l e n = VECLEN; 331 dotstr . a = a ; 332 dotstr . b = b ; 333 d o t s t r .sum=0; 334 335 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 3 ) ; 336 337 p t h r e a d m u t e x i n i t (&mutexsum , NULL) ; 338 339 p t h r e a d a t t r i n i t (& a t t r ) ; 340 p t h r e a d a t t r s e t d e t a c h s t a t e (& a t t r , PTHREAD CREATE JOINABLE) ; 341 342 f o r ( i =0; i < NUMTHRDS; i ++) 343 { 27
  • 29. 344 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 4 ) ; 345 p t h r e a d c r e a t e (& c a l l T h d [ i ] , &a t t r , dotprod , ( v o i d ∗ ) i ) ; 346 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 3 ) ; 347 } 348 349 p t h r e a d a t t r d e s t r o y (& a t t r ) ; 350 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 5 ) ; 351 352 f o r ( i =0; i < NUMTHRDS; i ++) 353 { 354 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 6 ) ; 355 p t h r e a d j o i n ( c a l l T h d [ i ] , &s t a t u s ) ; 356 // E x t r a e e v e n t a n d c o u n t e r s ( 1 , 7 ) ; 357 } 358 359 p r i n t f ( ”Sum = %f n” , d o t s t r .sum) ; 360 free (a) ; 361 free (b) ; 362 363 end=clock ( ) ; 364 t i m e s p e n t= ( d o u b l e ) ( end − b e g i n ) / CLOCKS PER SEC ; 365 p r i n t f ( ” E x e c u t i o n time : %f n ” , t i m e s p e n t ) ; 366 367 // E x t r a e f i n i ( ) ; 368 369 p t h r e a d m u t e x d e s t r o y (&mutexsum ) ; 370 p t h r e a d e x i t (NULL) ; 371 } 372 ” > dotprod . c 373 374 #echo ” Compiling dotprod ” 375 g c c −o dotprod dotprod . c −l p t h r e a d 376 377 #echo ” Running p i n t o o l ” 378 cd / s c r a t c h / boada −1/etm022 / p i n 379 . / p i n −t / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s /Memory/ obj− i n t e l 6 4 / dcache . s o −− / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s /Memory/ dotprod > / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s / Memory/ r e s u l t s / r e s −$1−$2−$3−$4−$5−$6−$7−$8−$9−${10}−$ { 1 1 } . r e s 380 381 mv a l l c a c h e . out / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s /Memory/ r e s u l t s / r e s −$1−$2−$3−$4−$5−$6−$7−$8−$9−${10}−$ { 1 1 } . a l l c a c h e 382 383 cd / s c r a t c h / boada −1/etm022 / p i n / s o u r c e / t o o l s /Memory 384 echo ” done ! ” A.2.2 Running the experiments 28
  • 30. 1 #! / b i n / b a s h 2 # 3 #S c r i p t by aknahs ( Mario Almeida ) 4 # 5 #c l u s t e r S i z e 6 # c o n s t UINT32 c a c h e S i z e = 256∗KILO ; 7 # c o n s t UINT32 l i n e S i z e = 1 ; 8 # c o n s t UINT32 a s s o c i a t i v i t y = 2 5 6 ; 9 # l u s t e r S i z e > <L 1 c a c h e s i z e > <L 1 l i n e S i z e > <L1assoc> <L 2 c a c h e s i z e <c > <L 2 l i n e S i z e > <L2assoc> <L 3 c a c h e s i z e > <L 3 l i n e S i z e > <L3assoc> <nThreads> 10 # $1 $2 $3 $4 $5 $6 $7 $8 $9 $10 $11 11 12 rm − r f r e s u l t s 13 mkdir r e s u l t s 14 15 t o t a l=$ ( ( 3 ∗ 4 ∗ 3 ∗ 3 ∗ 2 ∗ 3 ∗ 2 ) ) 16 n=0 17 r e s 1=$ ( date +%s .%N) 18 19 20 #c l u s t e r S i z e 21 f o r c s i n 2 4 8 22 do 23 f o r mt i n 2 4 8 16 24 do 25 #L 1 c a c h e S i z e 26 f o r l 1 c i n 16 32 64 27 do 28 #L 1 l i n e S i z e 29 f o r l 1 l i n 32 #64 128 30 do 31 #L1assoc 32 f o r l 1 a i n 1 #2 4 33 do 34 #L 2 c a c h e S i z e 35 for l 2 c in 1 2 4 36 do 37 #L 2 l i n e S i z e 38 f o r l 2 l i n 32 64 #128 39 do 40 #L2assoc 41 f o r l 2 a i n 1 #2 4 42 do 43 #L 3 c a c h e S i z e 44 f o r l 3 c i n 4 8 16 29
  • 31. 45 do 46 #L 3 l i n e S i z e 47 f o r l 3 l i n 32 64 #128 48 do 49 #L3assoc 50 f o r l 3 a i n 1 #2 4 51 do 52 clear 53 cat logo 54 echo ”−−−−−−−−−−−−−−−−−−−−−−−−−−by aknahs ” 55 echo −n ” G e n e r a t i n g [ $n/ $ t o t a l ] . . . ” 56 r e s 2=$ ( date +%s .%N) 57 p r i n t f ” Elapsed : %.3Fn” $ ( echo ” $ r e s 2 − $ r e s 1 ” | bc ) 58 59 n=$ ( ( $n + 1 ) ) 60 #echo ”−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−− 61 #echo ” G e n e r a t i n g CPP and Make” 62 #echo ” . / genMakeCPP $ c s $ l 1 c $ l 1 l $ l 1 a $ l 2 c $ l 2 l $ l 2 a $ l 3 c $ l 3 l $ l 3 a $mt” 63 . / genMakeCPP $ c s $ l 1 c $ l 1 l $ l 1 a $ l 2 c $ l 2 l $ l 2 a $ l 3 c $ l 3 l $ l 3 a $mt 64 echo ” . ” 65 done 66 done 67 done 68 done 69 done 70 done 71 done 72 done 73 done 74 done 75 done 76 echo ” a l l done . ” 77 78 g r e p ” T o t a l Miss Rate ” r e s u l t s / ∗ . a l l c a c h e | awk ’BEGIN{n=0; p r i n t f ” C l u s t e r S i z e , L1 Cache S i z e , L1 L ine S i z e , L1 A s s o c i a t i o n , L2 Cache S i z e , L2 Li ne S i z e , L2 A s s o c i a t i o n , L3 Cache S i z e , L3 L in e S i z e , L3 A s s o c i a t i o n , Number o f t h r e a d s , T o t a l Miss Caches n”} { s p l i t ( $1 , a , ” . ” ) ; s p l i t ( a [ 1 ] , b ,” −”) ; p r i n t f ”%d,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,%s ,% s n ” , n%3 +1,b [ 2 ] , b [ 3 ] , b [ 4 ] , b [ 5 ] , b [ 6 ] , b [ 7 ] , b [ 8 ] , b [ 9 ] , b [ 1 0 ] , b [ 1 1 ] , b [ 1 2 ] , $5 ;++n} ’ >> r e s u l t s / b r u t a l d b . c s v 30
  • 32. A.2.3 Importing the results to a database 1 #! / b i n / b a s h 2 # 3 # S c r i p t by aknahs ( Mario Almeida ) 4 # 5 6 rm − r f power 7 mkdir power 8 9 s q l i t e 3 power / r e s . db ’CREATE TABLE r e s ( c a c h e l e v e l INTEGER, c l u s t e r INTEGER, l 1 s i z e INTEGER, l 1 l i n e INTEGER, l 1 a s s o c INTEGER , l 2 s i z e INTEGER, l 2 l i n e INTEGER, l 2 a s s o c INTEGER, l 3 s i z e INTEGER , l 3 l i n e INTEGER, l 3 a s s o c INTEGER, t h r e a d s INTEGER, m i s s r a t e REAL ); ’ 10 11 echo ” I m p o r t i n g t o d a t a b a s e ” 12 echo ” . s e p a r a t o r ” , ” ” > power /command 13 echo ” . import b r u t a l d b . c s v r e s ” >> power /command 14 s q l i t e 3 power / r e s . db < power /command 15 rm power /command 16 17 echo ” done ” 18 19 ./ graphall A.2.4 Generating graphs 1 #! / b i n / b a s h 2 # 3 # S c r i p t by aknahs ( Mario Almeida ) 4 # 5 6 #s q l i t e 3 power / r e s . db ’CREATE TABLE r e s ( c a c h e l e v e l INTEGER, c l u s t e r INTEGER, l 1 s i z e INTEGER, l 1 l i n e INTEGER, l 1 a s s o c INTEGER , l 2 s i z e INTEGER, l 2 l i n e i n e INTEGER, l 2 a s s o c INTEGER, l 3 s i z e INTEGER, l 3 l i n e i n e INTEGER, l 3 a s s o c INTEGER, t h r e a d s INTEGER, m i s s r a t e REAL) ; ’ 7 8 mkdir power / c l u s t e r 2 9 mkdir power / c l u s t e r 4 10 mkdir power / c l u s t e r 8 11 12 #f o r i n s t r u m e n t a t i o n l e v e l 13 f o r set i n 1 2 3 14 do 15 #f o r each c l u s t e r s i z e 31
  • 33. 16 for cs in 2 4 8 17 do 18 #f o r each l e v e l o f c a ch e 19 for l in 1 2 3 20 do 21 22 i f [ $ s e t == 1 ] 23 then 24 f i l e n a m e=” power / c l u s t e r $ { c s }/L${ l } MissRate−L1Size16 −c l u s t e r $ { c s } ” 25 s q l=” s e l e c t l $ { l } s i z e , m i s s r a t e from r e s where c l u s t e r = $ c s and l 1 s i z e = 16 and c a c h e l e v e l = $ { l } and l 1 l i n e = 32 and l 2 l i n e = 32 and l 3 l i n e = 32 ” 26 t i t l e=” MissRate o f c a c h e $ { l } p e r L${ l } s i z e ( L s i z e = [ 1 6 , . , . ] ) ” 27 xlabel=” S i z e o f c a c h e L${ l } ” 28 fi 29 30 i f [ $ s e t == 2 ] 31 then 32 f i l e n a m e=” power / c l u s t e r $ { c s }/L${ l } MissRate−c l u s t e r $ { c s } ” 33 s q l=” s e l e c t l $ { l } s i z e , m i s s r a t e from r e s where c l u s t e r = $ c s and c a c h e l e v e l = $ { l } and l 1 l i n e = 32 and l 2 l i n e = 32 and l 3 l i n e = 32 ” 34 t i t l e=” MissRate o f c a c h e $ { l } p e r L${ l } s i z e ” 35 xlabel=” S i z e o f c a c h e L${ l } ” 36 f i 37 38 i f [ $ s e t == 3 ] 39 then 40 f i l e n a m e=” power / c l u s t e r $ { c s }/L${ l } MissRate−L1Size16 −L 2 s i z e 1 − c l u s t e r $ { c s }” 41 s q l=” s e l e c t l $ { l } s i z e , m i s s r a t e from r e s where c l u s t e r = $ c s and l 1 s i z e = 16 and l 2 s i z e = 1 and c a c h e l e v e l = ${ l } and l 1 l i n e = 32 and l 2 l i n e = 32 and l 3 l i n e = 32 ” 42 t i t l e=” MissRate o f c a c h e L${ l } p e r L${ l } s i z e ( L s i z e = [ 1 6 , 1 , . ] ) ” 43 xlabel=” S i z e o f c a c h e L${ l } ” 44 f i 45 46 i f [ [ $ s e t = 1 && $ l = 1 ] ] 47 then 48 continue 49 f i 50 51 i f [ [ $ s e t == 3 && ( $ l == 1 | | $ l == 2 ) ] ] 52 then 53 continue 54 f i 55 56 32
  • 34. 57 echo ” G e n e r a t i n g Graph f o r s e t $ s e t on c a c h e l e v e l $ l ” 58 g n u p l o t << EOF 59 set d a t a f i l e s e p a r a t o r ” | ” 60 61 # Line s t y l e f o r a x e s 62 set s t y l e l i n e 80 l t rgb ”#808080” 63 64 # Line s t y l e f o r g r i d 65 set s t y l e l i n e 81 l t 0 # dashed 66 set s t y l e l i n e 81 l t rgb ”#808080” # g r e y 67 68 set grid back l i n e s t y l e 81 69 set b o r d e r 3 back l i n e s t y l e 80 # Remove b o r d e r on t o p and r i g h t . These 70 # b o r d e r s a r e u s e l e s s and make i t h a r d e r 71 # t o s e e p l o t t e d l i n e s near t h e b o r d e r . 72 # Also , p u t i t i n g r e y ; no need f o r so much emphasis on a border . 73 set x t i c s n o m i r r o r 74 set y t i c s n o m i r r o r 75 76 #s e t l o g x 77 #s e t m x t i c s 10 # Makes l o g s c a l e l o o k good . 78 79 # Line s t y l e s : t r y t o p i c k p l e a s i n g c o l o r s , r a t h e r 80 # than s t r i c t l y primary c o l o r s or hard−to−s e e c o l o r s 81 # l i k e g n u p l o t ’ s d e f a u l t y e l l o w . Make t h e l i n e s t h i c k 82 # so t h e y ’ r e e a s y t o s e e i n s m a l l p l o t s i n p a p e r s . 83 set s t y l e l i n e 1 l t rgb ”#A00000 ” lw 2 ps 1 pt 1 84 set s t y l e l i n e 2 l t rgb ”#00A000” lw 2 ps 1 pt 6 85 set s t y l e l i n e 3 l t rgb ”#5060D0” lw 2 ps 1 pt 2 86 set s t y l e l i n e 4 l t rgb ”#F25900 ” lw 2 ps 1 pt 9 87 88 #s e t key t o p r i g h t 89 90 #s e t x r a n g e [ 0 : 1 ] 91 set yrange [ 0 : 1 0 0 ] 92 93 #p l o t ” t e m p l a t e . d a t ” 94 #i n d e x 0 t i t l e ” Example l i n e ” w l p l s 1 , 95 #”” i n d e x 1 t i t l e ” Another example ” w l p l s 2 96 97 #s e t s t y l e d a t a l i n e s 98 #s e t key o u t s i d e 99 #s e t x t i c s r o t a t e by −45 100 #s e t s i z e r a t i o 0 . 8 101 set t i t l e ” $ t i t l e ” 102 set xlabel ” $ x l a b e l ” 103 $aux 33
  • 35. 104 $aux2 105 set ylabel ” T o t a l Miss Rate (%)” 106 plot ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 2 ’ ” u s i n g 1 : 2 with p o i n t s l s 1 t i t l e ’#p r o c s = 2 ’ , 107 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 4 ’ ” u s i n g 1 : 2 with p o i n t s l s 2 t i t l e ’#p r o c s = 4 ’ , 108 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 8 ’ ” u s i n g 1 : 2 with p o i n t s l s 3 t i t l e ’#p r o c s = 8 ’ , 109 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 1 6 ’ ” u s i n g 1 : 2 with p o i n t s l s 4 t i t l e ’#p r o c s = 16 ’ 110 111 set terminal p d f c a i r o f o n t ” G i l l Sans , 7 ” l i n e w i d t h 4 rounded 112 #s e t t e r m i n a l p d f c a i r o s i z e 30cm, 1 5cm 113 set output ” ${ f i l e n a m e } . pdf ” 114 replot 115 EOF 116 done 117 done 118 done 119 120 echo ”Done” 121 122 f i l e n a m e=” power / L2MissRate−L1Size32 −L 2 s i z e 4 −l 3 s i z e 4 −v a r C l u s t e r ” 123 s q l=” s e l e c t c l u s t e r , m i s s r a t e from r e s where l 1 s i z e = 32 and l 2 s i z e = 4 and l 3 s i z e = 4 and c a c h e l e v e l = 2 and l 1 l i n e = 32 and l 2 l i n e = 32 and l 3 l i n e = 32 ” 124 t i t l e=” MissRate o f c a c h e L2 p e r c l u s t e r s i z e ( L s i z e = [ 3 2 , 4 , 4 ] ) ” 125 xlabel=” C l u s t e r s i z e ” 126 127 echo ” G e n e r a t i n g Graph f o r s e t v a r i a b l e c l u s t e r s ” 128 g n u p l o t << EOF 129 set d a t a f i l e s e p a r a t o r ” | ” 130 131 # Line s t y l e f o r a x e s 132 set s t y l e l i n e 80 l t rgb ”#808080” 133 134 # Line s t y l e f o r g r i d 135 set s t y l e l i n e 81 l t 0 # dashed 136 set s t y l e l i n e 81 l t rgb ”#808080” # g r e y 137 138 set grid back l i n e s t y l e 81 139 set b o r d e r 3 back l i n e s t y l e 80 # Remove b o r d e r on t o p and r i g h t . These 140 # b o r d e r s a r e u s e l e s s and make i t h a r d e r 141 # t o s e e p l o t t e d l i n e s near t h e b o r d e r . 142 # Also , p u t i t i n g r e y ; no need f o r so much emphasis on a border . 143 set x t i c s n o m i r r o r 144 set y t i c s n o m i r r o r 34
  • 36. 145 146 #s e t l o g x 147 #s e t m x t i c s 10 # Makes l o g s c a l e l o o k good . 148 149 # Line s t y l e s : t r y t o p i c k p l e a s i n g c o l o r s , r a t h e r 150 # than s t r i c t l y primary c o l o r s or hard−to−s e e c o l o r s 151 # l i k e g n u p l o t ’ s d e f a u l t y e l l o w . Make t h e l i n e s t h i c k 152 # so t h e y ’ r e e a s y t o s e e i n s m a l l p l o t s i n p a p e r s . 153 set s t y l e l i n e 1 ps 1 pt 1 154 set s t y l e l i n e 2 ps 1 pt 6 155 set s t y l e l i n e 3 ps 1 pt 2 156 set s t y l e l i n e 4 ps 1 pt 9 157 158 #s e t key t o p r i g h t 159 160 #s e t x r a n g e [ 0 : 1 ] 161 #s e t y r a n g e [ 0 : 1 ] 162 163 #p l o t ” t e m p l a t e . d a t ” 164 #i n d e x 0 t i t l e ” Example l i n e ” w l p l s 1 , 165 #”” i n d e x 1 t i t l e ” Another example ” w l p l s 2 166 167 #s e t s t y l e d a t a l i n e s 168 #s e t key o u t s i d e 169 #s e t x t i c s r o t a t e by −45 170 #s e t s i z e r a t i o 0 . 8 171 set t i t l e ” $ t i t l e ” 172 set xlabel ” $ x l a b e l ” 173 $aux 174 $aux2 175 set ylabel ” T o t a l Miss Rate (%)” 176 plot ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 2 ’ ” u s i n g 1 : 2 with l p l s 1 t i t l e ’#p r o c s = 2 ’ , 177 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 4 ’ ” u s i n g 1 : 2 with l p l s 2 t i t l e ’#p r o c s = 4 ’ , 178 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 8 ’ ” u s i n g 1 : 2 with l p l s 3 t i t l e ’#p r o c s = 8 ’ , 179 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 1 6 ’ ” u s i n g 1 : 2 with l p l s 4 t i t l e ’#p r o c s = 16 ’ 180 181 set terminal p d f c a i r o f o n t ” G i l l Sans , 7 ” l i n e w i d t h 4 rounded 182 #s e t t e r m i n a l p d f c a i r o s i z e 30cm, 1 5cm 183 set output ” ${ f i l e n a m e } . pdf ” 184 replot 185 EOF 186 187 f i l e n a m e=” power / L2MissRate−L1Size16 −L 2 s i z e 1 −l 3 s i z e 4 −v a r C l u s t e r ” 188 s q l=” s e l e c t c l u s t e r , m i s s r a t e from r e s where l 1 s i z e = 16 and l 2 s i z e = 1 and l 3 s i z e = 4 and c a c h e l e v e l = 2 and l 1 l i n e = 32 35
  • 37. and l 2 l i n e = 32 and l 3 l i n e = 32 ” 189 t i t l e=” MissRate o f c a c h e L2 p e r c l u s t e r s i z e ( L s i z e = [ 1 6 , 1 , 4 ] ) ” 190 xlabel=” C l u s t e r s i z e ” 191 192 echo ” G e n e r a t i n g Graph f o r s e t v a r i a b l e c l u s t e r s ” 193 g n u p l o t << EOF 194 set d a t a f i l e s e p a r a t o r ” | ” 195 196 # Line s t y l e f o r a x e s 197 set s t y l e l i n e 80 l t rgb ”#808080” 198 199 # Line s t y l e f o r g r i d 200 set s t y l e l i n e 81 l t 0 # dashed 201 set s t y l e l i n e 81 l t rgb ”#808080” # g r e y 202 203 set grid back l i n e s t y l e 81 204 set b o r d e r 3 back l i n e s t y l e 80 # Remove b o r d e r on t o p and r i g h t . These 205 # b o r d e r s a r e u s e l e s s and make i t h a r d e r 206 # t o s e e p l o t t e d l i n e s near t h e b o r d e r . 207 # Also , p u t i t i n g r e y ; no need f o r so much emphasis on a border . 208 set x t i c s n o m i r r o r 209 set y t i c s n o m i r r o r 210 211 #s e t l o g x 212 #s e t m x t i c s 10 # Makes l o g s c a l e l o o k good . 213 214 # Line s t y l e s : t r y t o p i c k p l e a s i n g c o l o r s , r a t h e r 215 # than s t r i c t l y primary c o l o r s or hard−to−s e e c o l o r s 216 # l i k e g n u p l o t ’ s d e f a u l t y e l l o w . Make t h e l i n e s t h i c k 217 # so t h e y ’ r e e a s y t o s e e i n s m a l l p l o t s i n p a p e r s . 218 set s t y l e l i n e 1 ps 1 pt 1 219 set s t y l e l i n e 2 ps 1 pt 6 220 set s t y l e l i n e 3 ps 1 pt 2 221 set s t y l e l i n e 4 ps 1 pt 9 222 223 #s e t key t o p r i g h t 224 225 #s e t x r a n g e [ 0 : 1 ] 226 #s e t y r a n g e [ 0 : 1 ] 227 228 #p l o t ” t e m p l a t e . d a t ” 229 #i n d e x 0 t i t l e ” Example l i n e ” w l p l s 1 , 230 #”” i n d e x 1 t i t l e ” Another example ” w l p l s 2 231 232 #s e t s t y l e d a t a l i n e s 233 #s e t key o u t s i d e 234 #s e t x t i c s r o t a t e by −45 36
  • 38. 235 #s e t s i z e r a t i o 0 . 8 236 set t i t l e ” $ t i t l e ” 237 set xlabel ” $ x l a b e l ” 238 $aux 239 $aux2 240 set ylabel ” T o t a l Miss Rate (%)” 241 plot ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 2 ’ ” u s i n g 1 : 2 with l p l s 1 t i t l e ’#p r o c s = 2 ’ , 242 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 4 ’ ” u s i n g 1 : 2 with l p l s 2 t i t l e ’#p r o c s = 4 ’ , 243 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 8 ’ ” u s i n g 1 : 2 with l p l s 3 t i t l e ’#p r o c s = 8 ’ , 244 ”< s q l i t e 3 power / r e s . db ’ $ s q l and t h r e a d s = 1 6 ’ ” u s i n g 1 : 2 with l p l s 4 t i t l e ’#p r o c s = 16 ’ 245 246 set terminal p d f c a i r o f o n t ” G i l l Sans , 7 ” l i n e w i d t h 4 rounded 247 #s e t t e r m i n a l p d f c a i r o s i z e 30cm, 1 5cm 248 set output ” ${ f i l e n a m e } . pdf ” 249 replot 250 EOF 37