SlideShare ist ein Scribd-Unternehmen logo
1 von 6
Downloaden Sie, um offline zu lesen
Multi-Application Multi-Step Mapping Method
            for Many-Core Network-on-Chips
       Bo Yang∗ , Liang Guang∗‡ , Thomas Canhao Xu∗‡ , Alexander Wei Yin∗‡ , Tero S¨ ntti∗† , Juha Plosila∗†
                                                                                   a
                             ∗ Department  of Information Technology, University of Turku, Finland
                        † Academy   of Finland, Research Council for Natural Sciences and Engineering
                                      ‡ Turku Center for Computer Science, Turku, Finland

                                     {boyan, liagua, canxu, yinwei, teansa, juplos}@utu.fi


   Abstract—Massive parallel computing performed on many-           rithms in literature are analyzed and compared in [12]. Tang
core Network-on-Chips (NoCs) is the future of the computing.        et al. proposed a two-step genetic algorithm and the related
One feasible approach to implement parallel computing is to         software for mapping concurrent applications on a fixed NoC
deploy multiple applications on the NoC simultaneously. In this
paper, we propose a multi-application mapping method starting       architecture [10]. Murali et al. presented a methodology to
with the application mapping which finds a region on the NoC         map multiple use-cases onto the NoC architecture, satisfying
for each application and then task mapping which maps all           the constraints of each use-case [13]. In these works multiple
tasks of the application into each region. In the application       applications reuse the same platforms in different time slots.
mapping step, several strategies based on the maximal empty         The main drawback of these systems is the timing overhead
rectangle (MER) technique are introduced for finding an optimal
region for each application. In the task mapping step, a tree-      incurred by reconfiguring the NoC and loading new appli-
model based algorithm is used with the purpose of reducing the      cations. Also, since various communication constraints and
communication latency and energy consumption. The experiment        traffic characteristics of applications have to be satisfied using
results show that the proposed method can achieve considerable      limited processing elements (PEs), the system design is more
reduction of network latency and energy consumption (up to          complicated and the optimized mapping for each application
18%) for a given set of applications.
                                                                    may not be achieved [13].
                                                                       While the traditional approaches to maximize the serial
                      I. I NTRODUCTION
                                                                    performance of processors by maximizing the clock speed
   Over the last 40 years, we have witnessed a series of remark-    and increasing instruction-level parallelism (ILP) are proved
able developments in computer industry. One of them is the          to reach their limits [7] [8], the many-core NoC architectures
increasing processing capability of the system. The increase is     provide more feasibility to deliver higher performance through
not only achieved by the performance improvements between           parallel computing. The massive parallel computing performed
the generations of uniprocessors, but also comes from the           on many-core NoCs is the future of computing [1]. With
advent of multi-core or many-core architectures where tens to       increasing number and computational power of on-chip PEs,
hundreds of processors or cores can be integrated on a single       the parallel computing on many-core NoCs can be realized no
chip. Examples of such architectures are [6] and [17]. A recent     only at the instruction level, but also at the higher task and/or
study at the University of California, Berkeley [1] suggests that   application levels. To realize the higher level parallelism and
it will soon be possible to integrate more than 1000 cores on       make full use of the abundant resources on the NoCs, it is
a single chip since Moore’s Law is still generously delivering      no longer reasonable to only focus on the implementation of
transistors at the rate of twice every couple of years. While       single application with abundant PEs being available on the
the amount of on-chip cores increases, the communication            many-core NoCs. Instead, the design focus should shift from
among them is critical to the system performance and energy         the single-application to the multi-application scenarios. More
consumption. In the last decade, NoC has been proposed as           precisely, multiple applications could be deployed on different
an alternative for the traditional bus and point-to-point adhoc     regions of the NoC and executed in parallel.
connections in order to address the challenge of increasing            In this paper, we propose a novel mapping method whereby
concurrent communication requirements as well as the diffi-          multiple applications can be simultaneously mapped on the
culty of global synchronization [4].                                many-core NoCs. The mapping method consists of application
   Based on the NoC platforms, a large body of researches           mapping and task mapping. The two-step mapping method
addressing the mapping problem has been undertaken in the           first finds a region on the NoC for each application and then
last couple of years [9] [12] [10] [13]. In [9], Hu et al.          maps all tasks of the application into the region. Several
presented a branch and bound algorithm which maps the               strategies based on the MER technique are introduced for
tasks of a single application to nodes and generates a suitable     finding an objective MER for each application. Following the
deadlock-free routing function such that the total commu-           application mapping, a tree-model based algorithm is used to
nication energy consumption is minimized under specified             map all tasks of the application into the objective MER. By
performance constraints. Several well used task mapping algo-       optimizing the layout of both multiple applications and tasks




978-1-4244-8971-8/10$26.00 c 2010 IEEE
within applications, the proposed method aims at achieving               Using these definitions, the problem of the multi-application
lower network latency and energy consumption for multiple             mapping can be described as follows:
applications on the many-core NoCs.                                      Given a set of TGs and a CRG, find a mapping area (MA)
                                                                      on CRG for each TG which can accommodate all tasks of the
            II. P ROBLEM F ORMULIZATION
                                                                      TG, also find a position within the MA for each task such that
A. System Model                                                       the lowest overall network delay and communication energy
   The target system is shown in Figure 1, consisting of a            consumption can be achieved for the give set of TGs.
Real-time Operating System (RTOS) and a NoC platform. The             C. Objective Formulization
NoC provides the computation and communication resources
to implement multiple applications. The RTOS schedules the               Since the network delay is proportional to the communi-
given set of applications (e.g. A1 to A6 in Figure 1) and             cation distance between the source and destination nodes on
manage the resources on the NoC. The mapper runs the pro-             the NoC, one feasible way to reduce network delay is to
posed mapping algorithm to map each application on a feasible         shorten the communication distance among tasks as much as
region and the loader loads all tasks on PEs according to the         possible. This can be achieved in the process of finding the
mapping solution. This work deals with on-line scenarios, i.e.,       optimal MA for an application. We use the nodes average
the RTOS does not know in advance when each application               distance (NAD) mentioned in [10] to evaluate the average
arrives and how much PEs they need. In this paper, we focus           communication distance within the MA. NAD is defined as
on the mapping algorithm of the mapper.                               the average distance between two randomly selected nodes in
                                                                      NoC architecture. For a X × Y mesh NoC, the NAD is:
                                                                                            X +Y               1
                                                                                   N AD =           × 1−                       (1)
                                                                                               3            X ×Y
                                                                      The Equation (1) implies that for a given application, the
                                                                      average communication distance among tasks varies when
                                                                      different areas are used to map the tasks of the application.
                     Fig. 1: System Model                             The more compact the area is, the smaller NAD it achieves.
B. Problem Description                                                   The energy consumption of a communication between tasks
                                                                      ti and tj is determined by both the communication weight wij
   In the single-application mapping scenarios, the mapping
                                                                      and the distance |lij |. To reduce the communication energy
problem is how to find an appropriate position for each task
                                                                      consumption, minimizing the weighted communication of the
of the application subject to particular performance or cost
                                                                      application (WCA) has been proved to be efficient [18]. The
metrics. In the multi-application scenarios, the problem is
                                                                      WCA is defined as the sum of products of the wij and |lij |
extended to search for the optimal positions for both the
                                                                      for all communications in an application as follows:
applications and tasks of the individual application. We first
give the definitions regarding the target application and NoC                             W CA =           wij × |lij |            (2)
architecture used in this paper.                                                                   ∀i,j
   Definition 1: We assume that each application has already             Based on these formulizations, the objectives of the pro-
been implemented as a set of tasks. The application is modeled        posed method are transformed into seeking the most compact
by a task graph (TG). A TG is a directed graph TG =                   mapping area MA with smallest NAD and the optimized task
< T, C >, where T = {t1 , t2 , . . . , tp } represents the set        mapping solution with minimized WCA.
of tasks, corresponding the set of TG vertices, and C =
{(ti , tj , wij )} denotes the set of communications between               III. M ULTI -A PPLICATION M ULTI -S TEP M APPING
tasks, corresponding to the set of TG edges. The edge weight             To reach the two goals mentioned in the previous section,
wij in (ti , tj , wij ) represents the total data amount, sent from   we propose a two-step multi-application mapping method.
ti to tj . The number of tasks p in TG is denoted as the size         The mapping consists of two sequential phases: application
of the given application.                                             mapping (AM) and task mapping (TM). AM deals with
   Definition 2: A NoC is modeled as a communication re-               the mapping of multiple applications and its purpose is to
source graph (CRG). A CRG is a directed graph CRG =                   optimize the layout of multiple applications mapped on the
< N, L >, where N = {n1 , n2 , . . . , nq } denotes the set of        NoC and find the optimal MA with the minimal NAD for each
nodes on the NoC, corresponding to the set of CRG vertices,           application. TM works after AM to conduct the task mapping
and L = {(ni , nj , |lij |)} designates the set of routing path       of an individual application and achieve the minimized WCA.
from node ni to node nj , corresponding to the edges of CRG.
|lij | represents the communication length from node ni to node       A. Application Mapping (AM)
nj . The number of nodes q in CRG is denoted as the size of             On a 2-D mesh NoC, any sub-mesh or rectangle can be
the NoC. For the sake of simplicity, in this paper, the NoC is        regarded as a piece of compact area. Thus, the problem of
assumed to be a homogeneous 2-D using deterministic X-Y               AM is turned into the problem of managing the rectangles
routing strategy.                                                     on the NoC. To do this, AM adopts the concept of maximal
empty rectangle (MER), which was originally used to solve               smallest size, the one with minimal A(R) is selected.
the placement problem in FPGA design [2].                               Best Shape Best Size (BShBS): Similar to the previous
                                                                        •
   1) MER Technique: A MER is a empty rectangle that is not             one, among all candidate MERs with the same minimal
contained by any other empty rectangles. In our case, a MER             A(R), the one with smallest size is selected.
represents a cluster of free nodes on the NoC that is used to      Whenever an objective MER Rm is selected, AM will choose
map an application. Figure 2 shows an example of application       a mapping area MA with minimal A(M A) in Rm to map the
mapping using the MER technique. At first, the whole surface        given application. In this paper, we define the corner of the
of the NoC is represented by one MER R0 (Figure 2a). After         objective MER Rm which is closest to any corner of the NoC
the mapping of application A1 , the R0 is split into R1 and R2     as the starting point to create the MA. The reasons behind this
(Figure 2b). In Figure 2c, the R1 is further fragmented into       include to reduce fragmentations along the borders of the NoC
R3 and R4 after the application A2 has been mapped. The            as well as to reduce the congestion in the middle area of the
MERs R2 , R3 and R4 can be used for the future application         NoC by leaving free MERs there. The created area MA will
mapping. Let w(R) and h(R) be the width and height of the          be returned as an input for TM phase.
MER R, the normalized aspect ratio A(R) of the MER R is               To deal with the third case, the LS+C strategy is applied.
defined as:                                                            • Largest Size + Combining (LS+C): In this case, the
                           max{w(R), h(R)}
                 A(R) =                                     (3)         application has to be mapped on separate MERs. To avoid
                           min{w(R), h(R)}
                                                                        increasing communication cost between more distant
The aspect ratio A(R) implies the shape of the MER. If it               MERs with small size , LS+C chooses the free MER
equals 1, the MER is a square. Otherwise, it is a standard              with largest number of PEs as the primary area and then
rectangle.                                                              combines the nearest free MERs to get adequate PEs
                                           R3
                                                                        for the application. The combined mapping area MA is
                                                      R4                returned as an input for TM phase.
                                                A2
                                                                      3) MER Merging: When the execution of an application
                                                 A1
                                                                   completes, the area occupied by the application can be released
                                                            R2     and merged with neighboring free MERs to get larger MERs
           (a)                (b)                     (c)          for the future mappings.
                                                                      Combining these techniques and strategies together, the
           Fig. 2: Application Mapping Using MER
                                                                   algorithm of AM is described as Algorithm 1.
   2) Objective MER Selection: For a given application with
the size p, AM tries to find an optimal or near-optimal              Algorithm 1: Multi-Application Mapping
objective MER Rm to map the application. Based on the state          Input : TGs: a set of applications, CRG: a 2-D mesh
of MERs on the NoC, the cases that AM possibly faces are:                    with size W × H
(1) the total amount of PEs in all MERs is not adequate to           Output: The mapping areas for applications in A
accommodate the given application; (2) there is at least one       1    Initiate the original MERs list R0 with size W × H.
candidate MER that can accommodate the given application;          2    if the free PEs on the NoC can not accommodate the
(3) the total amount of PEs in all MERs is adequate to                  arriving application Ai then
accommodate the given application, but neither of them can         3         Reject the mapping request.
fit the application alone.
   In the first case, the mapping request will be rejected at       4    else if More than one MER can accommodate Ai then
this time and the RTOS can try the mapping later. For the          5        Use appropriate strategy to select one objective MER
second case, we propose the following strategies for finding                 and create the mapping area MA.
the objective MER.                                                 6    else
   • Best Size (BS): BS chooses the candidate MER with the
                                                                   7        Use the LS+C strategy to find a mapping area MA.
     smallest size as the objective MER Rm . Intuitively, this     8    if application Aj is completed then
     strategy tries to keep the big rectangles for the future      9        Merge the area occupied by Aj with neighboring free
     application mapping.                                                   MERs;
   • Best Shape (BSh): It is noteworthy in Equation (1) that,      10   Repeat 2-9 until MA for each application is found.
     an area with the same width X and height Y holds
     the minimal NAD among all areas with size X × Y .                Figure 3 is an example of the application mapping using
     Taking this into consideration, BSh strategy chooses the      Algorithm 1. Four applications with size 25, 16, 16, 9 used in
     candidate MER with the minimal A(R) as the objective          the experiment in Section IV, denoted as FFT(25), X264(16),
     MER Rm . The reason behinds BSh is that in such a MER,        TPCH(16) and FFT(9) respectively, are mapped sequentially
     the application is more likely to be mapped in a area close   on a NoC with size 10 × 7. Figure 3a and 3b are the final
     to square so that a smaller NAD can be achieved.              mapping under the BS and BSh strategy respectively. The
   • Best Size Best Shape (BSBSh): BSBSh is extended from          main difference of these two mapping results is the transposed
     BS. If there are several candidate MERs with the same         locations of application X264(16) and TPCH(16). Under both
strategies, the LS+C strategy is used for the application           The major responsibility of the AM algorithm is to manage
FFT(9).                                                          the MERs list. As mentioned in [2], the algorithm of managing
                                                                 MERs is O n2 for n mapped applications.
                                                                 C. Task Mapping (TM)
                                                                    After the mapping area MA for a given application has been
                                                                 obtained in the AM phase, the role of TM is to map the
                                                                 tasks of the application with the purpose of minimizing the
                                                                 W CA of the application. To address the task mapping, we
         (a) BS Mapping                  (b) BSh Mapping         propose a tree-model based mapping algorithm. The mapping
      Fig. 3: Application Mapping Using BS and BSh               algorithm consists of two parts: the abstraction of a mapping
                                                                 area MA into an extended tree structure and the mapping of
                                                                 an application onto the extended tree. Figure 5 is an example
                                                                 of mapping the tasks of an application A2 (shown in Figure
                                                                 2c) on the selected MA.
                                                                                                                                                              Mapping Area (MA)


  (a) BSh Mapping for TPCH(16)          (b) WNAD Mapping
                                                                                                                                                                    A2
         Fig. 4: Application Mapping Using WNAD

B. Weighted NAD                                                             Task Graph (TG)                                       Step 1
                                                                                                                                                                         Abstraction
                                                                                                                                                                           Abstracted Tree
                                                                                                                  Step 2


   In Algorithm 1, the MERs which can’t accommodate the                                 T2    80                           10
                                                                                                                                               Step 4

                                                                                                                                                Step 3
                                                                                                                                                                    T4
                                                                                  20                     30       T5
given application would not be selected as an objective MER                  T1                    T4                            T7
                                                                                                                                                  T2           T7        T6            T3
                                                                                       20                        60
Rm as long as there are candidate MERs, although some of                          30
                                                                                        T3
                                                                                                        40
                                                                                                                  T6        20




                                                                                             50
                                                                                                                                      Step 7      T1           T5
them hold a smaller A(R) than the selected Rm and can                                                   Step 6
                                                                                                                                           Step 5


accommodate most tasks of the application. Figure 4a is an                                    Tx    Task                                                 Tx
                                                                                                                                                               Mapped         Spare
                                                                                                                                                                Node          Node
example of application mapping under the BSh strategy. After
                                                                            Fig. 5: Tree-Model Based Task Mapping
the application FFT(25) and X264(16) have been mapped, the
                                                                    1) Tree Model of MA: The abstraction of a MA into an
candidate MER R1 is selected (shown in Figure 3b), although
                                                                 extended tree structure follows Algorithm 2. Simply put, the
the non-candidate MER R2 with the better shape and close
                                                                 center point of the MA is chosen as the root node of the tree,
size (15) for the application TPCH(16). This is because the
                                                                 which has the shortest average distance to other nodes in the
combination of several separated MERs is likely to induce
                                                                 MA. The neighbors of the center point are put as the children
higher NAD and WCA than a monolithic MER. However, if
                                                                 nodes of the root node. The procedure continues until all nodes
the task mapping algorithm presented in the following section
                                                                 in the MA are put onto the tree (bottom right of Figure 5). The
is taken into account, it is reasonable to accept some non-
                                                                 structure is called an extended tree since some children may
candidate MER as the objective MER on which most of the
                                                                 have more than one parent node. This extended tree structure
tasks are able to be mapped. Since the task mapping always
                                                                 places the network nodes with shorter average distance (to
chooses the task which affects the WCA most and maps it prior
                                                                 other nodes) onto higher-level tree nodes. Intuitively, task in
to other tasks, the last selected tasks have limited impact on
                                                                 the application with a large communication volume should be
the overall WCA even if they are mapped on separate MERs.
                                                                 placed on as high level on the tree as possible, in order to
Therefore, we propose another strategy for the objective MER
                                                                 minimize the total communication cost which is proportional
selection, termed as weighted NAD (WNAD). The WNAD of
                                                                 to the average communication distance.
a MER is defined as follows:
                                 Ntasks                           Algorithm 2: Tree Abstraction Algorithm
                 W N AD =               × N AD             (4)
                                 Nnodes                            Input : mapping area MA
                                                                   Output: An extended tree abstraction
   where the first factor is the weighted ratio. Ntasks is the
number of tasks in the application. Nnodes is the number of      1   Select the center network node as the root node in the
nodes occupied by the tasks if the application is mapped on          tree;
the MER. For a candidate MER, the weighted ratio equals to       2   Traverse the NoC from the center node, record all its
1 and the WNAD strategy is equivalent to the BSh strategy.           neighbors as the child nodes;
The MER with a lower WNAD can accommodate more tasks             3   Repeat 2 for each child node until all nodes are in the
with a smaller NAD. Using the WNAD strategy, both the                tree.
candidate and non-candidate MERs presented in the previous          2) TM Algorithm: The mapping of applications onto the
strategies can be evaluated together to find the objective MER.   tree follows Algorithm 3. We calculate the communication
The Figure 4b is an example of using WNAD strategy to map        volume (CV , Definition 3) of each task in the task graph,
the same set of applications as in Figure 3.                     and place the task with the largest communication volume
onto the root node in the tree. Then we calculate the weighted          A cycle-accurate NoC simulator, Noxim [14], was extended
communication volumes of the remaining tasks to the ones             and used to simulate the four applications’ traffics on a
already mapped onto the tree, termed as affinity to partial tree      10 × 7 NoC and produce network delay and communication
(AP T , Definition 4), and place the task with the largest AP T       energy consumption under different mapping strategies. The
to the highest node available in the tree. This procedure iterates   workload traces of these four applications were gathered from
until all tasks have been mapped onto the tree.                      Simics [11] where the NoC was configured to model a chip
   Definition 3: Let ti be a task in the task graph TG, and cij       multiprocessor (CMP). Each PE has a core, a private L1
be the communication volume from ti to tj , then                     cache and a shared L2 cache bank. Memory controllers are
                                                                     connected to the top and bottom side of the chip. The static
                    CVti =             (wij + wji )                  non uniform cache architecture (NUCA) [3] is implemented
                              ∀tj ∈T                                 in our memory/cache architecture, in which data are mapped
CVti is the communication volume of ti .                             to cache banks statically.
  Definition 4: Let T be the set of mapped tasks on a tree,              The average communication distance (ACD), WCA, average
and ti be a task not yet mapped, then                                network latency (ANL), energy consumption (EC) under the
                                                                     different strategies were compared. The ACD is the average
                   AP Tti =             (wij + wji )                 communication distance among all tasks when the application
                              ∀tj ∈T                                 is mapped on the selected objective MER. The ANL is the
                                                                     average number of cycles needed for transferring one packet
AP Tti is the affinity of ti to the partial (mapped) tree.
                                                                     on the NoC.
    Algorithm 3: Task Mapping Algorithm
                                                                     B. Results Analysis
     Input : TG, Abstracted Tree of MA
     Output: Task Mapping on the Tree


                                                                                Average Communication Distance (ACD:hops)
                                                                                                                              4                                             BS+GI
                                                                                                                                                                            BS+Tree

1    Calculate CV for all tasks, and map the task with the                                                                   3.5

                                                                                                                              3
                                                                                                                                                                            BSh+GI
                                                                                                                                                                            BSh+Tree
                                                                                                                                                                            WNAD+GI
                                                                                                                                                                            WNAD+Tree
     largest CV onto the root node;                                                                                          2.5

2    Calculate AP T of all non-mapped tasks, and map the                                                                      2

                                                                                                                             1.5
     task with the largest AP T to the highest level tree node                                                                1

     available;                                                                                                              0.5

                                                                                                                              0
3    repeat 2 until all tasks have been mapped.                                                                                    FFT(25)   X264(16)   TPCH(16)   FFT(9)


                                                                                Fig. 6: ACD Using Different Strategies
   The tree-model based mapping has low complexity and high
                                                                                Weighted Communcation of Application(WCA)




                                                                                                                            100%                                            BS+GI
efficiency. For instance, compared to the greedy incremental                                                                                                                 BS+Tree
                                                                                                                                                                            BSh+GI
                                                                                                                                                                            BSh+Tree
                                                                                                                            80%
(GI) algorithm presented in [12], the tree-based mapping has                                                                                                                WNAD+GI
                                                                                                                                                                            WNAD+Tree


an algorithm complexity of O(N ), where N is the number                                                                     60%


of tasks in TG, while the GI algorithm has an algorithm                                                                     40%


complexity of O(N 2 ). By mapping tasks starting from the root                                                              20%

of the tree, the algorithm minimizes the W CA using the AP T                                                                 0%
                                                                                                                                   FFT(25)   X264(16)   TPCH(16)   FFT(9)
method and consequently reduces the energy consumption and
network delay.                                                                   Fig. 7: WCA Using Different Strategies
                                                                        Figure 6 shows the ACD for each application under different
                        IV.   EXPERIMENT
                                                                     strategies. BS+GI and BS+Tree respectively represent the
A. Experiment Setup                                                  cases where BS is applied to the application mapping and the
   Full system simulations were performed to evaluate the            GI and tree-model based algorithm to the task mapping, and
proposed method under different mapping strategies. Since the        so forth. The variant ACDs of the X264, TPCH and FFT(9)
comparison in [12] shows that the GI algorithm achieves good         under different strategies show the impact of objective MER
results compared with some other algorithms, the GI algorithm        on the ACD. For each of them, the optimal ACD is achieved
was chosen as a reference to evaluate the tree-model based           when they are mapped on an objective MER with minimal
algorithm used in task mapping. The tree-model based and GI          aspect ratio A(R), or intuitively, a rectangle close to square.
algorithms were used together with the BS, BSh and WNAD              Non-optimized mappings result highest ACD for X264 (3.14,
strategies of application mapping to compare the performance         25% higher than the optimal 2.49), TPCH (3.14, 25% higher
of these strategies. Four benchmark applications were selected,      than the optimal 2.49) and FFT (9) (2.49, 23% higher than the
three of them are from the SPLASH-2 [15] and PARSEC                  optimal 2.03). For most applications, the WNAD can obtain
[5] suite: FFT with 25 and 9 cores (FFT(25) and FFT(9)               the same or better solution than that of BS and BSh strategy.
respectively), X264 with 16 cores (X264(16)). Another TPC-           The only exception is the case of TPCH where one primary
H with 16 cores (TPCH(16)) is an ad-hoc, decision support            MER combing another MER is selected for the mapping under
benchmark from TPC [16]. The mapping was conducted in                the WNAD strategy, instead of a monolithic MER under the
order of FFT(25), X264(16), TPCH(16) and FFT(9).                     BS and BSh strategy. This does prove the negative impact
of separate MERs on the ACD. Also note, the task mapping                               for application mapping, the WNAD is likely to obtain a better
algorithm has negligible impact on the ACD.                                            solution than BS and BSh. For the task mapping, the proposed
   The normalized WCA of each application under different                              tree-model based algorithm outperforms the GI algorithm on
strategies is displayed in Figure 7. The impact of the applica-                        achieving lower network latency and energy consumption.
tion mapping on the WCA keeps consistent with that on the                              WNAD+Tree strategy achieves lowest network latency and
ACD (shown in Figure 6). However, it is noteworthy that the                            energy consumption among all strategies.
task mapping has a great impact on the WCA. In all cases,                                               VI. ACKNOWLEDGEMENT
the tree-model based algorithm outperforms the GI algorithm.
                                                                                          The authors would like to thank the Academy of Finland
For example, using the WNAD+Tree strategy, the tree-model
                                                                                       for the financial support for this work.
based algorithm achieves 35%, 17%, 16% and 32% lower
WCA than the GI algorithm for each application. Furthermore,                                                        R EFERENCES
WNAD+Tree contributes the lowest WCA for each application                               [1] Krste Asanovic, Ras Bodik, Bryan C. Catanzaro, Joseph J. Gebis, Parry
among all strategies.                                                                       Husbands, Kurt Keutzer, David A. Patterson, William L. Plishker, John
                                          100%                      GI                      Shalf, Samuel W. Williams, and Katherine A. Yelick. The landscape of
                                                                    Tree−Model Based
                                                                                            parallel computing research: a view from berkeley. (UCB/EECS-2006-
          Average Network Latency (ANL)




                                          80%
                                                                                            183), December 2006.
                                          60%                                           [2] K. Bazargan, R. Kastner, and M. Sarrafzadeh. Fast template placement
                                                                                            for reconfigurable computing systems. Design Test of Computers, IEEE,
                                          40%
                                                                                            17(1):68 –83, jan-mar 2000.
                                          20%                                           [3] Bradford M. Beckmann and David A. Wood. Managing wire delay in
                                                                                            large chip-multiprocessor caches. In Proceedings of the 37th annual
                                           0%
                                                 BS    BSh   WNAD                           IEEE/ACM International Symposium on Microarchitecture, pages 319–
                            Fig. 8: ANL Using Different Strategies                          330, December 2004.
                                          100%
                                                                                        [4] L. Benini and G. De Micheli. Networks on chips: a new soc paradigm.
                                                                    GI
                                                                    Tree−Model Based        Computer, 35(1):70–78, Jan 2002.
                                                                                        [5] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The
          Energy Consumption (EC)




                                          80%


                                          60%
                                                                                            parsec benchmark suite: characterization and architectural implications.
                                                                                            In Proceedings of the 17th international conference on Parallel archi-
                                          40%                                               tectures and compilation techniques, pages 72–81, October 2008.
                                          20%
                                                                                        [6] M. Denneau and H. S Warren, Jr. 64-bit cyclops: Principles of operation.
                                                                                            IBMTech-report, 2005.
                                           0%
                                                 BS    BSh   WNAD                       [7] P.P. Gelsinger. Microprocessors for the new millennium: Challenges,
            Fig. 9: EC Using Different Strategies                                           opportunities, and new frontiers. In Proceedings of The International
                                                                                            Solid State Circuits Conference (ISSCC), pages 22–25, 2001.
   The normalized simulation results of the ANL and the EC                              [8] J. Hennessy and D. Patterson. Computer Architecture: A Quantitative
are demonstrated in Figure 8 and 9. As anticipated by the                                   Application, 4th Edition. Morgan Kauffman, 2007.
WCA in Figure 7, the tree-model based algorithm achieves                                [9] Radu Marculescu Jingcao Hu. Energy- and performance-aware mapping
                                                                                            for regular noc architecture. IEEE Transations On Computer-Aided
lower ANL and EC than the GI algorithm. The ANL of tree-                                    Design of Integrated Circuits and Systems, Vol.24, No.4:551–562, 2005.
model based algorithm is 12%, 15%, 13% lower than that of                              [10] Tang Lei and Shashi Kumar. A two-step genetic algorithm for mapping
the GI under BS, BSh and WNAD strategies respectively. The                                  task graphs to a network on chip architecture. In DSD ’03: Proceedings
                                                                                            of the Euromicro Symposium on Digital Systems Design, page 180,
same achievements keeps for the EC. Furthermore, WNAD                                       Washington, DC, USA, 2003. IEEE Computer Society.
strategy outperform the BS and BSh and achieves lowest                                 [11] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg,
ANL and EC (about 5% lower in average ). For this set of                                    J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full
                                                                                            system simulation platform. Computer, 35(2):50–58, February 2002.
applications, the difference between BS and BSh is negligible                          [12] C.A.M. Marcon, E.I. Moreno, N.L.V. Calazans, and F.G. Moraes.
with respect to the ANL and EC. The lowest ANL and EC                                       Evaluation of algorithms for low energy mapping onto nocs. In Proc.
are achieved by WNAD+Tree which is 18% lower compared                                       IEEE International Symposium on Circuits and Systems ISCAS 2007,
                                                                                            pages 389–392, 2007.
with the worst case under BSh+GI strategy.                                             [13] Srinivasan Murali, Martijn Coenen, Andrei Radulescu, Kees Goossens,
                                                                                            and Giovanni De Micheli. Mapping and configuration methods for multi-
                                                 V. C ONCLUSION                             use-case networks on chips. In ASP-DAC ’06: Proceedings of the 2006
   An innovative method for multiple applications mapping on                                Asia and South Pacific Design Automation Conference, pages 146–151,
                                                                                            Piscataway, NJ, USA, 2006. IEEE Press.
the future many-core NoC is proposed. The two-step mapping                             [14] University of Catania. Noxim. http://www.noxim.org/.
method first finds a region on the NoC for a given application                           [15] Jaswinder Pal Singh, Anoop Gupta, Moriyoshi Ohara, Evan Torrie, and
and then maps all tasks of the application into the region.                                 Steven Cameron Woo. The splash-2 programs: Characterization and
                                                                                            methodological considerations. Computer Architecture, International
Several strategies based on the MER technique, e.g. BS,                                     Symposium on, 0:24, 1995.
BSh and WNAD are introduced for the application mapping.                               [16] TPC. Tpc-h. http://www.tpc.org/tpch/.
By using these strategies, the algorithm can efficiently find                            [17] S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz,
                                                                                            D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts,
the optimal objective MER to map the target application.                                    Y. Hoskote, N. Borkar, and S. Borkar. An 80-tile sub-100-w teraflops
Following the application mapping, a tree-model based algo-                                 processor in 65-nm cmos. Solid-State Circuits, IEEE Journal of,
rithm is proposed for the task mapping and compared against                                 43(1):29–41, 2008.
                                                                                       [18] Bo Yang, Thomas Canhao Xu, Tero Santti, and Juha Plosila. Tree-model
an existing GI algorithm. The experiment shows that in a                                    based mapping for energy-efficient and low-latency network-on-chip. In
common case, the MER with minimal aspect ratio is ideal for                                 Design and Diagnostics of Electronic Circuits and Systems (DDECS),
mapping a given application. Among the proposed strategies                                  pages 189 –192, 14-16 2010.

Weitere ähnliche Inhalte

Was ist angesagt?

Using an Explicit Nucleation Model in PRISIMS-PF to Predict Precipate Microst...
Using an Explicit Nucleation Model in PRISIMS-PF to Predict Precipate Microst...Using an Explicit Nucleation Model in PRISIMS-PF to Predict Precipate Microst...
Using an Explicit Nucleation Model in PRISIMS-PF to Predict Precipate Microst...PFHub PFHub
 
Initial study and implementation of the convolutional Perfectly Matched Layer...
Initial study and implementation of the convolutional Perfectly Matched Layer...Initial study and implementation of the convolutional Perfectly Matched Layer...
Initial study and implementation of the convolutional Perfectly Matched Layer...Arthur Weglein
 
An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...
An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...
An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...M H
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSitakanta Mishra
 
Transfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyTransfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyNUPUR YADAV
 
Chapter on Book on Cloud Computing 96
Chapter on Book on Cloud Computing 96Chapter on Book on Cloud Computing 96
Chapter on Book on Cloud Computing 96Michele Cermele
 
Digital Image Watermarking Basics
Digital Image Watermarking BasicsDigital Image Watermarking Basics
Digital Image Watermarking BasicsIOSR Journals
 
Performance comparision 1307.4129
Performance comparision 1307.4129Performance comparision 1307.4129
Performance comparision 1307.4129Pratik Joshi
 
Software effort estimation through clustering techniques of RBFN network
Software effort estimation through clustering techniques of RBFN networkSoftware effort estimation through clustering techniques of RBFN network
Software effort estimation through clustering techniques of RBFN networkIOSR Journals
 
Image transmission in wireless sensor networks
Image transmission in wireless sensor networksImage transmission in wireless sensor networks
Image transmission in wireless sensor networkseSAT Publishing House
 
IRJET-Multiple Object Detection using Deep Neural Networks
IRJET-Multiple Object Detection using Deep Neural NetworksIRJET-Multiple Object Detection using Deep Neural Networks
IRJET-Multiple Object Detection using Deep Neural NetworksIRJET Journal
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksJinwon Lee
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitJinwon Lee
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
Low Memory Low Complexity Image Compression Using HSSPIHT Encoder
Low Memory Low Complexity Image Compression Using HSSPIHT EncoderLow Memory Low Complexity Image Compression Using HSSPIHT Encoder
Low Memory Low Complexity Image Compression Using HSSPIHT EncoderIJERA Editor
 
Efficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingEfficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingJinwon Lee
 

Was ist angesagt? (17)

Using an Explicit Nucleation Model in PRISIMS-PF to Predict Precipate Microst...
Using an Explicit Nucleation Model in PRISIMS-PF to Predict Precipate Microst...Using an Explicit Nucleation Model in PRISIMS-PF to Predict Precipate Microst...
Using an Explicit Nucleation Model in PRISIMS-PF to Predict Precipate Microst...
 
Initial study and implementation of the convolutional Perfectly Matched Layer...
Initial study and implementation of the convolutional Perfectly Matched Layer...Initial study and implementation of the convolutional Perfectly Matched Layer...
Initial study and implementation of the convolutional Perfectly Matched Layer...
 
An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...
An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...
An Integrated Inductive-Deductive Framework for Data Mapping in Wireless Sens...
 
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_ReportSaptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
Saptashwa_Mitra_Sitakanta_Mishra_Final_Project_Report
 
Transfer Learning in NLP: A Survey
Transfer Learning in NLP: A SurveyTransfer Learning in NLP: A Survey
Transfer Learning in NLP: A Survey
 
Chapter on Book on Cloud Computing 96
Chapter on Book on Cloud Computing 96Chapter on Book on Cloud Computing 96
Chapter on Book on Cloud Computing 96
 
Digital Image Watermarking Basics
Digital Image Watermarking BasicsDigital Image Watermarking Basics
Digital Image Watermarking Basics
 
Performance comparision 1307.4129
Performance comparision 1307.4129Performance comparision 1307.4129
Performance comparision 1307.4129
 
Software effort estimation through clustering techniques of RBFN network
Software effort estimation through clustering techniques of RBFN networkSoftware effort estimation through clustering techniques of RBFN network
Software effort estimation through clustering techniques of RBFN network
 
Image transmission in wireless sensor networks
Image transmission in wireless sensor networksImage transmission in wireless sensor networks
Image transmission in wireless sensor networks
 
Hz2514321439
Hz2514321439Hz2514321439
Hz2514321439
 
IRJET-Multiple Object Detection using Deep Neural Networks
IRJET-Multiple Object Detection using Deep Neural NetworksIRJET-Multiple Object Detection using Deep Neural Networks
IRJET-Multiple Object Detection using Deep Neural Networks
 
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksPR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Low Memory Low Complexity Image Compression Using HSSPIHT Encoder
Low Memory Low Complexity Image Compression Using HSSPIHT EncoderLow Memory Low Complexity Image Compression Using HSSPIHT Encoder
Low Memory Low Complexity Image Compression Using HSSPIHT Encoder
 
Efficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter SharingEfficient Neural Architecture Search via Parameter Sharing
Efficient Neural Architecture Search via Parameter Sharing
 

Andere mochten auch (8)

41
4141
41
 
49
4949
49
 
52
5252
52
 
55
5555
55
 
61
6161
61
 
62
6262
62
 
94
9494
94
 
My profile
My profileMy profile
My profile
 

Ähnlich wie 53

ENERGY AND LATENCY AWARE APPLICATION MAPPING ALGORITHM & OPTIMIZATION FOR HOM...
ENERGY AND LATENCY AWARE APPLICATION MAPPING ALGORITHM & OPTIMIZATION FOR HOM...ENERGY AND LATENCY AWARE APPLICATION MAPPING ALGORITHM & OPTIMIZATION FOR HOM...
ENERGY AND LATENCY AWARE APPLICATION MAPPING ALGORITHM & OPTIMIZATION FOR HOM...cscpconf
 
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...ijassn
 
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...ijassn
 
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...ijcsit
 
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...TELKOMNIKA JOURNAL
 
OPTIMIZED TASK ALLOCATION IN SENSOR NETWORKS
OPTIMIZED TASK ALLOCATION IN SENSOR NETWORKSOPTIMIZED TASK ALLOCATION IN SENSOR NETWORKS
OPTIMIZED TASK ALLOCATION IN SENSOR NETWORKSZac Darcy
 
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...ijgca
 
Massive parallelism with gpus for centrality ranking in complex networks
Massive parallelism with gpus for centrality ranking in complex networksMassive parallelism with gpus for centrality ranking in complex networks
Massive parallelism with gpus for centrality ranking in complex networksijcsit
 
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...ijcsa
 
Applying Genetic Algorithm to Solve Partitioning and Mapping Problem for Mesh...
Applying Genetic Algorithm to Solve Partitioning and Mapping Problem for Mesh...Applying Genetic Algorithm to Solve Partitioning and Mapping Problem for Mesh...
Applying Genetic Algorithm to Solve Partitioning and Mapping Problem for Mesh...AIRCC Publishing Corporation
 
APPLYING GENETIC ALGORITHM TO SOLVE PARTITIONING AND MAPPING PROBLEM FOR MESH...
APPLYING GENETIC ALGORITHM TO SOLVE PARTITIONING AND MAPPING PROBLEM FOR MESH...APPLYING GENETIC ALGORITHM TO SOLVE PARTITIONING AND MAPPING PROBLEM FOR MESH...
APPLYING GENETIC ALGORITHM TO SOLVE PARTITIONING AND MAPPING PROBLEM FOR MESH...ijcsit
 
Algorithm selection for sorting in embedded and mobile systems
Algorithm selection for sorting in embedded and mobile systemsAlgorithm selection for sorting in embedded and mobile systems
Algorithm selection for sorting in embedded and mobile systemsJigisha Aryya
 
Application Aware Topology Generation for Surface Wave Networks-on-Chip
Application Aware Topology Generation for Surface Wave Networks-on-ChipApplication Aware Topology Generation for Surface Wave Networks-on-Chip
Application Aware Topology Generation for Surface Wave Networks-on-Chipzhao fu
 
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...csandit
 
Computing localized power efficient data
Computing localized power efficient dataComputing localized power efficient data
Computing localized power efficient dataambitlick
 
DATA GATHERING ALGORITHMS FOR WIRELESS SENSOR NETWORKS: A SURVEY
DATA GATHERING ALGORITHMS FOR WIRELESS SENSOR NETWORKS: A SURVEYDATA GATHERING ALGORITHMS FOR WIRELESS SENSOR NETWORKS: A SURVEY
DATA GATHERING ALGORITHMS FOR WIRELESS SENSOR NETWORKS: A SURVEYijasuc
 

Ähnlich wie 53 (20)

ENERGY AND LATENCY AWARE APPLICATION MAPPING ALGORITHM & OPTIMIZATION FOR HOM...
ENERGY AND LATENCY AWARE APPLICATION MAPPING ALGORITHM & OPTIMIZATION FOR HOM...ENERGY AND LATENCY AWARE APPLICATION MAPPING ALGORITHM & OPTIMIZATION FOR HOM...
ENERGY AND LATENCY AWARE APPLICATION MAPPING ALGORITHM & OPTIMIZATION FOR HOM...
 
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...
 
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...
CONFIGURABLE TASK MAPPING FOR MULTIPLE OBJECTIVES IN MACRO-PROGRAMMING OF WIR...
 
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...
Estimation of Optimized Energy and Latency Constraint for Task Allocation in ...
 
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
Stochastic Computing Correlation Utilization in Convolutional Neural Network ...
 
OPTIMIZED TASK ALLOCATION IN SENSOR NETWORKS
OPTIMIZED TASK ALLOCATION IN SENSOR NETWORKSOPTIMIZED TASK ALLOCATION IN SENSOR NETWORKS
OPTIMIZED TASK ALLOCATION IN SENSOR NETWORKS
 
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...
PERFORMANCE FACTORS OF CLOUD COMPUTING DATA CENTERS USING [(M/G/1) : (∞/GDM O...
 
Massive parallelism with gpus for centrality ranking in complex networks
Massive parallelism with gpus for centrality ranking in complex networksMassive parallelism with gpus for centrality ranking in complex networks
Massive parallelism with gpus for centrality ranking in complex networks
 
J41046368
J41046368J41046368
J41046368
 
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...
 
Applying Genetic Algorithm to Solve Partitioning and Mapping Problem for Mesh...
Applying Genetic Algorithm to Solve Partitioning and Mapping Problem for Mesh...Applying Genetic Algorithm to Solve Partitioning and Mapping Problem for Mesh...
Applying Genetic Algorithm to Solve Partitioning and Mapping Problem for Mesh...
 
APPLYING GENETIC ALGORITHM TO SOLVE PARTITIONING AND MAPPING PROBLEM FOR MESH...
APPLYING GENETIC ALGORITHM TO SOLVE PARTITIONING AND MAPPING PROBLEM FOR MESH...APPLYING GENETIC ALGORITHM TO SOLVE PARTITIONING AND MAPPING PROBLEM FOR MESH...
APPLYING GENETIC ALGORITHM TO SOLVE PARTITIONING AND MAPPING PROBLEM FOR MESH...
 
Algorithm selection for sorting in embedded and mobile systems
Algorithm selection for sorting in embedded and mobile systemsAlgorithm selection for sorting in embedded and mobile systems
Algorithm selection for sorting in embedded and mobile systems
 
Ijebea14 272
Ijebea14 272Ijebea14 272
Ijebea14 272
 
Application Aware Topology Generation for Surface Wave Networks-on-Chip
Application Aware Topology Generation for Surface Wave Networks-on-ChipApplication Aware Topology Generation for Surface Wave Networks-on-Chip
Application Aware Topology Generation for Surface Wave Networks-on-Chip
 
Ed33777782
Ed33777782Ed33777782
Ed33777782
 
Ed33777782
Ed33777782Ed33777782
Ed33777782
 
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...
 
Computing localized power efficient data
Computing localized power efficient dataComputing localized power efficient data
Computing localized power efficient data
 
DATA GATHERING ALGORITHMS FOR WIRELESS SENSOR NETWORKS: A SURVEY
DATA GATHERING ALGORITHMS FOR WIRELESS SENSOR NETWORKS: A SURVEYDATA GATHERING ALGORITHMS FOR WIRELESS SENSOR NETWORKS: A SURVEY
DATA GATHERING ALGORITHMS FOR WIRELESS SENSOR NETWORKS: A SURVEY
 

Mehr von srimoorthi (20)

87
8787
87
 
84
8484
84
 
83
8383
83
 
82
8282
82
 
75
7575
75
 
73
7373
73
 
72
7272
72
 
70
7070
70
 
69
6969
69
 
68
6868
68
 
63
6363
63
 
60
6060
60
 
57
5757
57
 
56
5656
56
 
50
5050
50
 
51
5151
51
 
45
4545
45
 
44
4444
44
 
43
4343
43
 
42
4242
42
 

53

  • 1. Multi-Application Multi-Step Mapping Method for Many-Core Network-on-Chips Bo Yang∗ , Liang Guang∗‡ , Thomas Canhao Xu∗‡ , Alexander Wei Yin∗‡ , Tero S¨ ntti∗† , Juha Plosila∗† a ∗ Department of Information Technology, University of Turku, Finland † Academy of Finland, Research Council for Natural Sciences and Engineering ‡ Turku Center for Computer Science, Turku, Finland {boyan, liagua, canxu, yinwei, teansa, juplos}@utu.fi Abstract—Massive parallel computing performed on many- rithms in literature are analyzed and compared in [12]. Tang core Network-on-Chips (NoCs) is the future of the computing. et al. proposed a two-step genetic algorithm and the related One feasible approach to implement parallel computing is to software for mapping concurrent applications on a fixed NoC deploy multiple applications on the NoC simultaneously. In this paper, we propose a multi-application mapping method starting architecture [10]. Murali et al. presented a methodology to with the application mapping which finds a region on the NoC map multiple use-cases onto the NoC architecture, satisfying for each application and then task mapping which maps all the constraints of each use-case [13]. In these works multiple tasks of the application into each region. In the application applications reuse the same platforms in different time slots. mapping step, several strategies based on the maximal empty The main drawback of these systems is the timing overhead rectangle (MER) technique are introduced for finding an optimal region for each application. In the task mapping step, a tree- incurred by reconfiguring the NoC and loading new appli- model based algorithm is used with the purpose of reducing the cations. Also, since various communication constraints and communication latency and energy consumption. The experiment traffic characteristics of applications have to be satisfied using results show that the proposed method can achieve considerable limited processing elements (PEs), the system design is more reduction of network latency and energy consumption (up to complicated and the optimized mapping for each application 18%) for a given set of applications. may not be achieved [13]. While the traditional approaches to maximize the serial I. I NTRODUCTION performance of processors by maximizing the clock speed Over the last 40 years, we have witnessed a series of remark- and increasing instruction-level parallelism (ILP) are proved able developments in computer industry. One of them is the to reach their limits [7] [8], the many-core NoC architectures increasing processing capability of the system. The increase is provide more feasibility to deliver higher performance through not only achieved by the performance improvements between parallel computing. The massive parallel computing performed the generations of uniprocessors, but also comes from the on many-core NoCs is the future of computing [1]. With advent of multi-core or many-core architectures where tens to increasing number and computational power of on-chip PEs, hundreds of processors or cores can be integrated on a single the parallel computing on many-core NoCs can be realized no chip. Examples of such architectures are [6] and [17]. A recent only at the instruction level, but also at the higher task and/or study at the University of California, Berkeley [1] suggests that application levels. To realize the higher level parallelism and it will soon be possible to integrate more than 1000 cores on make full use of the abundant resources on the NoCs, it is a single chip since Moore’s Law is still generously delivering no longer reasonable to only focus on the implementation of transistors at the rate of twice every couple of years. While single application with abundant PEs being available on the the amount of on-chip cores increases, the communication many-core NoCs. Instead, the design focus should shift from among them is critical to the system performance and energy the single-application to the multi-application scenarios. More consumption. In the last decade, NoC has been proposed as precisely, multiple applications could be deployed on different an alternative for the traditional bus and point-to-point adhoc regions of the NoC and executed in parallel. connections in order to address the challenge of increasing In this paper, we propose a novel mapping method whereby concurrent communication requirements as well as the diffi- multiple applications can be simultaneously mapped on the culty of global synchronization [4]. many-core NoCs. The mapping method consists of application Based on the NoC platforms, a large body of researches mapping and task mapping. The two-step mapping method addressing the mapping problem has been undertaken in the first finds a region on the NoC for each application and then last couple of years [9] [12] [10] [13]. In [9], Hu et al. maps all tasks of the application into the region. Several presented a branch and bound algorithm which maps the strategies based on the MER technique are introduced for tasks of a single application to nodes and generates a suitable finding an objective MER for each application. Following the deadlock-free routing function such that the total commu- application mapping, a tree-model based algorithm is used to nication energy consumption is minimized under specified map all tasks of the application into the objective MER. By performance constraints. Several well used task mapping algo- optimizing the layout of both multiple applications and tasks 978-1-4244-8971-8/10$26.00 c 2010 IEEE
  • 2. within applications, the proposed method aims at achieving Using these definitions, the problem of the multi-application lower network latency and energy consumption for multiple mapping can be described as follows: applications on the many-core NoCs. Given a set of TGs and a CRG, find a mapping area (MA) on CRG for each TG which can accommodate all tasks of the II. P ROBLEM F ORMULIZATION TG, also find a position within the MA for each task such that A. System Model the lowest overall network delay and communication energy The target system is shown in Figure 1, consisting of a consumption can be achieved for the give set of TGs. Real-time Operating System (RTOS) and a NoC platform. The C. Objective Formulization NoC provides the computation and communication resources to implement multiple applications. The RTOS schedules the Since the network delay is proportional to the communi- given set of applications (e.g. A1 to A6 in Figure 1) and cation distance between the source and destination nodes on manage the resources on the NoC. The mapper runs the pro- the NoC, one feasible way to reduce network delay is to posed mapping algorithm to map each application on a feasible shorten the communication distance among tasks as much as region and the loader loads all tasks on PEs according to the possible. This can be achieved in the process of finding the mapping solution. This work deals with on-line scenarios, i.e., optimal MA for an application. We use the nodes average the RTOS does not know in advance when each application distance (NAD) mentioned in [10] to evaluate the average arrives and how much PEs they need. In this paper, we focus communication distance within the MA. NAD is defined as on the mapping algorithm of the mapper. the average distance between two randomly selected nodes in NoC architecture. For a X × Y mesh NoC, the NAD is: X +Y 1 N AD = × 1− (1) 3 X ×Y The Equation (1) implies that for a given application, the average communication distance among tasks varies when different areas are used to map the tasks of the application. Fig. 1: System Model The more compact the area is, the smaller NAD it achieves. B. Problem Description The energy consumption of a communication between tasks ti and tj is determined by both the communication weight wij In the single-application mapping scenarios, the mapping and the distance |lij |. To reduce the communication energy problem is how to find an appropriate position for each task consumption, minimizing the weighted communication of the of the application subject to particular performance or cost application (WCA) has been proved to be efficient [18]. The metrics. In the multi-application scenarios, the problem is WCA is defined as the sum of products of the wij and |lij | extended to search for the optimal positions for both the for all communications in an application as follows: applications and tasks of the individual application. We first give the definitions regarding the target application and NoC W CA = wij × |lij | (2) architecture used in this paper. ∀i,j Definition 1: We assume that each application has already Based on these formulizations, the objectives of the pro- been implemented as a set of tasks. The application is modeled posed method are transformed into seeking the most compact by a task graph (TG). A TG is a directed graph TG = mapping area MA with smallest NAD and the optimized task < T, C >, where T = {t1 , t2 , . . . , tp } represents the set mapping solution with minimized WCA. of tasks, corresponding the set of TG vertices, and C = {(ti , tj , wij )} denotes the set of communications between III. M ULTI -A PPLICATION M ULTI -S TEP M APPING tasks, corresponding to the set of TG edges. The edge weight To reach the two goals mentioned in the previous section, wij in (ti , tj , wij ) represents the total data amount, sent from we propose a two-step multi-application mapping method. ti to tj . The number of tasks p in TG is denoted as the size The mapping consists of two sequential phases: application of the given application. mapping (AM) and task mapping (TM). AM deals with Definition 2: A NoC is modeled as a communication re- the mapping of multiple applications and its purpose is to source graph (CRG). A CRG is a directed graph CRG = optimize the layout of multiple applications mapped on the < N, L >, where N = {n1 , n2 , . . . , nq } denotes the set of NoC and find the optimal MA with the minimal NAD for each nodes on the NoC, corresponding to the set of CRG vertices, application. TM works after AM to conduct the task mapping and L = {(ni , nj , |lij |)} designates the set of routing path of an individual application and achieve the minimized WCA. from node ni to node nj , corresponding to the edges of CRG. |lij | represents the communication length from node ni to node A. Application Mapping (AM) nj . The number of nodes q in CRG is denoted as the size of On a 2-D mesh NoC, any sub-mesh or rectangle can be the NoC. For the sake of simplicity, in this paper, the NoC is regarded as a piece of compact area. Thus, the problem of assumed to be a homogeneous 2-D using deterministic X-Y AM is turned into the problem of managing the rectangles routing strategy. on the NoC. To do this, AM adopts the concept of maximal
  • 3. empty rectangle (MER), which was originally used to solve smallest size, the one with minimal A(R) is selected. the placement problem in FPGA design [2]. Best Shape Best Size (BShBS): Similar to the previous • 1) MER Technique: A MER is a empty rectangle that is not one, among all candidate MERs with the same minimal contained by any other empty rectangles. In our case, a MER A(R), the one with smallest size is selected. represents a cluster of free nodes on the NoC that is used to Whenever an objective MER Rm is selected, AM will choose map an application. Figure 2 shows an example of application a mapping area MA with minimal A(M A) in Rm to map the mapping using the MER technique. At first, the whole surface given application. In this paper, we define the corner of the of the NoC is represented by one MER R0 (Figure 2a). After objective MER Rm which is closest to any corner of the NoC the mapping of application A1 , the R0 is split into R1 and R2 as the starting point to create the MA. The reasons behind this (Figure 2b). In Figure 2c, the R1 is further fragmented into include to reduce fragmentations along the borders of the NoC R3 and R4 after the application A2 has been mapped. The as well as to reduce the congestion in the middle area of the MERs R2 , R3 and R4 can be used for the future application NoC by leaving free MERs there. The created area MA will mapping. Let w(R) and h(R) be the width and height of the be returned as an input for TM phase. MER R, the normalized aspect ratio A(R) of the MER R is To deal with the third case, the LS+C strategy is applied. defined as: • Largest Size + Combining (LS+C): In this case, the max{w(R), h(R)} A(R) = (3) application has to be mapped on separate MERs. To avoid min{w(R), h(R)} increasing communication cost between more distant The aspect ratio A(R) implies the shape of the MER. If it MERs with small size , LS+C chooses the free MER equals 1, the MER is a square. Otherwise, it is a standard with largest number of PEs as the primary area and then rectangle. combines the nearest free MERs to get adequate PEs R3 for the application. The combined mapping area MA is R4 returned as an input for TM phase. A2 3) MER Merging: When the execution of an application A1 completes, the area occupied by the application can be released R2 and merged with neighboring free MERs to get larger MERs (a) (b) (c) for the future mappings. Combining these techniques and strategies together, the Fig. 2: Application Mapping Using MER algorithm of AM is described as Algorithm 1. 2) Objective MER Selection: For a given application with the size p, AM tries to find an optimal or near-optimal Algorithm 1: Multi-Application Mapping objective MER Rm to map the application. Based on the state Input : TGs: a set of applications, CRG: a 2-D mesh of MERs on the NoC, the cases that AM possibly faces are: with size W × H (1) the total amount of PEs in all MERs is not adequate to Output: The mapping areas for applications in A accommodate the given application; (2) there is at least one 1 Initiate the original MERs list R0 with size W × H. candidate MER that can accommodate the given application; 2 if the free PEs on the NoC can not accommodate the (3) the total amount of PEs in all MERs is adequate to arriving application Ai then accommodate the given application, but neither of them can 3 Reject the mapping request. fit the application alone. In the first case, the mapping request will be rejected at 4 else if More than one MER can accommodate Ai then this time and the RTOS can try the mapping later. For the 5 Use appropriate strategy to select one objective MER second case, we propose the following strategies for finding and create the mapping area MA. the objective MER. 6 else • Best Size (BS): BS chooses the candidate MER with the 7 Use the LS+C strategy to find a mapping area MA. smallest size as the objective MER Rm . Intuitively, this 8 if application Aj is completed then strategy tries to keep the big rectangles for the future 9 Merge the area occupied by Aj with neighboring free application mapping. MERs; • Best Shape (BSh): It is noteworthy in Equation (1) that, 10 Repeat 2-9 until MA for each application is found. an area with the same width X and height Y holds the minimal NAD among all areas with size X × Y . Figure 3 is an example of the application mapping using Taking this into consideration, BSh strategy chooses the Algorithm 1. Four applications with size 25, 16, 16, 9 used in candidate MER with the minimal A(R) as the objective the experiment in Section IV, denoted as FFT(25), X264(16), MER Rm . The reason behinds BSh is that in such a MER, TPCH(16) and FFT(9) respectively, are mapped sequentially the application is more likely to be mapped in a area close on a NoC with size 10 × 7. Figure 3a and 3b are the final to square so that a smaller NAD can be achieved. mapping under the BS and BSh strategy respectively. The • Best Size Best Shape (BSBSh): BSBSh is extended from main difference of these two mapping results is the transposed BS. If there are several candidate MERs with the same locations of application X264(16) and TPCH(16). Under both
  • 4. strategies, the LS+C strategy is used for the application The major responsibility of the AM algorithm is to manage FFT(9). the MERs list. As mentioned in [2], the algorithm of managing MERs is O n2 for n mapped applications. C. Task Mapping (TM) After the mapping area MA for a given application has been obtained in the AM phase, the role of TM is to map the tasks of the application with the purpose of minimizing the W CA of the application. To address the task mapping, we (a) BS Mapping (b) BSh Mapping propose a tree-model based mapping algorithm. The mapping Fig. 3: Application Mapping Using BS and BSh algorithm consists of two parts: the abstraction of a mapping area MA into an extended tree structure and the mapping of an application onto the extended tree. Figure 5 is an example of mapping the tasks of an application A2 (shown in Figure 2c) on the selected MA. Mapping Area (MA) (a) BSh Mapping for TPCH(16) (b) WNAD Mapping A2 Fig. 4: Application Mapping Using WNAD B. Weighted NAD Task Graph (TG) Step 1 Abstraction Abstracted Tree Step 2 In Algorithm 1, the MERs which can’t accommodate the T2 80 10 Step 4 Step 3 T4 20 30 T5 given application would not be selected as an objective MER T1 T4 T7 T2 T7 T6 T3 20 60 Rm as long as there are candidate MERs, although some of 30 T3 40 T6 20 50 Step 7 T1 T5 them hold a smaller A(R) than the selected Rm and can Step 6 Step 5 accommodate most tasks of the application. Figure 4a is an Tx Task Tx Mapped Spare Node Node example of application mapping under the BSh strategy. After Fig. 5: Tree-Model Based Task Mapping the application FFT(25) and X264(16) have been mapped, the 1) Tree Model of MA: The abstraction of a MA into an candidate MER R1 is selected (shown in Figure 3b), although extended tree structure follows Algorithm 2. Simply put, the the non-candidate MER R2 with the better shape and close center point of the MA is chosen as the root node of the tree, size (15) for the application TPCH(16). This is because the which has the shortest average distance to other nodes in the combination of several separated MERs is likely to induce MA. The neighbors of the center point are put as the children higher NAD and WCA than a monolithic MER. However, if nodes of the root node. The procedure continues until all nodes the task mapping algorithm presented in the following section in the MA are put onto the tree (bottom right of Figure 5). The is taken into account, it is reasonable to accept some non- structure is called an extended tree since some children may candidate MER as the objective MER on which most of the have more than one parent node. This extended tree structure tasks are able to be mapped. Since the task mapping always places the network nodes with shorter average distance (to chooses the task which affects the WCA most and maps it prior other nodes) onto higher-level tree nodes. Intuitively, task in to other tasks, the last selected tasks have limited impact on the application with a large communication volume should be the overall WCA even if they are mapped on separate MERs. placed on as high level on the tree as possible, in order to Therefore, we propose another strategy for the objective MER minimize the total communication cost which is proportional selection, termed as weighted NAD (WNAD). The WNAD of to the average communication distance. a MER is defined as follows: Ntasks Algorithm 2: Tree Abstraction Algorithm W N AD = × N AD (4) Nnodes Input : mapping area MA Output: An extended tree abstraction where the first factor is the weighted ratio. Ntasks is the number of tasks in the application. Nnodes is the number of 1 Select the center network node as the root node in the nodes occupied by the tasks if the application is mapped on tree; the MER. For a candidate MER, the weighted ratio equals to 2 Traverse the NoC from the center node, record all its 1 and the WNAD strategy is equivalent to the BSh strategy. neighbors as the child nodes; The MER with a lower WNAD can accommodate more tasks 3 Repeat 2 for each child node until all nodes are in the with a smaller NAD. Using the WNAD strategy, both the tree. candidate and non-candidate MERs presented in the previous 2) TM Algorithm: The mapping of applications onto the strategies can be evaluated together to find the objective MER. tree follows Algorithm 3. We calculate the communication The Figure 4b is an example of using WNAD strategy to map volume (CV , Definition 3) of each task in the task graph, the same set of applications as in Figure 3. and place the task with the largest communication volume
  • 5. onto the root node in the tree. Then we calculate the weighted A cycle-accurate NoC simulator, Noxim [14], was extended communication volumes of the remaining tasks to the ones and used to simulate the four applications’ traffics on a already mapped onto the tree, termed as affinity to partial tree 10 × 7 NoC and produce network delay and communication (AP T , Definition 4), and place the task with the largest AP T energy consumption under different mapping strategies. The to the highest node available in the tree. This procedure iterates workload traces of these four applications were gathered from until all tasks have been mapped onto the tree. Simics [11] where the NoC was configured to model a chip Definition 3: Let ti be a task in the task graph TG, and cij multiprocessor (CMP). Each PE has a core, a private L1 be the communication volume from ti to tj , then cache and a shared L2 cache bank. Memory controllers are connected to the top and bottom side of the chip. The static CVti = (wij + wji ) non uniform cache architecture (NUCA) [3] is implemented ∀tj ∈T in our memory/cache architecture, in which data are mapped CVti is the communication volume of ti . to cache banks statically. Definition 4: Let T be the set of mapped tasks on a tree, The average communication distance (ACD), WCA, average and ti be a task not yet mapped, then network latency (ANL), energy consumption (EC) under the different strategies were compared. The ACD is the average AP Tti = (wij + wji ) communication distance among all tasks when the application ∀tj ∈T is mapped on the selected objective MER. The ANL is the average number of cycles needed for transferring one packet AP Tti is the affinity of ti to the partial (mapped) tree. on the NoC. Algorithm 3: Task Mapping Algorithm B. Results Analysis Input : TG, Abstracted Tree of MA Output: Task Mapping on the Tree Average Communication Distance (ACD:hops) 4 BS+GI BS+Tree 1 Calculate CV for all tasks, and map the task with the 3.5 3 BSh+GI BSh+Tree WNAD+GI WNAD+Tree largest CV onto the root node; 2.5 2 Calculate AP T of all non-mapped tasks, and map the 2 1.5 task with the largest AP T to the highest level tree node 1 available; 0.5 0 3 repeat 2 until all tasks have been mapped. FFT(25) X264(16) TPCH(16) FFT(9) Fig. 6: ACD Using Different Strategies The tree-model based mapping has low complexity and high Weighted Communcation of Application(WCA) 100% BS+GI efficiency. For instance, compared to the greedy incremental BS+Tree BSh+GI BSh+Tree 80% (GI) algorithm presented in [12], the tree-based mapping has WNAD+GI WNAD+Tree an algorithm complexity of O(N ), where N is the number 60% of tasks in TG, while the GI algorithm has an algorithm 40% complexity of O(N 2 ). By mapping tasks starting from the root 20% of the tree, the algorithm minimizes the W CA using the AP T 0% FFT(25) X264(16) TPCH(16) FFT(9) method and consequently reduces the energy consumption and network delay. Fig. 7: WCA Using Different Strategies Figure 6 shows the ACD for each application under different IV. EXPERIMENT strategies. BS+GI and BS+Tree respectively represent the A. Experiment Setup cases where BS is applied to the application mapping and the Full system simulations were performed to evaluate the GI and tree-model based algorithm to the task mapping, and proposed method under different mapping strategies. Since the so forth. The variant ACDs of the X264, TPCH and FFT(9) comparison in [12] shows that the GI algorithm achieves good under different strategies show the impact of objective MER results compared with some other algorithms, the GI algorithm on the ACD. For each of them, the optimal ACD is achieved was chosen as a reference to evaluate the tree-model based when they are mapped on an objective MER with minimal algorithm used in task mapping. The tree-model based and GI aspect ratio A(R), or intuitively, a rectangle close to square. algorithms were used together with the BS, BSh and WNAD Non-optimized mappings result highest ACD for X264 (3.14, strategies of application mapping to compare the performance 25% higher than the optimal 2.49), TPCH (3.14, 25% higher of these strategies. Four benchmark applications were selected, than the optimal 2.49) and FFT (9) (2.49, 23% higher than the three of them are from the SPLASH-2 [15] and PARSEC optimal 2.03). For most applications, the WNAD can obtain [5] suite: FFT with 25 and 9 cores (FFT(25) and FFT(9) the same or better solution than that of BS and BSh strategy. respectively), X264 with 16 cores (X264(16)). Another TPC- The only exception is the case of TPCH where one primary H with 16 cores (TPCH(16)) is an ad-hoc, decision support MER combing another MER is selected for the mapping under benchmark from TPC [16]. The mapping was conducted in the WNAD strategy, instead of a monolithic MER under the order of FFT(25), X264(16), TPCH(16) and FFT(9). BS and BSh strategy. This does prove the negative impact
  • 6. of separate MERs on the ACD. Also note, the task mapping for application mapping, the WNAD is likely to obtain a better algorithm has negligible impact on the ACD. solution than BS and BSh. For the task mapping, the proposed The normalized WCA of each application under different tree-model based algorithm outperforms the GI algorithm on strategies is displayed in Figure 7. The impact of the applica- achieving lower network latency and energy consumption. tion mapping on the WCA keeps consistent with that on the WNAD+Tree strategy achieves lowest network latency and ACD (shown in Figure 6). However, it is noteworthy that the energy consumption among all strategies. task mapping has a great impact on the WCA. In all cases, VI. ACKNOWLEDGEMENT the tree-model based algorithm outperforms the GI algorithm. The authors would like to thank the Academy of Finland For example, using the WNAD+Tree strategy, the tree-model for the financial support for this work. based algorithm achieves 35%, 17%, 16% and 32% lower WCA than the GI algorithm for each application. Furthermore, R EFERENCES WNAD+Tree contributes the lowest WCA for each application [1] Krste Asanovic, Ras Bodik, Bryan C. Catanzaro, Joseph J. Gebis, Parry among all strategies. Husbands, Kurt Keutzer, David A. Patterson, William L. Plishker, John 100% GI Shalf, Samuel W. Williams, and Katherine A. Yelick. The landscape of Tree−Model Based parallel computing research: a view from berkeley. (UCB/EECS-2006- Average Network Latency (ANL) 80% 183), December 2006. 60% [2] K. Bazargan, R. Kastner, and M. Sarrafzadeh. Fast template placement for reconfigurable computing systems. Design Test of Computers, IEEE, 40% 17(1):68 –83, jan-mar 2000. 20% [3] Bradford M. Beckmann and David A. Wood. Managing wire delay in large chip-multiprocessor caches. In Proceedings of the 37th annual 0% BS BSh WNAD IEEE/ACM International Symposium on Microarchitecture, pages 319– Fig. 8: ANL Using Different Strategies 330, December 2004. 100% [4] L. Benini and G. De Micheli. Networks on chips: a new soc paradigm. GI Tree−Model Based Computer, 35(1):70–78, Jan 2002. [5] Christian Bienia, Sanjeev Kumar, Jaswinder Pal Singh, and Kai Li. The Energy Consumption (EC) 80% 60% parsec benchmark suite: characterization and architectural implications. In Proceedings of the 17th international conference on Parallel archi- 40% tectures and compilation techniques, pages 72–81, October 2008. 20% [6] M. Denneau and H. S Warren, Jr. 64-bit cyclops: Principles of operation. IBMTech-report, 2005. 0% BS BSh WNAD [7] P.P. Gelsinger. Microprocessors for the new millennium: Challenges, Fig. 9: EC Using Different Strategies opportunities, and new frontiers. In Proceedings of The International Solid State Circuits Conference (ISSCC), pages 22–25, 2001. The normalized simulation results of the ANL and the EC [8] J. Hennessy and D. Patterson. Computer Architecture: A Quantitative are demonstrated in Figure 8 and 9. As anticipated by the Application, 4th Edition. Morgan Kauffman, 2007. WCA in Figure 7, the tree-model based algorithm achieves [9] Radu Marculescu Jingcao Hu. Energy- and performance-aware mapping for regular noc architecture. IEEE Transations On Computer-Aided lower ANL and EC than the GI algorithm. The ANL of tree- Design of Integrated Circuits and Systems, Vol.24, No.4:551–562, 2005. model based algorithm is 12%, 15%, 13% lower than that of [10] Tang Lei and Shashi Kumar. A two-step genetic algorithm for mapping the GI under BS, BSh and WNAD strategies respectively. The task graphs to a network on chip architecture. In DSD ’03: Proceedings of the Euromicro Symposium on Digital Systems Design, page 180, same achievements keeps for the EC. Furthermore, WNAD Washington, DC, USA, 2003. IEEE Computer Society. strategy outperform the BS and BSh and achieves lowest [11] P.S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. Hallberg, ANL and EC (about 5% lower in average ). For this set of J. Hogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. Computer, 35(2):50–58, February 2002. applications, the difference between BS and BSh is negligible [12] C.A.M. Marcon, E.I. Moreno, N.L.V. Calazans, and F.G. Moraes. with respect to the ANL and EC. The lowest ANL and EC Evaluation of algorithms for low energy mapping onto nocs. In Proc. are achieved by WNAD+Tree which is 18% lower compared IEEE International Symposium on Circuits and Systems ISCAS 2007, pages 389–392, 2007. with the worst case under BSh+GI strategy. [13] Srinivasan Murali, Martijn Coenen, Andrei Radulescu, Kees Goossens, and Giovanni De Micheli. Mapping and configuration methods for multi- V. C ONCLUSION use-case networks on chips. In ASP-DAC ’06: Proceedings of the 2006 An innovative method for multiple applications mapping on Asia and South Pacific Design Automation Conference, pages 146–151, Piscataway, NJ, USA, 2006. IEEE Press. the future many-core NoC is proposed. The two-step mapping [14] University of Catania. Noxim. http://www.noxim.org/. method first finds a region on the NoC for a given application [15] Jaswinder Pal Singh, Anoop Gupta, Moriyoshi Ohara, Evan Torrie, and and then maps all tasks of the application into the region. Steven Cameron Woo. The splash-2 programs: Characterization and methodological considerations. Computer Architecture, International Several strategies based on the MER technique, e.g. BS, Symposium on, 0:24, 1995. BSh and WNAD are introduced for the application mapping. [16] TPC. Tpc-h. http://www.tpc.org/tpch/. By using these strategies, the algorithm can efficiently find [17] S. R. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, the optimal objective MER to map the target application. Y. Hoskote, N. Borkar, and S. Borkar. An 80-tile sub-100-w teraflops Following the application mapping, a tree-model based algo- processor in 65-nm cmos. Solid-State Circuits, IEEE Journal of, rithm is proposed for the task mapping and compared against 43(1):29–41, 2008. [18] Bo Yang, Thomas Canhao Xu, Tero Santti, and Juha Plosila. Tree-model an existing GI algorithm. The experiment shows that in a based mapping for energy-efficient and low-latency network-on-chip. In common case, the MER with minimal aspect ratio is ideal for Design and Diagnostics of Electronic Circuits and Systems (DDECS), mapping a given application. Among the proposed strategies pages 189 –192, 14-16 2010.