SlideShare a Scribd company logo
1 of 4
Download to read offline
Analysis of Modeling Styles on Network-on-Chip
                    Simulation
                                 Lasse Lehtonen, Erno Salminen, Timo D. Hämäläinen
                           Tampere University of Technology Department of Computer Systems,
                                       P.O.Box 553, FIN-33101 Tampere, Finland
                                                   lasse.lehtonen@tut.fi


   Abstract—This paper analyses the effects of Network-on-Chip                        II. R ELATED W ORK
(NoC) models written in SystemC on simulation speed. Two
Register Transfer Level (RTL) models and Approximately Timed         Simulation speed of hardware models can be accelerated
(AT) and Loosely Timed (LT) Transaction Level (TL) models are     in many ways. There are multiple languages to choose from
compared against reference RTL VHDL 2D mesh model. Three          for describing hardware models, most used being VHDL,
different mesh sizes are evaluated using a commercial simulator
and OSCI SystemC reference kernel. Studied AT model achieved      Verilog, SystemVerilog and SystemC. According to [1] the
13-40x speedup with modest 10% estimation error.                  choise of language and tools have significant impact on
                                                                  simulation speed. Most commercial simulators were found
                     I. I NTRODUCTION                             to be optimized for one language only and having usually
   Today’s System-on-Chips (SoCs) consists of many Intellec-      more than 1.8x difference in simulation times for tool’s slower
tual Property (IP) components, such as Processing Elements        languages. In one extreme case a 10x difference in simulation
(PEs), memories and Network-on-Chip (NoC) components              time was measured between two commercial simulators for
executing complex applications.                                   the same model.
   As SoCs keep growing larger the design space grows even           Thorough evaluation of VHDL, Verilog, SystemVerilog and
more rapidly. Larger design space implies longer simulations      SystemC implemented with different abstraction levels and
and more simulation runs for Design Space Exploration (DSE)       data types was performed in [1] using three commercially
to find near optimum scheduling, configurations and mappings        available simulators and gcc compiler. Verilog was found
for IP blocks and the NoC connecting them.                        to be 2x faster than VHDL and SystemC 10x slower on
   Simulation performance is one of the key issues in verifica-    average for RTL models. SystemC TLM was 2.6x slower
tion and DSE. Fast and accurate estimates are required early      than SystemVerilog TLM Programmer’s View (PV) model.
in products design process to provide information of important    SystemVerilog was fastest to simulate in all abstraction levels
factors, such as performance, area and power consumption [3].     on average.
   Simulation performance is a sum of many different factors,        Speedup for simulation times can be gained through model
such as the choise of modeling language, simulator tool,          optimization, transformation and reduction [5]. Model opti-
abstraction level and the quality of the description.             mization is a straightforward process of removing unneces-
   One widely used method to improve simulation perfor-           sary or replacing inefficient code constructs with more effi-
mance is to raise the abstraction level. Register Transfer        cient ones for the used simulator without modifying model’s
Level (RTL) models have to describe every signal and their        structure. Model transformation means changing language
behaviour between clock edges to be synthesizeable which          constructs into functionally equivalent ones offering better
causes significant overhead when such a precision is not           performance in simulation time. Model reduction comprises
needed. Transaction Level (TL) modeling abstracts detailed        the process of removing all unrelevant parts of the model for
signaling from communication events using abstract function       a specific simulation run.
calls to annotate only the significant signaling events.              In [5] simulation performance was reported to improve
   Transaction Level Models (TLMs) can be modelled using          significantly by altering the source code. Simulation speedup
different abstraction levels. Untimed models offer best speedup   measured compilation times included was usually in range
but lack timing annotation totally. On the other hand cycle       from 1.1x to 10x depending on the used technique. Although
accurate models annotate time every clock cycle but have          these measurements were done about ten years ago, measure-
considerably smaller speedup in general.                          ments in [1] implies that commercial simulators don’t perform
   This paper presents a study regarding modeling styles for      all transformation optimizations automatically or effectively.
NoC models consentrating on the raise of abstraction level.       Hierarchically modelled designs were found in [1] to be on
Results show that the implemented Approximately Timed (AT)        average 1.5x slower than their flat counterparts and different
TLM 2.0 model offers a notable speedup (ranging from 13x          value types affecting the simulation time with up to 4x factor.
to 40x) with modest latency estimation error (under 10% on           TLM modeling techniques have been studied widely. For
average).                                                         example Proteo NoC architecture [10] used VHDL to construct




978-1-4244-8971-8/10$26.00 c 2010 IEEE
higher level model achieving an order of magnitude speedup       Router components instantiate one process and a priority
for a 16 node NoC.                                               queue per incoming link. Model was calibrated for 4 word
   Shirner et al. [9] created two TLM models for AMBA            packets and the error in average latencies was under 10%.
bus and compared them against synthesizeable Bus Functional         Second, Loosely Timed (LT) TLM (TLM LT) model was
Model version. More abstract TLM model was four orders of        made using same network interface models but abstracting all
magnitude faster with error up to 45% and the more accurate      routers using only one SystemC process. This implementation
TLM reached two order of magnitude speedup with an error         was left nearly untimed as the main interest was in simulation
of 35% in worst cases.                                           performance. A method to estimate latencies in one process
   Another research to speed up shared bus architecture ex-      for wormhole switched mesh topology was described in [4].
plorations in [7] used abstraction level between TLM and         Estimation error depended linearly on the network utilization.
Bus Cycle Accurate (BCA) to create accurate models offering      They measured 45% deviation in average latencies for satura-
a 1.55x speedup. OSCI (Open SystemC Initiative) TLM 2.0          tion load but on lower loads the estimation error was in more
based methodology to create BCA shared bus protocol models       acceptable ranges for DSE.
that retain the same level of accuracy as RTL models was
                                                                 B. Transaction Generator
introduced in [11]. Their BCA model was measured to be
between one and two orders of magnitude faster than RTL             Transaction Generator (TG) [8], a freely available SystemC
model.                                                           traffic generator for NoC benchmarking and DSE was used to
                                                                 generate traffic for NoC models. TG uses abstract application
              III. A NALYSIS E NVIRONMENT                        and platform models to mimic traffic patterns captured from
   Analysis is performed for 2-dimensional mesh topology         real data flow applications (Figure 4). It’s implemented using
(Figure 1). It uses 6 word fifos on links, wormhole switching     SystemC events and waits.
and XY routing with fixed priority arbitration. This paper           Network traffic created by TG is modelled in Transaction
focuses on using SystemC in three different abstraction levels   Level using a simple custom interface for RTL NoC models
to gain better performance in simulation speed compared to       and OSCI TLM 2.0 function calls with generic payload for
the reference synthesizeable RTL VHDL model.                     TLM models.




               Figure 1: Mesh 2D NoC topology


A. Network-on-Chip Models
   2D Mesh was implemented in synthesizeable RTL VHDL
and SystemC. Two SystemC RTL models were implemented
using VHDL coding style following the same structure as
the reference VHDL model. First RTL version used 4-state
logic and second 2-state logic. In addition, two SystemC TLM
versions were created using OSCI TLM 2.0 [6] function calls
                                                                 Figure 4: Transaction Generator uses abstract application and pro-
with generic payload.                                            cessing element models to mimic the computation of real SoCs to
                                                                 create accurate traffic


                                                                 C. Measurement
                                                                    All measurements were executed on a modern workstation
                                                                 with E8400 dual core processor, 3 GB RAM and 32-bit
    Figure 2: Conseptual difference between RTL and TLM.         Windows XP. Mixed-language simulation was run with a com-
                                                                 mercial simulator. SystemC-only simulations were run with
   First model (called TLM AT) was written in Approximately      both the commercial simulator and OSCI SystemC reference
Timed (AT) coding style using two timing points to annotate      kernel compiled with GNU gcc in Cygwin. Optimization level
the start and end of transactions (Figure 2). Two SystemC        -O3 was used with gcc and similar optimization were enabled
processes and queues were used to model network interfaces.      for the commercial simulator. No signal trace was gathered
4500                                                                                       4500                                                                                  4500
                                                                     gcc RTL 4-state
                                    4x4                              gcc RTL 2-state
                                                                                                                               6x6                                                                                   8x8
                          4000                                                                                       4000                                                                                  4000
                                                                     gcc TLM AT
                          3500                                       VHDL                                            3500                                                                                  3500
                                                                     RTL 4-state




                                                                                           Simulation duration [s]




                                                                                                                                                                                 Simulation duration [s]
Simulation duration [s]




                          3000                                       RTL 2-state                                     3000                                                                                  3000
                                                                                                                                                                                                                                               RTL
                                                                     TLM AT
                          2500                                       TLM LT                                          2500                                                                                  2500
                                                                     gcc TLM LT
                          2000                                                                                       2000                                                                                  2000


                          1500                                                                                       1500                                                                                  1500


                          1000                                                                                       1000                                                                                  1000


                           500                                                                                        500                                                                                   500
                                                                                                                                                                                                                                              TLM
                             0                                                                                          0                                                                                     0
                                 0.00      0.05        0.10         0.15         0.20                                       0.00     0.05         0.10         0.15       0.20                                    0.00     0.05        0.10          0.15      0.20
                                           Offered load [flits/agents/cycle]                                                          Offered load [flits/agents/cycle]                                                    Offered load [flits/agents/cycle]

Figure 3: Simulation times for 4x4, 6x6 and 8x8 meshes (4 word payload). Time increases with the amount of data sent. The difference
between RTL and TLMs is clearly seen. Gcc compiled OSCI reference kernel simulations are marked with gcc prefix and dashed lines.


and commercial simulator was run on console mode during                                                                                                     Due to the fact that TLM model processes activate only
simulations.                                                                                                                                             when transaction begins and ends a great increase in simula-
   Frequency of 50 MHz was used for NoC and processing                                                                                                   tion speed was achieved with bigger transfer sizes.
elements and simulations were run for 100 ms. A deterministic                                                                                               Simulations with gcc compiled OSCI reference SystemC
all-to-all traffic pattern was used generating the load.                                                                                                  kernel were on average 4.5x faster for 20 word transfers and
   Measured simulation times are for whole program so the                                                                                                3.5x faster for the commercial simulator when compared to 4
speedups shown are not only for the NoC models. Measure-                                                                                                 word transfers on same load conditions. Increasing size to 40
ments were performed more than twice to exclude variations                                                                                               words brought 8.5x speedup for gcc and 6.5x speedup for the
in workload affecting results. Table I lists the parameters that                                                                                         commercial simulator.
were varied on different load conditions.                                                                                                                   When compared to 6x6 VHDL RTL model with 40 word
                                          Table I: Summary of measured parameters                                                                        transfers the AT model offered on 30x-35x speedup and the
                                                                                                                                                         one process model 55x-60x speedup (figure 5) on average.
                 Simulator                      NoC Size             Tx Size              NoC Model
                 Commercial                   4x4, 6x6, 8x8         4, 20, 40           3 RTL, 2 TLM                                                     With 20 word transfers same speedups were 22x-28x and 40x-
                 OSCI                         4x4, 6x6, 8x8         4, 20, 40           2 RTL, 2 TLM                                                     45x.


                                                              IV. R ESULTS
A. Size of the NoC
   NoC sizes of 16, 36 and 64 routers were measured. Size
of the NoC affects nearly linearly the simulation time. With
the used traffic pattern the simulation speed for RTL models
decreases more than the increase in the number of mesh nodes
would imply. This is because in all-to-all traffic pattern on
bigger mesh means on average more hops to target and thus
more cycles to simulate for the same offered load. 6x6 mesh
simulated 2.6x and 8x8 5.4x slower than 4x4 mesh on average
as shown in figure 3.                                                                                                                                     Figure 5: TLM Speedups compared to RTL VHDL (40 word payload,
   TLM models scaled better. On average 6x6 mesh TLM                                                                                                     6x6 mesh)
models were 2.2x and 8x8 mesh models 4.6x slower than 4x4
mesh models.
B. Size of the Transfer
                                                                                                                                                         C. SystemC Logic Levels
  Transfer legths depends on the application and the speedup
grows with the transfer length for both the RTL and TLM                                                                                                    Difference in simulation times between 4-state and 2-state
models. By using 20 word transfers SystemC RTL simulation                                                                                                SystemC logic values was small. Gcc simulations were ap-
was approximately 2x faster on same load levels than with 4                                                                                              proximately 7% faster using 2-state values on all measure-
word payload. VHDL model was 2.5x faster. SystemC RTL                                                                                                    ments except cases with very low utilization. With commercial
was 2.3x faster when the size was increased to 40 words and                                                                                              simulator the difference was even smaller ranging from 5% to
VHDL 3.2x faster.                                                                                                                                        2% depending on network utilization.
D. SystemC Models                                                     TLM models offered a significant increase in simulation
   For RTL models simulated with the commercial simula-            performance. For Approximately Timed (AT) model speedup
tor only small difference was between VHDL and SystemC             ranged from 13x to 40x. Error on estimated latencies proved to
performance. Largest measured difference in times was 16%.         be sufficient for DSE on most simulation scenarios analysed.
SystemC gcc version using 4-state values simulated 10-30%             The more abstract Loosely Timed TLM model was approxi-
and 2-state version 13-50% faster than VHDL depending on           mately 2x faster than AT model to simulate but if implemented
the network size and utilization.                                  like in [4] the significant increase in latency estimation errors
   Figure 6 shows TLM speedups over RTL VHDL model                 brings doubts to the useability of this approach especially
measured with the commercial simulator. Speedup was signif-        under heavier loads.
icantly larger for bigger meshes on low utilization scenarios.        Based on the study we recommend using AT coding style
When mesh utilization got closer to saturation the difference in   over LT as the estimation error was small and LT didn’t offer
speedup between mesh sizes decreased. Approximately timed          significant speedup compared to AT.
model was 13x-15x and loosely timed model 20x-30x faster
                                                                   Table II: Summary of SystemC speedups over VHDL in near satu-
on near saturation load. With gcc compiled version TLM             ration load conditions
speedups followed the same trend but were on average 1.5x
                                                                    Scenario           4x4 Mesh          6x6 Mesh          8x8 Mesh
slower.
                                                                    4 word, RTL        1.1               1.4               1.5
                                                                    20 word, RTL       1.2               1.4               1.6
                                                                    40 words, RTL      1.1               1.3               1.5
                                                                    4 word, AT         13                13                14
                                                                    20 word, AT        20                28                28
                                                                    40 words, AT       25                37                40
                                                                    4 word, LT         23                23                26
                                                                    20 word, LT        30                50                61
                                                                    40 words, LT       33                60                81


                                                                                                R EFERENCES
                                                                    [1] W. Ecker, V. Esen, L. Schonberg, T. Steininger, M. Velten, and M. Hull,
                                                                        “Impact of Description Language, Abstraction Layer, and Value Rep-
                                                                        resentation on Simulation Performance,” Design, Automation Test in
                                                                        Europe Conference Exhibition, 2007. DATE ’07., pp. 1–6, apr. 2007.
                                                                    [2] P. Ezudheen, P. Chandran, J. Chandra, B. Simon, and D. Ravi, “Par-
                                                                        allelizing SystemC Kernel for Fast Hardware Simulation on SMP
Figure 6: TLM Speedups compared to RTL VHDL (4 word payload)            Machines,” Principles of Advanced and Distributed Simulation, 2009.
                                                                        PADS ’09, pp. 80–87, jun. 2009.
                                                                    [3] M. Gries, “Methods for Evaluating and Covering the Design Space
                                                                        during Early Design Development,” Integration, the VLSI Journal,
E. SystemC Implementations                                              vol. 38, no. 2, pp. 131–183, 2004.
                                                                    [4] A. Kohler and M. Radetzki, “A SystemC TLM2 model of communica-
   SystemC simulation kernel’s and TLM 2.0 library imple-               tion in wormhole switched Networks-On-Chip,” Forum on Specification
mentation naturally affects simulation speed. Both the OSCI             Design Languages, 2009. FDL 2009, pp. 1–4, sep. 2009.
reference kernel and the commercial one used in these simu-         [5] A. Morawiec and J. Mermet, “Techniques for Improving the HDL
                                                                        Simulation Performance,” 1999.
lations were single threaded. Alternatives such as the parallel     [6] Open SystemC Initiative OSCI, “SystemC Documentation.” [Online].
SystemC kernel presented in [2] could be used to speed up               Available: http://www.systemc.org
SystemC simulations.                                                [7] S. Pasricha, M. Ben-romdhane, and N. Dutt, “High Level Design
                                                                        Space Exploration of Shared Bus Communication Architectures,” CECS
   Gcc compiled SystemC RTL was approximately 1.5x faster               Technical Report, 2004.
than the commercial simulator with 4 word packet sizes.             [8] E. Salminen, C. Grecu, T. Hämälainen, and A. Ivanov, “Application
Commercial tool on the other hand had faster implementation             modelling and hardware description for network-on-chip benchmarking,”
                                                                        Computers Digital Techniques, IET, vol. 3, no. 5, pp. 539–550, sept.
of OSCI TLM 2.0 library and executed nearly 1.5x faster than            2009.
gcc compiled version. On larger packet sizes the difference         [9] G. Schirner and R. Domer, “Quantitative Analysis of Transaction Level
between used simulators was smaller as seen in Figure 5.                Models for the AMBA Bus,” Design, Automation and Test in Europe,
                                                                        2006. DATE ’06., vol. 1, pp. 1–6, mar. 2006.
                      V. C ONCLUSIONS                              [10] D. Siguenza-Tortosa and J. Nurmi, “VHDL-based simulation environ-
                                                                        ment for Proteo NoC,” High-Level Design Validation and Test Workshop,
   A simulation performance of SystemC 2D mesh NoC mod-                 2002., pp. 1–6, oct. 2002.
elled in three abstraction levels was compared against the         [11] H. van Moll, H. Corporaal, V. Reyes, and M. Boonen, “Fast and accurate
                                                                        protocol specific bus modeling using TLM 2.0,” Design, Automation Test
reference sythesizeable RTL VHDL model.                                 in Europe Conference Exhibition, 2009. DATE ’09., pp. 316–319, apr.
   Table II summarises the results under near saturation load.          2009.
RTL results are for gcc and TLM for the commercial simulator.
   The commercial simulator used was slower for RTL level
but faster for TLM. Gcc SystemC on RT level was measured
to be 1.5x faster than the VHDL model.

More Related Content

What's hot

Exploring NISC Architectures for Matrix Application
Exploring NISC Architectures for Matrix ApplicationExploring NISC Architectures for Matrix Application
Exploring NISC Architectures for Matrix ApplicationIDES Editor
 
Functional Verification of Large-integers Circuits using a Cosimulation-base...
Functional Verification of Large-integers Circuits using a  Cosimulation-base...Functional Verification of Large-integers Circuits using a  Cosimulation-base...
Functional Verification of Large-integers Circuits using a Cosimulation-base...IJECEIAES
 
Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...Maria Stylianou
 
Concurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsConcurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsCSCJournals
 
program partitioning and scheduling IN Advanced Computer Architecture
program partitioning and scheduling  IN Advanced Computer Architectureprogram partitioning and scheduling  IN Advanced Computer Architecture
program partitioning and scheduling IN Advanced Computer ArchitecturePankaj Kumar Jain
 
Master thesis Francesco Serafin
Master thesis Francesco SerafinMaster thesis Francesco Serafin
Master thesis Francesco SerafinRiccardo Rigon
 
Simulation of Wireless Communication Systems
Simulation of Wireless Communication SystemsSimulation of Wireless Communication Systems
Simulation of Wireless Communication SystemsBernd-Peter Paris
 
Fpga based efficient multiplier for image processing applications using recur...
Fpga based efficient multiplier for image processing applications using recur...Fpga based efficient multiplier for image processing applications using recur...
Fpga based efficient multiplier for image processing applications using recur...VLSICS Design
 
Parallel Computing 2007: Overview
Parallel Computing 2007: OverviewParallel Computing 2007: Overview
Parallel Computing 2007: OverviewGeoffrey Fox
 
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarksVauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarksMumtaz Hannah Vauhkonen
 
Performance measures
Performance measuresPerformance measures
Performance measuresDivya Tiwari
 
MATLAB and Simulink for Communications System Design (Design Conference 2013)
MATLAB and Simulink for Communications System Design (Design Conference 2013)MATLAB and Simulink for Communications System Design (Design Conference 2013)
MATLAB and Simulink for Communications System Design (Design Conference 2013)Analog Devices, Inc.
 
Advanced computer architecture unit 5
Advanced computer architecture  unit 5Advanced computer architecture  unit 5
Advanced computer architecture unit 5Kunal Bangar
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)ijceronline
 
Programming using MPI and OpenMP
Programming using MPI and OpenMPProgramming using MPI and OpenMP
Programming using MPI and OpenMPDivya Tiwari
 
ANALOG MODELING OF RECURSIVE ESTIMATOR DESIGN WITH FILTER DESIGN MODEL
ANALOG MODELING OF RECURSIVE ESTIMATOR DESIGN WITH FILTER DESIGN MODELANALOG MODELING OF RECURSIVE ESTIMATOR DESIGN WITH FILTER DESIGN MODEL
ANALOG MODELING OF RECURSIVE ESTIMATOR DESIGN WITH FILTER DESIGN MODELVLSICS Design
 
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...waqarnabi
 

What's hot (20)

Exploring NISC Architectures for Matrix Application
Exploring NISC Architectures for Matrix ApplicationExploring NISC Architectures for Matrix Application
Exploring NISC Architectures for Matrix Application
 
Model checking
Model checkingModel checking
Model checking
 
Functional Verification of Large-integers Circuits using a Cosimulation-base...
Functional Verification of Large-integers Circuits using a  Cosimulation-base...Functional Verification of Large-integers Circuits using a  Cosimulation-base...
Functional Verification of Large-integers Circuits using a Cosimulation-base...
 
Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...Performance Analysis of multithreaded applications based on Hardware Simulati...
Performance Analysis of multithreaded applications based on Hardware Simulati...
 
Concurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core ProcessorsConcurrent Matrix Multiplication on Multi-core Processors
Concurrent Matrix Multiplication on Multi-core Processors
 
program partitioning and scheduling IN Advanced Computer Architecture
program partitioning and scheduling  IN Advanced Computer Architectureprogram partitioning and scheduling  IN Advanced Computer Architecture
program partitioning and scheduling IN Advanced Computer Architecture
 
Master thesis Francesco Serafin
Master thesis Francesco SerafinMaster thesis Francesco Serafin
Master thesis Francesco Serafin
 
Simulation of Wireless Communication Systems
Simulation of Wireless Communication SystemsSimulation of Wireless Communication Systems
Simulation of Wireless Communication Systems
 
EEDC Programming Models
EEDC Programming ModelsEEDC Programming Models
EEDC Programming Models
 
Fpga based efficient multiplier for image processing applications using recur...
Fpga based efficient multiplier for image processing applications using recur...Fpga based efficient multiplier for image processing applications using recur...
Fpga based efficient multiplier for image processing applications using recur...
 
Machine Learning @NECST
Machine Learning @NECSTMachine Learning @NECST
Machine Learning @NECST
 
Parallel Computing 2007: Overview
Parallel Computing 2007: OverviewParallel Computing 2007: Overview
Parallel Computing 2007: Overview
 
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarksVauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
VauhkonenVohraMadaan-ProjectDeepLearningBenchMarks
 
Performance measures
Performance measuresPerformance measures
Performance measures
 
MATLAB and Simulink for Communications System Design (Design Conference 2013)
MATLAB and Simulink for Communications System Design (Design Conference 2013)MATLAB and Simulink for Communications System Design (Design Conference 2013)
MATLAB and Simulink for Communications System Design (Design Conference 2013)
 
Advanced computer architecture unit 5
Advanced computer architecture  unit 5Advanced computer architecture  unit 5
Advanced computer architecture unit 5
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
Programming using MPI and OpenMP
Programming using MPI and OpenMPProgramming using MPI and OpenMP
Programming using MPI and OpenMP
 
ANALOG MODELING OF RECURSIVE ESTIMATOR DESIGN WITH FILTER DESIGN MODEL
ANALOG MODELING OF RECURSIVE ESTIMATOR DESIGN WITH FILTER DESIGN MODELANALOG MODELING OF RECURSIVE ESTIMATOR DESIGN WITH FILTER DESIGN MODEL
ANALOG MODELING OF RECURSIVE ESTIMATOR DESIGN WITH FILTER DESIGN MODEL
 
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
A Fast and Accurate Cost Model for FPGA Design Space Exploration in HPC Appli...
 

Viewers also liked (9)

Academia ativa
Academia ativaAcademia ativa
Academia ativa
 
Misión y Visión
Misión y VisiónMisión y Visión
Misión y Visión
 
O colegialjaneiro 60
O colegialjaneiro 60O colegialjaneiro 60
O colegialjaneiro 60
 
O colegial agosto certo
O colegial agosto certoO colegial agosto certo
O colegial agosto certo
 
Misión y Visión
Misión y VisiónMisión y Visión
Misión y Visión
 
O colegial agosto certo
O colegial agosto certoO colegial agosto certo
O colegial agosto certo
 
O colegial janeiro 60
O colegial janeiro 60O colegial janeiro 60
O colegial janeiro 60
 
Proyecto de la nación
Proyecto de la naciónProyecto de la nación
Proyecto de la nación
 
Apostila Visualg
Apostila VisualgApostila Visualg
Apostila Visualg
 

Similar to 60

Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...NECST Lab @ Politecnico di Milano
 
Presentation on Behavioral Synthesis & SystemC
Presentation on Behavioral Synthesis & SystemCPresentation on Behavioral Synthesis & SystemC
Presentation on Behavioral Synthesis & SystemCMukit Ahmed Chowdhury
 
Co emulation of scan-chain based designs
Co emulation of scan-chain based designsCo emulation of scan-chain based designs
Co emulation of scan-chain based designsijcsit
 
Pretzel: optimized Machine Learning framework for low-latency and high throu...
Pretzel: optimized Machine Learning framework for  low-latency and high throu...Pretzel: optimized Machine Learning framework for  low-latency and high throu...
Pretzel: optimized Machine Learning framework for low-latency and high throu...NECST Lab @ Politecnico di Milano
 
System on Chip Design and Modelling Dr. David J Greaves
System on Chip Design and Modelling   Dr. David J GreavesSystem on Chip Design and Modelling   Dr. David J Greaves
System on Chip Design and Modelling Dr. David J GreavesSatya Harish
 
RTI-CODES+ISSS-2012-Submission-1
RTI-CODES+ISSS-2012-Submission-1RTI-CODES+ISSS-2012-Submission-1
RTI-CODES+ISSS-2012-Submission-1Serge Amougou
 
Verilog Ams Used In Top Down Methodology For Wireless Integrated Circuits
Verilog Ams Used In Top Down Methodology For Wireless Integrated CircuitsVerilog Ams Used In Top Down Methodology For Wireless Integrated Circuits
Verilog Ams Used In Top Down Methodology For Wireless Integrated CircuitsRégis SANTONJA
 
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...ijceronline
 
Csit77404
Csit77404Csit77404
Csit77404csandit
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...NECST Lab @ Politecnico di Milano
 
A tlm based platform to specify and verify component-based real-time systems
A tlm based platform to specify and verify component-based real-time systemsA tlm based platform to specify and verify component-based real-time systems
A tlm based platform to specify and verify component-based real-time systemsijseajournal
 
The embedded systems Model
The embedded systems ModelThe embedded systems Model
The embedded systems ModelAJAL A J
 
Presentation date -20-nov-2012 for prof. chen
Presentation date -20-nov-2012 for prof. chenPresentation date -20-nov-2012 for prof. chen
Presentation date -20-nov-2012 for prof. chenTAIWAN
 
Art%3 a10.1186%2f1687 6180-2011-29
Art%3 a10.1186%2f1687 6180-2011-29Art%3 a10.1186%2f1687 6180-2011-29
Art%3 a10.1186%2f1687 6180-2011-29Ishtiaq Ahmad
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsLightbend
 
DOUBLE PRECISION FLOATING POINT CORE IN VERILOG
DOUBLE PRECISION FLOATING POINT CORE IN VERILOGDOUBLE PRECISION FLOATING POINT CORE IN VERILOG
DOUBLE PRECISION FLOATING POINT CORE IN VERILOGIJCI JOURNAL
 
Developing Real-Time Systems on Application Processors
Developing Real-Time Systems on Application ProcessorsDeveloping Real-Time Systems on Application Processors
Developing Real-Time Systems on Application ProcessorsToradex
 

Similar to 60 (20)

Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...
 
Presentation on Behavioral Synthesis & SystemC
Presentation on Behavioral Synthesis & SystemCPresentation on Behavioral Synthesis & SystemC
Presentation on Behavioral Synthesis & SystemC
 
Co emulation of scan-chain based designs
Co emulation of scan-chain based designsCo emulation of scan-chain based designs
Co emulation of scan-chain based designs
 
Pretzel: optimized Machine Learning framework for low-latency and high throu...
Pretzel: optimized Machine Learning framework for  low-latency and high throu...Pretzel: optimized Machine Learning framework for  low-latency and high throu...
Pretzel: optimized Machine Learning framework for low-latency and high throu...
 
System on Chip Design and Modelling Dr. David J Greaves
System on Chip Design and Modelling   Dr. David J GreavesSystem on Chip Design and Modelling   Dr. David J Greaves
System on Chip Design and Modelling Dr. David J Greaves
 
RTI-CODES+ISSS-2012-Submission-1
RTI-CODES+ISSS-2012-Submission-1RTI-CODES+ISSS-2012-Submission-1
RTI-CODES+ISSS-2012-Submission-1
 
Modeling and Real-Time Simulation of Induction Motor Using RT-LAB
Modeling and Real-Time Simulation of Induction Motor Using RT-LABModeling and Real-Time Simulation of Induction Motor Using RT-LAB
Modeling and Real-Time Simulation of Induction Motor Using RT-LAB
 
Verilog Ams Used In Top Down Methodology For Wireless Integrated Circuits
Verilog Ams Used In Top Down Methodology For Wireless Integrated CircuitsVerilog Ams Used In Top Down Methodology For Wireless Integrated Circuits
Verilog Ams Used In Top Down Methodology For Wireless Integrated Circuits
 
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
Matlab Based High Level Synthesis Engine for Area And Power Efficient Arithme...
 
Csit77404
Csit77404Csit77404
Csit77404
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...
 
Lear unified env_paper-1
Lear unified env_paper-1Lear unified env_paper-1
Lear unified env_paper-1
 
A tlm based platform to specify and verify component-based real-time systems
A tlm based platform to specify and verify component-based real-time systemsA tlm based platform to specify and verify component-based real-time systems
A tlm based platform to specify and verify component-based real-time systems
 
The embedded systems Model
The embedded systems ModelThe embedded systems Model
The embedded systems Model
 
Presentation date -20-nov-2012 for prof. chen
Presentation date -20-nov-2012 for prof. chenPresentation date -20-nov-2012 for prof. chen
Presentation date -20-nov-2012 for prof. chen
 
Art%3 a10.1186%2f1687 6180-2011-29
Art%3 a10.1186%2f1687 6180-2011-29Art%3 a10.1186%2f1687 6180-2011-29
Art%3 a10.1186%2f1687 6180-2011-29
 
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data StreamsMachine Learning At Speed: Operationalizing ML For Real-Time Data Streams
Machine Learning At Speed: Operationalizing ML For Real-Time Data Streams
 
DOUBLE PRECISION FLOATING POINT CORE IN VERILOG
DOUBLE PRECISION FLOATING POINT CORE IN VERILOGDOUBLE PRECISION FLOATING POINT CORE IN VERILOG
DOUBLE PRECISION FLOATING POINT CORE IN VERILOG
 
10 3
10 310 3
10 3
 
Developing Real-Time Systems on Application Processors
Developing Real-Time Systems on Application ProcessorsDeveloping Real-Time Systems on Application Processors
Developing Real-Time Systems on Application Processors
 

More from srimoorthi (20)

94
9494
94
 
87
8787
87
 
84
8484
84
 
83
8383
83
 
82
8282
82
 
75
7575
75
 
73
7373
73
 
72
7272
72
 
70
7070
70
 
69
6969
69
 
68
6868
68
 
63
6363
63
 
62
6262
62
 
61
6161
61
 
59
5959
59
 
57
5757
57
 
56
5656
56
 
50
5050
50
 
55
5555
55
 
52
5252
52
 

60

  • 1. Analysis of Modeling Styles on Network-on-Chip Simulation Lasse Lehtonen, Erno Salminen, Timo D. Hämäläinen Tampere University of Technology Department of Computer Systems, P.O.Box 553, FIN-33101 Tampere, Finland lasse.lehtonen@tut.fi Abstract—This paper analyses the effects of Network-on-Chip II. R ELATED W ORK (NoC) models written in SystemC on simulation speed. Two Register Transfer Level (RTL) models and Approximately Timed Simulation speed of hardware models can be accelerated (AT) and Loosely Timed (LT) Transaction Level (TL) models are in many ways. There are multiple languages to choose from compared against reference RTL VHDL 2D mesh model. Three for describing hardware models, most used being VHDL, different mesh sizes are evaluated using a commercial simulator and OSCI SystemC reference kernel. Studied AT model achieved Verilog, SystemVerilog and SystemC. According to [1] the 13-40x speedup with modest 10% estimation error. choise of language and tools have significant impact on simulation speed. Most commercial simulators were found I. I NTRODUCTION to be optimized for one language only and having usually Today’s System-on-Chips (SoCs) consists of many Intellec- more than 1.8x difference in simulation times for tool’s slower tual Property (IP) components, such as Processing Elements languages. In one extreme case a 10x difference in simulation (PEs), memories and Network-on-Chip (NoC) components time was measured between two commercial simulators for executing complex applications. the same model. As SoCs keep growing larger the design space grows even Thorough evaluation of VHDL, Verilog, SystemVerilog and more rapidly. Larger design space implies longer simulations SystemC implemented with different abstraction levels and and more simulation runs for Design Space Exploration (DSE) data types was performed in [1] using three commercially to find near optimum scheduling, configurations and mappings available simulators and gcc compiler. Verilog was found for IP blocks and the NoC connecting them. to be 2x faster than VHDL and SystemC 10x slower on Simulation performance is one of the key issues in verifica- average for RTL models. SystemC TLM was 2.6x slower tion and DSE. Fast and accurate estimates are required early than SystemVerilog TLM Programmer’s View (PV) model. in products design process to provide information of important SystemVerilog was fastest to simulate in all abstraction levels factors, such as performance, area and power consumption [3]. on average. Simulation performance is a sum of many different factors, Speedup for simulation times can be gained through model such as the choise of modeling language, simulator tool, optimization, transformation and reduction [5]. Model opti- abstraction level and the quality of the description. mization is a straightforward process of removing unneces- One widely used method to improve simulation perfor- sary or replacing inefficient code constructs with more effi- mance is to raise the abstraction level. Register Transfer cient ones for the used simulator without modifying model’s Level (RTL) models have to describe every signal and their structure. Model transformation means changing language behaviour between clock edges to be synthesizeable which constructs into functionally equivalent ones offering better causes significant overhead when such a precision is not performance in simulation time. Model reduction comprises needed. Transaction Level (TL) modeling abstracts detailed the process of removing all unrelevant parts of the model for signaling from communication events using abstract function a specific simulation run. calls to annotate only the significant signaling events. In [5] simulation performance was reported to improve Transaction Level Models (TLMs) can be modelled using significantly by altering the source code. Simulation speedup different abstraction levels. Untimed models offer best speedup measured compilation times included was usually in range but lack timing annotation totally. On the other hand cycle from 1.1x to 10x depending on the used technique. Although accurate models annotate time every clock cycle but have these measurements were done about ten years ago, measure- considerably smaller speedup in general. ments in [1] implies that commercial simulators don’t perform This paper presents a study regarding modeling styles for all transformation optimizations automatically or effectively. NoC models consentrating on the raise of abstraction level. Hierarchically modelled designs were found in [1] to be on Results show that the implemented Approximately Timed (AT) average 1.5x slower than their flat counterparts and different TLM 2.0 model offers a notable speedup (ranging from 13x value types affecting the simulation time with up to 4x factor. to 40x) with modest latency estimation error (under 10% on TLM modeling techniques have been studied widely. For average). example Proteo NoC architecture [10] used VHDL to construct 978-1-4244-8971-8/10$26.00 c 2010 IEEE
  • 2. higher level model achieving an order of magnitude speedup Router components instantiate one process and a priority for a 16 node NoC. queue per incoming link. Model was calibrated for 4 word Shirner et al. [9] created two TLM models for AMBA packets and the error in average latencies was under 10%. bus and compared them against synthesizeable Bus Functional Second, Loosely Timed (LT) TLM (TLM LT) model was Model version. More abstract TLM model was four orders of made using same network interface models but abstracting all magnitude faster with error up to 45% and the more accurate routers using only one SystemC process. This implementation TLM reached two order of magnitude speedup with an error was left nearly untimed as the main interest was in simulation of 35% in worst cases. performance. A method to estimate latencies in one process Another research to speed up shared bus architecture ex- for wormhole switched mesh topology was described in [4]. plorations in [7] used abstraction level between TLM and Estimation error depended linearly on the network utilization. Bus Cycle Accurate (BCA) to create accurate models offering They measured 45% deviation in average latencies for satura- a 1.55x speedup. OSCI (Open SystemC Initiative) TLM 2.0 tion load but on lower loads the estimation error was in more based methodology to create BCA shared bus protocol models acceptable ranges for DSE. that retain the same level of accuracy as RTL models was B. Transaction Generator introduced in [11]. Their BCA model was measured to be between one and two orders of magnitude faster than RTL Transaction Generator (TG) [8], a freely available SystemC model. traffic generator for NoC benchmarking and DSE was used to generate traffic for NoC models. TG uses abstract application III. A NALYSIS E NVIRONMENT and platform models to mimic traffic patterns captured from Analysis is performed for 2-dimensional mesh topology real data flow applications (Figure 4). It’s implemented using (Figure 1). It uses 6 word fifos on links, wormhole switching SystemC events and waits. and XY routing with fixed priority arbitration. This paper Network traffic created by TG is modelled in Transaction focuses on using SystemC in three different abstraction levels Level using a simple custom interface for RTL NoC models to gain better performance in simulation speed compared to and OSCI TLM 2.0 function calls with generic payload for the reference synthesizeable RTL VHDL model. TLM models. Figure 1: Mesh 2D NoC topology A. Network-on-Chip Models 2D Mesh was implemented in synthesizeable RTL VHDL and SystemC. Two SystemC RTL models were implemented using VHDL coding style following the same structure as the reference VHDL model. First RTL version used 4-state logic and second 2-state logic. In addition, two SystemC TLM versions were created using OSCI TLM 2.0 [6] function calls Figure 4: Transaction Generator uses abstract application and pro- with generic payload. cessing element models to mimic the computation of real SoCs to create accurate traffic C. Measurement All measurements were executed on a modern workstation with E8400 dual core processor, 3 GB RAM and 32-bit Figure 2: Conseptual difference between RTL and TLM. Windows XP. Mixed-language simulation was run with a com- mercial simulator. SystemC-only simulations were run with First model (called TLM AT) was written in Approximately both the commercial simulator and OSCI SystemC reference Timed (AT) coding style using two timing points to annotate kernel compiled with GNU gcc in Cygwin. Optimization level the start and end of transactions (Figure 2). Two SystemC -O3 was used with gcc and similar optimization were enabled processes and queues were used to model network interfaces. for the commercial simulator. No signal trace was gathered
  • 3. 4500 4500 4500 gcc RTL 4-state 4x4 gcc RTL 2-state 6x6 8x8 4000 4000 4000 gcc TLM AT 3500 VHDL 3500 3500 RTL 4-state Simulation duration [s] Simulation duration [s] Simulation duration [s] 3000 RTL 2-state 3000 3000 RTL TLM AT 2500 TLM LT 2500 2500 gcc TLM LT 2000 2000 2000 1500 1500 1500 1000 1000 1000 500 500 500 TLM 0 0 0 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 Offered load [flits/agents/cycle] Offered load [flits/agents/cycle] Offered load [flits/agents/cycle] Figure 3: Simulation times for 4x4, 6x6 and 8x8 meshes (4 word payload). Time increases with the amount of data sent. The difference between RTL and TLMs is clearly seen. Gcc compiled OSCI reference kernel simulations are marked with gcc prefix and dashed lines. and commercial simulator was run on console mode during Due to the fact that TLM model processes activate only simulations. when transaction begins and ends a great increase in simula- Frequency of 50 MHz was used for NoC and processing tion speed was achieved with bigger transfer sizes. elements and simulations were run for 100 ms. A deterministic Simulations with gcc compiled OSCI reference SystemC all-to-all traffic pattern was used generating the load. kernel were on average 4.5x faster for 20 word transfers and Measured simulation times are for whole program so the 3.5x faster for the commercial simulator when compared to 4 speedups shown are not only for the NoC models. Measure- word transfers on same load conditions. Increasing size to 40 ments were performed more than twice to exclude variations words brought 8.5x speedup for gcc and 6.5x speedup for the in workload affecting results. Table I lists the parameters that commercial simulator. were varied on different load conditions. When compared to 6x6 VHDL RTL model with 40 word Table I: Summary of measured parameters transfers the AT model offered on 30x-35x speedup and the one process model 55x-60x speedup (figure 5) on average. Simulator NoC Size Tx Size NoC Model Commercial 4x4, 6x6, 8x8 4, 20, 40 3 RTL, 2 TLM With 20 word transfers same speedups were 22x-28x and 40x- OSCI 4x4, 6x6, 8x8 4, 20, 40 2 RTL, 2 TLM 45x. IV. R ESULTS A. Size of the NoC NoC sizes of 16, 36 and 64 routers were measured. Size of the NoC affects nearly linearly the simulation time. With the used traffic pattern the simulation speed for RTL models decreases more than the increase in the number of mesh nodes would imply. This is because in all-to-all traffic pattern on bigger mesh means on average more hops to target and thus more cycles to simulate for the same offered load. 6x6 mesh simulated 2.6x and 8x8 5.4x slower than 4x4 mesh on average as shown in figure 3. Figure 5: TLM Speedups compared to RTL VHDL (40 word payload, TLM models scaled better. On average 6x6 mesh TLM 6x6 mesh) models were 2.2x and 8x8 mesh models 4.6x slower than 4x4 mesh models. B. Size of the Transfer C. SystemC Logic Levels Transfer legths depends on the application and the speedup grows with the transfer length for both the RTL and TLM Difference in simulation times between 4-state and 2-state models. By using 20 word transfers SystemC RTL simulation SystemC logic values was small. Gcc simulations were ap- was approximately 2x faster on same load levels than with 4 proximately 7% faster using 2-state values on all measure- word payload. VHDL model was 2.5x faster. SystemC RTL ments except cases with very low utilization. With commercial was 2.3x faster when the size was increased to 40 words and simulator the difference was even smaller ranging from 5% to VHDL 3.2x faster. 2% depending on network utilization.
  • 4. D. SystemC Models TLM models offered a significant increase in simulation For RTL models simulated with the commercial simula- performance. For Approximately Timed (AT) model speedup tor only small difference was between VHDL and SystemC ranged from 13x to 40x. Error on estimated latencies proved to performance. Largest measured difference in times was 16%. be sufficient for DSE on most simulation scenarios analysed. SystemC gcc version using 4-state values simulated 10-30% The more abstract Loosely Timed TLM model was approxi- and 2-state version 13-50% faster than VHDL depending on mately 2x faster than AT model to simulate but if implemented the network size and utilization. like in [4] the significant increase in latency estimation errors Figure 6 shows TLM speedups over RTL VHDL model brings doubts to the useability of this approach especially measured with the commercial simulator. Speedup was signif- under heavier loads. icantly larger for bigger meshes on low utilization scenarios. Based on the study we recommend using AT coding style When mesh utilization got closer to saturation the difference in over LT as the estimation error was small and LT didn’t offer speedup between mesh sizes decreased. Approximately timed significant speedup compared to AT. model was 13x-15x and loosely timed model 20x-30x faster Table II: Summary of SystemC speedups over VHDL in near satu- on near saturation load. With gcc compiled version TLM ration load conditions speedups followed the same trend but were on average 1.5x Scenario 4x4 Mesh 6x6 Mesh 8x8 Mesh slower. 4 word, RTL 1.1 1.4 1.5 20 word, RTL 1.2 1.4 1.6 40 words, RTL 1.1 1.3 1.5 4 word, AT 13 13 14 20 word, AT 20 28 28 40 words, AT 25 37 40 4 word, LT 23 23 26 20 word, LT 30 50 61 40 words, LT 33 60 81 R EFERENCES [1] W. Ecker, V. Esen, L. Schonberg, T. Steininger, M. Velten, and M. Hull, “Impact of Description Language, Abstraction Layer, and Value Rep- resentation on Simulation Performance,” Design, Automation Test in Europe Conference Exhibition, 2007. DATE ’07., pp. 1–6, apr. 2007. [2] P. Ezudheen, P. Chandran, J. Chandra, B. Simon, and D. Ravi, “Par- allelizing SystemC Kernel for Fast Hardware Simulation on SMP Figure 6: TLM Speedups compared to RTL VHDL (4 word payload) Machines,” Principles of Advanced and Distributed Simulation, 2009. PADS ’09, pp. 80–87, jun. 2009. [3] M. Gries, “Methods for Evaluating and Covering the Design Space during Early Design Development,” Integration, the VLSI Journal, E. SystemC Implementations vol. 38, no. 2, pp. 131–183, 2004. [4] A. Kohler and M. Radetzki, “A SystemC TLM2 model of communica- SystemC simulation kernel’s and TLM 2.0 library imple- tion in wormhole switched Networks-On-Chip,” Forum on Specification mentation naturally affects simulation speed. Both the OSCI Design Languages, 2009. FDL 2009, pp. 1–4, sep. 2009. reference kernel and the commercial one used in these simu- [5] A. Morawiec and J. Mermet, “Techniques for Improving the HDL Simulation Performance,” 1999. lations were single threaded. Alternatives such as the parallel [6] Open SystemC Initiative OSCI, “SystemC Documentation.” [Online]. SystemC kernel presented in [2] could be used to speed up Available: http://www.systemc.org SystemC simulations. [7] S. Pasricha, M. Ben-romdhane, and N. Dutt, “High Level Design Space Exploration of Shared Bus Communication Architectures,” CECS Gcc compiled SystemC RTL was approximately 1.5x faster Technical Report, 2004. than the commercial simulator with 4 word packet sizes. [8] E. Salminen, C. Grecu, T. Hämälainen, and A. Ivanov, “Application Commercial tool on the other hand had faster implementation modelling and hardware description for network-on-chip benchmarking,” Computers Digital Techniques, IET, vol. 3, no. 5, pp. 539–550, sept. of OSCI TLM 2.0 library and executed nearly 1.5x faster than 2009. gcc compiled version. On larger packet sizes the difference [9] G. Schirner and R. Domer, “Quantitative Analysis of Transaction Level between used simulators was smaller as seen in Figure 5. Models for the AMBA Bus,” Design, Automation and Test in Europe, 2006. DATE ’06., vol. 1, pp. 1–6, mar. 2006. V. C ONCLUSIONS [10] D. Siguenza-Tortosa and J. Nurmi, “VHDL-based simulation environ- ment for Proteo NoC,” High-Level Design Validation and Test Workshop, A simulation performance of SystemC 2D mesh NoC mod- 2002., pp. 1–6, oct. 2002. elled in three abstraction levels was compared against the [11] H. van Moll, H. Corporaal, V. Reyes, and M. Boonen, “Fast and accurate protocol specific bus modeling using TLM 2.0,” Design, Automation Test reference sythesizeable RTL VHDL model. in Europe Conference Exhibition, 2009. DATE ’09., pp. 316–319, apr. Table II summarises the results under near saturation load. 2009. RTL results are for gcc and TLM for the commercial simulator. The commercial simulator used was slower for RTL level but faster for TLM. Gcc SystemC on RT level was measured to be 1.5x faster than the VHDL model.