SlideShare ist ein Scribd-Unternehmen logo
1 von 8
Downloaden Sie, um offline zu lesen
Performance and Memory Profiling for
                                      Embedded System Design

                                                            Heiko Hubert, Benno Stabernack, Kai-Immo Wels
                                                                     Image Processing Department,
                                                  Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut
                                                                Einsteinufer 37, 10587 Berlin, Germany
                                                             [huebert,stabernack,wels] ghhi. fraunhofer. de

     Abstract- The design of embedded hardware/software systems is                                        In order to reduce the overall data traffic, those parts of the
     often underlying strict requirements concerning various aspects,                                  code, which require a high amount of data transfers, have to be
     including real time performance, power consumption and die                                        identified and optimized. The above mentioned applications
     area. Especially for data intensive applications, such as
     multimedia systems, the number of memory accesses is a                                            contain up to 100.000 lines of source code. Therefore tools are
     dominant factor for these aspects. In order to meet the                                           required, which help the designer identifying the critical parts
     requirements and design a well-adapted system, the software                                       of the software. Several analysis tools exist, e.g. timing
     parts need to be optimized and an adequate hardware                                               analysis is provided by gprof or VTune. Memory access
     architecture needs to be designed. For complex applications this                                  analysis is part of the ATOMIUM [2] tool suite. However, all
     design space exploration can be rather difficult and requires in-                                 these tools provide only approximate results for either timing
     depth analysis of the application and its implementation
     alternatives. Tools are required which aid the designer in the                                    or memory accesses. A highly accurate memory analysis can
     design, optimization and scheduling of hardware and software.                                     be done with a hardware (HDL) simulator, if an HDL model
     We present a profiling tool for fast and accurate performance                                     of the processor is available. However, such an analysis
     and memory access analysis of embedded systems and show how                                       implies a long simulation time.
     it can be applied within the design flow. This concept has been                                      In order to achieve a fast and accurate solution, we
     proven in the design of a mixed hardware/software system for                                      developed a specialized profiler, called Memtrace [3], for
     H.264/AVC video decoding.
                                                                                                       obtaining performance and memory access statistics. This
         Keywords- profiling, embedded hardware/software systems,                                      paper describes the tool with all its features. We show how the
     design space exploration, scheduling                                                              provided profiling results can be used during the design and
                                                                                                       optimization of embedded hardware/software systems. As a
                                     I.      INTRODUCTION                                              case study, Memtrace is applied during the efficient design of
        The design of an embedded system often starts from a                                           a mixed hardware/software system for H.264/AVC video
     software description of the system in C language. For                                             decoding. Starting from a software implementation, it is
     example, the designer writes an executable specification based                                    shown, how the software is optimized, an efficient hardware
     on a reference implementation of the application, e.g. from                                       architecture is developed, and the system tasks are scheduled
     standardization organizations or the open-source community.                                       based on the profiling results.
     This software code is often not optimized in any manners,                                            II.     MEMTRACE: A PERFORMANCE AND MEMORY PROFILER
     because it mainly serves the purpose of functional and
     conformance testing. Therefore it has to be transformed into                                      A. Tool Architecture
     an efficient system, including hardware and software                                                 Memtrace is a non-intrusive profiler, which analyzes the
     components. The design of the system requires the following                                       memory accesses and real time performance of an application,
     steps: system architecture design, hardware/software                                              without the need of instrumentation code. The analysis is
     partitioning, software optimization, design of hardware                                           controlled by information about variables and functions in the
     accelerators and system scheduling. All these steps require                                       user application, which is automatically extracted from the
     detailed information about the performance of the different                                       application. Furthermore, the user can specify the system
     parts of the application. Besides the arithmetical demands of                                     parameters, e.g. the processor type and the memory system.
     the application, memory accesses can have a huge influence                                        During the analysis, Memtrace utilizes the instruction set
                                                                                                       simulator ARMulator [1] for executing the application. The
     on performance and power consumption. This is especially the                                      ARMulator provides Memtrace with the information required
     case for data intensive applications, such as multimedia                                          for the analysis, e.g. the program counter, the clock cycle
     systems, due to the huge amount of data to be transferred in                                      counter and the memory accesses. Memtrace creates detailed
     these applications. This problem is even increased if the given                                   results on memory accesses and timing for each function and
     data bandwidth is not used efficiently.                                                           variable in the code.




        1-4244-0840-7/07/$20.00 02007 IEEE.                                                       94
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
Clock Cycles              n 60
              executable of                                                                                                        it funcl1func2          o
             the application                 _                                                                                      1| 201 30            . > >,40
                                                                                                                                                               40                 ---- fundc1
                                                                                                                                  121 271 38                   20                   =
                                                                                                                                                                                      func2
                                list of functions                                     memitace                                    131 231 34               o
           analysis             stack location
        specification           variable location                                     fronten                               results of function analysis
                                                                                                                                                                    1 2 3 4 5 6
                           4 M result table format                   /,         srf                                               Cache Misses             t   60
                                                                                                                                   it var1 var2
           system              Processor AK                                 backend              (A
                                                                                                  RMulator)                         1   15    6                40                 ---- va rl
       specification           Caches.16K1tIII                        &IMemTimingn Set Simulator
                                                                                                                                   2   48
                                                                                                                                   3, 38,
                                                                                                                                             13
                                                                                                                                             22
                                                                                                                                                               20                  --va r2
                                                            lil          ~~~~~~Instruction Set Simulator
                                                                           with memtrace backeind                           results of memory analysis              1 2 3 4 5 6
                                   Figure 1. Performance analysis tool: Memtrace profiles the performance and memory accesses of a user application.

     B. Analysis            Workflow                                                                                       load memory accesses for each function. Furthermore the
         The performance analysis with Memtrace is carried out in                                                          results of several functions can be accumulated in groups for
     three steps, the initialization, the performance analysis and the                                                     comparing the results of entire application modules. The user-
     postprocessing of the results.                                                                                        defined tables are written to files in a tab-separated format.
                                                                                                                           Thus they can be further processed, e.g. by spreadsheet
         During initialization Memtrace extracts the names of all                                                          programs for creating diagrams.
     functions and variables of the application. During this process
     user variables and functions are separated from standard library                                                      C.        Tool Backend Interface to the ISS
                                                                                                                                                     -

     functions, such as printf() or malloc(. This is achieved by                                                               Memtrace communicates with the Instruction Set Simulator
     comparing the symbol table of the executable with the ones of                                                         (ISS) via its backend, as depicted in Figure 2. The backend is
     the user library and object files. The results are written to the                                                     implemented as dynamic link library (DLL), which connects to
     analysis specification file. The specification file can be edited                                                     the ISS. Currently only the ARM instruction set simulator
     by the user, e.g. for adding user-defined memory areas, such as                                                       ARMulator is supported. The backend is automatically called
     the stack and heap variables, for additional analysis.                                                                by the ISS during simulation. During the startup phase, the
     Furthermore the user can define a so called "split function",                                                         backend creates a list of all functions and marks the user and
     which instructs Memtrace to produce snapshot results, each                                                            split functions found in the analysis specification file. For each
     time the "split function" is called. This can be used e.g. in video                                                   function a data structure is created, which contains the
     processing for generating separate profiling results for each                                                         function's start address and variables for collecting the analysis
     processed frame. Additionally the user can control if the                                                             results. Finally two pointers, called currentFunction and
     analysis results, e.g. clock cycles, of a function should include                                                     evaluatedFunction, are initialized. The first pointer
     the results of a called function (accumulated) or if it should                                                        indicates the currently executed function and, if this function
     only reflect the function's own results (self). Typically                                                             should not be evaluated, the second pointer indicates the calling
     auxiliary functions, e.g. C library or simple arithmetic                                                              function, to which the result of the current function should be
     functions, are accumulated to the calling functions.                                                                  added.
         In the second step the performance analysis is carried out,
     based on the analysis specification and the system
     specification, as shown in Figure 1. The system specification
     includes the processor, cache and memory type definitions. The
     Memtrace backend connects to instruction set simulator for the
     simulation of the user application and writes the analysis
     results of the functions and variables to files, see chapter II.C
     for more details. If a "split function" has been specified, these
     files include tables for each call of the "split function", TABLE
     I. shows exemplary results for function profiling. The output                                                                 System Bus
     files serve as a database for the third step, where user-defined                                                                       Memory&Bus
     data is extracted from these tables.                                                                                                   Timing Model
                                                                                                                                                                 Memorie5
           TABLE I.               32-BIT EXEMPLARY RESULT TABLE FOR FUNCTIONS                                                     Figure 2. Interface between memtrace backend and the ISS
       f   ca cyl         Is      Id       18      st      s8      pm        cm        BI        BC        BD
      fl   8 215          75      22       7       52      3       42         5        123        92        0                  Each time the program counter changes memtrace checks,
       2   2 295          39      35       3       14      9       17         9        55        153       87              if the program execution has changed from one function to
      f3   2 432          78      68       4       10      2       31         17       143       289        0              another. If so, the cycle count of the evaluatedFunction
              Abbreviations are: f: function; ca: calls, yl: bus (clock) cycles; ls: all load/store accesses from          is recalculated and the call count of the currentFunction
              the core; Id: all loads; 18: byte and half-word loads; st: all stores; s8: byte and half-word stores;
            pm: page misses; cm: cache miss; BI: bus idle cycles, BC: core bus cycles, BD: DMA bus cycles
                                                                                                                           is     incremented.     Finally the pointers to the
                                                                                                                           currentFunction                 and      evaluatedFunction           are
         In the third step a postprocessing of the results can be                                                          updated. If currentFunction is a split function, the
     performed. Memtrace allows the generation of user-defined                                                             differential results from the last call of this function up to the
     tables, which contain specific results of the analysis, e.g. the                                                      current call are printed to the result files.




                                                                                                                      95
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
For each access that occurs on the data bus (to the data                                      processors of the ARM family can be profiled, a wide variety
     cache or TCM), the memory access counters of the                                                  of architectural features is covered, including variations of
     evaluatedFunction are incremented. Depending on the                                               pipeline length, instruction bit-width, availability of
     information provided by the ARMulator, it is decided, if a load                                   DSP/SIMD instructions, MMUs, cache size and organization,
     or store access was performed, and which bitwidth (8/16 or 32                                     tightly coupled memories, bus width and detailed memory
     bit) was used. Furthermore the ARMulator indicates if a cache                                     timing options. For a profiling estimation of a non-ARM
     miss occurred. Page hits and misses are calculated by                                             processor an ARM processor with a similar feature set should
     comparing the address of the current with the previous memory                                     be chosen. In TABLE II. a list of common embedded
     access and incorporating the page structure ofthe memory.                                         processors is given, which have similarities with ARM
                                                                                                       processors. They have a basic feature set in common, including
         For each bus cycle (on the external memory bus) memtrace                                      a 32-bit Harvard architecture with caches, a 5- to 8-stage
     checks if it was an idle cycle, a core access or DMA access and                                   pipeline and a RISC instruction set. Although, it has to be
     increments       the      appropriate      counter   of     the                                   mentioned, that some ofthe processor provide specific features,
     evaluatedFunction.
                                                                                                       which may have a significant influence on the performance, for
         At the end of the simulation the results of the last                                          example the custom instruction extensions of ARC and
     evaluatedFunction are updated and the results ofthe last                                          Tensilica Xtensa processors.
     call of the split function and the accumulated results are printed
     to the result files.                                                                                            TABLE II.           32-BIT EMBEDDED RISC PROCESSORS

     D. Memtrace Frontend                                                                                                       Pipe-      Reg-        Instr./Data                Special
                                                                                                          Processor              line      isters'    Cache, TCMA                 Features
         Memtrace comes with two frontends, a commandline                                                 ARM9E
                                                                                                                                  5
                                                                                                                                             16        128k/128k
                                                                                                                                                                           coprocessor interf
     interface and a graphical user interface (GUI). The                                                                         stage                  yes/yes
     commandline interface is very well suited for the usage in                                                                                                                  SIMD,
                                                                                                                                   8         16          64k/64k              branch pred.
     batch files, for example for performing a profiling for a set of                                     ARMII
                                                                                                                                 stage                   yes/yes               64-bit bus
     system configurations or input data. The GUI version allows an                                                                                                        coprocessor interf
     easy and fast access to all features ofthe tool. Especially for the                                                           5         32          32k/32k               custom instr.
     quick generation of result diagrams the GUI version is very                                          ARC600                 stage     (- 60)       512k/16k             extend. reg.file
     helpful.
                                                                                                                                                                               custom instr.
                                                                                                                                   7         32         64k/64k                branch pred.
                                                                                                          ARC700                 stage     (- 60)      512k/256k             extend. reg. file
                                                                                                                                                                                64-bit bus
                                                                                                          Tesilica                 5         64         32k/32k               custom instr.
                                                                                                          Xtensa7                stage      or >       256k/256k            windowed regs.
                                                                                                                                                                            up to 128-bit bus
                                                                                                          Tensilica                5         32         16k/16k             windowed regs.
                                                                                                          Diamond232L            stage
                                                                                                          LatticeMico32
                                                                                                                                   6         32          32k/32k
                                                                                                                                 stage
                                                                                                          Altera                  5-6        32          64k/64k           direct-map. cache
                                                                                                          NIOS II                stage                   yes/yes              custom instr.
                                                                                                          Xilinx                   5
                                                                                                                                             32          64k/64k           direct-map. cache
                                                                                                          MicroBlaze v5          stage                   yes/yes           coprocessor interf
                                                                                                          MIPS 4KE
                                                                                                                                   5
                                                                                                                                             32          64k/64k           coprocessor interf
                                                                                                                                 stage                   yes/yes
                                                                                                          openRISC                 5
                                                                                                                                             32          64k/64k           direct-map. cache
                                                                                                          OR1200                 stage                                        custom instr.
                                                                                                                                                           INI
                                                                                                          LEON3
                                                                                                                                   7
                                                                                                                                            520           lM/yM             windowed regs.
                              Figure 3. Memtrace GUI frontend                                                                    stage                   yes/yes           coprocessor interf
                                                                                                                                           a many features are customizable, given is the maximum value

     E. Portability to other Processor Architectures                                                                    MEMTRACE WITHIN THE DESIGN FLOW
                                                                                                                     III.
         The current version of Memtrace is only targeted to the
     ARM processor family, as it uses the ISS from ARM                                                     This chapter describes how the profiler can be applied
     (ARMulator). However the interface of the profiler, as                                            during the design of embedded systems. Figure 4. shows a
     described before, is rather simple and could be ported to other                                   typical design flow for such hardware/software systems.
     processor architectures if an instruction set simulator is                                        Starting from a functionally verified system description in
     available, which allows debugging access to its memory                                            software, this software is profiled with an initial system
     busses. Our plans for future work include Memtrace backends                                       specification, in order to measure the performance and see, if
     for other processor architectures.                                                                the (real-time) requirements are met. If not, an iterative cycle of
                                                                                                       software and hardware partitioning, optimization and
         As long as other backends are not available, the ARM-                                         scheduling starts. In this process detailed profiling results are
     based profiling results may function as a rough estimation for                                    crucial for all steps in the design cycle.
     the results on other RISC processor architectures. Since all




                                                                                                  96
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
SIMD instructions can be applied, if such instructions are
                                                                                                       available in the processor. If the performance of the code is
                                                                                                       significantly influenced by memory accesses, as it is mainly
                                                                                                       the case in video applications, the number of accesses has to
                                           HWSW
                                         Partitioning                                                  be reduced or they have to be accelerated. The profiler gives a
                                                                                                       detailed overview of the memory accesses and allows
                                                                                                       therewith identifying the influence of the memory access. One
                                                                                                       optimization mechanism is the conversion of byte (8-bit) to
                                                                                                       word (32-bit) memory accesses. This can be applied if
                                                                                                       adjacent bytes in memory are required concurrently or within
                                                                                                       a short time period, for example pixel data of an image during
                                          Scheduling                                                   image processing. A further mechanism is the usage of tightly
                                                                                                       coupled memories (TCMs) for storing frequently used data.
                                           System                                                      For finding the most frequently accessed data area, the
                                                                                                       memory access statistics of Memtrace can be used. In [1] these
                      Figure 4. Typical embedded system design flow                                    techniques are described in more detail.
                                                                                                       C. Hardware/Software Profiling and Scheduling
     A. Hardware/Software Partioning and                                                                   Besides the software profiling and optimization a system
         Design Space Exploration                                                                      simulation including the hardware accelerators needs to be
         For the definition of a starting point of a system architecture                               carried out in order to evaluate the overall performance.
     an initial design space exploration should be performed. These                                    Usually hardware components are developed in a hardware
     steps include a variation of the following parameters:                                            description language (HDL) and tested with an HDL simulator.
                                                                                                       This task requires long development and simulation times.
         * processor type                                                                              Therefore HDL modelling is not suitable for the early design
           *  cache size and organization                                                              cycles, where exhaustive testing of different design alternatives
                                                                                                       is important. Furthermore, if the system performance is data
         * tightly coupled memories                                                                    dependent also a huge set of input data should be tested to get
         * bus timing                                                                                  reliable profiling results. Therefore, a simulation and profiling
                                                                                                       environment is required, which allows short modification and
         * external memory system and timing (DRAM, SRAM)                                              simulation time.
         * hardware accelerators, DMA controller                                                           For this purpose, we used the instruction set simulator and
                                                                                                       extended it with simulators for the hardware components of the
         Memtrace can be run in batch mode and thus different                                          system. The ARMulator provides an extension interface, which
     system configurations can be tested and profiled. Thus the                                        allows the definition of a system bus and peripheral bus
     influence of the system architecture on the performance can be                                    components. It comes already with a bus simulator, which
     evaluated. This initial profiling also reveals the hot-spots of the                               reflects the industry standard AMBA bus and a timing model
     software. The most time consuming functions are good                                              for access times to memory mapped bus components, such as
     candidates for either software optimization or hardware                                           memories and peripheral modules, see Figure 5.
     acceleration. Especially computational intensive functions are
     well-suited for hardware acceleration in a coprocessor. With
     support of a DMA controller even the burden of data transfers
     can be taken from the processor. Control-intensive functions
     are better suited for software implementation, as a hardware
     implementation would lead to a complex state machine, which
     requires long design time and often doesn't allow
     parallelization. In order to get a first idea of the influence of
     hardware acceleration, a (well-educated guessed) factor can be
     defined for each hardware candidate function. This factor is
     used by Memtrace, in order to manipulate the original profiling
     results.
     B. Software Profiling and Optimization
        After a partitioning in hardware and software is found, the
     software part can be optimized. Numerous techniques exist,
     that can be applied for optimizing software, such as loop
     unrolling, loop invariant code motion, common subexpression
     elimination or constant folding and propagation. For
                                                                                                          Figure 5. Environment for hardware/software cosimulation and profiling
     computational intensive parts arithmetic optimizations or



                                                                                                  97
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
1) Coprocessors                                                                                 these results, for example Figure 6. shows the bus usage for
         We supplemented this system with a simple template for                                        each function depending on the access time ofthe memory.
     coprocessors, including local registers and memories and a
     cycle-accurate timing. The functionality of the coprocessor can
     be defined as standard C-code, thus the software function can
     be simulated as hardware accelerator by copying the software                                               e5   _ 1111    100                                             Bus Idle (SRAM)1
                                                                                                           ,0                                                                M Bus Accesses (DRAM)
     code to the coprocessor template. The timing parameter can be                                                                             7
                                                                                                                                     7 - 1lilill   l   l     l   l   l   |
                                                                                                                                                                             *~11Bus
                                                                                                                                                                                   Idle (DRAM)
     used to define the delay of the coprocessor between activation
     and result availability, i.e. the execution time of the task, as it
     would be in real hardware. This value can be either achieved                                           04
     from reference implementation found in literature or by an
     educated guess of a hardware engineer. Furthermore, often                                                  2
     multiple hardware implementations of a task with different
     execution time (and hardware cost) are possible. In the                                                    0
     proposed profiling environment, simply by varying the timing
     parameter and viewing its influence on the overall
     performance, a good trade-off between hardware cost and                                                                                               Functions
     speed-up can be found quickly.
        2) DMA Controller                                                                                   Figure 6. Bus usage for each function, depending on the memory type
         For data intensive applications data transfers have a
     tremendous influence on the overall performance. In order to                                        4) HDL Simulation
     efficiently outsource tasks into hardware accelerators also the                                       In a later design phase, when the hardware/software
     burden of data transfer has to be taken from the CPU. This job                                    partitioning is fixed and an appropriate system architecture is
     can be performed by a DMA-Controller. The Memtrace                                                found, the hardware component need to be developed in a
     hardware profiling environment includes a highly efficient                                        hardware description language and tested using a HDL
     DMA-Controller with the following features:                                                       simulator, such as Modelsim. Finally, the entire system needs
                                                                                                       to be verified including hardware and software components.
          * multi-channel (parameterizable number of channels)                                         For this purpose the instruction set simulator and the HDL
          * ID- and 2D- transfers                                                                      simulator have to be connected. The codesign environment
          * activation FIFO (non-blocking transfer, autonomous)                                        PeaCE [4] allows the connection of the Modelsim Simulator
          * internal memory for temporary storage between read                                         and the ARiulator.
               and write
          * burst transfer mode                                                                           IV.           APPLIcATioN EXAMPLE H.264/AVGCVIDEo DECODER
         Thus the designer is enabled to determine the influence of                                                             FOR MOBILE TV TERMINALS
     different DMA modes in order to find an appropriate trade-off
     between DMA Controller complexity and required CPU                                                    The proposed design methodology has been applied to the
     activity.                                                                                         design of a video decoder as part of a mobile digital TV
                                                                                                       receiver. Starting from an executable specification of the video
        3) Scheduling                                                                                  decoder, namely the (unoptimized) reference software, at first a
         After the software and hardware tasks have been defined a                                     pure optimized software implementation and then an ASIC has
     scheduling of these tasks is required. For increasing the overall                                 been developed incorporating hardware accelerators and a
     performance a high degree of parallelization should be                                            customized processor.
     accomplished between hardware and software tasks. In order to
     find an appropriate scheduling for parallel tasks the following                                   A. D VB-H and H 2641A VC Video Compression
     information is required:                                                                              The receiver is compliant to DVB-H, which is a new
          * dependencies between tasks                                                                 standard for broadcasting of digital audio and video content to
                                                                                                       mobile devices. The content is encoded using highly efficient
          * the execution time of each task                                                            compression methods, namely AAC-HE for audio data and the
          * data transfer overhead                                                                     H.264/AVC [5] codec for video content. DVB-H focuses on a
                                                                                                       high mobility and low power consumption of the receivers. The
         Especially for data intensive application the overhead for                                    most demanding part of the receiver in terms of computational
     data transfers can have a huge influence on the performance. It                                   requirements is the H.264 AVG video decoder.
     might even happen that the speed-up of a hardware accelerator
     is vanished by the overhead for transferring data to and from                                         The H.264pAVG video compression standard is similar to
     the accelerator.                                                                                  its predecessors, however it adds various new coding features
                                                                                                       and refiements of existing mechanisms, which lead to a 2 to 3
     The overhead for data transfers to the coprocessors is                                            time's increased coding efficiency compared to MPEGf-2.
     dependent on the bus usage. Furthermore side effects on other                                     However, the computational demands and required data
     functions may occur, if bus congestion occurs or when cache                                       accesses have also increased significantly. In Figure 7. the
     flushing is required in order to ensure cache coherency. In                                       block diagram of an H.264/AVC decoder is depicted.
     order to find these side-effects, detailed profiling of the system
     performance and the bus usage is required. Memtrace provides



                                                                                                  98
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
CL~~~~~~             E



         Ca)~~~~~~~~~~~~~~~~~~~~~~~~C




                                           F
                                               --




                                                          |j~ ~ref nce
                                                     -- -- --


                                        04----------------->
                                                    decoding    inversetransformage t




                    Figure 7. Block diagram of an H.264/AVC decoder
         The bitstream parsing and entropy decoding interpret the
     encoded symbols and are highly control flow dominated. The
     symbols contain control information and data for the following
     components. The inter and intra prediction modes are used to
     predict image data from previous frames or neighboring blocks,
     respectively. Both methods require filtering operations,
     whereas the inter prediction is more computational demanding.
                                                                                         i


                                                                                        fr001ame
                                                                                             buffer
                                                                                                           motion compensation for the chrominance pixels, which is
                                                                                                           mainly based on bilinear interpolation. Focusing on the read
                                                                                                           memory




                                                                                                               cs-
                                                                                                               CD,
                                                                                                                     8-
                                                                                                                     7-




                                                                                                                      6-
                                                                                                                        accesses,      which      are   performed
                                                                                                           motionCompChroma (), as given in the second column of
                                                                                                           TABLE III. , it shows that more than 30%0 are byte or half
                                                                                                                                                                        in


                                                                                                           word accesses (third column). This is due to the fact, that the
                                                                                                           pixel values have the size of one byte each.


                                                                                                                                                                     ..................................................................................................................................................................................



                                                                                                                                     .........................................................................................................................................................................................................




                                                                                                                                                                                           ~




     The residuals of the prediction are received as transformed and                                             Figure 8.     Profiling results for the H.264/AVC software decoder
     quantized coefficients. The applied transformation, which can
     be considered as a simplified discrete cosine transformation                                             Since the interpolation is applied iteratively on adjacent
     (DCT), is based on integer arithmetic and is computational                                            pixels, the source code can be optimized by reading 4 adjacent
     demanding. The reconstructed image is post processed by a                                             bytes at once. This leads to a reduction of the execution time
     deblocking filter for reducing blocking artifacts at block edges.
     The deblocking filter includes the calculation of the filter                                          of the function by almost 30°0o. The speedup of the function
     strength, which is control flow dominated, and the actual 3- to                                       leads to a reduction of the execution time for processing a P-
     5-tap filtering, which requires many arithmetic operations.                                           frame by about 500.
     Each of these components allows various modes of operation,
     which are chosen by symbols in the bitstream. This involves a                                         TABLE III.        PROFILING RESULTS FOR MOTIONCOMPCHROMAO) FUNCTION
     high degree of control flow in the decoder.                                                                                                                                   Clock Cycles                                                                                                           All Load                               Load 8/16
         The H.264/AVC baseline decoder has been profiled with                                                  before optimization                                                         13,149,109                                                                                                         309,368                            104,784
     Memtrace using a system specification typical for mobile                                                    after optimization                                                         9,355,709                                                                                                          196,746                            34,584
     embedded systems comprising an ARM946E-S processor core,
     a data and instruction cache (16kB each) and an external                                                 Further speed-up of the software could be achieved by
     DRAM as main memory. The execution time for each module                                               applying well-known software optimization techniques and
     of the decoder has been evaluated as depicted in Figure 8. The                                        those proposed in [3] to the functions identified by the
     results show, that the distribution over the modules differs
     significantly between I- and P-frames. Whereas in I-frames the                                        profiler. The resulting software decoder has been tested on an
     deblocking has the most influence on the overall performance,                                         Intel PXA270-based PDA within the DVB scenario. The
     in P-frames the motion compensation is the dominant part.                                             required processor clock frequency for H.264/AVC decoding
                                                                                                           is about 420 MHz. (320x240 pixel resolution, 384 kBit/s).
     B. Design and Optimizations                                                                              Considering the dynamic power consumption of CMOS-
         Based on the acquired profiling results several software and                                      circuits, given in equation 1, the rather high system frequency
     hardware architectural optimizations are applied. Our first                                           leads to high power consumption.
                                                                                                                                                        M
     target is a pure software version of the video decoder for the                                                                                                                                                                                                                                                                                          (1)
     implementation of a DVB-H terminal on a PDA. In a second                                                             Pdynamic               k=l
                                                                                                                                                                               Ck fk VDD
     step an embedded hardware/software is developed.                                                         For achieving lower power consumption, methods need to
        1) Software Implementation and Optimizations                                                       be applied, which allow the reduction of the system frequency,
        Following Amdahl's law, those parts of the software should                                         which in turn also allows a lower supply voltage (voltage
     be considered for optimization first, which take up the most of                                       scaling). Hardware accelerators can be used for this purpose.
     the execution time. Figure 8. shows, that motion                                                      However, their influence on the capacitance has to be
     compensation, loopfilter, inverse transformation and memory                                           considered and reduced by mechanism like clock gating.
     related functions are those candidates. Exploring the results of                                      Furthermore the memory architecture needs to be adapted
     the functions corresponding to the motion compensation, it                                            (reduced) to the specific application requirements.
     can be seen that the function motionCompChroma ()
     requires the most execution time. This function performs the



                                                                                                      99
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
2) Memory System                                                                                 control flow level. Therefore they are well suited for hardware
         Besides the processing power of the CPU the memory and                                         implementation as coprocessors, which can be controlled by
     bus architecture determine the overall performance of the                                          the main CPU. In order to ease the burden of providing the
     system. Namely the caches size and architecture, the speed and                                     coprocessors with data, a DMA controller can be applied
     usage of a tightly coupled (on-chip) memory (TCM), the width                                       allowing memory transfers concurrently to the processing of
     of the memory bus, the bandwidth of the off-chip memory and                                        the CPU. The coprocessors should be equipped with local
     a DMA controller are the most influencing factors. Adjusting                                       memory for storing input and output data for processing at least
     these factors requires a trade-off between hardware cost, power                                    one macroblock at a time preventing fragmented DMA
     consumption and performance. The H.264/AVC decoder has                                             transfers. As the video data is stored in the memory in a two
     been simulated with different cache sizes in order to find an                                      dimensional fashion, the DMA controller should feature 2-D
     appropriate size for the DVB-H terminal scenario (QVGA                                             memory transfers. The output of the video data to a display,
     image resolution). It has been evaluated how the required                                          which is required by a DVB-H terminal, even increases the
     decoding time changes when either the instruction cache size or                                    problem ofthe high amount of data transfers.
     the data cache size is increased, see Figure 9.
                                                                                                           4) Hardware/Software Interconnection and Scheduling
                                                  n 1=4k:D=var           m   I=var:D=Ok                     After the software optimization is performed and the
             120 -
                                                                                                        hardware accelerators are developed, a scheduling of the entire
         g 100-                                                                                         system is required. The scheduling is static and controlled by
         -    80-
                                                                                                        the software. The hardware accelerators are introduced step-by-
         0
                                                                                                        step to the system. Starting from the pure software
         ,, 60-                                                                                         implementation, at first the software functions are replaced by
         0                                                                                              their hardware counterparts. This also requires the transfer of
         ,, 40-                                                                                         input data to and output data from the coprocessors. These data
         0
         " 20-                                                                                          transfers are at first executed by load-store operations of the
                                                                                                        processor and in a next step replaced by DMA transfers. This
               0-                                                                                       might also requires flushing the cache or cache lines, which
                                                                                                        may decrease the performance of other software functions. In a
         Figure 9. Influence of the instruction (I) and data (D) cache sizes on the                     final step the parallelization of the hardware task and software
                       execution time of the H.264/AVC decoder.                                         tasks takes place. All decision taken in these steps are based on
                                                                                                        detailed profiling results.
         The results show that increasing the instruction cache size                                        The following example shows how the hardware
     from 4 kByte up to 32 kByte has a minor influence on the                                           accelerator for the deblocking is inserted into the software
     overall performance. However, adding a data cache of 4 kByte                                       decoder. The hardware accelerator only includes the filtering
     to the system decreases the decoding time to less than 20%.                                        process of the deblocking stage, filter strength calculation is
     Further increasing the data cache size does not yield a dramatic                                   performed in software, because it is rather control intensive and
     performance increase. Therefore a data and instruction cache                                       therefore more suitable for software implementation. The filter
     size of 4 kByte each is a good tradeoff between performance                                        processes the luminance and chrominance data for one
     and die area. The data cache increases the performance by                                          macroblock at a time. It requires the pixel data and filter
     decreasing the number of accesses to the external memory.                                          parameters as an input and provides filtered image data as an
     This is especially efficient for data areas with frequent accesses                                 output, this sums up to about 340 32-bit words of data transfer.
     to the same memory location, e.g. the stack. However for                                           Figure 10. shows the results for the pure software
     randomly accessed data areas, e.g. lookup tables, a fast on-chip                                   implementation, when using the filter accelerator with data
     memory (SRAM) is more appropriate. As the H.264/AVC                                                transfer managed by the processor, and when additionally using
     decoder requires about 1. 1 MByte of data memory (@ QVGA                                           the DMA controller. As can be seen, if data is transferred by
     video resolution), only small parts of the used data structures                                    the processor, the performance gain of the accelerator is
     (less than 3%0 with 32 kByte of SRAM) can be stored in the of                                      vanished by the data transfers, only in conjunction with the
     on-chip memory. In order to find a useful partitioning of data                                     DMA controller the coprocessor can be used efficiently.
     areas between on-chip and off-chip memory, it is required to
     profile the accesses to each data area of the decoder. Since a
     data cache is instantiated, accesses to these memories only                                             Million

     happen if cache misses occur. Therefore, the cache misses have
     been analyzed separately for each data area in the code
     including global variables, heap variables and the stack. Data
     areas with many cache misses are stored in on-chip memory.                                              10-
                                                                                                              14                                                       M Paaee Caclto
        3) Hardware/Software Partitioning
         In order to further increase the system efficiency and
     decrease power consumption and hardware costs, the CPU can
     be enhanced by coprocessors. Again, the hot spots in the
     software code should be considered, namely the loop filter, the                                                         SW                 HWwith CPU LD/ST        HWwith DMA
     motion compensation and the integer transformation. These are                                       Figure 10. Clock cycle comparison of different deblocking implementations
     the foremost candidates for hardware implementation. All these
     components are rather demanding on an arithmetical than on a



                                                                                                  100
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
C. Hardware/Software System Implementation                                                                 exhaustive performance testing and power measurements,
         The profiling and implementation results of the previous                                               separately for memory, core and IO supply voltages.
     chapters lead to a mixed hardware/software implementation of
     the video decoder, which is given in Figure 11. An application
     processor is extended with a companion chip for acceleration
     of the video decoding. The companion chip contains the
     hardware accelerators for H.264/AVC decoding. TABLE IV.
     shows a comparison of the required cycle times of the
     accelerators with their software counterparts.

       TABLE IV.          COMPARISON OF THE EXECUTION TIME iN HARDWARE AND
                                      SOFTWARE
                                                                Pixel                 Inverse
       Implementation              Debloc king
                                                          Interpolation             Transform
       Software                3000-7000 cylces          100-700 cycles            320 cycles                                            Figure 12. ASIC layout
       Hardware                232 cylces                16-34 cycles              30 cycles                                 V.  CONCLUSIONS AND FUTURE WORK
                                               a memory transfers are not included in this cycle counts            The design of an efficient system for applications with high
         Furthermore a so called SIMD engine is available on the                                                demands on the real-time performance requires the selection of
     chip, which is 32-bit RISC processor enhanced with special                                                 an appropriate system architecture and the incorporated
     SIMD instructions. The 32-bit system bus connecting the                                                    hardware and software components. For this decision a detailed
     processor core with the main memory and coprocessor                                                        knowledge of the computational demands of the application is
     components is augmented with a DMA-controller which                                                        mandatory. Furthermore for data intensive applications also the
     supports the main processor by performing the memory                                                       influence of memory accesses has to be taken into account. We
     transfers to the coprocessor units. A video output unit is                                                 presented a profiling tool which provides this information and
     implemented directly driving a connected display or video                                                  have shown how it can be integrated in the design flow. The
     DAC. To avoid a heavy bus load on the mentioned system bus                                                 tool aids the designer in taking the right decision during each
     due to transfers from a frame buffer to the video output                                                   step of the design, including the hardware/software
     interface, an extra frame buffer memory and the video output                                               partitioning, the optimization ofthe components and the system
     unit are provided by a separate video bus system. The data                                                 scheduling. We have applied this methodology for the
     transfers between these bus systems are also performed by the                                              development of a software solution and a hardware/software
     DMA controller. The main control functionality of the decoder                                              system for real-time video decoding.
     can either be run on the application processor or on the RISC
     core on the companion chip.                                                                                    Our future work includes the retargeting of the profiler
                                                                                                                backend to other processors. Many processor simulators offer
                                                                                                                already profiling capabilities, e.g. the LisaTek tool suite;
                                                                                                                however their results are not as detailed as the Memtrace
                                                                                                                results. Furthermore we plan to integrate power models for
                                                                                                                cache and memory accesses and instruction execution in order
                                                                                                                to allow power consumption estimation. These models will be
                                                                                                                based on existing power models of caches and memories and
                                                                                                                on measurement results of the presented ASIC design.

                                                                                                                                            REFERENCES
                                                                                                                [1] RealView ARMulator ISS User Guide Version 1.4, Ref: DUI0207C,
                                                                                        display                     January 2004, http://www.arm.com
                                                                                                                [2] J. Bormans, K. Denolf, S. Wuytack, L. Nachtergaele, and I. Bolsens,
                                                                                                                    "Integrating system-level low power methodologies into a real-life
                                                                                                                    design flow," The Ninth Int. Workshop Power and Timing Modeling,
                                                                                                                    Optimization and Simulation, pp. 19-28, Oct. 1999, Kos Island, Greece
                                                                                                                [3] H. Hubert, B. Stabernack, and H. Richter, "Tool-Aided Performance
            Figure 11. SOC architecture of the DVB-H/DMB companion chip                                             Analysis and Optimization of an H.264 Decoder for Embedded
                                                                                                                    Systems," The Eighth IEEE International Symposium on Consumer
         To fully evaluate the proposed concept the complete SOC                                                    Electronics (ISCE 2004), Reading, England, Sept. 2004
     architecture has been implemented as an ASIC design using                                                  [4] s. Ha, C. Lee, Y. Yi, S. Kwon, and Y.-P. Joo, "Hardware-software
     UMC's L180 1P6M GII logic technology, see Figure 12. The                                                       Codesign of Multimedia Embedded Systems: the PeaCE Approach,"
                                                                                                                    12th IEEE Int. Conf on Embedded and Real-Time Computing Systems
     maximum clock frequency of the design is 120 MHz, whereas                                                      and Applications, Sydney, Australia, Vol. 1 pp. 207-214, Aug. 2006
     50 MHz should be sufficient for the DVB-H scenario. An                                                     [5] International Standard of Joint Video Specification (ITU-T Rec. H.264
     evaluation board for the chip is currently under development. It                                                   ISO/IEC 14496-10 AVC), Joint Video Team (JVT) of ISO/IEC
     allows the fully functional verification and furthermore                                                       MPEG and ITU-T, VCEG, JVT-G050, March 2003




                                                                                                          101
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.

Weitere ähnliche Inhalte

Was ist angesagt?

Structure of Operating System
Structure of Operating System Structure of Operating System
Structure of Operating System anand hd
 
Characteristics and Quality Attributes of Embedded System
Characteristics and Quality Attributes of Embedded SystemCharacteristics and Quality Attributes of Embedded System
Characteristics and Quality Attributes of Embedded Systemanand hd
 
Kernel security Concepts
Kernel security ConceptsKernel security Concepts
Kernel security ConceptsMohit Saxena
 
How to Measure RTOS Performance
How to Measure RTOS Performance How to Measure RTOS Performance
How to Measure RTOS Performance mentoresd
 
ARM 32-bit Microcontroller Cortex-M3 introduction
ARM 32-bit Microcontroller Cortex-M3 introductionARM 32-bit Microcontroller Cortex-M3 introduction
ARM 32-bit Microcontroller Cortex-M3 introductionanand hd
 
Introduction to Embedded System I: Chapter 2 (5th portion)
Introduction to Embedded System I: Chapter 2 (5th portion)Introduction to Embedded System I: Chapter 2 (5th portion)
Introduction to Embedded System I: Chapter 2 (5th portion)Moe Moe Myint
 
Kernel security of Systems
Kernel security of SystemsKernel security of Systems
Kernel security of SystemsJamal Jamali
 
Analyzing Kernel Security and Approaches for Improving it
Analyzing Kernel Security and Approaches for Improving itAnalyzing Kernel Security and Approaches for Improving it
Analyzing Kernel Security and Approaches for Improving itMilan Rajpara
 
Embedded systems unit4
Embedded systems unit4Embedded systems unit4
Embedded systems unit4baskaransece
 
Trainingreport on embedded system
Trainingreport on embedded systemTrainingreport on embedded system
Trainingreport on embedded systemMukul Mohal
 
Embedded systems unit1
Embedded systems unit1Embedded systems unit1
Embedded systems unit1baskaransece
 
Embedded systems unit 5
Embedded systems unit 5Embedded systems unit 5
Embedded systems unit 5baskaransece
 
OPERATING SYSTEM
OPERATING SYSTEM OPERATING SYSTEM
OPERATING SYSTEM Khehra Saab
 

Was ist angesagt? (19)

4213ijsea06
4213ijsea064213ijsea06
4213ijsea06
 
Structure of Operating System
Structure of Operating System Structure of Operating System
Structure of Operating System
 
Characteristics and Quality Attributes of Embedded System
Characteristics and Quality Attributes of Embedded SystemCharacteristics and Quality Attributes of Embedded System
Characteristics and Quality Attributes of Embedded System
 
Kernel security Concepts
Kernel security ConceptsKernel security Concepts
Kernel security Concepts
 
E.s (2)
E.s (2)E.s (2)
E.s (2)
 
How to Measure RTOS Performance
How to Measure RTOS Performance How to Measure RTOS Performance
How to Measure RTOS Performance
 
ARM 32-bit Microcontroller Cortex-M3 introduction
ARM 32-bit Microcontroller Cortex-M3 introductionARM 32-bit Microcontroller Cortex-M3 introduction
ARM 32-bit Microcontroller Cortex-M3 introduction
 
Microkontroler
MicrokontrolerMicrokontroler
Microkontroler
 
Introduction to Embedded System I: Chapter 2 (5th portion)
Introduction to Embedded System I: Chapter 2 (5th portion)Introduction to Embedded System I: Chapter 2 (5th portion)
Introduction to Embedded System I: Chapter 2 (5th portion)
 
Kernel security of Systems
Kernel security of SystemsKernel security of Systems
Kernel security of Systems
 
Analyzing Kernel Security and Approaches for Improving it
Analyzing Kernel Security and Approaches for Improving itAnalyzing Kernel Security and Approaches for Improving it
Analyzing Kernel Security and Approaches for Improving it
 
RTOS Basic Concepts
RTOS Basic ConceptsRTOS Basic Concepts
RTOS Basic Concepts
 
Embedded systems unit4
Embedded systems unit4Embedded systems unit4
Embedded systems unit4
 
Trainingreport on embedded system
Trainingreport on embedded systemTrainingreport on embedded system
Trainingreport on embedded system
 
Embedded systems unit1
Embedded systems unit1Embedded systems unit1
Embedded systems unit1
 
Fleksible sundhedsprocesser af Thomas Hildebrandt, ITU
Fleksible sundhedsprocesser af Thomas Hildebrandt, ITUFleksible sundhedsprocesser af Thomas Hildebrandt, ITU
Fleksible sundhedsprocesser af Thomas Hildebrandt, ITU
 
Embedded systems unit 5
Embedded systems unit 5Embedded systems unit 5
Embedded systems unit 5
 
OPERATING SYSTEM
OPERATING SYSTEM OPERATING SYSTEM
OPERATING SYSTEM
 
Ch13 annotated
Ch13 annotatedCh13 annotated
Ch13 annotated
 

Ähnlich wie Performance and memory profiling for embedded system design

UNIT 1 SONCA.pptx
UNIT 1 SONCA.pptxUNIT 1 SONCA.pptx
UNIT 1 SONCA.pptxmohan134666
 
Spie2006 Paperpdf
Spie2006 PaperpdfSpie2006 Paperpdf
Spie2006 PaperpdfFalascoj
 
Avionics Paperdoc
Avionics PaperdocAvionics Paperdoc
Avionics PaperdocFalascoj
 
HW/SW Partitioning Approach on Reconfigurable Multimedia System on Chip
HW/SW Partitioning Approach on Reconfigurable Multimedia System on ChipHW/SW Partitioning Approach on Reconfigurable Multimedia System on Chip
HW/SW Partitioning Approach on Reconfigurable Multimedia System on ChipCSCJournals
 
SIMULATION-BASED APPLICATION SOFTWARE DEVELOPMENT IN TIME-TRIGGERED COMMUNICA...
SIMULATION-BASED APPLICATION SOFTWARE DEVELOPMENT IN TIME-TRIGGERED COMMUNICA...SIMULATION-BASED APPLICATION SOFTWARE DEVELOPMENT IN TIME-TRIGGERED COMMUNICA...
SIMULATION-BASED APPLICATION SOFTWARE DEVELOPMENT IN TIME-TRIGGERED COMMUNICA...IJSEA
 
DYNAMIC HW PRIORITY QUEUE BASED SCHEDULERS FOR EMBEDDED SYSTEM
DYNAMIC HW PRIORITY QUEUE BASED SCHEDULERS FOR EMBEDDED SYSTEMDYNAMIC HW PRIORITY QUEUE BASED SCHEDULERS FOR EMBEDDED SYSTEM
DYNAMIC HW PRIORITY QUEUE BASED SCHEDULERS FOR EMBEDDED SYSTEMijesajournal
 
A todo esto ¿Qué es una arquitectura?
A todo esto ¿Qué es una arquitectura?A todo esto ¿Qué es una arquitectura?
A todo esto ¿Qué es una arquitectura?DCC8090
 
It 443 lecture 1
It 443 lecture 1It 443 lecture 1
It 443 lecture 1elisha25
 
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
 Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridadMarketing Donalba
 
Software Engineering
 Software Engineering  Software Engineering
Software Engineering JayaKamal
 
Software performance simulation strategies for high-level embedded system design
Software performance simulation strategies for high-level embedded system designSoftware performance simulation strategies for high-level embedded system design
Software performance simulation strategies for high-level embedded system designMr. Chanuwan
 
List and describe various features of electronic systems.List and .pdf
List and describe various features of electronic systems.List and .pdfList and describe various features of electronic systems.List and .pdf
List and describe various features of electronic systems.List and .pdfinfo824691
 
A system for performance evaluation of embedded software
A system for performance evaluation of embedded softwareA system for performance evaluation of embedded software
A system for performance evaluation of embedded softwareMr. Chanuwan
 
Asystemforperformanceevaluationofembeddedsoftware 100813001230-phpapp02
Asystemforperformanceevaluationofembeddedsoftware 100813001230-phpapp02Asystemforperformanceevaluationofembeddedsoftware 100813001230-phpapp02
Asystemforperformanceevaluationofembeddedsoftware 100813001230-phpapp02NNfamily
 

Ähnlich wie Performance and memory profiling for embedded system design (20)

UNIT 1 SONCA.pptx
UNIT 1 SONCA.pptxUNIT 1 SONCA.pptx
UNIT 1 SONCA.pptx
 
Spie2006 Paperpdf
Spie2006 PaperpdfSpie2006 Paperpdf
Spie2006 Paperpdf
 
Avionics Paperdoc
Avionics PaperdocAvionics Paperdoc
Avionics Paperdoc
 
HW/SW Partitioning Approach on Reconfigurable Multimedia System on Chip
HW/SW Partitioning Approach on Reconfigurable Multimedia System on ChipHW/SW Partitioning Approach on Reconfigurable Multimedia System on Chip
HW/SW Partitioning Approach on Reconfigurable Multimedia System on Chip
 
SIMULATION-BASED APPLICATION SOFTWARE DEVELOPMENT IN TIME-TRIGGERED COMMUNICA...
SIMULATION-BASED APPLICATION SOFTWARE DEVELOPMENT IN TIME-TRIGGERED COMMUNICA...SIMULATION-BASED APPLICATION SOFTWARE DEVELOPMENT IN TIME-TRIGGERED COMMUNICA...
SIMULATION-BASED APPLICATION SOFTWARE DEVELOPMENT IN TIME-TRIGGERED COMMUNICA...
 
Module-1 Embedded computing.pdf
Module-1 Embedded computing.pdfModule-1 Embedded computing.pdf
Module-1 Embedded computing.pdf
 
Report file on Embedded systems
Report file on Embedded systemsReport file on Embedded systems
Report file on Embedded systems
 
DYNAMIC HW PRIORITY QUEUE BASED SCHEDULERS FOR EMBEDDED SYSTEM
DYNAMIC HW PRIORITY QUEUE BASED SCHEDULERS FOR EMBEDDED SYSTEMDYNAMIC HW PRIORITY QUEUE BASED SCHEDULERS FOR EMBEDDED SYSTEM
DYNAMIC HW PRIORITY QUEUE BASED SCHEDULERS FOR EMBEDDED SYSTEM
 
UNIT 1.docx
UNIT 1.docxUNIT 1.docx
UNIT 1.docx
 
A todo esto ¿Qué es una arquitectura?
A todo esto ¿Qué es una arquitectura?A todo esto ¿Qué es una arquitectura?
A todo esto ¿Qué es una arquitectura?
 
It 443 lecture 1
It 443 lecture 1It 443 lecture 1
It 443 lecture 1
 
ERTS_Unit 1_PPT.pdf
ERTS_Unit 1_PPT.pdfERTS_Unit 1_PPT.pdf
ERTS_Unit 1_PPT.pdf
 
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
 Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
 
Software Engineering
 Software Engineering  Software Engineering
Software Engineering
 
Software performance simulation strategies for high-level embedded system design
Software performance simulation strategies for high-level embedded system designSoftware performance simulation strategies for high-level embedded system design
Software performance simulation strategies for high-level embedded system design
 
publishable paper
publishable paperpublishable paper
publishable paper
 
List and describe various features of electronic systems.List and .pdf
List and describe various features of electronic systems.List and .pdfList and describe various features of electronic systems.List and .pdf
List and describe various features of electronic systems.List and .pdf
 
ES-Basics.pdf
ES-Basics.pdfES-Basics.pdf
ES-Basics.pdf
 
A system for performance evaluation of embedded software
A system for performance evaluation of embedded softwareA system for performance evaluation of embedded software
A system for performance evaluation of embedded software
 
Asystemforperformanceevaluationofembeddedsoftware 100813001230-phpapp02
Asystemforperformanceevaluationofembeddedsoftware 100813001230-phpapp02Asystemforperformanceevaluationofembeddedsoftware 100813001230-phpapp02
Asystemforperformanceevaluationofembeddedsoftware 100813001230-phpapp02
 

Mehr von Mr. Chanuwan

Java lejos-multithreading
Java lejos-multithreadingJava lejos-multithreading
Java lejos-multithreadingMr. Chanuwan
 
High level programming of embedded hard real-time devices
High level programming of embedded hard real-time devicesHigh level programming of embedded hard real-time devices
High level programming of embedded hard real-time devicesMr. Chanuwan
 
Runtime performance evaluation of embedded software
Runtime performance evaluation of embedded softwareRuntime performance evaluation of embedded software
Runtime performance evaluation of embedded softwareMr. Chanuwan
 
Performance testing based on time complexity analysis for embedded software
Performance testing based on time complexity analysis for embedded softwarePerformance testing based on time complexity analysis for embedded software
Performance testing based on time complexity analysis for embedded softwareMr. Chanuwan
 
High-Performance Timing Simulation of Embedded Software
High-Performance Timing Simulation of Embedded SoftwareHigh-Performance Timing Simulation of Embedded Software
High-Performance Timing Simulation of Embedded SoftwareMr. Chanuwan
 
High performance operating system controlled memory compression
High performance operating system controlled memory compressionHigh performance operating system controlled memory compression
High performance operating system controlled memory compressionMr. Chanuwan
 
Application scenarios in streaming oriented embedded-system design
Application scenarios in streaming oriented embedded-system designApplication scenarios in streaming oriented embedded-system design
Application scenarios in streaming oriented embedded-system designMr. Chanuwan
 
Software Architectural low energy
Software Architectural low energySoftware Architectural low energy
Software Architectural low energyMr. Chanuwan
 
Object and method exploration for embedded systems
Object and method exploration for embedded systemsObject and method exploration for embedded systems
Object and method exploration for embedded systemsMr. Chanuwan
 
Embedded architect a tool for early performance evaluation of embedded software
Embedded architect a tool for early performance evaluation of embedded softwareEmbedded architect a tool for early performance evaluation of embedded software
Embedded architect a tool for early performance evaluation of embedded softwareMr. Chanuwan
 
Performance prediction for software architectures
Performance prediction for software architecturesPerformance prediction for software architectures
Performance prediction for software architecturesMr. Chanuwan
 
Model-Based Performance Prediction in Software Development: A Survey
Model-Based Performance Prediction in Software Development: A SurveyModel-Based Performance Prediction in Software Development: A Survey
Model-Based Performance Prediction in Software Development: A SurveyMr. Chanuwan
 

Mehr von Mr. Chanuwan (12)

Java lejos-multithreading
Java lejos-multithreadingJava lejos-multithreading
Java lejos-multithreading
 
High level programming of embedded hard real-time devices
High level programming of embedded hard real-time devicesHigh level programming of embedded hard real-time devices
High level programming of embedded hard real-time devices
 
Runtime performance evaluation of embedded software
Runtime performance evaluation of embedded softwareRuntime performance evaluation of embedded software
Runtime performance evaluation of embedded software
 
Performance testing based on time complexity analysis for embedded software
Performance testing based on time complexity analysis for embedded softwarePerformance testing based on time complexity analysis for embedded software
Performance testing based on time complexity analysis for embedded software
 
High-Performance Timing Simulation of Embedded Software
High-Performance Timing Simulation of Embedded SoftwareHigh-Performance Timing Simulation of Embedded Software
High-Performance Timing Simulation of Embedded Software
 
High performance operating system controlled memory compression
High performance operating system controlled memory compressionHigh performance operating system controlled memory compression
High performance operating system controlled memory compression
 
Application scenarios in streaming oriented embedded-system design
Application scenarios in streaming oriented embedded-system designApplication scenarios in streaming oriented embedded-system design
Application scenarios in streaming oriented embedded-system design
 
Software Architectural low energy
Software Architectural low energySoftware Architectural low energy
Software Architectural low energy
 
Object and method exploration for embedded systems
Object and method exploration for embedded systemsObject and method exploration for embedded systems
Object and method exploration for embedded systems
 
Embedded architect a tool for early performance evaluation of embedded software
Embedded architect a tool for early performance evaluation of embedded softwareEmbedded architect a tool for early performance evaluation of embedded software
Embedded architect a tool for early performance evaluation of embedded software
 
Performance prediction for software architectures
Performance prediction for software architecturesPerformance prediction for software architectures
Performance prediction for software architectures
 
Model-Based Performance Prediction in Software Development: A Survey
Model-Based Performance Prediction in Software Development: A SurveyModel-Based Performance Prediction in Software Development: A Survey
Model-Based Performance Prediction in Software Development: A Survey
 

Kürzlich hochgeladen

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 

Kürzlich hochgeladen (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 

Performance and memory profiling for embedded system design

  • 1. Performance and Memory Profiling for Embedded System Design Heiko Hubert, Benno Stabernack, Kai-Immo Wels Image Processing Department, Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut Einsteinufer 37, 10587 Berlin, Germany [huebert,stabernack,wels] ghhi. fraunhofer. de Abstract- The design of embedded hardware/software systems is In order to reduce the overall data traffic, those parts of the often underlying strict requirements concerning various aspects, code, which require a high amount of data transfers, have to be including real time performance, power consumption and die identified and optimized. The above mentioned applications area. Especially for data intensive applications, such as multimedia systems, the number of memory accesses is a contain up to 100.000 lines of source code. Therefore tools are dominant factor for these aspects. In order to meet the required, which help the designer identifying the critical parts requirements and design a well-adapted system, the software of the software. Several analysis tools exist, e.g. timing parts need to be optimized and an adequate hardware analysis is provided by gprof or VTune. Memory access architecture needs to be designed. For complex applications this analysis is part of the ATOMIUM [2] tool suite. However, all design space exploration can be rather difficult and requires in- these tools provide only approximate results for either timing depth analysis of the application and its implementation alternatives. Tools are required which aid the designer in the or memory accesses. A highly accurate memory analysis can design, optimization and scheduling of hardware and software. be done with a hardware (HDL) simulator, if an HDL model We present a profiling tool for fast and accurate performance of the processor is available. However, such an analysis and memory access analysis of embedded systems and show how implies a long simulation time. it can be applied within the design flow. This concept has been In order to achieve a fast and accurate solution, we proven in the design of a mixed hardware/software system for developed a specialized profiler, called Memtrace [3], for H.264/AVC video decoding. obtaining performance and memory access statistics. This Keywords- profiling, embedded hardware/software systems, paper describes the tool with all its features. We show how the design space exploration, scheduling provided profiling results can be used during the design and optimization of embedded hardware/software systems. As a I. INTRODUCTION case study, Memtrace is applied during the efficient design of The design of an embedded system often starts from a a mixed hardware/software system for H.264/AVC video software description of the system in C language. For decoding. Starting from a software implementation, it is example, the designer writes an executable specification based shown, how the software is optimized, an efficient hardware on a reference implementation of the application, e.g. from architecture is developed, and the system tasks are scheduled standardization organizations or the open-source community. based on the profiling results. This software code is often not optimized in any manners, II. MEMTRACE: A PERFORMANCE AND MEMORY PROFILER because it mainly serves the purpose of functional and conformance testing. Therefore it has to be transformed into A. Tool Architecture an efficient system, including hardware and software Memtrace is a non-intrusive profiler, which analyzes the components. The design of the system requires the following memory accesses and real time performance of an application, steps: system architecture design, hardware/software without the need of instrumentation code. The analysis is partitioning, software optimization, design of hardware controlled by information about variables and functions in the accelerators and system scheduling. All these steps require user application, which is automatically extracted from the detailed information about the performance of the different application. Furthermore, the user can specify the system parts of the application. Besides the arithmetical demands of parameters, e.g. the processor type and the memory system. the application, memory accesses can have a huge influence During the analysis, Memtrace utilizes the instruction set simulator ARMulator [1] for executing the application. The on performance and power consumption. This is especially the ARMulator provides Memtrace with the information required case for data intensive applications, such as multimedia for the analysis, e.g. the program counter, the clock cycle systems, due to the huge amount of data to be transferred in counter and the memory accesses. Memtrace creates detailed these applications. This problem is even increased if the given results on memory accesses and timing for each function and data bandwidth is not used efficiently. variable in the code. 1-4244-0840-7/07/$20.00 02007 IEEE. 94 Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
  • 2. Clock Cycles n 60 executable of it funcl1func2 o the application _ 1| 201 30 . > >,40 40 ---- fundc1 121 271 38 20 = func2 list of functions memitace 131 231 34 o analysis stack location specification variable location fronten results of function analysis 1 2 3 4 5 6 4 M result table format /, srf Cache Misses t 60 it var1 var2 system Processor AK backend (A RMulator) 1 15 6 40 ---- va rl specification Caches.16K1tIII &IMemTimingn Set Simulator 2 48 3, 38, 13 22 20 --va r2 lil ~~~~~~Instruction Set Simulator with memtrace backeind results of memory analysis 1 2 3 4 5 6 Figure 1. Performance analysis tool: Memtrace profiles the performance and memory accesses of a user application. B. Analysis Workflow load memory accesses for each function. Furthermore the The performance analysis with Memtrace is carried out in results of several functions can be accumulated in groups for three steps, the initialization, the performance analysis and the comparing the results of entire application modules. The user- postprocessing of the results. defined tables are written to files in a tab-separated format. Thus they can be further processed, e.g. by spreadsheet During initialization Memtrace extracts the names of all programs for creating diagrams. functions and variables of the application. During this process user variables and functions are separated from standard library C. Tool Backend Interface to the ISS - functions, such as printf() or malloc(. This is achieved by Memtrace communicates with the Instruction Set Simulator comparing the symbol table of the executable with the ones of (ISS) via its backend, as depicted in Figure 2. The backend is the user library and object files. The results are written to the implemented as dynamic link library (DLL), which connects to analysis specification file. The specification file can be edited the ISS. Currently only the ARM instruction set simulator by the user, e.g. for adding user-defined memory areas, such as ARMulator is supported. The backend is automatically called the stack and heap variables, for additional analysis. by the ISS during simulation. During the startup phase, the Furthermore the user can define a so called "split function", backend creates a list of all functions and marks the user and which instructs Memtrace to produce snapshot results, each split functions found in the analysis specification file. For each time the "split function" is called. This can be used e.g. in video function a data structure is created, which contains the processing for generating separate profiling results for each function's start address and variables for collecting the analysis processed frame. Additionally the user can control if the results. Finally two pointers, called currentFunction and analysis results, e.g. clock cycles, of a function should include evaluatedFunction, are initialized. The first pointer the results of a called function (accumulated) or if it should indicates the currently executed function and, if this function only reflect the function's own results (self). Typically should not be evaluated, the second pointer indicates the calling auxiliary functions, e.g. C library or simple arithmetic function, to which the result of the current function should be functions, are accumulated to the calling functions. added. In the second step the performance analysis is carried out, based on the analysis specification and the system specification, as shown in Figure 1. The system specification includes the processor, cache and memory type definitions. The Memtrace backend connects to instruction set simulator for the simulation of the user application and writes the analysis results of the functions and variables to files, see chapter II.C for more details. If a "split function" has been specified, these files include tables for each call of the "split function", TABLE I. shows exemplary results for function profiling. The output System Bus files serve as a database for the third step, where user-defined Memory&Bus data is extracted from these tables. Timing Model Memorie5 TABLE I. 32-BIT EXEMPLARY RESULT TABLE FOR FUNCTIONS Figure 2. Interface between memtrace backend and the ISS f ca cyl Is Id 18 st s8 pm cm BI BC BD fl 8 215 75 22 7 52 3 42 5 123 92 0 Each time the program counter changes memtrace checks, 2 2 295 39 35 3 14 9 17 9 55 153 87 if the program execution has changed from one function to f3 2 432 78 68 4 10 2 31 17 143 289 0 another. If so, the cycle count of the evaluatedFunction Abbreviations are: f: function; ca: calls, yl: bus (clock) cycles; ls: all load/store accesses from is recalculated and the call count of the currentFunction the core; Id: all loads; 18: byte and half-word loads; st: all stores; s8: byte and half-word stores; pm: page misses; cm: cache miss; BI: bus idle cycles, BC: core bus cycles, BD: DMA bus cycles is incremented. Finally the pointers to the currentFunction and evaluatedFunction are In the third step a postprocessing of the results can be updated. If currentFunction is a split function, the performed. Memtrace allows the generation of user-defined differential results from the last call of this function up to the tables, which contain specific results of the analysis, e.g. the current call are printed to the result files. 95 Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
  • 3. For each access that occurs on the data bus (to the data processors of the ARM family can be profiled, a wide variety cache or TCM), the memory access counters of the of architectural features is covered, including variations of evaluatedFunction are incremented. Depending on the pipeline length, instruction bit-width, availability of information provided by the ARMulator, it is decided, if a load DSP/SIMD instructions, MMUs, cache size and organization, or store access was performed, and which bitwidth (8/16 or 32 tightly coupled memories, bus width and detailed memory bit) was used. Furthermore the ARMulator indicates if a cache timing options. For a profiling estimation of a non-ARM miss occurred. Page hits and misses are calculated by processor an ARM processor with a similar feature set should comparing the address of the current with the previous memory be chosen. In TABLE II. a list of common embedded access and incorporating the page structure ofthe memory. processors is given, which have similarities with ARM processors. They have a basic feature set in common, including For each bus cycle (on the external memory bus) memtrace a 32-bit Harvard architecture with caches, a 5- to 8-stage checks if it was an idle cycle, a core access or DMA access and pipeline and a RISC instruction set. Although, it has to be increments the appropriate counter of the mentioned, that some ofthe processor provide specific features, evaluatedFunction. which may have a significant influence on the performance, for At the end of the simulation the results of the last example the custom instruction extensions of ARC and evaluatedFunction are updated and the results ofthe last Tensilica Xtensa processors. call of the split function and the accumulated results are printed to the result files. TABLE II. 32-BIT EMBEDDED RISC PROCESSORS D. Memtrace Frontend Pipe- Reg- Instr./Data Special Processor line isters' Cache, TCMA Features Memtrace comes with two frontends, a commandline ARM9E 5 16 128k/128k coprocessor interf interface and a graphical user interface (GUI). The stage yes/yes commandline interface is very well suited for the usage in SIMD, 8 16 64k/64k branch pred. batch files, for example for performing a profiling for a set of ARMII stage yes/yes 64-bit bus system configurations or input data. The GUI version allows an coprocessor interf easy and fast access to all features ofthe tool. Especially for the 5 32 32k/32k custom instr. quick generation of result diagrams the GUI version is very ARC600 stage (- 60) 512k/16k extend. reg.file helpful. custom instr. 7 32 64k/64k branch pred. ARC700 stage (- 60) 512k/256k extend. reg. file 64-bit bus Tesilica 5 64 32k/32k custom instr. Xtensa7 stage or > 256k/256k windowed regs. up to 128-bit bus Tensilica 5 32 16k/16k windowed regs. Diamond232L stage LatticeMico32 6 32 32k/32k stage Altera 5-6 32 64k/64k direct-map. cache NIOS II stage yes/yes custom instr. Xilinx 5 32 64k/64k direct-map. cache MicroBlaze v5 stage yes/yes coprocessor interf MIPS 4KE 5 32 64k/64k coprocessor interf stage yes/yes openRISC 5 32 64k/64k direct-map. cache OR1200 stage custom instr. INI LEON3 7 520 lM/yM windowed regs. Figure 3. Memtrace GUI frontend stage yes/yes coprocessor interf a many features are customizable, given is the maximum value E. Portability to other Processor Architectures MEMTRACE WITHIN THE DESIGN FLOW III. The current version of Memtrace is only targeted to the ARM processor family, as it uses the ISS from ARM This chapter describes how the profiler can be applied (ARMulator). However the interface of the profiler, as during the design of embedded systems. Figure 4. shows a described before, is rather simple and could be ported to other typical design flow for such hardware/software systems. processor architectures if an instruction set simulator is Starting from a functionally verified system description in available, which allows debugging access to its memory software, this software is profiled with an initial system busses. Our plans for future work include Memtrace backends specification, in order to measure the performance and see, if for other processor architectures. the (real-time) requirements are met. If not, an iterative cycle of software and hardware partitioning, optimization and As long as other backends are not available, the ARM- scheduling starts. In this process detailed profiling results are based profiling results may function as a rough estimation for crucial for all steps in the design cycle. the results on other RISC processor architectures. Since all 96 Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
  • 4. SIMD instructions can be applied, if such instructions are available in the processor. If the performance of the code is significantly influenced by memory accesses, as it is mainly the case in video applications, the number of accesses has to HWSW Partitioning be reduced or they have to be accelerated. The profiler gives a detailed overview of the memory accesses and allows therewith identifying the influence of the memory access. One optimization mechanism is the conversion of byte (8-bit) to word (32-bit) memory accesses. This can be applied if adjacent bytes in memory are required concurrently or within a short time period, for example pixel data of an image during Scheduling image processing. A further mechanism is the usage of tightly coupled memories (TCMs) for storing frequently used data. System For finding the most frequently accessed data area, the memory access statistics of Memtrace can be used. In [1] these Figure 4. Typical embedded system design flow techniques are described in more detail. C. Hardware/Software Profiling and Scheduling A. Hardware/Software Partioning and Besides the software profiling and optimization a system Design Space Exploration simulation including the hardware accelerators needs to be For the definition of a starting point of a system architecture carried out in order to evaluate the overall performance. an initial design space exploration should be performed. These Usually hardware components are developed in a hardware steps include a variation of the following parameters: description language (HDL) and tested with an HDL simulator. This task requires long development and simulation times. * processor type Therefore HDL modelling is not suitable for the early design * cache size and organization cycles, where exhaustive testing of different design alternatives is important. Furthermore, if the system performance is data * tightly coupled memories dependent also a huge set of input data should be tested to get * bus timing reliable profiling results. Therefore, a simulation and profiling environment is required, which allows short modification and * external memory system and timing (DRAM, SRAM) simulation time. * hardware accelerators, DMA controller For this purpose, we used the instruction set simulator and extended it with simulators for the hardware components of the Memtrace can be run in batch mode and thus different system. The ARMulator provides an extension interface, which system configurations can be tested and profiled. Thus the allows the definition of a system bus and peripheral bus influence of the system architecture on the performance can be components. It comes already with a bus simulator, which evaluated. This initial profiling also reveals the hot-spots of the reflects the industry standard AMBA bus and a timing model software. The most time consuming functions are good for access times to memory mapped bus components, such as candidates for either software optimization or hardware memories and peripheral modules, see Figure 5. acceleration. Especially computational intensive functions are well-suited for hardware acceleration in a coprocessor. With support of a DMA controller even the burden of data transfers can be taken from the processor. Control-intensive functions are better suited for software implementation, as a hardware implementation would lead to a complex state machine, which requires long design time and often doesn't allow parallelization. In order to get a first idea of the influence of hardware acceleration, a (well-educated guessed) factor can be defined for each hardware candidate function. This factor is used by Memtrace, in order to manipulate the original profiling results. B. Software Profiling and Optimization After a partitioning in hardware and software is found, the software part can be optimized. Numerous techniques exist, that can be applied for optimizing software, such as loop unrolling, loop invariant code motion, common subexpression elimination or constant folding and propagation. For Figure 5. Environment for hardware/software cosimulation and profiling computational intensive parts arithmetic optimizations or 97 Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
  • 5. 1) Coprocessors these results, for example Figure 6. shows the bus usage for We supplemented this system with a simple template for each function depending on the access time ofthe memory. coprocessors, including local registers and memories and a cycle-accurate timing. The functionality of the coprocessor can be defined as standard C-code, thus the software function can be simulated as hardware accelerator by copying the software e5 _ 1111 100 Bus Idle (SRAM)1 ,0 M Bus Accesses (DRAM) code to the coprocessor template. The timing parameter can be 7 7 - 1lilill l l l l l | *~11Bus Idle (DRAM) used to define the delay of the coprocessor between activation and result availability, i.e. the execution time of the task, as it would be in real hardware. This value can be either achieved 04 from reference implementation found in literature or by an educated guess of a hardware engineer. Furthermore, often 2 multiple hardware implementations of a task with different execution time (and hardware cost) are possible. In the 0 proposed profiling environment, simply by varying the timing parameter and viewing its influence on the overall performance, a good trade-off between hardware cost and Functions speed-up can be found quickly. 2) DMA Controller Figure 6. Bus usage for each function, depending on the memory type For data intensive applications data transfers have a tremendous influence on the overall performance. In order to 4) HDL Simulation efficiently outsource tasks into hardware accelerators also the In a later design phase, when the hardware/software burden of data transfer has to be taken from the CPU. This job partitioning is fixed and an appropriate system architecture is can be performed by a DMA-Controller. The Memtrace found, the hardware component need to be developed in a hardware profiling environment includes a highly efficient hardware description language and tested using a HDL DMA-Controller with the following features: simulator, such as Modelsim. Finally, the entire system needs to be verified including hardware and software components. * multi-channel (parameterizable number of channels) For this purpose the instruction set simulator and the HDL * ID- and 2D- transfers simulator have to be connected. The codesign environment * activation FIFO (non-blocking transfer, autonomous) PeaCE [4] allows the connection of the Modelsim Simulator * internal memory for temporary storage between read and the ARiulator. and write * burst transfer mode IV. APPLIcATioN EXAMPLE H.264/AVGCVIDEo DECODER Thus the designer is enabled to determine the influence of FOR MOBILE TV TERMINALS different DMA modes in order to find an appropriate trade-off between DMA Controller complexity and required CPU The proposed design methodology has been applied to the activity. design of a video decoder as part of a mobile digital TV receiver. Starting from an executable specification of the video 3) Scheduling decoder, namely the (unoptimized) reference software, at first a After the software and hardware tasks have been defined a pure optimized software implementation and then an ASIC has scheduling of these tasks is required. For increasing the overall been developed incorporating hardware accelerators and a performance a high degree of parallelization should be customized processor. accomplished between hardware and software tasks. In order to find an appropriate scheduling for parallel tasks the following A. D VB-H and H 2641A VC Video Compression information is required: The receiver is compliant to DVB-H, which is a new * dependencies between tasks standard for broadcasting of digital audio and video content to mobile devices. The content is encoded using highly efficient * the execution time of each task compression methods, namely AAC-HE for audio data and the * data transfer overhead H.264/AVC [5] codec for video content. DVB-H focuses on a high mobility and low power consumption of the receivers. The Especially for data intensive application the overhead for most demanding part of the receiver in terms of computational data transfers can have a huge influence on the performance. It requirements is the H.264 AVG video decoder. might even happen that the speed-up of a hardware accelerator is vanished by the overhead for transferring data to and from The H.264pAVG video compression standard is similar to the accelerator. its predecessors, however it adds various new coding features and refiements of existing mechanisms, which lead to a 2 to 3 The overhead for data transfers to the coprocessors is time's increased coding efficiency compared to MPEGf-2. dependent on the bus usage. Furthermore side effects on other However, the computational demands and required data functions may occur, if bus congestion occurs or when cache accesses have also increased significantly. In Figure 7. the flushing is required in order to ensure cache coherency. In block diagram of an H.264/AVC decoder is depicted. order to find these side-effects, detailed profiling of the system performance and the bus usage is required. Memtrace provides 98 Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
  • 6. CL~~~~~~ E Ca)~~~~~~~~~~~~~~~~~~~~~~~~C F -- |j~ ~ref nce -- -- -- 04-----------------> decoding inversetransformage t Figure 7. Block diagram of an H.264/AVC decoder The bitstream parsing and entropy decoding interpret the encoded symbols and are highly control flow dominated. The symbols contain control information and data for the following components. The inter and intra prediction modes are used to predict image data from previous frames or neighboring blocks, respectively. Both methods require filtering operations, whereas the inter prediction is more computational demanding. i fr001ame buffer motion compensation for the chrominance pixels, which is mainly based on bilinear interpolation. Focusing on the read memory cs- CD, 8- 7- 6- accesses, which are performed motionCompChroma (), as given in the second column of TABLE III. , it shows that more than 30%0 are byte or half in word accesses (third column). This is due to the fact, that the pixel values have the size of one byte each. .................................................................................................................................................................................. ......................................................................................................................................................................................................... ~ The residuals of the prediction are received as transformed and Figure 8. Profiling results for the H.264/AVC software decoder quantized coefficients. The applied transformation, which can be considered as a simplified discrete cosine transformation Since the interpolation is applied iteratively on adjacent (DCT), is based on integer arithmetic and is computational pixels, the source code can be optimized by reading 4 adjacent demanding. The reconstructed image is post processed by a bytes at once. This leads to a reduction of the execution time deblocking filter for reducing blocking artifacts at block edges. The deblocking filter includes the calculation of the filter of the function by almost 30°0o. The speedup of the function strength, which is control flow dominated, and the actual 3- to leads to a reduction of the execution time for processing a P- 5-tap filtering, which requires many arithmetic operations. frame by about 500. Each of these components allows various modes of operation, which are chosen by symbols in the bitstream. This involves a TABLE III. PROFILING RESULTS FOR MOTIONCOMPCHROMAO) FUNCTION high degree of control flow in the decoder. Clock Cycles All Load Load 8/16 The H.264/AVC baseline decoder has been profiled with before optimization 13,149,109 309,368 104,784 Memtrace using a system specification typical for mobile after optimization 9,355,709 196,746 34,584 embedded systems comprising an ARM946E-S processor core, a data and instruction cache (16kB each) and an external Further speed-up of the software could be achieved by DRAM as main memory. The execution time for each module applying well-known software optimization techniques and of the decoder has been evaluated as depicted in Figure 8. The those proposed in [3] to the functions identified by the results show, that the distribution over the modules differs significantly between I- and P-frames. Whereas in I-frames the profiler. The resulting software decoder has been tested on an deblocking has the most influence on the overall performance, Intel PXA270-based PDA within the DVB scenario. The in P-frames the motion compensation is the dominant part. required processor clock frequency for H.264/AVC decoding is about 420 MHz. (320x240 pixel resolution, 384 kBit/s). B. Design and Optimizations Considering the dynamic power consumption of CMOS- Based on the acquired profiling results several software and circuits, given in equation 1, the rather high system frequency hardware architectural optimizations are applied. Our first leads to high power consumption. M target is a pure software version of the video decoder for the (1) implementation of a DVB-H terminal on a PDA. In a second Pdynamic k=l Ck fk VDD step an embedded hardware/software is developed. For achieving lower power consumption, methods need to 1) Software Implementation and Optimizations be applied, which allow the reduction of the system frequency, Following Amdahl's law, those parts of the software should which in turn also allows a lower supply voltage (voltage be considered for optimization first, which take up the most of scaling). Hardware accelerators can be used for this purpose. the execution time. Figure 8. shows, that motion However, their influence on the capacitance has to be compensation, loopfilter, inverse transformation and memory considered and reduced by mechanism like clock gating. related functions are those candidates. Exploring the results of Furthermore the memory architecture needs to be adapted the functions corresponding to the motion compensation, it (reduced) to the specific application requirements. can be seen that the function motionCompChroma () requires the most execution time. This function performs the 99 Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
  • 7. 2) Memory System control flow level. Therefore they are well suited for hardware Besides the processing power of the CPU the memory and implementation as coprocessors, which can be controlled by bus architecture determine the overall performance of the the main CPU. In order to ease the burden of providing the system. Namely the caches size and architecture, the speed and coprocessors with data, a DMA controller can be applied usage of a tightly coupled (on-chip) memory (TCM), the width allowing memory transfers concurrently to the processing of of the memory bus, the bandwidth of the off-chip memory and the CPU. The coprocessors should be equipped with local a DMA controller are the most influencing factors. Adjusting memory for storing input and output data for processing at least these factors requires a trade-off between hardware cost, power one macroblock at a time preventing fragmented DMA consumption and performance. The H.264/AVC decoder has transfers. As the video data is stored in the memory in a two been simulated with different cache sizes in order to find an dimensional fashion, the DMA controller should feature 2-D appropriate size for the DVB-H terminal scenario (QVGA memory transfers. The output of the video data to a display, image resolution). It has been evaluated how the required which is required by a DVB-H terminal, even increases the decoding time changes when either the instruction cache size or problem ofthe high amount of data transfers. the data cache size is increased, see Figure 9. 4) Hardware/Software Interconnection and Scheduling n 1=4k:D=var m I=var:D=Ok After the software optimization is performed and the 120 - hardware accelerators are developed, a scheduling of the entire g 100- system is required. The scheduling is static and controlled by - 80- the software. The hardware accelerators are introduced step-by- 0 step to the system. Starting from the pure software ,, 60- implementation, at first the software functions are replaced by 0 their hardware counterparts. This also requires the transfer of ,, 40- input data to and output data from the coprocessors. These data 0 " 20- transfers are at first executed by load-store operations of the processor and in a next step replaced by DMA transfers. This 0- might also requires flushing the cache or cache lines, which may decrease the performance of other software functions. In a Figure 9. Influence of the instruction (I) and data (D) cache sizes on the final step the parallelization of the hardware task and software execution time of the H.264/AVC decoder. tasks takes place. All decision taken in these steps are based on detailed profiling results. The results show that increasing the instruction cache size The following example shows how the hardware from 4 kByte up to 32 kByte has a minor influence on the accelerator for the deblocking is inserted into the software overall performance. However, adding a data cache of 4 kByte decoder. The hardware accelerator only includes the filtering to the system decreases the decoding time to less than 20%. process of the deblocking stage, filter strength calculation is Further increasing the data cache size does not yield a dramatic performed in software, because it is rather control intensive and performance increase. Therefore a data and instruction cache therefore more suitable for software implementation. The filter size of 4 kByte each is a good tradeoff between performance processes the luminance and chrominance data for one and die area. The data cache increases the performance by macroblock at a time. It requires the pixel data and filter decreasing the number of accesses to the external memory. parameters as an input and provides filtered image data as an This is especially efficient for data areas with frequent accesses output, this sums up to about 340 32-bit words of data transfer. to the same memory location, e.g. the stack. However for Figure 10. shows the results for the pure software randomly accessed data areas, e.g. lookup tables, a fast on-chip implementation, when using the filter accelerator with data memory (SRAM) is more appropriate. As the H.264/AVC transfer managed by the processor, and when additionally using decoder requires about 1. 1 MByte of data memory (@ QVGA the DMA controller. As can be seen, if data is transferred by video resolution), only small parts of the used data structures the processor, the performance gain of the accelerator is (less than 3%0 with 32 kByte of SRAM) can be stored in the of vanished by the data transfers, only in conjunction with the on-chip memory. In order to find a useful partitioning of data DMA controller the coprocessor can be used efficiently. areas between on-chip and off-chip memory, it is required to profile the accesses to each data area of the decoder. Since a data cache is instantiated, accesses to these memories only Million happen if cache misses occur. Therefore, the cache misses have been analyzed separately for each data area in the code including global variables, heap variables and the stack. Data areas with many cache misses are stored in on-chip memory. 10- 14 M Paaee Caclto 3) Hardware/Software Partitioning In order to further increase the system efficiency and decrease power consumption and hardware costs, the CPU can be enhanced by coprocessors. Again, the hot spots in the software code should be considered, namely the loop filter, the SW HWwith CPU LD/ST HWwith DMA motion compensation and the integer transformation. These are Figure 10. Clock cycle comparison of different deblocking implementations the foremost candidates for hardware implementation. All these components are rather demanding on an arithmetical than on a 100 Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
  • 8. C. Hardware/Software System Implementation exhaustive performance testing and power measurements, The profiling and implementation results of the previous separately for memory, core and IO supply voltages. chapters lead to a mixed hardware/software implementation of the video decoder, which is given in Figure 11. An application processor is extended with a companion chip for acceleration of the video decoding. The companion chip contains the hardware accelerators for H.264/AVC decoding. TABLE IV. shows a comparison of the required cycle times of the accelerators with their software counterparts. TABLE IV. COMPARISON OF THE EXECUTION TIME iN HARDWARE AND SOFTWARE Pixel Inverse Implementation Debloc king Interpolation Transform Software 3000-7000 cylces 100-700 cycles 320 cycles Figure 12. ASIC layout Hardware 232 cylces 16-34 cycles 30 cycles V. CONCLUSIONS AND FUTURE WORK a memory transfers are not included in this cycle counts The design of an efficient system for applications with high Furthermore a so called SIMD engine is available on the demands on the real-time performance requires the selection of chip, which is 32-bit RISC processor enhanced with special an appropriate system architecture and the incorporated SIMD instructions. The 32-bit system bus connecting the hardware and software components. For this decision a detailed processor core with the main memory and coprocessor knowledge of the computational demands of the application is components is augmented with a DMA-controller which mandatory. Furthermore for data intensive applications also the supports the main processor by performing the memory influence of memory accesses has to be taken into account. We transfers to the coprocessor units. A video output unit is presented a profiling tool which provides this information and implemented directly driving a connected display or video have shown how it can be integrated in the design flow. The DAC. To avoid a heavy bus load on the mentioned system bus tool aids the designer in taking the right decision during each due to transfers from a frame buffer to the video output step of the design, including the hardware/software interface, an extra frame buffer memory and the video output partitioning, the optimization ofthe components and the system unit are provided by a separate video bus system. The data scheduling. We have applied this methodology for the transfers between these bus systems are also performed by the development of a software solution and a hardware/software DMA controller. The main control functionality of the decoder system for real-time video decoding. can either be run on the application processor or on the RISC core on the companion chip. Our future work includes the retargeting of the profiler backend to other processors. Many processor simulators offer already profiling capabilities, e.g. the LisaTek tool suite; however their results are not as detailed as the Memtrace results. Furthermore we plan to integrate power models for cache and memory accesses and instruction execution in order to allow power consumption estimation. These models will be based on existing power models of caches and memories and on measurement results of the presented ASIC design. REFERENCES [1] RealView ARMulator ISS User Guide Version 1.4, Ref: DUI0207C, display January 2004, http://www.arm.com [2] J. Bormans, K. Denolf, S. Wuytack, L. Nachtergaele, and I. Bolsens, "Integrating system-level low power methodologies into a real-life design flow," The Ninth Int. Workshop Power and Timing Modeling, Optimization and Simulation, pp. 19-28, Oct. 1999, Kos Island, Greece [3] H. Hubert, B. Stabernack, and H. Richter, "Tool-Aided Performance Figure 11. SOC architecture of the DVB-H/DMB companion chip Analysis and Optimization of an H.264 Decoder for Embedded Systems," The Eighth IEEE International Symposium on Consumer To fully evaluate the proposed concept the complete SOC Electronics (ISCE 2004), Reading, England, Sept. 2004 architecture has been implemented as an ASIC design using [4] s. Ha, C. Lee, Y. Yi, S. Kwon, and Y.-P. Joo, "Hardware-software UMC's L180 1P6M GII logic technology, see Figure 12. The Codesign of Multimedia Embedded Systems: the PeaCE Approach," 12th IEEE Int. Conf on Embedded and Real-Time Computing Systems maximum clock frequency of the design is 120 MHz, whereas and Applications, Sydney, Australia, Vol. 1 pp. 207-214, Aug. 2006 50 MHz should be sufficient for the DVB-H scenario. An [5] International Standard of Joint Video Specification (ITU-T Rec. H.264 evaluation board for the chip is currently under development. It ISO/IEC 14496-10 AVC), Joint Video Team (JVT) of ISO/IEC allows the fully functional verification and furthermore MPEG and ITU-T, VCEG, JVT-G050, March 2003 101 Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.