Performance and memory profiling for embedded system design
1. Performance and Memory Profiling for
Embedded System Design
Heiko Hubert, Benno Stabernack, Kai-Immo Wels
Image Processing Department,
Fraunhofer Institute for Telecommunications, Heinrich-Hertz-Institut
Einsteinufer 37, 10587 Berlin, Germany
[huebert,stabernack,wels] ghhi. fraunhofer. de
Abstract- The design of embedded hardware/software systems is In order to reduce the overall data traffic, those parts of the
often underlying strict requirements concerning various aspects, code, which require a high amount of data transfers, have to be
including real time performance, power consumption and die identified and optimized. The above mentioned applications
area. Especially for data intensive applications, such as
multimedia systems, the number of memory accesses is a contain up to 100.000 lines of source code. Therefore tools are
dominant factor for these aspects. In order to meet the required, which help the designer identifying the critical parts
requirements and design a well-adapted system, the software of the software. Several analysis tools exist, e.g. timing
parts need to be optimized and an adequate hardware analysis is provided by gprof or VTune. Memory access
architecture needs to be designed. For complex applications this analysis is part of the ATOMIUM [2] tool suite. However, all
design space exploration can be rather difficult and requires in- these tools provide only approximate results for either timing
depth analysis of the application and its implementation
alternatives. Tools are required which aid the designer in the or memory accesses. A highly accurate memory analysis can
design, optimization and scheduling of hardware and software. be done with a hardware (HDL) simulator, if an HDL model
We present a profiling tool for fast and accurate performance of the processor is available. However, such an analysis
and memory access analysis of embedded systems and show how implies a long simulation time.
it can be applied within the design flow. This concept has been In order to achieve a fast and accurate solution, we
proven in the design of a mixed hardware/software system for developed a specialized profiler, called Memtrace [3], for
H.264/AVC video decoding.
obtaining performance and memory access statistics. This
Keywords- profiling, embedded hardware/software systems, paper describes the tool with all its features. We show how the
design space exploration, scheduling provided profiling results can be used during the design and
optimization of embedded hardware/software systems. As a
I. INTRODUCTION case study, Memtrace is applied during the efficient design of
The design of an embedded system often starts from a a mixed hardware/software system for H.264/AVC video
software description of the system in C language. For decoding. Starting from a software implementation, it is
example, the designer writes an executable specification based shown, how the software is optimized, an efficient hardware
on a reference implementation of the application, e.g. from architecture is developed, and the system tasks are scheduled
standardization organizations or the open-source community. based on the profiling results.
This software code is often not optimized in any manners, II. MEMTRACE: A PERFORMANCE AND MEMORY PROFILER
because it mainly serves the purpose of functional and
conformance testing. Therefore it has to be transformed into A. Tool Architecture
an efficient system, including hardware and software Memtrace is a non-intrusive profiler, which analyzes the
components. The design of the system requires the following memory accesses and real time performance of an application,
steps: system architecture design, hardware/software without the need of instrumentation code. The analysis is
partitioning, software optimization, design of hardware controlled by information about variables and functions in the
accelerators and system scheduling. All these steps require user application, which is automatically extracted from the
detailed information about the performance of the different application. Furthermore, the user can specify the system
parts of the application. Besides the arithmetical demands of parameters, e.g. the processor type and the memory system.
the application, memory accesses can have a huge influence During the analysis, Memtrace utilizes the instruction set
simulator ARMulator [1] for executing the application. The
on performance and power consumption. This is especially the ARMulator provides Memtrace with the information required
case for data intensive applications, such as multimedia for the analysis, e.g. the program counter, the clock cycle
systems, due to the huge amount of data to be transferred in counter and the memory accesses. Memtrace creates detailed
these applications. This problem is even increased if the given results on memory accesses and timing for each function and
data bandwidth is not used efficiently. variable in the code.
1-4244-0840-7/07/$20.00 02007 IEEE. 94
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
2. Clock Cycles n 60
executable of it funcl1func2 o
the application _ 1| 201 30 . > >,40
40 ---- fundc1
121 271 38 20 =
func2
list of functions memitace 131 231 34 o
analysis stack location
specification variable location fronten results of function analysis
1 2 3 4 5 6
4 M result table format /, srf Cache Misses t 60
it var1 var2
system Processor AK backend (A
RMulator) 1 15 6 40 ---- va rl
specification Caches.16K1tIII &IMemTimingn Set Simulator
2 48
3, 38,
13
22
20 --va r2
lil ~~~~~~Instruction Set Simulator
with memtrace backeind results of memory analysis 1 2 3 4 5 6
Figure 1. Performance analysis tool: Memtrace profiles the performance and memory accesses of a user application.
B. Analysis Workflow load memory accesses for each function. Furthermore the
The performance analysis with Memtrace is carried out in results of several functions can be accumulated in groups for
three steps, the initialization, the performance analysis and the comparing the results of entire application modules. The user-
postprocessing of the results. defined tables are written to files in a tab-separated format.
Thus they can be further processed, e.g. by spreadsheet
During initialization Memtrace extracts the names of all programs for creating diagrams.
functions and variables of the application. During this process
user variables and functions are separated from standard library C. Tool Backend Interface to the ISS
-
functions, such as printf() or malloc(. This is achieved by Memtrace communicates with the Instruction Set Simulator
comparing the symbol table of the executable with the ones of (ISS) via its backend, as depicted in Figure 2. The backend is
the user library and object files. The results are written to the implemented as dynamic link library (DLL), which connects to
analysis specification file. The specification file can be edited the ISS. Currently only the ARM instruction set simulator
by the user, e.g. for adding user-defined memory areas, such as ARMulator is supported. The backend is automatically called
the stack and heap variables, for additional analysis. by the ISS during simulation. During the startup phase, the
Furthermore the user can define a so called "split function", backend creates a list of all functions and marks the user and
which instructs Memtrace to produce snapshot results, each split functions found in the analysis specification file. For each
time the "split function" is called. This can be used e.g. in video function a data structure is created, which contains the
processing for generating separate profiling results for each function's start address and variables for collecting the analysis
processed frame. Additionally the user can control if the results. Finally two pointers, called currentFunction and
analysis results, e.g. clock cycles, of a function should include evaluatedFunction, are initialized. The first pointer
the results of a called function (accumulated) or if it should indicates the currently executed function and, if this function
only reflect the function's own results (self). Typically should not be evaluated, the second pointer indicates the calling
auxiliary functions, e.g. C library or simple arithmetic function, to which the result of the current function should be
functions, are accumulated to the calling functions. added.
In the second step the performance analysis is carried out,
based on the analysis specification and the system
specification, as shown in Figure 1. The system specification
includes the processor, cache and memory type definitions. The
Memtrace backend connects to instruction set simulator for the
simulation of the user application and writes the analysis
results of the functions and variables to files, see chapter II.C
for more details. If a "split function" has been specified, these
files include tables for each call of the "split function", TABLE
I. shows exemplary results for function profiling. The output System Bus
files serve as a database for the third step, where user-defined Memory&Bus
data is extracted from these tables. Timing Model
Memorie5
TABLE I. 32-BIT EXEMPLARY RESULT TABLE FOR FUNCTIONS Figure 2. Interface between memtrace backend and the ISS
f ca cyl Is Id 18 st s8 pm cm BI BC BD
fl 8 215 75 22 7 52 3 42 5 123 92 0 Each time the program counter changes memtrace checks,
2 2 295 39 35 3 14 9 17 9 55 153 87 if the program execution has changed from one function to
f3 2 432 78 68 4 10 2 31 17 143 289 0 another. If so, the cycle count of the evaluatedFunction
Abbreviations are: f: function; ca: calls, yl: bus (clock) cycles; ls: all load/store accesses from is recalculated and the call count of the currentFunction
the core; Id: all loads; 18: byte and half-word loads; st: all stores; s8: byte and half-word stores;
pm: page misses; cm: cache miss; BI: bus idle cycles, BC: core bus cycles, BD: DMA bus cycles
is incremented. Finally the pointers to the
currentFunction and evaluatedFunction are
In the third step a postprocessing of the results can be updated. If currentFunction is a split function, the
performed. Memtrace allows the generation of user-defined differential results from the last call of this function up to the
tables, which contain specific results of the analysis, e.g. the current call are printed to the result files.
95
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
3. For each access that occurs on the data bus (to the data processors of the ARM family can be profiled, a wide variety
cache or TCM), the memory access counters of the of architectural features is covered, including variations of
evaluatedFunction are incremented. Depending on the pipeline length, instruction bit-width, availability of
information provided by the ARMulator, it is decided, if a load DSP/SIMD instructions, MMUs, cache size and organization,
or store access was performed, and which bitwidth (8/16 or 32 tightly coupled memories, bus width and detailed memory
bit) was used. Furthermore the ARMulator indicates if a cache timing options. For a profiling estimation of a non-ARM
miss occurred. Page hits and misses are calculated by processor an ARM processor with a similar feature set should
comparing the address of the current with the previous memory be chosen. In TABLE II. a list of common embedded
access and incorporating the page structure ofthe memory. processors is given, which have similarities with ARM
processors. They have a basic feature set in common, including
For each bus cycle (on the external memory bus) memtrace a 32-bit Harvard architecture with caches, a 5- to 8-stage
checks if it was an idle cycle, a core access or DMA access and pipeline and a RISC instruction set. Although, it has to be
increments the appropriate counter of the mentioned, that some ofthe processor provide specific features,
evaluatedFunction.
which may have a significant influence on the performance, for
At the end of the simulation the results of the last example the custom instruction extensions of ARC and
evaluatedFunction are updated and the results ofthe last Tensilica Xtensa processors.
call of the split function and the accumulated results are printed
to the result files. TABLE II. 32-BIT EMBEDDED RISC PROCESSORS
D. Memtrace Frontend Pipe- Reg- Instr./Data Special
Processor line isters' Cache, TCMA Features
Memtrace comes with two frontends, a commandline ARM9E
5
16 128k/128k
coprocessor interf
interface and a graphical user interface (GUI). The stage yes/yes
commandline interface is very well suited for the usage in SIMD,
8 16 64k/64k branch pred.
batch files, for example for performing a profiling for a set of ARMII
stage yes/yes 64-bit bus
system configurations or input data. The GUI version allows an coprocessor interf
easy and fast access to all features ofthe tool. Especially for the 5 32 32k/32k custom instr.
quick generation of result diagrams the GUI version is very ARC600 stage (- 60) 512k/16k extend. reg.file
helpful.
custom instr.
7 32 64k/64k branch pred.
ARC700 stage (- 60) 512k/256k extend. reg. file
64-bit bus
Tesilica 5 64 32k/32k custom instr.
Xtensa7 stage or > 256k/256k windowed regs.
up to 128-bit bus
Tensilica 5 32 16k/16k windowed regs.
Diamond232L stage
LatticeMico32
6 32 32k/32k
stage
Altera 5-6 32 64k/64k direct-map. cache
NIOS II stage yes/yes custom instr.
Xilinx 5
32 64k/64k direct-map. cache
MicroBlaze v5 stage yes/yes coprocessor interf
MIPS 4KE
5
32 64k/64k coprocessor interf
stage yes/yes
openRISC 5
32 64k/64k direct-map. cache
OR1200 stage custom instr.
INI
LEON3
7
520 lM/yM windowed regs.
Figure 3. Memtrace GUI frontend stage yes/yes coprocessor interf
a many features are customizable, given is the maximum value
E. Portability to other Processor Architectures MEMTRACE WITHIN THE DESIGN FLOW
III.
The current version of Memtrace is only targeted to the
ARM processor family, as it uses the ISS from ARM This chapter describes how the profiler can be applied
(ARMulator). However the interface of the profiler, as during the design of embedded systems. Figure 4. shows a
described before, is rather simple and could be ported to other typical design flow for such hardware/software systems.
processor architectures if an instruction set simulator is Starting from a functionally verified system description in
available, which allows debugging access to its memory software, this software is profiled with an initial system
busses. Our plans for future work include Memtrace backends specification, in order to measure the performance and see, if
for other processor architectures. the (real-time) requirements are met. If not, an iterative cycle of
software and hardware partitioning, optimization and
As long as other backends are not available, the ARM- scheduling starts. In this process detailed profiling results are
based profiling results may function as a rough estimation for crucial for all steps in the design cycle.
the results on other RISC processor architectures. Since all
96
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
4. SIMD instructions can be applied, if such instructions are
available in the processor. If the performance of the code is
significantly influenced by memory accesses, as it is mainly
the case in video applications, the number of accesses has to
HWSW
Partitioning be reduced or they have to be accelerated. The profiler gives a
detailed overview of the memory accesses and allows
therewith identifying the influence of the memory access. One
optimization mechanism is the conversion of byte (8-bit) to
word (32-bit) memory accesses. This can be applied if
adjacent bytes in memory are required concurrently or within
a short time period, for example pixel data of an image during
Scheduling image processing. A further mechanism is the usage of tightly
coupled memories (TCMs) for storing frequently used data.
System For finding the most frequently accessed data area, the
memory access statistics of Memtrace can be used. In [1] these
Figure 4. Typical embedded system design flow techniques are described in more detail.
C. Hardware/Software Profiling and Scheduling
A. Hardware/Software Partioning and Besides the software profiling and optimization a system
Design Space Exploration simulation including the hardware accelerators needs to be
For the definition of a starting point of a system architecture carried out in order to evaluate the overall performance.
an initial design space exploration should be performed. These Usually hardware components are developed in a hardware
steps include a variation of the following parameters: description language (HDL) and tested with an HDL simulator.
This task requires long development and simulation times.
* processor type Therefore HDL modelling is not suitable for the early design
* cache size and organization cycles, where exhaustive testing of different design alternatives
is important. Furthermore, if the system performance is data
* tightly coupled memories dependent also a huge set of input data should be tested to get
* bus timing reliable profiling results. Therefore, a simulation and profiling
environment is required, which allows short modification and
* external memory system and timing (DRAM, SRAM) simulation time.
* hardware accelerators, DMA controller For this purpose, we used the instruction set simulator and
extended it with simulators for the hardware components of the
Memtrace can be run in batch mode and thus different system. The ARMulator provides an extension interface, which
system configurations can be tested and profiled. Thus the allows the definition of a system bus and peripheral bus
influence of the system architecture on the performance can be components. It comes already with a bus simulator, which
evaluated. This initial profiling also reveals the hot-spots of the reflects the industry standard AMBA bus and a timing model
software. The most time consuming functions are good for access times to memory mapped bus components, such as
candidates for either software optimization or hardware memories and peripheral modules, see Figure 5.
acceleration. Especially computational intensive functions are
well-suited for hardware acceleration in a coprocessor. With
support of a DMA controller even the burden of data transfers
can be taken from the processor. Control-intensive functions
are better suited for software implementation, as a hardware
implementation would lead to a complex state machine, which
requires long design time and often doesn't allow
parallelization. In order to get a first idea of the influence of
hardware acceleration, a (well-educated guessed) factor can be
defined for each hardware candidate function. This factor is
used by Memtrace, in order to manipulate the original profiling
results.
B. Software Profiling and Optimization
After a partitioning in hardware and software is found, the
software part can be optimized. Numerous techniques exist,
that can be applied for optimizing software, such as loop
unrolling, loop invariant code motion, common subexpression
elimination or constant folding and propagation. For
Figure 5. Environment for hardware/software cosimulation and profiling
computational intensive parts arithmetic optimizations or
97
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
5. 1) Coprocessors these results, for example Figure 6. shows the bus usage for
We supplemented this system with a simple template for each function depending on the access time ofthe memory.
coprocessors, including local registers and memories and a
cycle-accurate timing. The functionality of the coprocessor can
be defined as standard C-code, thus the software function can
be simulated as hardware accelerator by copying the software e5 _ 1111 100 Bus Idle (SRAM)1
,0 M Bus Accesses (DRAM)
code to the coprocessor template. The timing parameter can be 7
7 - 1lilill l l l l l |
*~11Bus
Idle (DRAM)
used to define the delay of the coprocessor between activation
and result availability, i.e. the execution time of the task, as it
would be in real hardware. This value can be either achieved 04
from reference implementation found in literature or by an
educated guess of a hardware engineer. Furthermore, often 2
multiple hardware implementations of a task with different
execution time (and hardware cost) are possible. In the 0
proposed profiling environment, simply by varying the timing
parameter and viewing its influence on the overall
performance, a good trade-off between hardware cost and Functions
speed-up can be found quickly.
2) DMA Controller Figure 6. Bus usage for each function, depending on the memory type
For data intensive applications data transfers have a
tremendous influence on the overall performance. In order to 4) HDL Simulation
efficiently outsource tasks into hardware accelerators also the In a later design phase, when the hardware/software
burden of data transfer has to be taken from the CPU. This job partitioning is fixed and an appropriate system architecture is
can be performed by a DMA-Controller. The Memtrace found, the hardware component need to be developed in a
hardware profiling environment includes a highly efficient hardware description language and tested using a HDL
DMA-Controller with the following features: simulator, such as Modelsim. Finally, the entire system needs
to be verified including hardware and software components.
* multi-channel (parameterizable number of channels) For this purpose the instruction set simulator and the HDL
* ID- and 2D- transfers simulator have to be connected. The codesign environment
* activation FIFO (non-blocking transfer, autonomous) PeaCE [4] allows the connection of the Modelsim Simulator
* internal memory for temporary storage between read and the ARiulator.
and write
* burst transfer mode IV. APPLIcATioN EXAMPLE H.264/AVGCVIDEo DECODER
Thus the designer is enabled to determine the influence of FOR MOBILE TV TERMINALS
different DMA modes in order to find an appropriate trade-off
between DMA Controller complexity and required CPU The proposed design methodology has been applied to the
activity. design of a video decoder as part of a mobile digital TV
receiver. Starting from an executable specification of the video
3) Scheduling decoder, namely the (unoptimized) reference software, at first a
After the software and hardware tasks have been defined a pure optimized software implementation and then an ASIC has
scheduling of these tasks is required. For increasing the overall been developed incorporating hardware accelerators and a
performance a high degree of parallelization should be customized processor.
accomplished between hardware and software tasks. In order to
find an appropriate scheduling for parallel tasks the following A. D VB-H and H 2641A VC Video Compression
information is required: The receiver is compliant to DVB-H, which is a new
* dependencies between tasks standard for broadcasting of digital audio and video content to
mobile devices. The content is encoded using highly efficient
* the execution time of each task compression methods, namely AAC-HE for audio data and the
* data transfer overhead H.264/AVC [5] codec for video content. DVB-H focuses on a
high mobility and low power consumption of the receivers. The
Especially for data intensive application the overhead for most demanding part of the receiver in terms of computational
data transfers can have a huge influence on the performance. It requirements is the H.264 AVG video decoder.
might even happen that the speed-up of a hardware accelerator
is vanished by the overhead for transferring data to and from The H.264pAVG video compression standard is similar to
the accelerator. its predecessors, however it adds various new coding features
and refiements of existing mechanisms, which lead to a 2 to 3
The overhead for data transfers to the coprocessors is time's increased coding efficiency compared to MPEGf-2.
dependent on the bus usage. Furthermore side effects on other However, the computational demands and required data
functions may occur, if bus congestion occurs or when cache accesses have also increased significantly. In Figure 7. the
flushing is required in order to ensure cache coherency. In block diagram of an H.264/AVC decoder is depicted.
order to find these side-effects, detailed profiling of the system
performance and the bus usage is required. Memtrace provides
98
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
6. CL~~~~~~ E
Ca)~~~~~~~~~~~~~~~~~~~~~~~~C
F
--
|j~ ~ref nce
-- -- --
04----------------->
decoding inversetransformage t
Figure 7. Block diagram of an H.264/AVC decoder
The bitstream parsing and entropy decoding interpret the
encoded symbols and are highly control flow dominated. The
symbols contain control information and data for the following
components. The inter and intra prediction modes are used to
predict image data from previous frames or neighboring blocks,
respectively. Both methods require filtering operations,
whereas the inter prediction is more computational demanding.
i
fr001ame
buffer
motion compensation for the chrominance pixels, which is
mainly based on bilinear interpolation. Focusing on the read
memory
cs-
CD,
8-
7-
6-
accesses, which are performed
motionCompChroma (), as given in the second column of
TABLE III. , it shows that more than 30%0 are byte or half
in
word accesses (third column). This is due to the fact, that the
pixel values have the size of one byte each.
..................................................................................................................................................................................
.........................................................................................................................................................................................................
~
The residuals of the prediction are received as transformed and Figure 8. Profiling results for the H.264/AVC software decoder
quantized coefficients. The applied transformation, which can
be considered as a simplified discrete cosine transformation Since the interpolation is applied iteratively on adjacent
(DCT), is based on integer arithmetic and is computational pixels, the source code can be optimized by reading 4 adjacent
demanding. The reconstructed image is post processed by a bytes at once. This leads to a reduction of the execution time
deblocking filter for reducing blocking artifacts at block edges.
The deblocking filter includes the calculation of the filter of the function by almost 30°0o. The speedup of the function
strength, which is control flow dominated, and the actual 3- to leads to a reduction of the execution time for processing a P-
5-tap filtering, which requires many arithmetic operations. frame by about 500.
Each of these components allows various modes of operation,
which are chosen by symbols in the bitstream. This involves a TABLE III. PROFILING RESULTS FOR MOTIONCOMPCHROMAO) FUNCTION
high degree of control flow in the decoder. Clock Cycles All Load Load 8/16
The H.264/AVC baseline decoder has been profiled with before optimization 13,149,109 309,368 104,784
Memtrace using a system specification typical for mobile after optimization 9,355,709 196,746 34,584
embedded systems comprising an ARM946E-S processor core,
a data and instruction cache (16kB each) and an external Further speed-up of the software could be achieved by
DRAM as main memory. The execution time for each module applying well-known software optimization techniques and
of the decoder has been evaluated as depicted in Figure 8. The those proposed in [3] to the functions identified by the
results show, that the distribution over the modules differs
significantly between I- and P-frames. Whereas in I-frames the profiler. The resulting software decoder has been tested on an
deblocking has the most influence on the overall performance, Intel PXA270-based PDA within the DVB scenario. The
in P-frames the motion compensation is the dominant part. required processor clock frequency for H.264/AVC decoding
is about 420 MHz. (320x240 pixel resolution, 384 kBit/s).
B. Design and Optimizations Considering the dynamic power consumption of CMOS-
Based on the acquired profiling results several software and circuits, given in equation 1, the rather high system frequency
hardware architectural optimizations are applied. Our first leads to high power consumption.
M
target is a pure software version of the video decoder for the (1)
implementation of a DVB-H terminal on a PDA. In a second Pdynamic k=l
Ck fk VDD
step an embedded hardware/software is developed. For achieving lower power consumption, methods need to
1) Software Implementation and Optimizations be applied, which allow the reduction of the system frequency,
Following Amdahl's law, those parts of the software should which in turn also allows a lower supply voltage (voltage
be considered for optimization first, which take up the most of scaling). Hardware accelerators can be used for this purpose.
the execution time. Figure 8. shows, that motion However, their influence on the capacitance has to be
compensation, loopfilter, inverse transformation and memory considered and reduced by mechanism like clock gating.
related functions are those candidates. Exploring the results of Furthermore the memory architecture needs to be adapted
the functions corresponding to the motion compensation, it (reduced) to the specific application requirements.
can be seen that the function motionCompChroma ()
requires the most execution time. This function performs the
99
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
7. 2) Memory System control flow level. Therefore they are well suited for hardware
Besides the processing power of the CPU the memory and implementation as coprocessors, which can be controlled by
bus architecture determine the overall performance of the the main CPU. In order to ease the burden of providing the
system. Namely the caches size and architecture, the speed and coprocessors with data, a DMA controller can be applied
usage of a tightly coupled (on-chip) memory (TCM), the width allowing memory transfers concurrently to the processing of
of the memory bus, the bandwidth of the off-chip memory and the CPU. The coprocessors should be equipped with local
a DMA controller are the most influencing factors. Adjusting memory for storing input and output data for processing at least
these factors requires a trade-off between hardware cost, power one macroblock at a time preventing fragmented DMA
consumption and performance. The H.264/AVC decoder has transfers. As the video data is stored in the memory in a two
been simulated with different cache sizes in order to find an dimensional fashion, the DMA controller should feature 2-D
appropriate size for the DVB-H terminal scenario (QVGA memory transfers. The output of the video data to a display,
image resolution). It has been evaluated how the required which is required by a DVB-H terminal, even increases the
decoding time changes when either the instruction cache size or problem ofthe high amount of data transfers.
the data cache size is increased, see Figure 9.
4) Hardware/Software Interconnection and Scheduling
n 1=4k:D=var m I=var:D=Ok After the software optimization is performed and the
120 -
hardware accelerators are developed, a scheduling of the entire
g 100- system is required. The scheduling is static and controlled by
- 80-
the software. The hardware accelerators are introduced step-by-
0
step to the system. Starting from the pure software
,, 60- implementation, at first the software functions are replaced by
0 their hardware counterparts. This also requires the transfer of
,, 40- input data to and output data from the coprocessors. These data
0
" 20- transfers are at first executed by load-store operations of the
processor and in a next step replaced by DMA transfers. This
0- might also requires flushing the cache or cache lines, which
may decrease the performance of other software functions. In a
Figure 9. Influence of the instruction (I) and data (D) cache sizes on the final step the parallelization of the hardware task and software
execution time of the H.264/AVC decoder. tasks takes place. All decision taken in these steps are based on
detailed profiling results.
The results show that increasing the instruction cache size The following example shows how the hardware
from 4 kByte up to 32 kByte has a minor influence on the accelerator for the deblocking is inserted into the software
overall performance. However, adding a data cache of 4 kByte decoder. The hardware accelerator only includes the filtering
to the system decreases the decoding time to less than 20%. process of the deblocking stage, filter strength calculation is
Further increasing the data cache size does not yield a dramatic performed in software, because it is rather control intensive and
performance increase. Therefore a data and instruction cache therefore more suitable for software implementation. The filter
size of 4 kByte each is a good tradeoff between performance processes the luminance and chrominance data for one
and die area. The data cache increases the performance by macroblock at a time. It requires the pixel data and filter
decreasing the number of accesses to the external memory. parameters as an input and provides filtered image data as an
This is especially efficient for data areas with frequent accesses output, this sums up to about 340 32-bit words of data transfer.
to the same memory location, e.g. the stack. However for Figure 10. shows the results for the pure software
randomly accessed data areas, e.g. lookup tables, a fast on-chip implementation, when using the filter accelerator with data
memory (SRAM) is more appropriate. As the H.264/AVC transfer managed by the processor, and when additionally using
decoder requires about 1. 1 MByte of data memory (@ QVGA the DMA controller. As can be seen, if data is transferred by
video resolution), only small parts of the used data structures the processor, the performance gain of the accelerator is
(less than 3%0 with 32 kByte of SRAM) can be stored in the of vanished by the data transfers, only in conjunction with the
on-chip memory. In order to find a useful partitioning of data DMA controller the coprocessor can be used efficiently.
areas between on-chip and off-chip memory, it is required to
profile the accesses to each data area of the decoder. Since a
data cache is instantiated, accesses to these memories only Million
happen if cache misses occur. Therefore, the cache misses have
been analyzed separately for each data area in the code
including global variables, heap variables and the stack. Data
areas with many cache misses are stored in on-chip memory. 10-
14 M Paaee Caclto
3) Hardware/Software Partitioning
In order to further increase the system efficiency and
decrease power consumption and hardware costs, the CPU can
be enhanced by coprocessors. Again, the hot spots in the
software code should be considered, namely the loop filter, the SW HWwith CPU LD/ST HWwith DMA
motion compensation and the integer transformation. These are Figure 10. Clock cycle comparison of different deblocking implementations
the foremost candidates for hardware implementation. All these
components are rather demanding on an arithmetical than on a
100
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.
8. C. Hardware/Software System Implementation exhaustive performance testing and power measurements,
The profiling and implementation results of the previous separately for memory, core and IO supply voltages.
chapters lead to a mixed hardware/software implementation of
the video decoder, which is given in Figure 11. An application
processor is extended with a companion chip for acceleration
of the video decoding. The companion chip contains the
hardware accelerators for H.264/AVC decoding. TABLE IV.
shows a comparison of the required cycle times of the
accelerators with their software counterparts.
TABLE IV. COMPARISON OF THE EXECUTION TIME iN HARDWARE AND
SOFTWARE
Pixel Inverse
Implementation Debloc king
Interpolation Transform
Software 3000-7000 cylces 100-700 cycles 320 cycles Figure 12. ASIC layout
Hardware 232 cylces 16-34 cycles 30 cycles V. CONCLUSIONS AND FUTURE WORK
a memory transfers are not included in this cycle counts The design of an efficient system for applications with high
Furthermore a so called SIMD engine is available on the demands on the real-time performance requires the selection of
chip, which is 32-bit RISC processor enhanced with special an appropriate system architecture and the incorporated
SIMD instructions. The 32-bit system bus connecting the hardware and software components. For this decision a detailed
processor core with the main memory and coprocessor knowledge of the computational demands of the application is
components is augmented with a DMA-controller which mandatory. Furthermore for data intensive applications also the
supports the main processor by performing the memory influence of memory accesses has to be taken into account. We
transfers to the coprocessor units. A video output unit is presented a profiling tool which provides this information and
implemented directly driving a connected display or video have shown how it can be integrated in the design flow. The
DAC. To avoid a heavy bus load on the mentioned system bus tool aids the designer in taking the right decision during each
due to transfers from a frame buffer to the video output step of the design, including the hardware/software
interface, an extra frame buffer memory and the video output partitioning, the optimization ofthe components and the system
unit are provided by a separate video bus system. The data scheduling. We have applied this methodology for the
transfers between these bus systems are also performed by the development of a software solution and a hardware/software
DMA controller. The main control functionality of the decoder system for real-time video decoding.
can either be run on the application processor or on the RISC
core on the companion chip. Our future work includes the retargeting of the profiler
backend to other processors. Many processor simulators offer
already profiling capabilities, e.g. the LisaTek tool suite;
however their results are not as detailed as the Memtrace
results. Furthermore we plan to integrate power models for
cache and memory accesses and instruction execution in order
to allow power consumption estimation. These models will be
based on existing power models of caches and memories and
on measurement results of the presented ASIC design.
REFERENCES
[1] RealView ARMulator ISS User Guide Version 1.4, Ref: DUI0207C,
display January 2004, http://www.arm.com
[2] J. Bormans, K. Denolf, S. Wuytack, L. Nachtergaele, and I. Bolsens,
"Integrating system-level low power methodologies into a real-life
design flow," The Ninth Int. Workshop Power and Timing Modeling,
Optimization and Simulation, pp. 19-28, Oct. 1999, Kos Island, Greece
[3] H. Hubert, B. Stabernack, and H. Richter, "Tool-Aided Performance
Figure 11. SOC architecture of the DVB-H/DMB companion chip Analysis and Optimization of an H.264 Decoder for Embedded
Systems," The Eighth IEEE International Symposium on Consumer
To fully evaluate the proposed concept the complete SOC Electronics (ISCE 2004), Reading, England, Sept. 2004
architecture has been implemented as an ASIC design using [4] s. Ha, C. Lee, Y. Yi, S. Kwon, and Y.-P. Joo, "Hardware-software
UMC's L180 1P6M GII logic technology, see Figure 12. The Codesign of Multimedia Embedded Systems: the PeaCE Approach,"
12th IEEE Int. Conf on Embedded and Real-Time Computing Systems
maximum clock frequency of the design is 120 MHz, whereas and Applications, Sydney, Australia, Vol. 1 pp. 207-214, Aug. 2006
50 MHz should be sufficient for the DVB-H scenario. An [5] International Standard of Joint Video Specification (ITU-T Rec. H.264
evaluation board for the chip is currently under development. It ISO/IEC 14496-10 AVC), Joint Video Team (JVT) of ISO/IEC
allows the fully functional verification and furthermore MPEG and ITU-T, VCEG, JVT-G050, March 2003
101
Authorized licensed use limited to: King Mongkuts Institute of Technology Ladkrabang. Downloaded on November 27, 2009 at 04:48 from IEEE Xplore. Restrictions apply.