5. Why a DSP?
It’s easy: we want an architecture optimized for Digital
Signal Processing
Some versions are further optimized for some specific
applications
- e.g. very low power consumption for mobile phones
6. Which is the difference between a DSP and a
general purpose processor? (1/4)
Memory architecture and bus
The first processors (in the ‘40) had a Harvard
architecture: separate memories for program and data
But it’s complex -> soon replaced by Von Neumann
architecture: no real difference between program and
data (an instruction has two fields: operation and data)
Problem: the processor cannot access instructions and
data simultaneously
To improve performance: Harvard architecture again!
In particular
- separate memories and busses for program and data
- possibly, another separate bus for the DMA
7. Which is the difference between a DSP and a
general purpose processor? (2/4)
A DSP is often used to realize a linear filter
The convolution integral
is actually a sum:
yn=Σixn-ihi
- if the number of sums is finite: FIR filter (finite impulse
response),
- otherwise: IIR (infinite impulse response),
- which can be realized using two finite sums:
yn=Σixn-ibi + Σiyn-iai
8. Which is the difference between a DSP and a
general purpose processor? (3/4)
A common operation in a FIR or IIR filter is A=BC+D: we
need
- a hardware multiplier (introduced in DSPs in the '70)
- a multiply and accumulate in only one clock cycle: MAC
instruction.
Actually, the MAC is in a loop: we also need a zero
overhead loop:
- H/W for address generation (the access to memory is
not random)
- loop management
- auto-increment; circular addressing
Other possible H/W:
- H/W saturation
- Instructions to perform a division quickly
- Bit reversal for FFT
9. Which is the difference between a DSP and a
general purpose processor? (4/4)
Other possible features:
Often, data are 16- o 8-bit wide (e.g., audio or images)
- a 32-bit ALU can be splitted in two 16-bit ALUs or four
8-bit ALUs,
-> 2 o 4 operations in parallel
several ALUs which work in parallel
fixed point ALUs, o 16-bit ALUs, to reduce power
consumption and costs
optimized versions:
- cost: for consumer applications
- power: for mobile applications
- for specific applications, e.g. electric motor control
12. Note: several of these characteristics, which were born on
DSPs, have been ported to general purpose processors
E.g.: the cache in the
Pentium processor is
Harvard-like
13. Another example.: several units working in parallel, and
splittable ALUs (see. MMX extensions) in the Pentium 4
processor
14. Pipeline…
Example of a 4-stage pipeline (TI ‘C30)
each instruction is executed in 4 clock cycles, but (normally)
can be put just 1 cycle after the previous one (data are
needed only 3 cycles later)
15. Pipeline: branch (e.g. on the ‘C30)
Standard branch: the pipeline is flushed to correctly handle
the PC -> 4 cycles
Delayed branch: the pipeline is not flushed, and the 3
following instructions are loaded before modifying the PC
-> only 1 cycle needed!
BRD label ; delayed branch
MPYF ; executed
ADDF ; executed
SUBF ; executed
AND ; not executed
…
…
label MPYF ; fetched after SUBF
…
16. Two architectures
In order to exploit the instruction level parallelism (ILP): two
possible architectures
- Superscalar: the parallelism is dynamically managed by
the hardware
- Very Long Instruction Word (VLIW): the parallelism is
statically managed by the compiler
Which is the problem?
Dependences in data or control can generate conflicts
- on data (an instruction needs the result of a previous
instruction, but the results is not ready yet), or
- on control (conditional jump, but the condition is not ready
yet)
-> pipeline stall
17. Superscalar
The analysis of the independent instructions is dynamically
done by hardware (which is complex!)
The sequence of instructions can be executed out-of-order;
then, the completion of the instructions (commit) is done in-
order to correctly update the state of the CPU
18. VLIW
Very Long Instruction Word (VLIW): the parallelism is
statically managed by the compiler
The analysis of independent instructions is statically
realized during the compilation phase;
- the instructions which can be realized in parallel are
assembled in long instructions and send to the various
functional units in-order
Convenient solution for DSP programs (fixed length
cycles, few conditional operations); less convenient for
general purpose applications
Simpler hardware! But a specific compilation for each
platform is needed
Deterministic behaviour -> exact computation of
execution times