2. Basic Processor Structure
• Here we see a very simple processor structure - such as
might be found in a small 8-bit microprocessor.
12 DEC 01 ECSE 6620 - Jason Stripinis2(jasonstripinis@eng
3. Basic Processor Functions
• ALU
– Arithmetic Logic Unit - this circuit takes two operands on the
inputs (labeled A and B) and produces a result on the output
(labeled Y).
– The operations will usually include, as a minimum:
• add, subtract
• and, or, not
• shift right, shift left
• ALUs in more complex processors will execute many more
instructions.
12 DEC 01 ECSE 6620 - Jason Stripinis3(jasonstripinis@eng
4. Basic Processor Functions
• Register File
– A set of storage locations (registers) for storing temporary results.
Early machines had just one register (accumulator). Modern RISC
processors will have at least 32 registers.
• Instruction Register
– The instruction currently being executed by the processor is stored
here.
• Control Unit
– The control unit decodes the instruction in the instruction register
and sets signals which control the operation of most other units of
the processor. For example, the operation code (opcode) in the
instruction will be used to determine the settings of control signals
for the ALU which determine which operation (+,-,^,v,~,shift,etc)
it performs.
12 DEC 01 ECSE 6620 - Jason Stripinis4(jasonstripinis@eng
5. Basic Processor Functions
• Clock
– The vast majority of processors are synchronous, that is, they use a
clock signal to determine when to capture the next data word and
perform an operation on it. In a globally synchronous processor, a
common clock needs to be routed (connected) to every unit in the
processor.
• Program counter
– The program counter holds the memory address of the next
instruction to be executed. It is updated every instruction cycle to
point to the next instruction in the program. Branch instructions
change the program counter by other than a simple increment.
12 DEC 01 ECSE 6620 - Jason Stripinis5(jasonstripinis@eng
6. Basic Processor Functions
• Memory Address Register
– This register is loaded with the address of the next data word to be
fetched from or stored into main memory.
• Address Bus
– Transfers addresses to memory and memory-mapped peripherals.
It is driven by the processor acting as a bus master.
• Data Bus
– Carries data to and from the processor, memory and peripherals. It
will be driven by the data source, i.e. processor, memory, etc.
• Multiplexed Bus
– To limit device pin counts and bus complexity, some processors
MUX address and data onto the same bus, with an adverse affect
on performance.
12 DEC 01 ECSE 6620 - Jason Stripinis6(jasonstripinis@eng
7. DSP Implementations
• DSP Algorithm
– Series of mathematical operations that are applied to process a
sequence of digital signals sampled from the real (analog) world
• Application examples
– Filtering
– FFT
– Noise cancellation
– Spectral Processing
12 DEC 01 ECSE 6620 - Jason Stripinis7(jasonstripinis@eng
8. Why is special architecture good for
digital signal processing?
• DSPs are tailored to run DSP algorithms efficiently.
• Special functions to handle DSP algorithm demands:
– Unique data access patterns
• Streams of data requiring high bandwidth
• Low data repetition but high code repetition
– Math operation focus (“number cruncher”)
– Real-time constraints
– Power and size constraints
– Cost requirement
– Attention to numeric effects (limited fixed point error)
12 DEC 01 ECSE 6620 - Jason Stripinis8(jasonstripinis@eng
9. DSP Functional Characteristics
• Typically require a few specific operations
• Consider a FIR Filter :
This requires:
–additions & multiplications
–delays
–array handling
12 DEC 01 ECSE 6620 - Jason Stripinis9(jasonstripinis@eng
10. DSP Typical Operations
• Additions & Multiplications
– fetch two operands
– perform the addition or multiplication (or both)
– store the result
• Delays
– store the result for later use
• Array Handling
– fetch values from consecutive memory locations
– copy data from register to register
12 DEC 01 ECSE 6620 - Jason Stripinis10
(jasonstripinis@eng
11. DSP Typical Operations
• To perform these basic operations most DSPs:
– have a parallel multiply and add
– have multiple memory accesses (to fetch two operands and store the
result)
– have sufficient registers to hold data temporarily
– efficient address generation for array handling
– special features such as delays or circular addressing
12 DEC 01 ECSE 6620 - Jason Stripinis11
(jasonstripinis@eng
12. DSP Arithmetic Logic Unit
• Most DSP operations require additions and multiplications
together. So DSP processors usually have parallel
hardware adders and multipliers which can be used with a
single instruction:
12 DEC 01 ECSE 6620 - Jason Stripinis12
(jasonstripinis@eng
13. Register Structure
• Delays require that intermediate values be held for later
use.
• For example, when keeping a running total - the total can
be kept within the processor to avoid wasting repeated
reads from and writes to memory.
• For this reason DSP processors have lots of registers which
can be used to hold intermediate values.
• Registers may be fixed-point or floating-point.
12 DEC 01 ECSE 6620 - Jason Stripinis13
(jasonstripinis@eng
14. Memory Addressing
• Array handling requires that data can be fetched efficiently
from consecutive memory locations.
• For this reason DSP processors have address registers
which are used to hold addresses and can be used to
generate the next needed address efficiently.
• Usually, the next needed address can be generated during
the data fetch or store operation, and with no overhead.
12 DEC 01 ECSE 6620 - Jason Stripinis14
(jasonstripinis@eng
15. Memory Addressing
• Example DSP address generation operations:
Instruction Name Description
read the data pointed to by the address in
*rP register indirect
register rP
having read the data, postincrement the address
*rP++ postincrement
pointer to point to the next value in the array
having read the data, postdecrement the address
*rP-- postdecrement pointer to point to the previous value in the
array
having read the data, postincrement the address
*rP++rI register postincrement pointer by the amount held in register rI to point
to rI values further down the array
having read the data, postincrement the address
*rP++rIr bit reversed pointer to point to the next value in the array, as
if the address bits were in bit reversed order
12 DEC 01 ECSE 6620 - Jason Stripinis15
(jasonstripinis@eng
16. Memory Architectures for DSP
• For arithmetic the DSP needs to fetch two operands in a
single instruction cycle.
• Since we also need to store the result and to read the
instruction itself more than two memory accesses per
instruction cycle are needed.
• Even the simplest DSP operation - an addition involving
two operands and a store of the result to memory - requires
four memory accesses (three to fetch the two operands and
the instruction, plus a fourth to write the result)
12 DEC 01 ECSE 6620 - Jason Stripinis16
(jasonstripinis@eng
17. Memory Architectures for DSP
• DSP processors usually support multiple memory accesses
in the same instruction cycle.
• It is not possible to access two different memory addresses
simultaneously over a single memory bus.
• There are two common methods to achieve multiple
memory accesses per instruction cycle:
• Harvard architecture
• modified von Neumann architecture
12 DEC 01 ECSE 6620 - Jason Stripinis17
(jasonstripinis@eng
18. Memory Architectures for DSP
(Harvard Architecture)
• The Harvard architecture has two separate physical
memory buses, allowing two simultaneous memory
accesses.
• The true Harvard architecture dedicates one bus for
fetching instructions, with the other available to fetch
operands.
• This is inadequate for DSP operations, which usually
involve at least two operands. So DSP Harvard
architectures usually permit the 'program' bus to be used
also for access of operands.
12 DEC 01 ECSE 6620 - Jason Stripinis18
(jasonstripinis@eng
19. Memory Architectures for DSP
(Harvard Architecture)
• Note that it is often necessary to fetch three things - the
instruction plus two operands - and the Harvard
architecture is inadequate to support this.
• So DSP Harvard architectures often also include a cache
memory which can be used to store instructions which will
be reused, leaving both Harvard buses free for fetching
operands.
• The Harvard architecture plus cache - is sometimes called
an extended Harvard architecture or Super Harvard
ARChitecture (SHARC).
12 DEC 01 ECSE 6620 - Jason Stripinis19
(jasonstripinis@eng
20. Memory Architectures for DSP
(Harvard Architecture)
• The Harvard architecture requires two memory buses. This
makes it expensive to bring off the chip - for example a
DSP using 32 bit words and with a 32 bit address space
requires at least 64 pins for each memory bus - a total of
128 pins if the Harvard architecture is brought off the chip.
This results in very large chips, which are difficult to
design into a circuit.
12 DEC 01 ECSE 6620 - Jason Stripinis20
(jasonstripinis@eng
21. Memory Architectures for DSP
(von Neumann Architecture)
• The von Neumann architecture uses only a single memory
bus. This is relatively cheap, requiring less pins that the
Harvard architecture, and simple to use because the
programmer can place instructions or data anywhere
throughout the available memory.
• But it does not permit multiple memory accesses.
• The modified von Neumann architecture allows multiple
memory accesses per instruction cycle by running the
memory clock faster than the instruction cycle.
12 DEC 01 ECSE 6620 - Jason Stripinis21
(jasonstripinis@eng
22. Memory Architectures for DSP
(von Neumann Architecture)
• Each instruction cycle is divided into multiple 'machine
states' and a memory access can be made in each machine
state, permitting a multiple memory accesses per
instruction cycle.
• The modified von Neumann architecture permits all the
memory accesses needed to support addition or
multiplication: fetch of the instruction; fetch of the two
operands; and storage of the result.
12 DEC 01 ECSE 6620 - Jason Stripinis22
(jasonstripinis@eng
23. Why use a special architecture for
digital signal processing?
The Answers
Unique data access patterns Bit reversed addressing (FFT)
Streams of data requiring high Multiple access memory
bandwidth architecture
Low data repetition but high Eliminate data cache (save $$)
code repetition
Math operation focus MAC instruction
Vector processing unit
Real-time constraints Zero-overhead loops
Power and size constraints Limited addition function
units (unlike GPP)
Cost requirement On-board peripherals (SOC)
Attention to numeric effects ALU with 16-bit operands and
(limited fixed point error) 32-bit result
12 DEC 01 ECSE 6620 - Jason Stripinis23
(jasonstripinis@eng
24. DSP Generations
• 1st Generation (1979-1982)
– Transition from experimental signal processors
• 2nd Generation (1985-1986)
– Move from co-processor to stand-alone processor
• 3rd Generation (1987-1989)
– Major hardware improvements to speed
• 4th Generation (1990-1996)
– More on-chip integration (ADC, DAC, memory, multi-processor)
• 5th Generation (1997-)
12 DEC 01 ECSE 6620 - Jason Stripinis24
(jasonstripinis@eng
25. DSP Generations
1st Generation (1979-1982)
• Primarily targeted at digital filtering
• Specialized co-processor for signal processing
• NMOS (n-Channel Metal Oxide Semi) fabrication
• 16-bit fixed point
• fast multiplier (and adder)
• Harvard architecture
• Specialized Instruction set
12 DEC 01 ECSE 6620 - Jason Stripinis25
(jasonstripinis@eng
26. DSP Generations
1st Generation (1979-1982)
• Example = Texas Instruments TMS32010
– 16-bit fixed point
– Harvard architecture
– two Address registers
– one A register (adder)
– one P register (multiplier)
– one T register (data shift on delay line)
– No zero-overhead loop
– Specialized Instruction set
– MAC Time 400 ns (<100 ns today)
– 50 ms per 1024-FFT
12 DEC 01 ECSE 6620 - Jason Stripinis26
(jasonstripinis@eng
27. DSP Generations
1st Generation (1979-1982)
• Example = Texas Instruments TMS32010
12 DEC 01 ECSE 6620 - Jason Stripinis27
(jasonstripinis@eng
28. DSP Generations
2nd Generation (1985-1986)
• Move from co-processor to stand-alone processor
• CMOS (Complementary Metal Oxide Semi) fabrication
• Double the speed of first generation
• Advances in memory architecture (more internal RAM)
• better pipelining of functional units
• address generators (bit-reversing)
• Zero-overhead loop HW
• Limited floating point in SW
12 DEC 01 ECSE 6620 - Jason Stripinis28
(jasonstripinis@eng
29. DSP Generations
2nd Generation (1985-1986)
• Example = Texas Instruments TMS32020 (1985)
– 16-bit fixed point
– Harvard architecture
– Improved TMS32010
– RPTS allows pipelined instruction performed in single cycle
– Specialized Instruction set
– MAC Time 200 ns
– 10 ms per 1024-FFT
12 DEC 01 ECSE 6620 - Jason Stripinis29
(jasonstripinis@eng
30. DSP Generations
3rd Generation (1987-1989)
• Increased floating point support
– 32-bit floating point hardware DSPs released
– Floating point emulation on fixed point processors
– IEEE754 support
• Hardware enhancements (large speed increase)
– dense CMOS fabrication
– on chip DMA
– instruction caches
– increased clock rates (first cores above 10 MHz)
• Increased complexity of SW
12 DEC 01 ECSE 6620 - Jason Stripinis30
(jasonstripinis@eng
31. DSP Generations
3rd Generation (1987-1989)
• Example = Motorola DSP56001 (1988)
– 24-bit data, instructions
– 24-bit fixed point
– 3 memory spaces (P, X, Y)
– parallel moves
– circular addressing
– MAC Time 75 ns (21 ns today)
– ~3 ms per 1024-FFT
• Other Examples:
– AT&T DSP16A
– Analog Devices ADSP-2100
– TI TMS320C50
12 DEC 01 ECSE 6620 - Jason Stripinis31
(jasonstripinis@eng
32. DSP Generations
4th Generation (1990-1996)
• Hardware integration
– ADC
– DAC
– more memory
– multiple DSPs on one chip
• Decreasing power consumption
– 5.0 VDC → 3.3 VDC → 3.0 VDC → 2.7 VDC
• GPPs start to get DSP functions
– SIMD
– Leads to Intel introducing MMX (MultiMedia eXtensions) for x86
12 DEC 01 ECSE 6620 - Jason Stripinis32
(jasonstripinis@eng
33. DSP Generations
4th Generation (1990-1996)
• Example = TI TMS320C541 (1995)
– Enhanced architecture
– Low voltage (3.3 VDC)
– More on-chip memory
– Application specific functional units
– MAC Time 20 ns (10 ns today)
– ~1 ms per 1024-FFT
• Example = TI TMS320C80
– multiple processors per chip
12 DEC 01 ECSE 6620 - Jason Stripinis33
(jasonstripinis@eng
34. The GPP Option
• High-performance general-purpose processors for PCs and
workstations are increasingly suitable for some DSP
applications.
• E.g., Intel MMX Pentium, Motorola/IBM PowerPC 604e
• These processors achieve excellent to outstanding floating
and/or fixed-point DSP performance via:
– Very high clock rates (200-500 MHz)
– Superscalar architectures
– Single-cycle multiplication and arithmetic operations
– Good memory bandwidth
– Branch prediction
– In some cases, single-instruction, multiple-data (SIMD) ops
12 DEC 01 ECSE 6620 - Jason Stripinis34
(jasonstripinis@eng
35. DSP Generations
5th Generation (1997-)
• Not the classic DSP architectures
– SIMD (Single Instruction Multiple Data stream) instructions
– VLIW (Very Long Instruction Words) allows RISC processing
• High parallelism
• Increased clock speeds
• No longer application specific functional units (no MAC FU)
• Low voltage (2.5 VDC or less, even 1.2 VDC cores)
• MAC Time 3 ns (but can be power hungry)
• GPPs start to get DSP functions
– Intel introduces MMX (MultiMedia eXtensions) for x86 in 1997
• Increased integration
– MCU and DSP cores on same chip
– MCU functions/ports added to DSPs
12 DEC 01 ECSE 6620 - Jason Stripinis35
(jasonstripinis@eng
36. DSP Generations
5th Generation (1997-)
• SIMD (Single Instruction Multiple Data) instructions
– Enhance throughput by allowing parallelism
– Requires multiple functional units and wider buses
– May support multiple data widths (different functional groups)
– Example = DSP16000
WAS SIMD
12 DEC 01 ECSE 6620 - Jason Stripinis36
(jasonstripinis@eng
37. DSP Generations
5th Generation (1997-)
• VLIW (Very Long Instruction
Words)
– Instruction Level Parallelism (ILP) can
be a major performance gain
• Superscalar implementation requires
larger die and more power to
dynamically pipeline instructions
– VLIW can be used to statically pipeline
instructions at compile time (or even by
hand!)
– VLIW instruction words have fixed
"slots" for instructions that map to the
functional units available.
12 DEC 01 ECSE 6620 - Jason Stripinis37
(jasonstripinis@eng
38. DSP Generations
5th Generation (1997-)
• VLIW Advantages
– huge theoretical pay off
• less than 1 ns per MAC!
• Less than 75 ns per 1024-FFT
• VLIW Drawbacks
– Can be very difficult to program and debug
– High power consumption if VLIW is not filled
– Code size dramatically increases requiring more program memory
12 DEC 01 ECSE 6620 - Jason Stripinis38
(jasonstripinis@eng
39. DSP Generations
5th Generation (1997-)
• VLIW Example = TI TMS320C6201
32-bit Functional Units
Lx = ALU
Sx = Branching
and shifting
Mx = Multiplier
Dx = Data Store
12 DEC 01 ECSE 6620 - Jason Stripinis39
(jasonstripinis@eng
40. DSP Generational Development
• DSP processor performance has increased by a factor of
about 400x over the past 20 years
400
350
300
250
200
150 MAC (ns)
100
50
0
1st 2nd 3rd 4th 5th
Gen Gen Gen Gen Gen
• DSP architectures will be increasingly specialized for
applications, especially communications applications
• General-purpose processors will become viable for many
DSP applications
12 DEC 01 ECSE 6620 - Jason Stripinis40
(jasonstripinis@eng