Mpmc

Microprocessors
and
Microcontrollers
Third Year BE Computers

Pawar Virendra D.
Mo. No.:9423582261

1/153 MPMC© Pawar Virendra D.

Syllabus

EC4813 : Microprocessors and Microcontrollers
Microprocessors and Microcontrollers Prerequisites :
Understanding of Microprocessors, Peripheral Chips, Analogue Sensors, Conversion,
Interfacing Techniques.
Aim : This course covers the design of hardware and software code using a modern
microcontroller. It emphasizes on assembly language programming of the microcontroller
including device drivers, exception and interrupt handling, and interfacing with higher-
level languages.
Objectives:
1. To exhibit knowledge of the architecture of microcontrollers and apply program
control structures to microcontrollers;
2. To develop the ability to use assembly language to program a microcontroller and
demonstrate the capability to program the microcontroller to communicate with external
circuitry using parallel ports;
3. To demonstrate the capability to program the microcontroller to communicate with
external circuitry using serial ports and timer ports.
Unit 1 : Introduction to Pentium microprocessor ( 7 Hrs ) Pentium Microprocessor:
History ,Feature & Architecture, Pin Description , Functional Description Real Mode,
Risc Super Scalar, Pipe lining , Instruction Pairing, Branch Prediction, Inst Data Cache.
FPU
Unit 2 : Bus Cycles and Memory Organization: ( 7 Hrs ) Bus Cycles & Memory
Organisation : Init & Configuration, Bus Operations-RST, Bus Operations-RST, Mem/Io
Organisation, Data Transfer Mechanism , 8/16/32 bit Data Bus I, Programmers Model,
Register Set, Instru Set , Data Types, Instructions
Unit 3 : Protected Mode: ( 6 Hrs ) Protected Mode :Intro Segmentation, Supp Registers
,Rel Int Desc, Mem Man thru Segmentation , Logical to linear translation, protection by
segmentation, Privilege Level protection, related instructions, inter - privilege level
transfer of control, paging-support registers, descriptors ,linear-physical add trans, TLB,
page level protection ,virtual memory
Unit 4 : Multitasking, Interrupts, Exceptions and I/O ( 6 Hrs ) Multitasking,
Interrupts, Exception I/O :Multi Tasking Support Reg , Rel Des, Task Switch I/O per
BitMap, Virtual Mode, Add Gen, Priv Level, Inst &Reg ,enter/Leaving V86 M, Interrupt
Structure Real/Prot V86 Mode, I/O Handling, comparison of 3 modes.
Unit 5 : 8051 Micro controller ( 7 Hrs ) Family Architecture , ,Data / Programme
Memory , Reg set Reg Bank SFR, Ext Data / Mem Programme Mem, Interrupt Structure
, Timer Prog ,Serial Port Prog , Misc Features, Min System
Unit 6 : PIC Micro-Controller ( 7 Hrs ) PIC Micro-Controller :OverView ,Features,
Pin Out, Capture /Compare /Pulse width modulation Mode , Block Dia Prog Model, Rest
/Clocking, Mem Org, Prog/Data, Flash Eprom, Add Mode/Inst Set Prog , I/o, Interrupt ,
Timer, ADC
Outcomes: Upon completion of the course, the student should be able to:


1. Describe and use the functional blocks utilized in a basic microcontroller based
system.
2. Describe the programmer's model of the CPU's instruction set and various addressing
modes.
3. Proficiently use the various instruction set and functional groups, when programming.
4. Integrate structured programming techniques and sub-routines into microcontroller
based hardware topologies.
5. Develop I/O port, ADC hardware, and software interfacing techniques.
6. Describe the use of sensors, interfacing, and signal conditioning when utilizing the
microcontroller in control and monitor applications.
Text Books:
1. Antonakos J., "The Pentium Microprocessor", Pearson Education, 2004, 2nd Edition.
2. Deshmukh A., "Microcontrollers - Theory and Applications", Tata McGraw-Hill,
2004,
Reference Books:
1. Mazidi M., Gillispie J., " The 8051 Microcontroller and embedded systems", Pearson
education, 2002, ISBN - 81-7808-574-7
2 Intel Pentium Data Sheets
3. Ayala K., "The 8051 Microcontroller", Penram International, 1996, ISBN 81 -900828-
4-1
4. Intel 8 bit Microcontroller manual
5. Microchip manual for PIC 16CXX and 16FXX


INTRODUCTION

16-bit Processors and Segmentation (1978)
The IA-32 architecture family was preceded by 16-bit processors, the 8086 and 8088.
The 8086 has 16-bit registers and a 16-bit external data bus, with 20-bit addressing giving
a 1-MByte address space. The 8088 is similar to the 8086 except it has an 8-bit external
data bus. The 8086/8088 introduced segmentation to the IA-32 architecture. With
segmentation, a 16-bit segment register contains a pointer to a memory segment of up to
64 KBytes. Using four segment registers at a time, 8086/8088 processors are able to
address up to 256 KBytes without switching between segments. The 20-bit addresses that
can be formed using a segment register and an additional 16-bit pointer provide a total
address range of 1 MByte.

The Intel® 286 Processor (1982)
The Intel 286 processor introduced protected mode operation into the IA-32 architecture.
Protected mode uses the segment register content as selectors or pointers into descriptor
tables. Descriptors provide 24-bit base addresses with a physical memory size of up to 16
Mbytes , support for virtual memory management on a segment swapping basis, and a
number of protection mechanisms. These mechanisms include:
• Segment limit checking
• Read-only and execute-only segment options
• Four privilege levels

The Intel386™ Processor (1985)
The Intel386 processor was the first 32-bit processor in the IA-32 architecture family. It
introduced 32-bit registers for use both to hold operands and for addressing. The lower
half of each 32-bit Intel386 register retains the properties of the 16-bit registers of earlier
generations, permitting backward compatibility. The processor also provides a virtual-
8086 mode that allows for even greater efficiency when executing programs created for
8086/8088 processors.
In addition, the Intel386 processor has support for:
• A 32-bit address bus that supports up to 4-GBytes of physical memory
• A segmented-memory model and a flat memory model
• Paging, with a fixed 4-KByte page size providing a method for virtual memory
management
• Support for parallel stages

The Intel486™ Processor (1989)
The Intel486™ processor added more parallel execution capability by expanding the
Intel386 processor’s instruction decode and execution units into five pipelined stages.
Each stage operates in parallel with the others on up to five instructions in different
stages of execution.
In addition, the processor added:
• An 8-KByte on-chip first-level cache that increased the percent of instructions that
could execute at the scalar rate of one per clock


• An integrated x87 FPU
• Power saving and system management capabilities

The Intel® Pentium® Processor (1993)
The introduction of the Intel Pentium processor added a second execution pipeline to
achieve superscalar performance (two pipelines, known as u and v, together can execute
two instructions per clock). The on-chip first-level cache doubled, with 8 KBytes devoted
to code and another 8 KBytes devoted to data. The data cache uses the MESI protocol to
support more efficient write-back cache in addition to the write-through cache previously
used by the Intel486 processor. Branch prediction with an on-chip branch table was
added to increase performance in looping constructs.
In addition, the processor added:
• Extensions to make the virtual-8086 mode more efficient and allow for 4-MByte as well
as 4-KByte pages
• Internal data paths of 128 and 256 bits add speed to internal data transfers
• Burst able external data bus was increased to 64 bits
• An APIC to support systems with multiple processors
• A dual processor mode to support glueless two processor systems

PROCESSOR FEATURES OVERVIEW
The Pentium processor supports the features of previous Intel Architecture processors and
provides significant enhancements including the following:
• Superscalar Architecture
• Dynamic Branch Prediction
• Pipelined Floating-Point Unit
• Improved Instruction Execution Time
• Separate Code and Data Caches.
• Writeback MESI Protocol in the Data Cache
• 64-Bit Data Bus
• Bus Cycle Pipelining
• Address Parity
• Internal Parity Checking
• Functional Redundancy Checking2 and Lock Step operation2
• Execution Tracing
• Performance Monitoring
• IEEE 1149.1 Boundary Scan
• System Management Mode
• Virtual Mode Extensions
• Upgradable with a Pentium OverDrive processor2
• Dual processing support
• Advanced SL Power Management Features
• Fractional Bus Operation
• On-Chip Local APIC Device
• Functional Redundancy Checking and Lock Step operation


• Support for the Intel 82498/82493 and 82497/82492 cache chipset products
• Upgradability with a Pentium OverDrive processor
• Split line accesses to the code cache

COMPONENT INTRODUCTION
The application instruction set of the Pentium processor family includes the complete
instruction set of existing Intel Architecture processors to ensure backward compatibility,
with extensions to accommodate the additional functionality of the Pentium processor.
All application software written for the Intel386™ and Intel486™ microprocessors will
run on the Pentium processor without modification. The on-chip memory management
unit (MMU) is completely compatible with the Intel386 and Intel486 CPUs.

The two instruction pipelines and the floating-point unit on the Pentium processor are
capable of independent operation. Each pipeline issues frequently used instructions in a
single clock. Together, the dual pipes can issue two integer instructions in one clock, or
one floating-point instruction (under certain circumstances, 2 floating-point instructions)


in one clock. Branch prediction is implemented in the Pentium processor. To support this,
the Pentium processor implements two prefetch buffers, one to prefetch code in a linear
fashion, and one that prefetches code according to the Branch Target Buffer (BTB) so the
needed code is almost always prefetched before it is needed for execution.

The Pentium processor includes separate code and data caches integrated on chip to meet
its performance goals.. The caches on the Pentium processor are each 8 Kbytes in size
and 2-way set-associative. Each cache has a dedicated Translation Lookaside Buffer
(TLB) to translate linear addresses to physical addresses. The Pentium processor data
cache is configurable to be writeback or writethrough on a line-by-line basis and follows
the MESI protocol. The data cache tags are triple ported to support two data transfers and
an inquire cycle in the same clock. The code cache is an inherently write protected cache.
The code cache tags of the Pentium processor are also triple ported to support snooping
and split-line accesses.

The Pentium processor has a 64-bit data bus. Burst read and burst writeback cycles are
supported by the Pentium processor. In addition, bus cycle pipelining has been added to
allow two bus cycles to be in progress simultaneously. The Pentium processor Memory
Management Unit contains optional extensions to the architecture which allow 4 MB
page sizes.

The Pentium processor has added significant data integrity and error detection capability.
Data parity checking is still supported on a byte-by-byte basis. Address parity checking,
and internal parity checking features have been added along with a new exception, the
machine check exception.

The Pentium processor has implemented functional redundancy checking to provide
maximum error detection of the processor and the interface to the processor. When
functional redundancy checking is used, a second processor, the “checker” is used to
execute in lock step with the “master” processor. The checker samples the master’s
outputs and compares those values with the values it computes internally, and asserts an
error signal if a mismatch occurs. The Pentium processor with MMX technology does not
support functional redundancy checking.

As more and more functions are integrated on chip, the complexity of board level testing
is increased. To address this, the Pentium processor has increased test and debug
capability by implementing IEEE Boundary Scan (Standard 1149.1). System
management mode has been implemented along with some extensions to the SMM
architecture.

Enhancements to the Virtual 8086 mode have been made to increase performanceby
reducing the number of times it is necessary to trap to a Virtual 8086 monitor. including
the two instruction pipelines, the “u” pipe and the “v” pipe. The u-pipe can execute all
integer and floating-point instructions. The v-pipe can execute simple integer instructions
and the FXCH floating-point instruction.


The separate code and data caches are shown. The data cache has two ports, one for each
of the two pipes (the tags are triple ported to allow simultaneous inquire cycles). The data
cache has a dedicated to translate linear addresses to the physical addresses used by the
data cache.
The code cache, branch target buffer and prefetch buffers are responsible for getting raw
instructions into the execution units of the Pentium processor. Instructions are fetched
from the code cache or from the external bus. Branch addresses are remembered by the
branch target buffer. The code cache TLB translates linear addresses to physical
addresses used by the code cache.
The decode unit contains two parallel decoders which decode and issue up to the next
two sequential instructions into the execution pipeline. The control ROM contains the
microcode which controls the sequence of operations performed by the processor. The
control unit has direct control over both pipelines.

The Pentium processor contains a pipelined floating-point unit that provides a significant
floating-point performance advantage over previous generations of Intel Architecture-
based processors.

The Pentium processor includes features to support multi-processor systems, namely an
on chip Advanced Programmable Interrupt Controller (APIC). This APIC
implementation supports multiprocessor interrupt management (with symmetric interrupt
distribution across all processors), multiple I/O subsystem support, 8259A compatibility,
and inter-processor interrupt support.

The dual processor configuration allows two Pentium processors to share a single L2
cache for a low-cost symmetric multi-processor system. The two processors appear to the
system as a single Pentium processor. Multiprocessor operating systems properly
schedule computing tasks between the two processors. This scheduling of tasks is
transparent to software applications and the end-user. Logic built into the processors
support a “glueless” interface for easy system design. Through a private bus, the two
Pentium processors arbitrate for the external bus and maintain cache coherency. The
Pentium processor can also be used in a conventional multi-processor system in which
one L2 cache is dedicated to each processor.

The Pentium processor is produced on Intel’s advanced silicon technology. The Pentium
processor also includes SL enhanced power management features. When the clock to the
Pentium processor is stopped, power dissipation is virtually eliminated. The low VCC
operating voltages and SL enhanced power management features make the Pentium
processor a good choice for energy-efficient desktop designs.


PIN DESCRIPTION
Symbol Type Name and Function
A31-A3 I/O As outputs, the address lines of the processor along with the
byte enables define the physical area of memory or I/O
accessed. The external system drives the inquire address to the
processor on A31-A5.
D63-D0 I/O These are the 64 data lines for the processor. Lines D7-D0
define the least significant byte of the data bus; lines D63-D56
define the most significant byte of the data bus. When the CPU
is driving the data lines, they are driven during the T2, T12, or
T2P clocks for that cycle. During reads, the CPU samples the
data bus when BRDY# is returned.
ADS# O The address status indicates that a new valid bus cycle is
currently being driven by the Pentium processor
BE7#-BE5# O The byte enable pins are used to determine which bytes must
BE4#-BE0# I/O be written to external memory, or which bytes were requested
by the CPU for the current cycle. The byte enables are driven
in the same clock as the address lines (A31-3).
BOFF# I The backoff input is used to abort all outstanding bus cycles
that have not yet completed. In response to BOFF#, the
Pentium processor will float all pins normally floated during
bus hold in the next clock. Theprocessor remains in bus hold
until BOFF# is negated, at which time the Pentium processor
restarts the aborted bus cycle(s) in their entirety.
BRDY# I The burst ready input indicates that the external system has
presented valid data on the data pins in response to a read or
that the external system has accepted the Pentium processor
data in response to a write request. This signal is sampled in the
T2, T12 and T2P bus states.
CACHE# O For Pentium processor initiated cycles the cache pin indicates
internal cacheability of the cycle (if a read), and indicates a
burst write back cycle (if a write). If this pin is driven inactive
during a read cycle, the Pentium processor will not cache the
returned data, regardless of the state of the KEN# pin. This pin
is also used to determine the cycle length (number of transfers
in the cycle).
CPUTYP I CPU type distinguishes the Primary processor from the Dual
processor. In a single processor environment, or when the
Pentium processor is acting as the Primary processor in a dual
processing system, CPUTYP should be strapped to VSS. The
Dual processor should have CPUTYP strapped to VCC. For the
Pentium OverDrive processor, CPUTYP will be used to
determine whether the bootup handshake protocol will be used
(in a dual socket system) or not (in a single socket system).
FLUSH# I When asserted, the cache flush input forces the Pentium
processor to write back all modified lines in the data cache


and invalidate its internal caches. A Flush Acknowledge
special cycle will be generated by the Pentium processor
indicating completion of the write back and invalidation.
If FLUSH# is sampled low when RESET transitions from high
to low, tristate test mode is entered. If two Pentium processor
are operating in dual processing mode and FLUSH# is asserted,
the Dual processor will perform a flush first (without a flush
acknowledge cycle), then the Primary processor will perform a
flush followed by a flush acknowledge cycle.
NOTE:
If the FLUSH# signal is asserted in dual processing mode, it
must be deasserted at least one clock prior to BRDY# of the
FLUSH Acknowledge cycle to avoid DP arbitration problems.
FRCMC# I The functional redundancy checking master/checker mode
input is used to determine whether the Pentium processor is
configured in master mode or checker mode. When configured
as a master, the Pentium processor drives its output pins as
required by the bus protocol. When configured as a checker,
the Pentium processor tristates all outputs (except IERR# and
TDO) and samples the output pins. The configuration as a
master/checker is set after RESET and may not be changed
other than by a subsequent RESET.
HOLD I In response to the bus hold request, the Pentium processor
will float most of its output and input/output pins and assert
HLDA after completing all outstanding bus cycles. The
Pentium processor will maintain its bus in this state until
HOLD is de-asserted. HOLD is not recognized during LOCK
cycles. The Pentium processor will recognize HOLD during
reset.
HOLDA O The bus hold acknowledge pin goes active in response to a
hold request driven to the processor on the HOLD pin. It
indicates that the Pentium processor has floated most of the
output pins and relinquished the bus to another local bus
master. When leaving bus hold, HLDA will be driven inactive
and the Pentium processor will resume driving the bus. If the
Pentium processor has a bus cycle pending, it will be driven in
the same clock that HLDA is de-asserted.
INIT I The Pentium processor initialization input pin
forces the Pentium processor to begin execution in a known
state. The processor state after INIT is the same as the state
after RESET except that the internal caches, write buffers, and
floating point registers retain the values they had prior to INIT.
INIT may NOT be used in lieu of RESET after power-up. If
INIT is sampled high when RESET transitions from high to
low, the Pentium processor will perform built-in self test prior
to the start of program execution.


INV I The invalidation input determines the final cache line state (S
or I) in case of an inquire cycle hit. It is sampled together with
the address for the inquire cycle in the clock EADS# is
sampled active.
KEN# I The cache enable pin is used to determine whether the current
cycle is cacheable or not and is consequently used to determine
cycle length. When the Pentium processor generates a cycle
that can be cached (CACHE# asserted) and KEN# is active, the
cycle will be transformed into a burst line fill cycle.
LOCK# O The bus lock pin indicates that the current bus cycle is locked.
The Pentium processor will not allow a bus hold when
LOCK# is asserted (but AHOLD and BOFF# are allowed).
LOCK# goes active in the first clock of the first locked bus
cycle and goes inactive after the BRDY# is returned for the last
locked bus cycle. LOCK# is guaranteed to be de-asserted for at
least one clock between back-to-back locked cycles.
NA# I An active next address input indicates that the external
memory system is ready to accept a new bus cycle although all
data transfers for the current cycle have not yet completed. The
Pentium processor will issue ADS# for a pending cycle two
clocks after NA# is asserted. The Pentium processor supports
up to 2 outstanding bus cycles.
RESET I RESET forces the Pentium processor to begin execution at a
known state. All the Pentium processor internal caches will be
invalidated upon the RESET. Modified lines in the data cache
are not written back. FLUSH#, FRCMC# and INIT are
sampled when RESET transitions from high to low to
determine if tristate test mode or checker mode will be entered,
or if BIST will be run.


REAL MODE

RISC

A Complex Instruction Set Computer (CISC) provides a large and powerful range of
instructions, which is less flexible to implement. For example, the 8086 microprocessor
family has these instructions:

JA Jump if Above
JAE Jump if Above or Equal
JB Jump if Below

By contrast, the Reduced Instruction Set Computer (RISC) concept is to identify the sub-
components and use those. As these are much simpler, they can be implemented directly
in silicon, so will run at the maximum possible speed. Nothing is 'translated'

Most modern CISC processors, such as the Pentium, uses a fast RISC core with an
interpreter sitting between the core and the instruction. So when you are running
Windows95 on a PC, it is not that much different to trying to get W95 running on the
software PC emulator. Just imagine the power hidden inside the Pentium... .

This is not to say that CISC processors cannot have a large number of registers, some do.
However for it's use, a typical RISC processor requires more registers to give it additional
flexibility. Gone are the days when you had two general purpose registers and an
'accumulator'.

One thing RISC does offer, though, is register independence

The 8086 offers you fourteen registers, but with caveats:
The first four (A, B, C, and D) are Data registers (a.k.a. scratch-pad registers). They are
16bit and accessed as two 8 bit registers, thus register A is really AH (A, high-order byte)
and AL (A low-order byte). These can be used as general purpose registers, but they can
also have dedicated functions - Accumulator, Base, Count, and Data.

The advantages of RISC against CISC are those today:

• RISC processors are much simpler to build, by this again results in the following
advantages:
o easier to build, i.e. you can use already existing production facilities
o much less expensive, just compare the price of a XScale with that of a
Pentium III at 1 GHz...
o less power consumption, which again gives two advantages:
much longer use of battery driven devices
no need for cooling of the device, which again gives to advantages:


smaller design of the whole device
no noise

RISC processors are much simpler to program which doesn't only help the assembler
programmer, but the compiler designer, too. You'll hardly find any compiler which uses
all the functions of a Pentium III optimally

SUPER SCALAR

A superscalar CPU architecture implements a form of parallelism called instruction
level parallelism within a single processor. It therefore allows faster CPU throughput than
would otherwise be possible at a given clock rate. A superscalar processor executes more
than one instruction during a clock cycle by simultaneously dispatching multiple
instructions to redundant functional units on the processor. Each functional unit is not a
separate CPU core but an execution resource within a single CPU such as an arithmetic
logic unit, a bit shifter, or a multiplier.

While a superscalar CPU is typically also pipelined, pipelining and superscalar
architecture are considered different performance enhancement techniques.

The superscalar technique is traditionally associated with several identifying
characteristics (within a given CPU core):

• Instructions are issued from a sequential instruction stream
• CPU hardware dynamically checks for data dependencies between instructions at
run time (versus software checking at compile time)
• The CPU accepts multiple instructions per clock cycle

The simplest processors are scalar processors. Each instruction executed by a scalar
processor typically manipulates one or two data items at a time. By contrast, each
instruction executed by a vector processor operates simultaneously on many data items.
An analogy is the difference between scalar and vector arithmetic. A superscalar
processor is sort of a mixture of the two. Each instruction processes one data item, but
there are multiple redundant functional units within each CPU thus multiple instructions
can be processing separate data items concurrently.

Superscalar CPU design emphasizes improving the instruction dispatcher accuracy, and
allowing it to keep the multiple functional units in use at all times. This has become
increasingly important when the number of units increased. While early superscalar
CPUs would have two ALUs and a single FPU, a modern design such as the PowerPC
970 includes four ALUs, two FPUs, and two SIMD units. If the dispatcher is ineffective
at keeping all of these units fed with instructions, the performance of the system will
suffer.


A superscalar processor usually sustains an execution rate in excess of one instruction per
machine cycle. But merely processing multiple instructions concurrently does not make
an architecture superscalar, since pipelined, multiprocessor or multi-core architectures
also achieve that, but with different methods.

In a superscalar CPU the dispatcher reads instructions from memory and decides which
ones can be run in parallel, dispatching them to redundant functional units contained
inside a single CPU. Therefore a superscalar processor can be envisioned having multiple
parallel pipelines, each of which is processing instructions simultaneously from a single
instruction thread.

Existing binary executable programs have varying degrees of intrinsic parallelism. In
some cases instructions are not dependent on each other and can be executed
simultaneously. In other cases they are inter-dependent: one instruction impacts either
resources or results of the other. The instructions a = b + c; d = e + f can be run in
parallel because none of the results depend on other calculations. However, the
instructions a = b + c; b = e + f might not be runnable in parallel, depending on the
order in which the instructions complete while they move through the units.

When the number of simultaneously issued instructions increases, the cost of dependency
checking increases extremely rapidly. This is exacerbated by the need to check
dependencies at run time and at the CPU's clock rate. This cost includes additional logic
gates required to implement the checks,


PIPELINE AND INSTRUCTION FLOW
The integer instructions traverse a five stage pipeline in the Pentium processor

The pipeline stages are as follows:
PF Prefetch
D1 Instruction Decode
D2 Address Generate
EX Execute - ALU and Cache Access
WB Writeback
The Pentium processor is a superscalar machine, built around two general purpose integer
pipelines and a pipelined floating-point unit capable of executing two instructions in
parallel. Both pipelines operate in parallel allowing integer instructions to execute in a
single clock in each pipeline. Figure depicts instruction flow in the Pentium processor.
The pipelines in the Pentium processor are called the “u” and “v” pipes and the process
of issuing two instructions in parallel is termed “pairing.” The u-pipe can execute any
instruction in the Intel architecture, while the v-pipe can execute “simple” instructions as
defined in the “Instruction Pairing Rules” section of this chapter. When instructions are
paired, the instruction issued to the v-pipe is always the next sequential instruction after
the one issued to the u-pipe.

Pentium® Processor Pipeline Execution

The Pentium processor pipeline has been optimized to achieve higher throughput
compared to previous generations of Intel Architecture processors.
The first stage of the pipeline is the Prefetch (PF) stage in which instructions are
prefetched from the on-chip instruction cache or memory. Because the Pentium processor
has separate caches for instructions and data, prefetches do not conflict with data
references for access to the cache. If the requested line is not in the code cache, a memory
reference is made. In the PF stage of the Pentium processor, two independent pairs of
line-size (32-byte) prefetch buffers operate in conjunction with the Branch Target
Buffer. This allows one prefetch buffer to prefetch instructions sequentially, while the
other prefetches according to the branch target buffer predictions. The pipeline stage after


the PF stage in the Pentium processor is Decode 1 (D1) in which two parallel decoders
attempt to decode and issue the next two sequential instructions. The decoders determine
whether one or two instructions can be issued contingent upon the “Instruction Pairing
Rules.” The Pentium processor requires an extra D1 clock to decode instruction
prefixes. Prefixes are issued to the u-pipe at the rate of one per clock without pairing.
After all prefixes have been issued, the base instruction will then be issued and paired
according to the pairing rules.
The D1 stage is followed by Decode2 (D2) in which addresses of memory resident
operands are calculated. In instructions containing both a displacement and an immediate,
or instructions containing a base and index addressing mode , The Pentium processor
removes both of these restrictions and is able to issue instructions in these categories in a
single clock.
The Pentium processor uses the Execute (EX) stage of the pipeline for both ALU
operations and for data cache access; therefore those instructions specifying both an ALU
operation and a data cache access will require more than one clock in this stage. In EX all
u-pipe instructions and all v-pipe instructions except conditional branches are verified for
correct branch prediction. Microcode is designed to utilize both pipelines and thus those
instructions requiring microcode execute faster.
The final stage is Writeback (WB) where instructions are enabled to modify processor
state and complete execution. In this stage, v-pipe conditional branches are verified for
correct branch prediction. During their progression through the pipeline, instructions may
be stalled due to certain conditions. Both the u-pipe and v-pipe instructions enter and
leave the D1 and D2 stages in unison. When an instruction in one pipe is stalled, then
the instruction in the other pipe is also stalled at the same pipeline stage. Thus both the u-
pipe and the v-pipe instructions enter the EX stage in unison. Once in EX if the u-pipe
instruction is stalled, then the v-pipe instruction (if any) is also stalled. If the v-pipe
instruction is stalled then the instruction paired with it in the u-pipe is not allowed to
advance. No successive instructions are allowed to enter the EX stage of either pipeline
until the instructions in both pipelines have advanced to WB.

INSTRUCTION PREFETCH
In the Pentium processor PF stage, two independent pairs of line-size (32-byte) prefetch
buffers operate in conjunction with the branch target buffer. Only one prefetch buffer
actively requests prefetches at any given time. Prefetches are requested sequentially until
a branch instruction is fetched. When a branch instruction is fetched, the branch target
buffer (BTB) predicts whether the branch will be taken or not. If the branch is predicted
not taken, prefetch requests continue linearly. On a predicted taken branch the other
prefetch buffer is enabled and begins to prefetch as though the branch was taken. If a
branch is discovered mis-predicted, the instruction pipelines are flushed and prefetching
activity starts over.

Integer Instruction Pairing Rules
The Pentium processor can issue one or two instructions every clock. In order to issue
two instructions simultaneously they must satisfy the following conditions:
• Both instructions in the pair must be “simple” as defined below


Simple instructions are entirely hardwired; they do not require any microcode control
and, in general, execute in one clock. The exceptions are the ALU mem, reg and ALU
reg, mem
• There must be no read-after-write or write-after-write register dependencies between
them
• Neither instruction may contain both a displacement and an immediate
• Instructions with prefixes can only occur in the u-pipe.
• Instruction prefixes are treated as separate 1-byte instructions. Sequencing hardware is
used to allow them to function as simple instructions. The following integer instructions
are
considered simple and may be paired:
1. mov reg, reg/mem/imm
2. mov mem, reg/imm
3. alu reg, reg/mem/imm
4. alu mem, reg/imm
5. inc reg/mem
6. dec reg/mem
7. push reg/mem
8. pop reg
9. lea reg,mem
10. jmp/call/jcc near
11. nop
12. test reg, reg/mem
13. test acc, imm

In addition, conditional and unconditional branches may be paired only if they occur as
the second instruction in the pair. They may not be paired with the next sequential
instruction. Also, SHIFT/ROT by 1 and SHIFT by imm may pair as the first instruction
in a pair. The register dependencies that prohibit instruction pairing include implicit
dependencies via registers or flags not explicitly encoded in the instruction. For example,
an ALU instruction in the u-pipe (which sets the flags) may not be paired with an ADC or
an SBB instruction in the v-pipe. There are two exceptions to this rule. The first is the
commonly occurring sequence of compare and branch which may be paired. The second
exception is pairs of pushes or pops. Although these instructions have an implicit
dependency on the stack pointer, special hardware is included to allow these common
operations to proceed in parallel. Although in general two paired instructions may
proceed in parallel independently, there is an exception for paired “read-modify-write”
instructions. Read-modify-write instructions are ALU operations with an operand in
memory. When two of these instructions are paired there is a sequencing delay of two
clocks in addition to the three clocks required to execute the individual instructions.
Although instructions may execute in parallel their behavior as seen by the programmer
is exactly the same as if they were executed sequentially.


BRANCH PREDICTION Branch Target Buffer (BTB)

The Pentium processor uses a Branch Target Buffer (BTB) to predict the outcome of
branch instructions which minimizes pipeline stalls due to prefetch delays.
The Pentium processor accesses the BTB with the address of the instruction in the D1
stage. It contains a Branch prediction state machine with four states: (1) strongly not
taken, (2) weakly not taken, (3) weakly taken, and (4) strongly taken. In the event of a
correct prediction, a branch will execute without pipeline stalls or flushes. Branches
which miss the BTB are assumed to be not taken. Conditional and unconditional near
branches and near calls execute in 1 clock and may be executed in parallel with other
integer instructions. A mispredicted branch (whether a BTB hit or miss) or a correctly
predicted branch with the wrong target address will cause the pipelines to be flushed and
the correct target to be fetched. Incorrectly predicted unconditional branches will incur an
additional three clock delay, incorrectly predicted conditional branches in the u-pipe will
incur an additional three clock delay, and incorrectly predicted conditional branches in
the v-pipe will incur an additional four clock delay.
NT H: History
T
H: 11 H: 10 P: Prediction
T P: T P: T T: Taken
T NT: Not Taken

T NT
T
T
NT
H: 00 H: 01
T P: NT P: T
NT
T
The benefits of branch prediction are illustrated in the following example. Consider the
following loop from a benchmark program for computing prime numbers:
for(k=i+prime;k<=SIZE;k+=prime)
flags[k]=FALSE;
A popular compiler generates the following assembly code:
(prime is allocated to ecx, k is allocated to edx, and al contains the value FALSE)
inner_loop:
mov byte ptr flags[edx],al
add edx,ecx
cmp edx, SIZE
jle inner_loop
Each iteration of this loop will execute in 6 clocks on the Intel486 CPU. On the Pentium
processor, the mov is paired with the add; the cmp with the jle. With branch
prediction, each loop iteration executes in 2 clocks.


CACHE

ON-CHIP CACHES
The Pentium processor implements two internal caches for a total integrated cache size of
16 Kbytes: an 8 Kbyte data cache and a separate 8 Kbyte code cache. These caches are
transparent to application software to maintain compatibility with previous The data
cache fully supports the MESI (modified/exclusive/shared/invalid) writeback cache
consistency protocol. The code cache is inherently write protected to prevent code from
being inadvertently corrupted, and as a consequence supports a subset of the MESI
protocol, the S (shared) and I (invalid) states. The caches have been designed for
maximum flexibility and performance. The data cache is configurable as writeback or
writethrough on a line-by-line basis. Memory areas can be defined as non-cacheable by
software and external hardware. Cache writeback and invalidations can be initiated by
hardware or software. Protocols for cache consistency and line replacement are
implemented in hardware, easing system devise On the Pentium processor , each of the
caches are 8 Kbytes in size and each is organized as a 2-way set associative cache. There
are 128 sets in each cache, each set containing 2 lines (each line has its own tag address).
Each cache line is 32 bytes wide. The In the Pentium processor , replacement in both the
data and instruction caches is handled by the LRU mechanism which requires one bit per
set in each of the caches.

Cache Structure
The instruction and data caches can be accessed simultaneously. The instruction cache
can provide up to 32 bytes of raw opcodes and the data cache can provide data for two
data references all in the same clock. This capability is implemented partially through the
tag structure. The tags in the data cache are triple ported. One of the ports is dedicated
to snooping while the other two are used to lookup two independent addresses
corresponding to data references from each of the pipelines. The instruction cache
tags of the Pentium processor are also triple ported. Again, one port is dedicated to
support snooping and the other two ports facilitate split line accesses (simultaneously
accessing upper half of one line and lower half of the next line. Each of the caches are
parity protected. The operating modes of the caches are controlled by the CD (cache
disable) and NW (not writethrough) bits in CR0. TLB (Translation lookaside Buffers).
Each of the caches are accessed with physical addresses and each cache has its own TLB
(translation lookaside buffer) to translate linear addresses to physical addresses. The
TLBs associated with the instruction cache are single ported whereas the data cache
TLBs are fully dual ported to be able to translate two independent linear addresses for
two data references simultaneously.


The goal of an effective memory system is that the effective access time that the
processor sees is very close to to, the access time of the cache. Most accesses that the
processor makes to the cache are contained within this level. The achievement of this
goal depends on many factors: the architecture of the processor, the behavioral properties
of the programs being executed, and the size and organization of the cache. Caches work
on the basis of the locality of program behavior. There are three principles involved:

1. Spatial Locality - Given an access to a particular location in memory, there is a
high probability that other accesses will be made to either that or neighboring
locations within the lifetime of the program.
2. Temporal Locality - This is complementary to spatial locality. Given a sequence
of references to n locations, there is a high probability that references following
this sequence will be made into the sequence. Elements of the sequence will again
be referenced during the lifetime of the program.
3. Sequentiality- Given that a reference has been made to a particular location s it is
likely that within the next several references a reference to the location of s + 1
will be made. Sequentiality is a restricted type of spatial locality and can be
regarded as a subset of it.

Some common terms
Processor reference that are found in the cache are called cache hits. References not
found in the cache are called cache misses. On a cache miss, the cache control
mechanism must fetch the missing data from memory and place it in the cache. Usually
the cache fetches a spatial locality called the line from memory. The physical word is the
basic unit of access in the memory.
The processor-cache interface can be characterized by a number of parameters. Those
that directly affect processor performance include:

1. Access time for a reference found in the cache (a hit) - property of the cache size
and organization.
2. Access time for a reference not found in the cache (a miss) - property of the
memory organization.
3. Time to initially compute a real address given a virtual address (not-in-TLB-time)
- property of the address translation facility, which, though strictly speaking, is
not part of the cache, resembles the cache in most aspects and is discussed in this
chapter.

Data Cache Consistency Protocol (MESI Protocol)
The Pentium processor Cache Consistency Protocol is a set of rules by which states are


assigned to cached entries (lines). The rules apply for memory read/write cycles only. I/O
and special cycles are not run through the data cache. Every line in the Pentium processor
data cache is assigned a state dependent on both Pentium processor generated activities
and activities generated by other bus masters (snooping). The Pentium processor Data
Cache Protocol consists of four states that define whether a line is valid (HIT/MISS), if it
is available in other caches, and if it has been MODIFIED. The four states are the M
(Modified), E (Exclusive), S (Shared) and the I (Invalid) states and the protocol is
referred to as the MESI protocol. A definition of the states is given below:

M - Modified: An M-state line is available in ONLY one cache and it is also MODIFIED
(different from main memory). An M-state line can be accessed (read/written
to) without sending a cycle out on the bus.
E - Exclusive: An E-state line is also available in ONLY one cache in the system, but the
line is not MODIFIED (i.e., it is the same as main memory). An E-state line can be
accessed (read/written to) without generating a bus cycle. A write to an E-state line will
cause the line to become MODIFIED.
S - Shared: This state indicates that the line is potentially shared with other caches (i.e.
the same line may exist in more than one cache). A read to an S-state line will not
generate bus activity, but a write to a SHARED line will generate a write through cycle
on the bus. The write through cycle may invalidate this line in other caches. A write to an
S-state line will update the cache.
I - Invalid: This state indicates that the line is not available in the cache. A read to this
line will be a MISS and may cause the Pentium processor to execute a LINE FILL (fetch
the whole line into the cache from main memory). A write to an INVALID line will
cause the Pentium processor to execute a write-through
cycle on the bus.

Inquire Cycles (Snooping)
The purpose of inquire cycles is to check whether the address being presented is
contained within the caches in the Pentium processor.

------------------------------------------------------------------------
----------------------


Cache Organization
Within the cache, there are three basic types of organization:

1. Direct Mapped
2. Fully Associative
3. Set Associative

In fully associative mapping, when a request is made to the cache, the requested address
is compared in a directory against all entries in the directory. If the requested address is
found (a directory hit), the corresponding location in the cache is fetched and returned to
the processor; otherwise, a miss occurs.


Fully Associative Cache

In a direct mapped cache, lower order line address bits are used to access the directory.
Since multiple line addresses map into the same location in the cache directory, the upper
line address bits (tag bits) must be compared with the directory address to ensure a hit. If
a comparison is not valid, the result is a cache miss, or simply a miss. The address given
to the cache by the processor actually is subdivided into several pieces, each of which has
a different role in accessing data.


Direct Mapped Cache

The set associative cache operates in a fashion somewhat similar to the direct-mapped
cache. Bits from the line address are used to address a cache directory. However, now
there are multiple choices: two, four, or more complete line addresses may be present in
the directory. Each of these line addresses corresponds to a location in a sub-cache. The
collection of these sub-caches forms the total cache array. In a set associative cache, as in
the direct-maped cache, all of these sub-arrays can be accessed simultaneously, together
with the cache directory. If any of the entries in the cache directory match the reference
address, and there is a hit, the particular sub-cache array is selected and out gated back to
the processor.

Set Associative Cache


Cache Calculation
Tag Line / Set Byte/Block

Cache Main
512 bytes Memory
16Kb

2 4 Lines
16 Bytes / 210 Lines

line 16 bytes /
line
2 Sets

Line Size = 16 = 24 Byte / Block = 4
Total Number of address lines
to address main memory = 16 Kb = 214

Total number of lines in Cache = 512 = 29
Set or Ways = 2
512
= = 28
2

28
Line or Set Size = 4
= 24 Line /Set Size = 4
2
Total Number lines in main memory
Tag Size =
Total Number of lines
in cache set
10
2
= = 26 Tag size = 6
24

214 (Total ) = 2 6 (Tag ) * 2 4 ( Line / Set ) * 2 4 ( Block / Byte)


THE X87 FPU
FLOATING-POINT UNIT
The floating-point unit (FPU) of the Pentium processor is integrated with the integer unit
on the first five stages of the U pipe line The fifth stage FB becomes X1. It is heavily
pipelined. The FPU is designed to be able to accept one floating point .operation every
clock. It can receive up to two floating-point instructions every clock, one of which must
be an exchange instruction.

Floating-Point Pipeline Stages
The Pentium processor FPU has 8 pipeline stages, the first five of which it shares with
the integer unit. Integer instructions pass through only the first 5 stages. Integer
instructions use the fifth (X1) stage as a WB (write-back) stage. The 8 FP pipeline stages,
and the activities that are performed in them are summarized below:

PF Prefetch;
D1 Instruction Decode;
D2 Address generation;
EX Memory and register read; conversion of FP data to external memory format and
memory write;
X1 Floating-Point Execute stage one; conversion of external memory format to internal
FP data format and write operand to FP register file; bypass 1 (bypass 1 described in the
“Bypasses” section).
X2 Floating-Point Execute stage two;
WF Perform rounding and write floating-point result to register file; bypass 2 (bypass 2
described in the “Bypasses” section).
ER Error Reporting/Update Status Word.

FPU Bypasses

The Pentium processor stack architecture instruction set requires that all instructions have
one source operand on the top of the stack. Since most instructions also have their
destination as the top of the stack, most instructions see a “top of stack bottleneck.” New
source operands must be brought to the top of the stack before we can issue an arithmetic
instruction on them. This calls for extra usage of the exchange instruction, which allows
the programmer to bring an available operand to the top of the stack.

The following section describes the floating-point register file bypasses that exist on the
Pentium processor. The register file has two write ports and two read ports. The read
ports are used to read data out of the register file in the E stage. One write port is used to
write data into the register file in the X1 stage, and the other in the WF stage. A bypass
allows data that is about to be written into the register file to be available as an operand
that is to be read from the register file by any succeeding floating-point instruction. A
bypass is specified by a pair of ports (a write port and a read port) that get circumvented.
Using the bypass, data is made available even before actually writing it to the register
file.


The following procedures are implemented:
1. Bypass the X1 stage register file write port and the E stage register file read port.
2. Bypass the WF stage register file write port and the E stage register file read port.
With bypass 1, the result of a floating-point load (that writes to the register file in the X1
stage) can bypass the X1 stage write and be sent directly to the operand fetch stage or E
stage of the next instruction. With bypass 2, the result of any arithmetic operation can
bypass the WF stage write to the register file, and be sent directly to the desired execution
unit as an operand for the next instruction.

PROGRAMMING WITH THE x87 FPU

The x87 Floating-Point Unit (FPU) provides high-performance floating-point processing
capabilities for use in graphics processing, scientific, engineering, and business
applications. It supports the floating-point, integer, and packed BCD integer data types
and the floating-point processing algorithms and exception handling architecture defined
in the IEEE Standard 754 for Binary Floating-Point Arithmetic.

X87 FPU EXECUTION ENVIRONMENT

The x87 FPU represents a separate execution environment within the IA-32. This
execution environment consists of eight data registers (called the x87 FPU data registers)
and the following special-purpose registers:
• Status register
• Control register
• Tag word register
• Last instruction pointer register
• Last data (operand) pointer register
• Opcode register
These registers are described in the following sections.

x87 FPU Data Registers

The x87 FPU data registers consist of eight 80-bit registers. Values are stored in these
registers in the double extended-precision floating-point format. When floating-point,
integer, or packed BCD integer values are loaded from memory into any of the x87 FPU
data registers, the values are automatically converted into double extended precision
floating-point format (if they are not already in that format). When computation results
are subsequently transferred back into memory from any of the x87 FPU registers, the
results can be left in the double extended-precision floating-point format or converted
back into a shorter floating-point format, an integer format, or the packed BCD integer
format.


x87 FPU Execution Environment

The x87 FPU instructions treat the eight x87 FPU data registers as a register stack .All
addressing of the data registers is relative to the register on the top of the stack. The
register number of the current top-of-stack register is stored in the TOP (stack TOP) field
in the x87 FPU status word. Load operations decrement TOP by one and load a value into
the new top of- stack register, and store operations store the value from the current TOP
register in memory and then increment TOP by one. (For the x87 FPU, a load operation is
equivalent to a push and a store operation is equivalent to a pop.) Note that load and store
operations are also available that do not push and pop the stack.

x87 FPU Data Register Stack


If a load operation is performed when TOP is at 0, register wraparound occurs and the
new value of TOP is set to 7. The floating-point stack-overflow exception indicates when
wraparound might cause an unsaved value to be overwritten
.
Many floating-point instructions have several addressing modes that permit the
programmer to implicitly operate on the top of the stack, or to explicitly operate on
specific registers relative to the TOP. Assemblers support these register addressing
modes, using the expression ST(0), or simply ST, to represent the current stack top and
ST(i) to specify the ith register from TOP in the stack (0 ≤ i ≤ 7). For example, if TOP
contains 011B (register 3 is the top of the stack), the following instruction would add the
contents of two registers in the stack (registers 3 and 5):

FADD ST, ST(2);

Figure shows an example of how the stack structure of the x87 FPU registers and
instructions are typically used to perform a series of computations. Here, a two-
dimensional dot product is computed, as follows:
1. The first instruction (FLD value1) decrements the stack register pointer (TOP) and
loads the value 5.6 from memory into ST(0). The result of this operation is shown in
snapshot (a).
2. The second instruction multiplies the value in ST(0) by the value 2.4 from memory and
stores the result in ST(0), shown in snap-shot (b).
3. The third instruction decrements TOP and loads the value 3.8 in ST(0).
4. The fourth instruction multiplies the value in ST(0) by the value 10.3 from memory
and stores the result in ST(0), shown in snap-shot (c).
5. The fifth instruction adds the value and the value in ST(1) and stores the result in
ST(0), shown in snap-shot (d).

Example x87 FPU Dot Product Computation


MICROPROCESSOR INITIALIZATION AND
CONFIGURATION
Before normal operation of the Pentium processor can begin, the Pentium processor must
be initialized by driving the RESET pin active. The RESET pin forces the Pentium
processor to begin execution in a known state. Several features are optionally invoked at
the falling edge of RESET: Built-in-Self-Test (BIST), Functional Redundancy Checking
and Tristate Test Mode.
In addition to the standard RESET pin, the Pentium processor has implemented an
initialization pin (INIT) that allows the processor to begin execution in a known state
without disrupting the contents of the internal caches or the floating-point state.

POWER UP SPECIFICATIONS
During power up, RESET must be asserted while VCC is approaching nominal operating
voltage to prevent internal bus contention which could negatively affect the reliability of
the processor. It is recommended that CLK begin toggling within 150 ms after VCC
reaches its proper operating level. This recommendation is only to ensure long term
reliability of the device.
In order for RESET to be recognized, the CLK input needs to be toggling. RESET must
remain asserted for 1 millisecond after VCC and CLK have reached their AC/DC
specifications.

TEST AND CONFIGURATION FEATURES (BIST, FRC,
TRISTATE TEST MODE)

The INIT, FLUSH#, and FRCMC# inputs are sampled when RESET transitions from
high to low to determine if BIST will be run, or if tristate test mode or checker mode will
be entered (respectively). If RESET is driven synchronously, these signals must be at
their valid level and meet setup and hold times on the clock before the falling edge of
RESET. If RESET is asserted asynchronously, these signals must be at their valid level
two clocks before and after RESET transitions from high to low.

Built In Self-Test
Self-test is initiated by driving the INIT pin high when RESET transitions from high
to low. No bus cycles are run by the Pentium processor during self test. The duration of
self test is approximately 219 core clocks. Approximately 70% of the devices in the
Pentium processor are tested by BIST. The Pentium processor BIST consists of two parts:
hardware self-test and microcode self-test. During the hardware portion of BIST, the
microcode ROM and all large PLAs are tested. All possible input combinations of the
microcode ROM and PLAs are tested. The constant ROMs, BTB, TLBs, and all caches
are tested by the microcode portion of BIST. The array tests (caches, TLBs and BTB)
have two passes. On the first pass, data patterns are written to arrays, read back and
checked for mismatches. The second pass writes the complement of the initial data
pattern, reads it back, and checks for mismatches. The constant ROMs are tested by using
the microcode to add various constants and check the result against a stored value.


Upon successful completion of BIST, the cumulative result of all tests are stored in the
EAX register. If EAX contains 0h, then all checks passed; any non-zero result indicates a
faulty unit

Tristate Test Mode
When the FLUSH# pin is sampled low when RESET transitions from high to low, the
Pentium processor enters tristate test mode. The Pentium processor floats all of its output
pins and bidirectional pins including pins which are never floated during normal
operation (except TDO). Tristate test mode can be initiated in order to facilitate testing by
external circuitry to test board interconnects. The Pentium processor remains in tristate
test mode until the RESET pin is asserted again.

Functional Redundancy Checking
The functional redundancy checking master/checker configuration input is sampled when
RESET is high to determine whether the Pentium processor is configured in master mode
(FRCMC# high) or checker mode (FRCMC# low). The final master/checker
configuration of the Pentium processor is determined the clock before the falling edge of
RESET. When configured as a master, the Pentium processor drives its output pins as
required by the bus protocol. When configured as a checker, the Pentium processor
tristates all outputs (except IERR#, PICD0, PICD1 and TDO) and samples the output
pins (that would normally be driven in master mode). If the sampled value differs from
the value computed internally, the Pentium processor asserts IERR# to indicate an error.

INITIALIZATION WITH RESET, INIT AND BIST
Two pins, RESET and INIT, are used to reset the Pentium processor in different manners. A
“cold” or “power on” RESET refers to the assertion of RESET while power is initially being
applied to the Pentium processor. A “warm” RESET refers to the assertion of RESET or INIT
while VCC and CLK remain within specified operating limits.
Table 3-1 shows the effect of asserting RESET and/or INIT.

Toggling either the RESET pin or the INIT pin individually forces the Pentium processor
to begin execution at address FFFFFFF0h. The internal instruction cache and data cache
are invalidated when RESET is asserted (modified lines in the data cache are NOT
written back). The instruction cache and data cache are not altered when the INIT pin is
asserted without RESET. In both cases, the branch target buffer (BTB) and translation
lookaside buffers (TLBs) are invalidated. After RESET (with or without BIST) or INIT,
the Pentium processor will start executing instructions at location FFFFFFF0H. When the
first Intersegment Jump or Call instruction is executed, address lines A20-A31 will be
driven low for CS-relative memory cycles and the Pentium processor will only execute


instructions in the lower one Mbyte of physical memory. This allows the system designer
to use a ROM at the top of physical memory to initialize the system. RESET is internally
hardwired and forces the Pentium processor to terminate all execution and bus cycle
activity within 2 clocks. No instruction or bus activity will occur as long as RESET is
active. INIT is implemented as an edge triggered interrupt and will be recognized when
an instruction boundary is reached. As soon as the Pentium processor completes the INIT
sequence, instruction execution and bus cycle activity will continue at address
FFFFFFF0h even if the INIT pin is not deasserted. At the conclusion of RESET (with or
without self-test) or INIT, the DX register will contain a component identifier. The upper
byte will contain 05h and the lower byte will contain a stepping identifier.


BUS CYCLES
The Pentium processor bus is designed to support a 528-Mbyte/sec data transfer rate at 66
MHz. All data transfers occur as a result of one or more bus cycles.

PHYSICAL MEMORY AND I/O INTERFACE
Pentium processor memory is accessible in 8-, 16-, 32-, and 64-bit quantities. Pentium
processor I/O is accessible in 8-, 16-, and 32-bit quantities. The Pentium processor can
directly address up to 4 Gbytes of physical memory, and up to 64 Kbytes of I/O.
In hardware, memory space is organized as a sequence of 64-bit quantities. Each 64-bit
location has eight individually addressable bytes at consecutive memory addresses

Memory Organization

The I/O space is organized as a sequence of 32-bit quantities. Each 32-bit quantity has
four individually addressable bytes at consecutive memory addresses. See Figure for a
conceptual diagram of the I/O space.

I/O Space Organization


Sixty-four-bit memories are organized as arrays of physical quadwords (8-byte words).
Physical quadwords begin at addresses evenly divisible by 8. The quadwords are
addressable by physical address lines A31-A3.
Thirty-two-bit memories are organized as arrays of physical dwords (4-byte words).
Physical dwords begin at addresses evenly divisible by 4. The dwords are addressable by
physical address lines A31-A3 and A2. A2 can be decoded from the byte enables .
Sixteen-bit memories are organized as arrays of physical words (2-byte words). Physical
words begin at addresses evenly divisible by 2.

DATA TRANSFER MECHANISM
All data transfers occur as a result of one or more bus cycles. Logical data operands of
byte, word, dword, and quadword lengths may be transferred. Data may be accessed at
any byte boundary, but two cycles may be required for misaligned data transfers. The
Pentium processor considers a 2-byte or 4-byte operand that crosses a 4-byte boundary to
be misaligned. In addition, an 8-byte operand that crosses an 8-byte boundary is
misaligned. The Pentium processor address signals are split into two components.

High-order address bits are provided by the address lines A31-A3. The byte enables
BE7#- BE0# form the low-order address and selects the appropriate byte of the 8-byte
data bus.

For both memory and I/O accesses, the byte enable outputs indicate which of the
associated data bus bytes are driven valid for write cycles and on which bytes data is
expected back for read cycles. Non-contiguous byte enable patterns will never occur.

Generating A2-A0 from BE7-0#

Interfacing With 8-, 16-, 32-, and 64-Bit Memories
In 64-bit physical memories such as, each 8-byte quadword begins at a byte address
that is a multiple of eight. A31-A3 are used as an 8-byte quadword select and BE7#-
BE0# select individual bytes within the word.


Pentium® Processor with 64-Bit Memory

The Figure shows the Pentium processor data bus interface to 32-, 16- and 8-bit wide
memories. External byte swapping logic is needed on the data lines so that data is
supplied to and received from the Pentium processor on the correct data pins see Table.
For memory widths smaller than 64 bits, byte assembly logic is needed to return all bytes
of data requested by the Pentium processor in one cycle.

Addressing 32-, 16- and 8-Bit Memories


Data Bus Interface to 32-, 16- and 8-Bit Memories

Operand alignment and size dictate when two cycles are required for a data transfer.


BUS STATE DEFINITION
This section describes the Pentium processor bus states in detail. See Figure for the bus
state diagram.
Ti: This is the bus idle state. In this state, no bus cycles are being run. The Pentium
processor may or may not be driving the address and status pins, depending on the state
of the HLDA,AHOLD, and BOFF# inputs. An asserted BOFF# or RESET will always
force the state machine back to this state. HLDA will only be driven in this state.
T1: This is the first clock of a bus cycle. Valid address and status are driven out and
ADS# is asserted. There is one outstanding bus cycle.
T2: This is the second and subsequent clock of the first outstanding bus cycle. In state T2,
data is driven out (if the cycle is a write), or data is expected (if the cycle is a read), and
the BRDY# pin is sampled. There is one outstanding bus cycle.
T12: This state indicates there are two outstanding bus cycles, and that the Pentium
processor is starting the second bus cycle at the same time that data is being transferred
for the first. In T12, the Pentium processor drives the address and status and asserts
ADS# for the second outstanding bus cycle, while data is transferred and BRDY# is
sampled for the first outstanding cycle.
T2P: This state indicates there are two outstanding bus cycles, and that both are in their
second and subsequent clocks. In T2P, data is being transferred and BRDY# is sampled
for the first outstanding cycle. The address, status and ADS# for the second outstanding
cycle were driven sometime in the past (in state T12).
TD: This state indicates there is one outstanding bus cycle, that its address, status and
ADS# have already been driven sometime in the past (in state T12), and that the data and
BRDY# pins are not being sampled because the data bus requires one dead clock to turn
around between consecutive reads and writes, or writes and reads. The Pentium processor
enters TD if in the previous clock there were two outstanding cycles, the last BRDY# was
returned, and a dead clock is needed. The timing diagrams in the next section give
examples when a dead clock is needed.
Table gives a brief summary of bus activity during each bus state. Figure shows the
Pentium processor bus state diagram.

Pentium® Processor Bus Activity


Pentium® Processor Bus Control State Machine


BUS CYCLES
The Pentium processor requests data transfer cycles, bus cycles, and bus operations.
A data transfer cycle is one data item, up to 8 bytes in width, being returned to the
Pentium processor or accepted from the Pentium processor with BRDY# asserted. A bus
cycle begins with the Pentium processor driving an address and status and asserting
ADS#, and ends when the last BRDY# is returned. A bus cycle may have 1 or 4 data
transfers. A burst cycle is a bus cycle with 4 data transfers. A bus operation is a sequence
of bus cycles to carry out a specific function, such as a locked read-modify-write or an
interrupt acknowledge.

Single-Transfer Cycle
The Pentium processor supports a number of different types of bus cycles. The simplest
type of bus cycle is a single-transfer non-cacheable 64-bit cycle, either with or without
wait states. Non-pipelined read and write cycles with 0 wait states are shown in Figure

Non Pipelined Read or Write


The Pentium processor initiates a cycle by asserting the address status signal (ADS#) in
the first clock. The clock in which ADS# is asserted is by definition the first clock in the
bus cycle. The ADS# output indicates that a valid bus cycle definition and address is
available on the cycle definition pins and the address bus. The CACHE# output is
deasserted (high) to indicate that the cycle will be a single transfer cycle.

For a zero wait state transfer, BRDY# is returned by the external system in the second
clock of the bus cycle. BRDY# indicates that the external system has presented valid data
on the data pins in response to a read or the external system has accepted data in response
to a write. The Pentium processor samples the BRDY# input in the second and
subsequent clocks of a bus Cycle

If the system is not ready to drive or accept data, wait states can be added to these cycles
by not returning BRDY# to the processor at the end of the second clock. Cycles of this
type, with one and two wait states added are shown in Figure .Note that BRDY# must be
driven inactive at the end of the second clock.

Burst Cycles
For bus cycles that require more than a single data transfer (cacheable cycles and
writeback cycles), the Pentium processor uses the burst data transfer. In burst transfers, a
new data item can be sampled or driven by the Pentium processor in consecutive clocks.
In addition the addresses of the data items in burst cycles all fall within the same 32-byte
aligned area (corresponding to an internal Pentium processor cache line).

The implementation of burst cycles is via the BRDY# pin. While running a bus cycle of
more than one data transfer, the Pentium processor requires that the memory system
perform a burst transfer and follow the burst order see Table. Given the first address in
the burst sequence, the address of subsequent transfers must be calculated by external
hardware. This requirement exists because the Pentium processor address and byte-
enables are asserted for the first transfer and are not re-driven for each transfer. The burst
sequence is optimized for two bank memory subsystems and is shown in Table

Pentium Processor Burst Order


BURST READ CYCLES
When initiating any read, the Pentium processor will present the address and byte enables
for the data item requested. When the cycle is converted into a cache linefill, the first data
item returned should correspond to the address sent out by the Pentium processor;
however, the byte enables should be ignored, and valid data must be returned on all 64
data lines. In addition, the address of the subsequent transfers in the burst sequence must
be calculated by external hardware since the address and byte enables are not re-driven
for each transfer.

Figure shows a cacheable burst read cycle. Note that in this case the initial cycle
generated by the Pentium processor might have been satisfied by a single data transfer,
but was transformed into a multiple-transfer cache fill by KEN# being returned active on
the clock that the first BRDY# is returned. In this case KEN# has such an effect because
the cycle is internally cacheable in the Pentium processor (CACHE# pin is driven active).
KEN# is only sampled once during a cycle to determine cacheability.

Basic Burst Read Cycle


BURST WRITE CYCLES
Figure shows the timing diagram of basic burst write cycle. KEN# is ignored in burst
write cycle. If the CACHE# pin is active (low) during a write cycle, it indicates that the
cycle will be a burst writeback cycle. Burst write cycles are always writebacks of
modified lines in the data cache. Writeback cycles have several causes:

1. Writeback due to replacement of a modified line in the data cache.
2. Writeback due to an inquire cycle that hits a modified line in the data cache.
3. Writeback due to an internal snoop that hits a modified line in the data cache.
4. Writebacks caused by asserting the FLUSH# pin.
5. Writebacks caused by executing the WBINVD instruction.

The only write cycles that are burstable by the Pentium processor are writeback cycles.
All other write cycles will be 64 bits or less, single transfer bus cycles.

Basic Burst Write Cycle

For writeback cycles, the lower five bits of the first burst address always starts at zero;
therefore, the burst order becomes 0, 8h, 10h, and 18h. Again, note that the address of the
subsequent transfers in the burst sequence must be calculated by external hardware since
the Pentium processor does not drive the address and byte enables for each transfer.


Locked Operations
The Pentium processor architecture provides a facility to perform atomic accesses of
memory. For example, a programmer can change the contents of a memory-based
variable and be assured that the variable was not accessed by another bus master between
the read of the variable and the update of that variable. This functionality is provided for
select instructions using a LOCK prefix, and also for instructions which implicitly
perform locked read modify write cycles such as the XCHG (exchange) instruction when
one of its operands is memory based. Locked cycles are also generated when a segment
descriptor or page table entry is updated and during interrupt acknowledge cycles.

In hardware, the LOCK functionality is implemented through the LOCK# pin, which
indicates to the outside world that the Pentium processor is performing a read-modify-
write sequence of cycles, and that the Pentium processor should be allowed atomic
access for the location that was accessed with the first locked cycle. Locked operations
begin with a read cycle and end with a write cycle. Note that the data width read is not
necessarily the data width written. For example, for descriptor access bit updates the
Pentium processor fetches eight bytes and writes one byte.

A locked operation is a combination of one or multiple read cycles followed by one or
multiple write cycles. Programmer generated locked cycles and locked page table /
directory accesses are treated differently and are described in the following sections.
Snooping (Inquire)
When operating in an MP system, IA-32 processors (beginning with the Intel486
processor) have the ability to snoop other processor’s accesses to system memory and
to their internal caches. They use this snooping ability to keep their internal caches
consistent both with system memory and with the caches in other processors on the bus.
For example, in the Pentium and P6 family processors, if through snooping one processor
detects that another processor intends to write to a memory location that it currently has
cached in shared state, the snooping processor will invalidate its cache line forcing it to
perform a cache line fill the next time it accesses the same memory location.

.


REGISTER SET

Alternate General Purpose Register Names


• I/O ports — The IA-32 architecture supports a transfers of data to and from
input/output (I/O) ports.
• Control registers — The five control registers (CR0 through CR4) determine the
operating mode of the processor and the characteristics of the currently executing task.
• Memory management registers — The GDTR, IDTR, task register, and LDTR
specify the locations of data structures used in protected mode memory management.
• Debug registers — The debug registers (DR0 through DR7) control and allow
monitoring of the processor’s debugging operations.

BASIC PROGRAM EXECUTION REGISTERS
The processor provides 16 basic program execution registers for use in general system
and application programming (see Figure ). These registers can be grouped as follows:
• General-purpose registers. These eight registers are available for storing operands and
pointers.
• Segment registers. These registers hold up to six segment selectors.
• EFLAGS (program status and control) register. The EFLAGS register report on the
status of the program being executed and allows limited (application-program level)
control of the processor.
• EIP (instruction pointer) register. The EIP register contains a 32-bit pointer to the
next instruction to be executed.
• EAX — Accumulator for operands and results data
• EBX — Pointer to data in the DS segment
• ECX — Counter for string and loop operations
• EDX — I/O pointer
• ESI — Pointer to data in the segment pointed to by the DS register; source pointer for
string operations
• EDI — Pointer to data (or destination) in the segment pointed to by the ES register;
destination pointer for string operations
• ESP — Stack pointer (in the SS segment)
• EBP — Pointer to data on the stack (in the SS segment)
As shown in Figure 3-5, the lower 16 bits of the general-purpose registers map directly to
the register set found in the 8086 and Intel 286 processors and can be referenced with the
names AX, BX, CX, DX, BP, SI, DI, and SP. Each of the lower two bytes of the EAX,
EBX, ECX, and EDX registers can be referenced by the names AH, BH, CH, and DH
(high bytes) and AL, BL, CL, and DL (low bytes).

DATA TYPES
This chapter introduces data types defined for the IA-32 architecture.

FUNDAMENTAL DATA TYPES
The fundamental data types of IA-32 architecture are bytes, words, doublewords,
quadwords, and double quadwords (see Figure ). A byte is eight bits, a word is 2 bytes


(16 bits), a doubleword is 4 bytes (32 bits), a quadword is 8 bytes (64 bits), and a double
quadword is 16 bytes (128 bits). A subset of the IA-32 architecture instructions operates
on these fundamental data types without any additional operand typing.

Figure shows the byte order of each of the fundamental data types when referenced as
operands in memory. The low byte (bits 0 through 7) of each data type occupies the
lowest address in memory and that address is also the address of the operand.

Bytes, Words, Doublewords, Quadwords, and Double Quadwords in Memory


Alignment
Words, Doublewords, Quadwords, and Double Quadwords

Words, doublewords, and quadwords do not need to be aligned in memory on natural
boundaries. The natural boundaries for words, double words, and quadwords are even-
numbered addresses, addresses evenly divisible by four, and addresses evenly divisible
by eight, respectively. However, to improve the performance of programs, data structures
(especially stacks) should be aligned on natural boundaries whenever possible. The
reason for this is that the processor requires two memory accesses to make an unaligned
memory access; aligned accesses require only one memory access. A word or
doubleword operand that crosses a 4-byte boundary or a quadword operand that crosses
an 8-byte boundary is considered unaligned and requires two separate memory bus cycles
for access.
Some instructions that operate on double quadwords require memory operands to be
aligned on a natural boundary. These instructions generate a general-protection exception
(#GP) if an unaligned operand is specified. A natural boundary for a double quadword is
any address evenly divisible by 16. Other instructions that operate on double quadwords
permit unaligned access (without generating a general-protection exception). However,
additional memory bus cycles are required to access unaligned data from memory.

NUMERIC DATA TYPES
Although bytes, words, and doublewords are the fundamental data types of the IA-32
architecture, some instructions support additional interpretations of these data types to
allow operations to be performed on numeric data types (signed and unsigned integers,
and floating-point numbers). See Figure


Numeric Data Types

OPERAND ADDRESSING
IA-32 machine-instructions act on zero or more operands. Some operands are specified
explicitly and others are implicit. The data for a source operand can be located in:
• the instruction itself (an immediate operand)
• a register
• a memory location
• an I/O port

When an instruction returns data to a destination operand, it can be returned to:
• a register

• a memory location
• an I/O port

Immediate Operands
Some instructions use data encoded in the instruction itself as a source operand. These
operands are called immediate operands (or simply immediates). For example, the
following ADD instruction adds an immediate value of 14 to the contents of the EAX
register:
ADD EAX, 14


All arithmetic instructions (except the DIV and IDIV instructions) allow the source
operand to be an immediate value. The maximum value allowed for an immediate
operand varies among instructions, but can never be greater than the maximum value of
an unsigned doubleword integer (232).

Register Operands
Source and destination operands can be any of the following registers, depending on the
instruction being executed:
• 32-bit general-purpose registers (EAX, EBX, ECX, EDX, ESI, EDI, ESP, or EBP)
• 16-bit general-purpose registers (AX, BX, CX, DX, SI, DI, SP, or BP)
• 8-bit general-purpose registers (AH, BH, CH, DH, AL, BL, CL, or DL)
• segment registers (CS, DS, SS, ES, FS, and GS)
• EFLAGS register
• x87 FPU registers (ST0 through ST7, status word, control word, tag word, data operand
pointer, and instruction pointer)

in a pair
Some instructions (such as the DIV and MUL instructions) use quadword operands contained
of 32-bit registers. Register pairs are represented with a colon separating them. For
example, in the register pair EDX:EAX, EDX contains the high order bits and EAX
contains the low order bits of a quadword operand. Several instructions (such as the
PUSHFD and POPFD instructions) are provided to load and store the contents of the
EFLAGS register or to set or clear individual flags in this register. Other
instructions (such as the Jcc instructions) use the state of the status flags in the EFLAGS
register as condition codes for branching or other decision making operations.
The processor contains a selection of system registers that are used to control memory
management, interrupt and exception handling, task management, processor
management, and debugging activities. Some of these system registers are accessible by
an application program, the operating system, or the executive through a set of system
instructions. When accessing a system register with a system instruction, the register is
generally an implied operand of the instruction.

Memory Operands
Source and destination operands in memory are referenced by means of a segment
selector and an offset (see Figure). Segment selectors specify the segment containing the
operand. Offsets specify the linear or effective address of the operand. Offsets can be 32
bits (represented by the notation m16:32) or 16 bits (represented by the notation m16:16).

Memory Operand Address

Specifying a Segment Selector
The segment selector can be specified either implicitly or explicitly. The most common
method of specifying a segment selector is to load it in a segment register and then allow


the processor to select the register implicitly, depending on the type of operation being
performed. The processor automatically chooses a segment according to the rules given
in Table When storing data in memory or loading data from memory, the DS segment
default can be overridden to allow other segments to be accessed. Within an assembler,
the segment override is generally handled with a colon “:” operator. For example, the
following MOV instruction moves a value from register EAX into the segment pointed to
by the ES register. The offset into the segment is contained in the EBX register:
MOV ES:[EBX], EAX;

Default Segment Selection Rules

At the machine level, a segment override is specified with a segment-override prefix,
which is a byte placed at the beginning of an instruction. The following default segment
selections cannot be overridden:
• Instruction fetches must be made from the code segment.
• Destination strings in string instructions must be stored in the data segment pointed to
by
the ES register.
• Push and pop operations must always reference the SS segment.
Some instructions require a segment selector to be specified explicitly. In these cases, the
16-bit segment selector can be located in a memory location or in a 16-bit register. For
example, the following MOV instruction moves a segment selector located in register BX
into segment register DS:
MOV DS, BX
Segment selectors can also be specified explicitly as part of a 48-bit far pointer in
memory. Here, the first doubleword in memory contains the offset and the next word
contains the segment selector.

Specifying an Offset
The offset part of a memory address can be specified directly as a static value (called a
displacement) or through an address computation made up of one or more of the
following components:
• Displacement — An 8-, 16-, or 32-bit value.
• Base — The value in a general-purpose register.
• Index — The value in a general-purpose register.
• Scale factor — A value of 2, 4, or 8 that is multiplied by the index value.


The offset which results from adding these components is called an effective address.
Each of these components can have either a positive or negative (2s complement) value,
with the exception of the scaling factor. Figure 3-11 shows all the possible ways that
these components can be combined to create an effective address in the selected segment.

Offset (or Effective Address) Computation

The uses of general-purpose registers as base or index components are restricted in the
following manner:
• The ESP register cannot be used as an index register.
• When the ESP or EBP register is used as the base, the SS segment is the default
segment. In all other cases, the DS segment is the default segment.
The base, index, and displacement components can be used in any combination, and any
of these components can be null. A scale factor may be used only when an index also is
used. Each possible combination is useful for data structures commonly used by
programmers in high-level languages and assembly language. The following addressing
modes suggest uses for common combinations of address components.
• Displacement A displacement alone represents a direct (uncomputed) offset to the
operand. Because the displacement is encoded in the instruction, this form of an address
is sometimes called an absolute or static address. It is commonly used to access a
statically allocated scalar operand.
• Base A base alone represents an indirect offset to the operand. Since the value in the
base register can change, it can be used for dynamic storage of variables and data
structures.
• Base + Displacement A base register and a displacement can be used together for
two distinct purposes:
• As an index into an array when the element size is not 2, 4, or 8 bytes—The
displacement component encodes the static offset to the beginning of the array. The base
register holds the results of a calculation to determine the offset to a specific element
within the array.
• To access a field of a record: the base register holds the address of the beginning of the
record, while the displacement is a static offset to the field.
An important special case of this combination is access to parameters in a procedure
activation record. A procedure activation record is the stack frame created when a
procedure is entered. Here, the EBP register is the best choice for the base register,


because it automatically selects the stack segment. This is a compact encoding for this
common function.
• (Index ∗ Scale) + Displacement This address mode offers an efficient way to index
into a static array when the element size is 2, 4, or 8 bytes. The displacement locates the
beginning of the array, the index register holds the subscript of the desired array element,
and the processor automatically converts the subscript into an index by applying the
scaling factor.
• Base + Index + Displacement Using two registers together supports either a
twodimensional array (the displacement holds the address of the beginning of the array)
or one of several instances of an array of records (the displacement is an offset to a field
within the record).
• Base + (Index ∗ Scale) + Displacement Using all the addressing components
together allows efficient indexing of a two-dimensional array when the elements of the
array are 2, 4, or 8 bytes in size.

I/O Port Addressing
The processor supports an I/O address space that contains up to 65,536 8-bit I/O ports.
Ports that are 16-bit and 32-bit may also be defined in the I/O address space. An I/O port
can be addressed with either an immediate operand or a value in the DX register.


Mpmc

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (18)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Mpmc

Ähnlich wie Mpmc (20)

Mehr von Akshay Nagpurkar

Mehr von Akshay Nagpurkar (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Mpmc