2. Memory Hierarchy
Computer memory is organized in a
hierarchy. This done to cope up with
the speed of processor and hence
increase performance.
Closest to the processor are the
Processing registers. Then comes the
Cache memory, followed by Main
memory.
3. SRAM and DRAM
Both are random access memories and are
volatile, i.e. constant power supply is required
to avoid data loss.
DRAM :- made up of a capacitor and a
transistor. Transistor acts as a switch and
data in the form of charge is present on the
capacitor. Requires periodic charge
refreshing to maintain data storage. Lesser
cost per bit, less expensive. Used for large
memory
SRAM :- made up of 4 transistors, which are
cross-connected in an arrangement that
produces stable logic state. Greater costs per
bit, more expensive. Used for small memory.
4. Principles of Locality
Since programs can access a small
portion of their address space at any
given instant, thus to increase
performance, two policies are followed
:-
A) Temporal Locality :- locality in time,
i.e. if an item is referred, it will tend to
referred again soon.
B) Spatial Locality :- locality in space,
i.e. if an item is referred, its neighboring
5. Mapping Functions
There are three main types of memory
mapping functions :-
1) Direct Mapped
2) Fully Associative
3) Set Associative
For the coming explanations, let us
assume 1GB main memory, 128KB
Cache memory and Cache line size
32B.
6. Direct Mapping
TAG LINE or SLOT (r) OFFSET
•Each memory block is mapped to a
single cache line. For the purpose of
cache access, each main memory
address can be viewed as consisting of
three fields
•No two block in the same line have the
same Tag field
•Check contents of the cache by finding
s w
7. For the given example, we have –
1GB main memory = 220 bytes
Cache size = 128KB = 217 bytes
Block size = 32B = 25 bytes
No. of cache lines = 217/25 = 212, thus
12 bits are required to locate 212 lines.
Also, offset is 25bytes and thus 5 bits
are required to locate individual byte.
Thus Tag bits = 32 – 12 - 5 = 14 bits
14 12 5
8. Summary
Address length = (s + w) bits
Number of addressable units = 2s+w
words or bytes
Block size = line size = 2w words or bytes
No. of blocks in main memory = 2s+ w/2w
= 2s
Number of lines in cache = m = 2r
Size of tag = (s – r) bits
Mapping Function
Jth Block of the main memory maps to ith
cache line
I = J modulo M (M = no. of cache lines)
9. Pro’s and Con’s
Simple
Inexpensive
Fixed location for given block
If a program accesses 2 blocks that
map to the same line repeatedly,
cache misses (conflict misses) are
very high
10. Fully Associative Mapping
A main memory block can load into any
line
of cache
Memory address is interpreted as tag
and
word
Tag uniquely identifies block of memory
Every line’s tag is examined for a match
Cache searching gets expensive and
more power consumption due to parallel
comparators
TAG OFFSET
s w
12. For the given example, we have –
1GB main memory = 220 bytes
Cache size = 128KB = 217 bytes
Block size = 32B = 25 bytes
Here, offset is 25bytes and thus 5 bits
are required to locate individual byte.
Thus Tag bits = 32 – 5 = 27 bits
27 5
13. Fully Associative Mapping
Summary
Address length = (s + w) bits
Number of addressable units = 2s+w words
or bytes
Block size = line size = 2w words or bytes
No. of blocks in main memory = 2s+ w/2w =
2s
Number of lines in cache = Total Number
of cache blocks
Size of tag = s bits
14. Pro’s and Con’s
There is flexibility as to which block to
replace when a new block is read into
the cache
The complex circuitry required for
parallel Tag comparison is however a
major disadvantage.
15. Set Associative Mapping
Cache is divided into a number of sets
Each set contains a number of lines
A given block maps to any line in a
given set. e.g. Block B can be in any
line of set i
If 2 lines per set,
2 way associative mapping
A given block can be in one of 2 lines in
only one sets w
TAG SET (d) OFFSET
17. For the given example, we have –
1GB main memory = 220 bytes
Cache size = 128KB = 217 bytes
Block size = 32B = 25 bytes
Let it be a 2-way set associative cache,
No. of sets = 217/(2*25 )= 211, thus 11 bits
are required to locate 211 sets and each
set containing 2 lines each
Also, offset is 25bytes and thus 5 bits are
required to locate individual byte.
Thus Tag bits = 32 – 11 - 5 = 16 bits
16 11 5
18. Set Associative Mapping
Summary
Address length = (s + w) bits
Number of addressable units = 2s+w words or
bytes
Block size = line size = 2w words or bytes
Number of blocks in main memory = 2s
Number of lines in set = k
Number of sets = v = 2d
Number of lines in cache = kv = k * 2d
Size of tag = (s – d) bits
Mapping Function
Jth Block of the main memory maps to ith set
I = J modulo v (v = no. of sets)
Within the set, the block can be mapped to any
cache line.
19. Pro’s and Con’s
After simulating the hit ratio for direct
mapped and (2,4,8 way) set associative
mapped cache, we observe that there
is significant difference in performance
at least up to cache size of 64KB, set
associative being the better one.
However, beyond that, the complexity
of cache increases in proportion to the
associativity, hence both mapping give
approximately similar hit ratio.
20. N-way Set Associative Cache
Vs. Direct Mapped Cache:
N comparators Vs 1
Extra mux delay for the data
Data comes after hit/miss
In a direct map cache, cache block is
available before hit/miss
Number of misses
DM > SA > FA
Access latency : time to perform read or
write operation, i.e. time from instant
address is presented to memory to the
instant that data have stored or made
available
DM < SA < FA
21. Types of Misses
Compulsory Misses :-
When a program is started, the cache
is completely empty and hence the
first access to the block will always be
a miss as it has to brought to the
cache from memory, at least for the
first time.
Also called first reference misses.
Can’t be avoided easily.
22. Capacity Misses
Since the cache cannot hold all the
blocks needed during the execution of
program
Thus this miss occurs due to the
blocks being discarded and later
retrieved.
They occur because the cache is
limited in size.
Fully Associative cache has this as its
major miss reason.
23. Conflict Misses
It occurs because multiple distinct
memory locations map to the same
cache location.
Thus in case of DM or SA, it occurs
because a blocks being discarded and
later retrieved.
In DM, this is a repeated phenomenon
as two blocks which map to the same
cache line can be accessed alternately
and thereby decreasing the hit ratio.
This phenomenon is called
24. Solutions to reduce misses
Capacity Misses :-
◦ Increase cache size
◦ Re-structure the program
Conflict Misses :-
◦ Increase cache size
◦ Increase associativity
25. Coherence Misses
Occur when other processors update
memory which in turn invalidates the
data block present in other
processor’s cache.
26. Replacement Algorithms
For Direct Mapped Cache, since each
block maps to only one line, we have no
choice but the replace that line itself
Hence there isn’t any replacement policy
for DM.
For SA and FA, few replacement policies
:-
◦ Optimal
◦ Random
◦ Arrival
◦ Frequency
◦ Recently Used
27. Optimal
This is the ideal benchmarking
replacement strategy.
All other policies are compared to it.
This is not implemented, but used just
for comparison purposes.
28. Random
Block to be replaced is randomly
picked
Minimum hardware complexity – just a
pseudo random number generator
required.
Access time is not affected by the
replacement circuit.
Not suitable for high performance
systems
29. Arrival - FIFO
For an N-way set associative cache
Implementation 1
Use N-bit register per cache line to store arrival time information
On cache miss – registers of all cache line in the set are
compared to choose the victim cache line
Implementation 2
Maintain a FIFO queue
Register with (log2 N) bits per cache line
On cache miss – cache line corresponding to register value 00
will be the victim.
Decrement all other registers in the set by 1 and set the victim
register with value N-1
30. FIFO : Advantages &
Disadvantages
Advantages
Low hardware Complexity
Better cache hit performance than Random
replacement
The cache access time is not affected by the
replacement
strategy (not in critical path)
Disadvantages
Cache hit performance is poor compared to LRU and
frequency based replacement schemes
Not suitable for high performance systems
Replacement circuit complexity increases with increase
31. Frequency – Least Frequently
Used
Requires a register per cache line to
save number of references (frequency
count)
If cache access is hit, then increase
frequency count of the corresponding
register by 1
If cache miss, find the victim cache line
as the cache line corresponding to
minimum frequency count in the set
Reset the register corresponding to
victim cache line as 0
LFU can not differentiate between past
32. Least Frequently Used –
Dynamic Aging (LFU-DA)
When any frequency count register in
the set reaches its maximum value, all
the frequency count registers in that
set is shifter one position right (divide
by 2)
Rest is same as LFU
33. LFU : Advantages &
Disadvantages
Advantages
For small and medium caches LFU works better
than
FIFO and Random replacements
Suitable for high performance systems whose
memory pattern follows frequency order
Disadvantages
The register should be updated in every cache
access
Affects the critical path
The replacement circuit becomes more complicated
when
34. Least Recently Used Policy
Most widely used replacement
strategy
Replaces the least recently used
cache line
Implemented by two techniques :-
◦ Square Matrix Implementation
◦ Counter Implementation
35. Square Matrix Implementation
N2 bits per set (DFF’s) to store the LRU
information
The cache line corresponding to the row
with all zeros is the victim cache line
for replacement
If cache hit, all the bits in corresponding
row is set to 1 and all the bits in
corresponding column is set to 0.
If cache miss, priority encoder selects
the cache line corresponding to the row
with all zeros for replacement
Used when associativity is less
37. Counter Implementation
N registers with log2N bits for N- way
set associativity. Thus Nlog2N bits
used.
Each register for each line
Cache line corresponding to counter 0
is victim cache line for replacement
If hit, all cache line with counter
greater than hit cache line is
decremented by 1 & hit cache line is
set to N-1
If miss, the cache whose count value
38. Look Policy
Look Through : Access Cache, if data not found access the lower
level
Look Aside : Request to Cache and its lower level at the same
39. Write Policy
Need of Write Policy :-
A block in cache might have been be
updated, but corresponding updation
in main memory might not have been
done
Multiple CPU’s have individual
cache’s, thereby invalidating the data
in other processor’s cache
I/O may be able to read write directly
into main memory
40. Write Through
In this technique, all the write operations
are made to main memory as well as to
cache, ensuring MM is always valid.
Any other processor-cache module, may
monitor traffic to MM to maintain
consistency.
DISADVANTAGE
It generates memory traffic and may
create bottleneck.
Bottleneck : delay in transmission of data
due to less bandwidth. Hence info is not
relayed at speed it is processed.
41. Pseudo Write Through
Also called Write Buffer
Processor writes data into the cache
and the write buffer
Memory controller writes contents of
the buffer to memory
FIFO (typical number of entries 4)
After write is complete, buffer is
flushed
42. Write Back
In this technique, the updates are made only
in cache.
When an update is made, a dirty bit or use bit,
associated with the line is set
Then when a block is replaced, it is written
back into the main memory, iff the dirty bit is
set
Thus it minimizes memory writes
DISADVANTAGE
Portions of MM are still invalid, hence I/O
should be allowed access only through cache
This makes complex circuitry and potential
bottleneck
43. Cache Coherency
This is required only in case of
multiprocessors where each CPU has
its own cache
Why is it needed ?
Be it any write policy, if the data is
modified in one cache, it invalidates
the data in other cache, if they seem
to hold the same data
Hence we need to maintain a cache
coherency to obtain correct results
44. Approaches towards Cache
Coherency
1) Bus watching write through :
Cache controller monitors writes into
shared memory that also resides in
the cache memory
If any writes are made, the controller
invalidates the cache entry
This approach depends on use of
write through policy
45. 2) Hardware Transparency :-
Additional hardware to ensure that all
updates to main memory via cache
are reflected in all cache
3) Non Cacheable memory :-
Only a portion of main memory is
shared by more than 1 processor, and
this is designated as non cacheable.
Here, all access to shared memory
are cache misses, as its never copied
to cache
46. Cache Optimization
Reducing the miss penalty
1. Multi level caches
2. Critical word first
3. Priority to Read miss over writes
4. Merging write buffers
5. Victim caches
47. Multilevel Cache
The inclusion of an on-chip cache gave
left a question whether another external
cache is still desirable?
The answer is yes! The reasons are :
◦ If there is no L2 cache and Processor makes
a request for a memory location not in the L1
cache, then it accesses the DRAM or ROM.
Due to relatively slower bus speed,
performance degrades.
◦ Whereas, if an L2 SRAM cache is included,
the frequently missing information can be
quickly retrieved. Also SRAM is fast enough
to match the bus speed, hence giving zero-
wait state transaction.
48. L2 cache do not use the system bus as
path for transfer between L2 and
processor, but a separate data path to
reduce burden
A series have simulations have proved
that L2 cache is most efficient when
its double the size of L1 cache, as
otherwise, its contents will be similar to
L1
Due to continued shrinkage of processor
components, many processors can
accommodate L2 cache on chip giving
rise to opportunity to include an L3 cache
The only disadvantage of multilevel
cache is that it complicates the design,
49. Cache Performance
Average memory access time = Hit
timeL1+Miss Rate L1 X (Hit time L2 +
Miss Rate L2 X Miss penalty L2)
Average memory stalls per instruction
= Misses per instruction L1 X (Hit time
L2 + Misses per instruction L2 X Miss
penalty L2)
50. Unified Vs Split Cache
Earlier same cache is used for data as
well as instructions i.e. Unified Cache
Now we have separate caches for
data and instructions i.e. Split cache
Thus, if the processor attempts to
fetch instruction from main memory, it
first consults the instruction L1 cache
and similarly for data.
51. Advantages of Unified Cache
It balances load between data and
instructions automatically.
That is, if execution involves more
instruction fetches, the cache will tend
to fill up with instructions, and if
execution involves more of data
fetches, the cache tends to fill up with
data.
Only one cache is needed to design
52. Advantages of Split Cache
Useful in parallel instruction execution
and pre-fetching of predicted future
instructions
Eliminate contention for the instruction
fetch/decode unit and the execution
unit and thereby supporting pipelining
the processor will fetch the instructions
ahead of time and fill the buffer, or
pipeline.
E.g. Super scalar machines Pentium
and Power PC
53. Critical Word First
This policy involves sending the
requested word first and then transfer
the rest. Thus getting the data to the
processor in 1st cycle.
Assume that 1 block = 16 bytes. 1 cycle
transfers 4 bytes. Thus at least 4 cycles
required to transfer the block.
If the processor demands for 2nd byte,
then why should we wait for entire block
to be transferred. We can first send that
word and then the complete block with
the remaining bytes.
54. Priority to read miss over
writes
Write Buffer:
Using write buffers: RAW conflicts with reads on cache
misses
If simply wait for write buffer to empty - increases read
miss penalty by 50%
Check the content of the write buffer on read miss, if no
conflicts and memory system is available, allow read
miss to continue. If there is a conflict, then flush the
buffer before read
Write Back?
Read miss replacing dirty block
Normal: Write dirty block to memory, and then do the
read
Instead copy the dirty block to a write buffer, then do the
55. Victim Cache
How to combine fast hit time of DM with
reduced conflict Misses?
Add a small fully associative buffer
(cache) to hold data discarded from
cache Victim Cache
A small fully associative cache is used
for collecting spill out data
Blocks that are discarded because of a
miss (Victim) is stored in victim cache
and is checked on a cache miss.
If found swap the data block between
victim cache and main cache
56. Replacement will always happen with the LRU
block of victim cache. The block that we want
to transfer is made MRU.
Then from cache, the block will come to victim
cache and made MRU.
The block which was transferred to cache is
now made LRU
If miss in victim cache also, then MM is
referred.
01 00 10 11
8
8 0
0
11 11
00
57. Cache Optimization
Reducing hit time
1. Small and simple caches
2. Way prediction cache
3. Trace cache
4. Avoid Address translation during
indexing of the cache