This document discusses various approaches for improving the energy efficiency of cache memory architectures, specifically for write-through caches. It begins by introducing the way-tagged cache approach, which maintains way tags for the L2 cache in the L1 cache. This allows the L2 cache to operate in a direct-mapped manner for write hits, reducing energy. The document then reviews related work on cache sub-banking, bit line segmentation, way prediction, way memoization, and a new way memoization technique using a memory address buffer to skip redundant tag/way accesses. The goal of these techniques is to reduce unnecessary accesses and optimize for write-through policy overhead while maintaining performance.
Different Approaches in Energy Efficient Cache Memory
1. 4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 184
EC-29
DIFFERENT APPROACHES IN ENERGY
EFFICIENT CACHEMEMORY
ARCHITECTURE
Dhritiman Halder
Dept. of ECE, REVA ITM
Yealahanka, Bangalore-64
ABSTRACT - Many high-performance
microprocessors employ cache write-through
policy for performance improvement and at
the same time achieving good tolerance to soft
errors in on-chip caches. However, write-
through policy also incurs large energy
overhead due to the increased accesses to
caches at the lower level (e.g., L2 caches)
during write operations. In this project, new
cache architecture referred to as way-tagged
cache to improve the energy efficiency of
write-through caches is introduced. By
maintaining the way tags of L2 cache in the L1
cache during read operations, the proposed
technique enables L2 cache to work in an
equivalent direct-mapping manner during
write hits, which account for the majority of
L2 cache accesses. This leads to significant
energy reduction without performance
degradation.
Index Terms-Cache, low power, write-through
policy.
I.INTRODUCTION
MULTI-LEVEL on-chip cache systems have
been widely adopted in high-performance
microprocessors. To keep data consistence
throughout the memory hierarchy, write-through
and write-back policies are commonly employed.
Under the write-back policy, a modified cache
block is copied back to its corresponding lower
level cache only when the block is about to be
replaced. While under the write-through policy,
all copies of a cache block are updated
immediately after the cache block is modified at
the current cache, even though the block might
not be evicted. As a result, the write-through
policy maintains identical data copies at all levels
of the cache hierarchy throughout most of their
life time of execution. This feature is important
as CMOS technology is scaled into the
nanometer range, where soft errors have emerged
as a major reliability issue in on-chip cache
systems. It has been reported that single-event
multi-bit upsets are getting worse in on-chip
memories. Currently, this problem has been
addressed at different levels of the design
abstraction. At the architecture level, an effective
solution is to keep data consistent among
different levels of the memory hierarchy to
prevent the system from collapse due to soft
errors. Benefited from immediate update, cache
write-through policy is inherently tolerant to soft
errors because the data at all related levels of the
cache hierarchy are always kept consistent. Due
to this feature, many high-performance
microprocessor designs have adopted the write-
through policy. While enabling better tolerance
to soft errors, the write-through policy also incurs
large energy overhead. This is because under the
write-through policy, caches at the lower level
experience more accesses during write
operations. Consider a two-level (i.e., Level-1
and Level-2) cache system for example. If the L1
data cache implements the write-back policy, a
write hit in the L1 cache does not need to access
the L2 cache. In contrast, if the L1 cache is write-
through, then both L1 and L2 caches need to be
accessed for every write operation.Obviously, the
write-through policy incurs more write accesses
in the L2 cache, which in turn increases the
energy consumption of the cache system. Power
dissipation is now considered as one of the
critical issues in cache design. Studies have
shown that on-chip caches can consume about
2. 4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 185
50% of the total power in high-performance
microprocessors.
In this paper, new cache architecture, referred to
as way-tagged cache, to improve the energy
efficiency of write-through cache systems with
minimal area overhead and no performance
degradation is proposed. Consider a two-level
cache hierarchy, where the L1 data cache is
write-through and the L2 cache is inclusive for
high performance. It is observed that all the data
residing in the L1 cache will have copies in the
L2 cache. In addition, the locations of these
copies in the L2 cache will not change until they
are evicted from the L2 cache. Thus, a tag to each
way in the L2 cache and send this tag information
to the L1 cache when the data is loaded to the L1
cache can be attached. By doing so, for all the
data in the L1 cache exactly the locations (i.e.,
ways) of their copies in the L2 cache is known.
During the subsequent accesses when there is a
write hit in the L1 cache (which also initiates a
write access to the L2 cache under the write-
through policy), the L2 cache can be accessed in
an equivalent direct-mapping manner because the
way tag of the data copy in the L2 cache is
available. As this operation accounts for the
majority of L2 cache accesses in most
applications, the energy consumption of L2 cache
can be reduced significantly.
II. RELATED WORKS
The basic idea of the horizontal cache
partitioning approach is to partition the cache
data memory into several segments. Each
segment can be powered individually. Cache sub-
banking, proposed in, is one horizontal cache
partition technique which partitions the data array
of a cache into several banks (called cache sub-
banks). Each cache sub-bank can be accessed
(powered up) individually. Only the cache sub-
bank where the requested data is located
consumes power in each cache access. A basic
structure for cache sub-banking is presented in
Figure below.
Cache sub-banking saves power by eliminating
unnecessary accesses. The amount of power
saving depends on the number of cache sub-
banks. More cache sub-banks save more power.
One advantage of cache sub-banking over block
buffering is that the effective cache hit time of a
sub-bank cache can be as fast as a conventional
performance-driven cache since the sub-bank
selection logic is usually very simple and can be
easily hidden in the cache index decoding logic.
With the advantage of maintaining the cache
performance, cache sub-banking could be very
attractive to computer architects in designing
energy-efficient high-performance
microprocessors. [2]
Bit line segmentation offers a solution for further
power savings. The internal organization of each
column in the data or tag array gets modified as
shown in Figure below.
Here every column of bitcells, sharing one (or
more) pair of bitlines are split into independent
segments as shown. An additional pair of lines
are run across the segments. The bit lines within
each segment can be connected or isolated from
these common lines as shown. The metal layer
used for clock distribution can implement this
line, since the clock does not need to be routed
across the bit cell array. Before a readout, all
segments are connected to the common lines,
which are precharged as usual. In the meantime,
3. 4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 186
the address decoder identifies the segment
targeted by the row address issued to the array
and isolates all but the targeted segment from the
common bit line. This reduces the effective
capacitive loading (due to the diffusion
capacitances of the pass transistors) on the
common line. This reduction is somewhat offset
by the additional capacitance of the common line
that spans a single segment and the diffusion
capacitances of the isolating switches. The
common line is then sensed. Because of the
reduced loading on the common line, the energy
discharged due to readout or spent in a write are
small. Thus, smaller drivers, precharging
transistors and sense amps can be used. [3]
Figure above depicts the architecture of our base
cache. The memory address is split into a line-
offset field, an index field, and a tag field. For
our base cache, those fields are 5, 6 and 21 bits,
respectively, assuming a 32-bit address. Being
four-way set-associative, the cache contains four
tag arrays and four data arrays. During an access,
the cache decodes the address’ index field to
simultaneously read out the appropriate tag from
each of the four tag arrays, while decoding the
index field to simultaneously read out the
appropriate data from the four data arrays. The
cache feeds the decoded lines through two
inverters to strengthen their signals. The read tags
and data items pass through sense amplifiers. The
cache simultaneously compares the four tags with
the address’ tag field. If one tag matches, a
multiplexor routes the corresponding data to the
cache output. [4]
The
energy consumption of set-associative cache
tends to be higher than that of direct-mapped
cache, because all the ways in a set are accessed
in parallel although at most only one way has the
desired data. To solve the energy issue the
phased cache divides the cache-access process
into the following two phases as shown below.
First, all the tags in the set are examined in
parallel, and no data accesses occur during this
phase. Next, if there is a hit, then a data access is
performed for the hit way. The way-predicting
cache speculatively chooses one way before
starting the normal cache-access process, and
then accesses the predicted way as shown below.
4. 4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 187
Fig-a
If the prediction is correct, the cache access has
been completed successfully. Otherwise, the
cache then searches the other remaining ways as
shown below:
Fig-b
On a prediction-hit, shown in Figure (a), the way-
predicting cache consumes only energy for
activating the predicted way. In addition, the
cache access can be completed in one cycle. On
prediction-misses (or cache misses), however, the
cache-access time of the way-predicting cache
increases due to the successive process of two
phases as shown in Figure (b). Since all the
remaining ways are activated in the same manner
as a conventional set-associative cache, the way-
predicting cache could not reduce energy
consumption in this scenario. The
performance/energy efficiency of the way-
predicting cache largely depends on the accuracy
of the way prediction
In this approach MRU algorithm has been
introduced. The MRU information for each set,
which is a two-bit flag, is used to speculatively
choose one way from the corresponding set.
These two-bit flags are stored in a table accessed
by the set-index address. Reading the MRU
information before starting the cache access
might make cache access time longer. However,
it can be hidden by calculating the set-index
address at an earlier pipe-line stage. In addition,
way prediction helps reduce cache access-time
due to eliminating of a delay for way selection.
So, we assumed that the cache-access time on
prediction hit of the way-predicting cache is
same as that of conventional set-associative
cache. [5]
Another approach uses a two-phase associative
cache: access all tags to determine the correct
way in the first phase, and then only access a
single data item from the matching way in the
second phase. Although this approach has been
proposed to reduce primary cache energy, it is
more suited for secondary cache designs due to
the performance penalty of an extra cycle in
cache access time. A higher performance
alternative to phased primary cache is to use
CAM (content-addressablememory) to hold tags.
CAM tags have been used in a number of low-
power processors including the StrongARM and
XScale. Although they add roughly 10% to total
cache area, CAMs perform tag checks for all
ways and read out only the matching data in one
cycle. Moreover, a 32-way associative cache
with CAM tags has roughly the same hit energy
as a two-way set associative cache with RAM
tags, but has a higher hit rate. Even so, a CAM
tag lookup still adds considerable energy
overhead to the simple RAM fetch of one
instruction word. Way-prediction can also reduce
the cost of tag accesses by using a way-
prediction table and only accessing the tag and
data from the predicted way.
5. 4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 188
Correct prediction avoids the cost of reading tags
and data from incorrect ways, but a misprediction
requires an extra cycle to perform tag
comparisons from all ways. This scheme has
been used in commercial high-performance
designs to add associativity to off-chip secondary
caches; to on-chip primary instruction caches to
reduce cache hit latencies in superscalar
processors; and has been proposed to reduce the
access energy in low-power microprocessors.
Since way prediction is a speculative technique, it
still requires that we fetch one tag and compare it
against the current PC to check if the prediction
was correct. Though it has never been examined,
way-prediction can also be applied to CAM-
tagged caches. However, because of the
speculative nature of way-prediction, a tag still
needs to be read out and compared. Also, on a
mispredict, the entire access needs to be restarted;
there is no work that can be salvaged. Thus, twice
the number of words are read out of the cache.
An alternative to wayprediction is way
memoization. Way memoization stores tag
lookup results (links) within the instruction cache
in a manner similar to some way prediction
schemes. However, way memoization also
associates a valid bit with each link. These valid
bits indicate, prior to instruction access, whether
the link is correct. This is in contrast to way
prediction where the access needs to be verified
afterward. This is the crucial difference between
the two schemes, and allows way-memoization to
work better in CAM-tagged caches. If the link is
valid, we simply follow the link to fetch the next
instruction and no tag checks are performed.
Otherwise, we fall back on a regular tag search to
find the location of the next instruction and
update the link for future use. The main
complexity in our technique is caused by the need
to invalidate all links to a line when that line is
evicted. The coherence of all the links is
maintained through an invalidation scheme. Way
memoization is orthogonal to and can be used in
conjunction with other cache energy reduction
techniques such as sub-banking, block buffering,
and the filter cache. Another approach to remove
instruction cache tag lookup energy is the L-
cache, however, it is only applicable to loops and
requires compiler support.
The way-memoizing instruction cache keeps
links within the cache. These links allow
instruction fetch to bypass the tag-array and read
out words directly from the instruction array.
Valid bits indicate whether the cache should use
the direct access method or fall back to the
normal access method. These valid bits are the
key to maintaining the coherence of the way-
memoizing cache. When we encounter a valid
link, we follow the link to obtain the cache
address of the next instruction and thereby
completely avoid tag checks. However, when we
encounter an invalid link, we fall back to a
regular tag search to find the target instruction
and update the link. Future instruction fetches
reuse the valid link. Way-memoization can be
applied to a conventional cache, a phased cache,
or a CAM-tag cache. On a correct way
prediction, the way-predicting cache performs
one tag lookup and reads one word, whereas the
way-memoizing cache does no tag lookup, and
only reads out one word. On a way
misprediction, the way-predicting cache is as
power-hungry as the conventional cache, and as
slow as the phased cache. Thus it can be worse
than the normal non-predicting caches. The way-
memoizing cache, however, merely becomes one
of the three normal non-predicting caches in the
worst case. However, the most important
difference is that the waymemoization technique
can be applied to CAM-tagged caches. [6]
There is a new way memoization technique
which eliminates redundant tag and way accesses
to reduce the power consumption. The basic idea
is to keep a small number of Most Recently Used
(MRU) addresses in a Memory Address Buffer
(MAB) and to omit redundant tag and way
accesses when there is a MAB-hit.
6. 4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 189
The MAB is accessed in parallel with the adder
used for address generation. The technique does
not increase the delay of the circuit. Furthermore,
this approach does not require modifying the
cache architecture. This is considered an
important advantage in industry because it makes
it possible to use the processor core with
previously designed caches or IPs provided by
other vendors.
The base address and the displacement for load
and store operations usually take a small number
of distinct values. Therefore, we can improve the
hit rate of the MAB by keeping only a small
number of most recently used tags. Assume the
bit width of tag memory, the number of sets in
the cache, and the size of cache lines are 18, 512,
and 32 bytes, respectively. The width of the
setindex and offset fields will be 9 and 5 bits,
respectively. Since most (according to our
experiments, more than 99% of) displacement
values are less than 214
, we can easily calculate
tag values without address generation. This can
be done by checking the upper 18 bits of the base
address, the sign-extension of the displacement,
and the carry bit of a 14-bit adder which adds the
low 14 bits of the base address and the
displacement. Therefore, the delay of the added
circuit is the sum of the delay of the 14-bit adder
and the delay of accessing the set-index table.
Our experiment shows this delay is smaller than
the delay of the 32-bit adder used to calculate the
address.
Therefore, our technique does not have any delay
penalty. Note that if the displacement value is
more than or equal to 214
or less Than -214
, there
will be a MAB miss, but the chance of this
happening is less than 1%.
To eliminate redundant tag and way accesses for
intercache-line flows, we can use a MAB. Unlike
the MAB used for D-cache, the inputs of the
MAB used for I-cache can be one of the
following three types: 1) an address stored in a
link register, 2) a base address (i.e. the current
program counter address) and a displacement
value (i.e., a branch offset), and 3) the current
program counter address and its stride. In the
case of inter-cacheline sequential flow, the
current program counter address and the stride of
the program counter are chosen as inputs of the
MAB. The stride is treated as the displacement
value. If the current operation is a”branch (or
jump) to the link target”, the address in the link
register is selected as the input of the MAB as
shown in Figure below. Otherwise, the base
address and the displacement are used as done
for the data cache. [7]
A new cache architecture called the location cache. Figure
below illustrates its structure.
7. 4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 190
The location cache is a small virtually-indexed
direct-mapped cache. It caches the location
information (the way number in one set a
memory reference falls into). This cache works in
parallel with the TLB and the L1 cache. On an L1
cache miss, the physical address translated by the
TLB and the way information of the reference are
both presented to the L2 cache. The L2 cache is
then accessed as a direct-mapped cache. There
can be a miss in the location cache, then the L2
cache is accessed as a conventional set-
associative cache. As opposed to way-prediction
information, the cached location is not a
prediction. Thus when there is a hit, both time
and power will be saved. Even if there is a miss,
we do not see any extra delay penalty as seen in
way- prediction caches. Caching the position,
unlike caching the data itself, will not cause
coherence problems in multi-processor systems.
Although the snooping mechanism may modify
the data stored in the L2 cache, the location will
not change. Also, even if a cache line is replaced
in the L2 cache, the way information stored in the
location cache will not generate a fault. One
interesting issue arises here: the locations for
which references should be cached? The location
cache should catch the references which turn out
to be L1 misses. A recency based strategy is not
suitable because the recent accesses to the L2
caches are very likely to be cached in the L1
caches. The equation below defines the optimal
coverage of the location cache.
Opt. coverage = L2 Coverage - L1 Coverage
As the indexing rules of L1 and L2 caches are
different, this optimal coverage is not reachable.
Fortunately, the memory locations are usually
referenced in sequences or strides. Whenever a
reference to the L2 cache is generated, we
calculate the location of the next cache line and
feed it into the location cache. The proposed
cache system works in the following way. The
location cache is accessed in parallel with the L1
caches. If the L1 cache sees a hit, then the results
from the location cache is discarded. If there is a
miss in the L1 cache, and there is a hit in the
location cache, the L2 cache is accessed as a
direct-mapped cache. If both the L1 cache and
the location cache see a miss, then the L2 cache
is accessed as a traditional L2 cache. The tags of
the L2 cache is duplicated. We call the duplicated
tag arrays of the L2 cache location tag arrays.
When the L2 cache is accessed, the location tag
arrays are accessed to generate the location
information for the next memory reference. The
generated location information is then sent to and
stored in the location cache.
The L1 cache is a 16KB 4-way set-associative
cache, with a cache line size of 64-bytes,
implemented with a 0.13μm technology. The
results were produced using the CACTI3.2
simulator. We chose the access delay of a 16KB
direct-mapped cache as the baseline, which is the
best-case delay when a way-prediction
mechanism is implemented in the L1 cache. We
normalized the baseline delay to 1. It is observed
that a location cache with up-to 1024 entries has
shorter access latency than the L1 cache. Though
the organization of the location cache is similar
to that of a direct-mapped cache, there is a small
change in the indexing rule. The block offset is 7
bit as the cache line size for the simulated L2
cache is 128 bytes. Thus the width of the tag is
smaller for the location cache, compared with a
regular cache.
8. 4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 191
Compared to a regular cache design, the
modification is minor. Note that we need to
double the tags (or the number of ports to the tag)
because when the original tags are compared to
validate the accesses, a spare set of tag is
compared to generate the future location
information. This idea is similar to the phased
cache. The difference is that we overlap the tag
comparison for future references with existing
cache reference and use the location cache to
store such location information. The simulated
cache geometry parameters were optimized for
the set-associative cache. The simulation results
show that the access latency for a direct-mapped
hit is 40% faster than a set-associative hit.
Although the extra hardware employed by the
location cache design does not introduce extra
delay on the memory reference critical path, it
does introduce extra power consumption. The
extra power consumption comes from the small
location cache and the duplicated tag arrays. The
power consumption for the tag access of a direct-
mapped hit to one. Comparing to the L2 cache
power consumption, the location cache consumes
a small amount of power is normalized.
However, as the location cache is triggered much
often than the L2 cache, its power consumption
cannot be ignored. The total chip area of the
proposed location cache system (with duplicated
tag and a location cache of 1024 entries) is only
1.39% larger than that of the original cache
system. [8]
The r-a cache is formed by using the tag array of
a set-associative cache with the data array of a
direct-mapped cache, as shown in Figure 1.
For an n-way r-a cache, there is a single data
bank, and n tag banks. The tag array is accessed
using the conventional set-associative index,
probing all the n-ways of the set in parallel, just
as in a normal set-associative cache. The data
array index uses the conventional set-associative
index concatenated with a way number to locate
a block in the set. The way number is log2(n) bits
wide. For the first probe, it may come from either
the conventional set-associative tag field’s lower-
order bits (for the direct-mapped blocks), or the
way-prediction mechanism (for the displaced
blocks). If there is a second probe (due to a
misprediction), then the matching way number is
provided by the tag array. The r-a cache
simultaneously accesses the tag and data arrays
for the first probe, at either the direct-mapped
location or a set-associative position provided by
the way-prediction mechanism. If the first probe,
called probe0, hits, then the access is complete
and the data is returned to the processor. If
probe0 fails to locate the block due to a
misprediction (i.e., either the block is in a set-
associative position when probe0 assumed direct-
mapped access or the block is in a set-associative
position different than the one supplied by way-
prediction), probe0 obtains the correct way-
number from the tag array if the block is in the
cache, and a second probe, called probe1, is done
using the correct way-number.
9. 4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 192
Probe1 probes only the data array, and not the tag
array. If the block is not in the cache, probe0
signals an overall miss and probe1 is not
necessary. Thus there are three possible paths
through the cache for a given address:(1) probe0
is predicted to be a direct mapped access, (2)
probe0 is predicted to be a set-associative access
and the prediction mechanism provides the
predicted way-number, and (3) probe0 is
mispredicted but obtains the correct way-number
from the tag array, and the data array is probed
using the correct way-number in probe1.
On an overall miss, the block is placed in the
direct-mapped position if it is non-conflicting,
and a set-associative position (LRU, random,
etc.) otherwise. Way Prediction: The r-a cache
employs hardware way-prediction to obtain the
way-number for the blocks that are displaced to
set-associative positions before address
computation is complete. The strict timing
constraint of performing the prediction in parallel
with effective address computation requires that
the prediction mechanism use information that is
available in the pipeline earlier than the address
compute stage. The equivalent of way-prediction
for i-caches is often combined with branch
prediction but because D-caches do not interact
with branch prediction, those techniques cannot
be used directly. An alternative to prediction is to
obtain the correct way-number of the displaced
block using the address, which delays initiating
cache access to the displaced block, as is the case
for statically probed schemes such as column-
associative and group-associative caches. We
examine two handles that can be used to perform
way prediction: instruction PC and approximate
data address formed by XORing the register
III.WAY-TAGGED CACHE
A way-tagged cache that exploits the way
information in L2 cache to improve energy
efficiency is introduced. A conventional set-
associative cache system when the L1 data cache
loads/writes data from/into the L2 cache, all
ways in the L2 cache are activated
simultaneously for performance consideration at
the cost of energy overhead.
value with the instruction offset (proposed in,
and used in), which may be faster than
performing a full add. These two handles
represent the two extremes of the trade-off
between prediction accuracy and early
availability in the pipeline.
PC is available much earlier than the XOR
approximation but the XOR approximation is
more accurate because it is hard for PC to
distinguish among different data addresses
touched by the same instruction. Other handles
such as instruction fields (e.g., operand register
numbers) do not have significantly more
information content from a prediction standpoint,
and the PSA paper recommends the XOR scheme
for its high accuracy. In an out-of-order
processor pipeline (Figure above), the instruction
PC of a memory operation is available much
earlier than the source register.
10. 4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 193
Therefore, way-prediction can be done in parallel
with the pipeline front end processing of memory
instructions so that the predicted way-number and
probe0 way# mux select input are ready well
before the data address is computed. The XOR
scheme, on the other hand, needs to squeeze in an
XOR operation on a value often obtained late
from a register-forwarding path followed by
prediction table lookup to produce the predicted
way-number and the probe0 way# mux select, all
within the time the pipeline computes the real
address using a full add. Note that the prediction
table must have more entries or be more
associative than the cache itself to avoid conflicts
among the XORed approximate data addresses,
and therefore will probably have a significant
access time, exacerbating the timing problem.
The above figure illustrates the architecture of the
two-level cache. Only the L1 data cache and L2
unified cache are shown as the L1 instruction
cache only reads from the L2 cache. Under the
write-through policy, the L2 cache always
maintains the most recent copy of the data. Thus,
whenever a data is updated in the L1 cache, the
L2 cache is updated with the same data as well.
This results in an increase in the write accesses to
the L2 cache and consequently more energy
consumption. The locations (i.e., way tags) of L1
data copies in the L2 cache will not change until
the data are evicted from the L2 cache. The
proposed way-tagged cache exploits this fact to
reduce the number of ways accessed during L2
cache accesses. When the L1 data cache loads a
data from the L2 cache, the way tag of the data in
the L2 cache is also sent to the L1 cache and
stored in a new set of way-tag arrays. These way
tags provide the key information for the
subsequent write accesses to the L2 cache.
In general, both write and read accesses in the L1
cache may need to access the L2 cache. These
accesses lead to different operations in the
proposed way-tagged cache, as summarized in
Table I.
Under the write-through policy, all write
operations of the L1 cache need to access the L2
cache. In the case of a write hit in the L1 cache,
only one way in the L2 cache will be activated
because the way tag information of the L2 cache
is available, i.e., from the way-tag arrays we can
obtain the L2 way of the accessed data. While for
a write miss in the L1 cache, the requested data is
not stored in the L1 cache. As a result, its
corresponding L2 way information is not
available in the way-tag arrays. Therefore, all
ways in the L2 cache need to be activated
simultaneously. Since write hit/miss is not known
a priori, the way-tag arrays need to be accessed
simultaneously with all L1 write operations in
order to avoid performance degradation. The
way-tag arrays are very small and the involved
energy overhead.
11. 4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 194
The above figure shows the system diagram of
proposed way-tagged cache. We introduce
several new components: way-tag arrays, way-tag
buffer, way decoder, and way register, all shown
in the dotted line. The way tags of each cache
line in the L2 cache are maintained in the way-
tag arrays, located with the L1 data cache. Note
that write buffers are commonly employed in
write-through caches (and even in many write-
back caches) to improve the performance. With a
write buffer, the data to be written into the L1
cache is also sent to the write buffer. The
operations stored in the write buffer are then sent
to the L2 cache in sequence. This avoids write
stalls when the processor waits for write
operations to be completed in the L2 cache. In the
proposed technique, we also need to send the way
tags stored in the way-tag arrays to the L2 cache
along with the operations in the write buffer.
Thus, a small way-tag buffer is introduced to
buffer the way tags read from the way-tag arrays.
A way decoder is employed to decode way tags
and generate the enable signals for the L2 cache,
which activate only the desired ways in the L2
cache. Each way in the L2 cache is encoded into
a way tag. A way register stores way tags and
provides this information to the way-tag arrays
can be easily compensated for. For L1 read
operations, neither read hits nor misses need to
access the way-tag arrays. This is because read
hits do not need to access the L2 cache; while for
read misses, the corresponding way tag
information is not available in the way-tag arrays.
As a result, all ways in the L2 cache are activated
simultaneously under read misses.
The amount of energy consumption per read and
write across the conventional set-associative L2
cache and proposed L2 cache is shown below:
This cache configuration, used in Pentium-4, will
be used as a baseline system for comparison with
the proposed technique under different cache
configurations.
IV. CONCLUSION
This paper presents a new energy-efficient cache
technique for high-performance microprocessors
employing the write-through policy. The
proposed technique attaches a tag to each way in
the L2 cache. This way tag is sent to the way-tag
arrays in the L1 cache when the data is loaded
from the L2 cache to the L1 cache. Utilizing the
way tags stored in the way-tag arrays, the L2
cache can be accessed as a direct-mapping cache
during the subsequent write hits, thereby
reducing cache energy consumption. Simulation
results demonstrate significantly reduction in
cache energy consumption with minimal area
overhead and no performance degradation.
Furthermore, the idea of way tagging can be
applied to many existing low-power cache
techniques such as the phased access cache to
further reduce cache energy consumption. Future
work is being directed towards extending this
technique to other levels of cache hierarchy and
reducing the energy consumption of other cache
operations.
REFERRENCES
[1].An Energy-Efficient L2 Cache Architecture
Using Way Tag Information Under Write-
Through Policy, Jianwei Dai and Lei Wang,
Senior Member, IEEE, IEEE Transactions on
Very Large Scale Integration (VLSI) Systems,
Vol. 21, No. 1, January 2013
[2].C. Su and A. Despain, “Cache design
tradeoffs for power and performance
optimization: A case study,” in
Proc. Int. Symp. Low Power Electron. Design,
1997, pp. 63–68.
[3]. K. Ghose and M. B.Kamble, “Reducing
power in superscalar processor caches using
subbanking, multiple line buffers and bit-line
segmentation,” in Proc. Int. Symp. Low Power
Electron. Design, 1999, pp. 70–75.
12. 4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 195
[4]. C. Zhang, F. Vahid, and W. Najjar, “A
highly-configurable cache architecture for
embedded systems,” in Proc. Int. Symp. Comput.
Arch., 2003, pp. 136–146.
[5]. K. Inoue, T. Ishihara, and K. Murakami,
“Way-predicting set-associative cache for high
performance and low energy consumption,” in
Proc. Int. Symp. Low Power Electron. Design,
1999, pp. 273–275.
[6].A.Ma, M. Zhang, and K.Asanovi, “Way
memoization to reduce fetch energy in instruction
caches,” in Proc. ISCA Workshop Complexity
Effective Design, 2001, pp. 1–9.
[7]. T. Ishihara and F. Fallah, “A way
memorization technique for reducing power
consumption of caches in application specific
integrated processors,” in Proc. Design Autom.
Test Euro. Conf., 2005, pp. 358–363.
R. Min, W. Jone, and Y. Hu, “Location cache: A
low-power L2 cache system,” in Proc. Int. Symp.
Low Power Electron. Design, 2004, pp. 120–125.
[8]T. N. Vijaykumar, “Reactive-associative
caches,” in
Proc. Int. Conf. Parallel Arch. Compiler Tech.,
2011, p.4961.
[9] Way-Tagged L2 Cache Architecture in
Conjunction with Energy Efficient Datum
Storage Vineeta Vasudevan Nair ECE
Department, ANNA University Chennai Sri
Eshwar College Of Engineering Coimbatore,
India
13. 4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 196
14. 4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 197