SlideShare a Scribd company logo
1 of 14
Download to read offline
4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 184
EC-29
DIFFERENT APPROACHES IN ENERGY
EFFICIENT CACHEMEMORY
ARCHITECTURE
Dhritiman Halder
Dept. of ECE, REVA ITM
Yealahanka, Bangalore-64
ABSTRACT - Many high-performance
microprocessors employ cache write-through
policy for performance improvement and at
the same time achieving good tolerance to soft
errors in on-chip caches. However, write-
through policy also incurs large energy
overhead due to the increased accesses to
caches at the lower level (e.g., L2 caches)
during write operations. In this project, new
cache architecture referred to as way-tagged
cache to improve the energy efficiency of
write-through caches is introduced. By
maintaining the way tags of L2 cache in the L1
cache during read operations, the proposed
technique enables L2 cache to work in an
equivalent direct-mapping manner during
write hits, which account for the majority of
L2 cache accesses. This leads to significant
energy reduction without performance
degradation.
Index Terms-Cache, low power, write-through
policy.
I.INTRODUCTION
MULTI-LEVEL on-chip cache systems have
been widely adopted in high-performance
microprocessors. To keep data consistence
throughout the memory hierarchy, write-through
and write-back policies are commonly employed.
Under the write-back policy, a modified cache
block is copied back to its corresponding lower
level cache only when the block is about to be
replaced. While under the write-through policy,
all copies of a cache block are updated
immediately after the cache block is modified at
the current cache, even though the block might
not be evicted. As a result, the write-through
policy maintains identical data copies at all levels
of the cache hierarchy throughout most of their
life time of execution. This feature is important
as CMOS technology is scaled into the
nanometer range, where soft errors have emerged
as a major reliability issue in on-chip cache
systems. It has been reported that single-event
multi-bit upsets are getting worse in on-chip
memories. Currently, this problem has been
addressed at different levels of the design
abstraction. At the architecture level, an effective
solution is to keep data consistent among
different levels of the memory hierarchy to
prevent the system from collapse due to soft
errors. Benefited from immediate update, cache
write-through policy is inherently tolerant to soft
errors because the data at all related levels of the
cache hierarchy are always kept consistent. Due
to this feature, many high-performance
microprocessor designs have adopted the write-
through policy. While enabling better tolerance
to soft errors, the write-through policy also incurs
large energy overhead. This is because under the
write-through policy, caches at the lower level
experience more accesses during write
operations. Consider a two-level (i.e., Level-1
and Level-2) cache system for example. If the L1
data cache implements the write-back policy, a
write hit in the L1 cache does not need to access
the L2 cache. In contrast, if the L1 cache is write-
through, then both L1 and L2 caches need to be
accessed for every write operation.Obviously, the
write-through policy incurs more write accesses
in the L2 cache, which in turn increases the
energy consumption of the cache system. Power
dissipation is now considered as one of the
critical issues in cache design. Studies have
shown that on-chip caches can consume about
4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 185
50% of the total power in high-performance
microprocessors.
In this paper, new cache architecture, referred to
as way-tagged cache, to improve the energy
efficiency of write-through cache systems with
minimal area overhead and no performance
degradation is proposed. Consider a two-level
cache hierarchy, where the L1 data cache is
write-through and the L2 cache is inclusive for
high performance. It is observed that all the data
residing in the L1 cache will have copies in the
L2 cache. In addition, the locations of these
copies in the L2 cache will not change until they
are evicted from the L2 cache. Thus, a tag to each
way in the L2 cache and send this tag information
to the L1 cache when the data is loaded to the L1
cache can be attached. By doing so, for all the
data in the L1 cache exactly the locations (i.e.,
ways) of their copies in the L2 cache is known.
During the subsequent accesses when there is a
write hit in the L1 cache (which also initiates a
write access to the L2 cache under the write-
through policy), the L2 cache can be accessed in
an equivalent direct-mapping manner because the
way tag of the data copy in the L2 cache is
available. As this operation accounts for the
majority of L2 cache accesses in most
applications, the energy consumption of L2 cache
can be reduced significantly.
II. RELATED WORKS
The basic idea of the horizontal cache
partitioning approach is to partition the cache
data memory into several segments. Each
segment can be powered individually. Cache sub-
banking, proposed in, is one horizontal cache
partition technique which partitions the data array
of a cache into several banks (called cache sub-
banks). Each cache sub-bank can be accessed
(powered up) individually. Only the cache sub-
bank where the requested data is located
consumes power in each cache access. A basic
structure for cache sub-banking is presented in
Figure below.
Cache sub-banking saves power by eliminating
unnecessary accesses. The amount of power
saving depends on the number of cache sub-
banks. More cache sub-banks save more power.
One advantage of cache sub-banking over block
buffering is that the effective cache hit time of a
sub-bank cache can be as fast as a conventional
performance-driven cache since the sub-bank
selection logic is usually very simple and can be
easily hidden in the cache index decoding logic.
With the advantage of maintaining the cache
performance, cache sub-banking could be very
attractive to computer architects in designing
energy-efficient high-performance
microprocessors. [2]
Bit line segmentation offers a solution for further
power savings. The internal organization of each
column in the data or tag array gets modified as
shown in Figure below.
Here every column of bitcells, sharing one (or
more) pair of bitlines are split into independent
segments as shown. An additional pair of lines
are run across the segments. The bit lines within
each segment can be connected or isolated from
these common lines as shown. The metal layer
used for clock distribution can implement this
line, since the clock does not need to be routed
across the bit cell array. Before a readout, all
segments are connected to the common lines,
which are precharged as usual. In the meantime,
4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 186
the address decoder identifies the segment
targeted by the row address issued to the array
and isolates all but the targeted segment from the
common bit line. This reduces the effective
capacitive loading (due to the diffusion
capacitances of the pass transistors) on the
common line. This reduction is somewhat offset
by the additional capacitance of the common line
that spans a single segment and the diffusion
capacitances of the isolating switches. The
common line is then sensed. Because of the
reduced loading on the common line, the energy
discharged due to readout or spent in a write are
small. Thus, smaller drivers, precharging
transistors and sense amps can be used. [3]
Figure above depicts the architecture of our base
cache. The memory address is split into a line-
offset field, an index field, and a tag field. For
our base cache, those fields are 5, 6 and 21 bits,
respectively, assuming a 32-bit address. Being
four-way set-associative, the cache contains four
tag arrays and four data arrays. During an access,
the cache decodes the address’ index field to
simultaneously read out the appropriate tag from
each of the four tag arrays, while decoding the
index field to simultaneously read out the
appropriate data from the four data arrays. The
cache feeds the decoded lines through two
inverters to strengthen their signals. The read tags
and data items pass through sense amplifiers. The
cache simultaneously compares the four tags with
the address’ tag field. If one tag matches, a
multiplexor routes the corresponding data to the
cache output. [4]
The
energy consumption of set-associative cache
tends to be higher than that of direct-mapped
cache, because all the ways in a set are accessed
in parallel although at most only one way has the
desired data. To solve the energy issue the
phased cache divides the cache-access process
into the following two phases as shown below.
First, all the tags in the set are examined in
parallel, and no data accesses occur during this
phase. Next, if there is a hit, then a data access is
performed for the hit way. The way-predicting
cache speculatively chooses one way before
starting the normal cache-access process, and
then accesses the predicted way as shown below.
4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 187
Fig-a
If the prediction is correct, the cache access has
been completed successfully. Otherwise, the
cache then searches the other remaining ways as
shown below:
Fig-b
On a prediction-hit, shown in Figure (a), the way-
predicting cache consumes only energy for
activating the predicted way. In addition, the
cache access can be completed in one cycle. On
prediction-misses (or cache misses), however, the
cache-access time of the way-predicting cache
increases due to the successive process of two
phases as shown in Figure (b). Since all the
remaining ways are activated in the same manner
as a conventional set-associative cache, the way-
predicting cache could not reduce energy
consumption in this scenario. The
performance/energy efficiency of the way-
predicting cache largely depends on the accuracy
of the way prediction
In this approach MRU algorithm has been
introduced. The MRU information for each set,
which is a two-bit flag, is used to speculatively
choose one way from the corresponding set.
These two-bit flags are stored in a table accessed
by the set-index address. Reading the MRU
information before starting the cache access
might make cache access time longer. However,
it can be hidden by calculating the set-index
address at an earlier pipe-line stage. In addition,
way prediction helps reduce cache access-time
due to eliminating of a delay for way selection.
So, we assumed that the cache-access time on
prediction hit of the way-predicting cache is
same as that of conventional set-associative
cache. [5]
Another approach uses a two-phase associative
cache: access all tags to determine the correct
way in the first phase, and then only access a
single data item from the matching way in the
second phase. Although this approach has been
proposed to reduce primary cache energy, it is
more suited for secondary cache designs due to
the performance penalty of an extra cycle in
cache access time. A higher performance
alternative to phased primary cache is to use
CAM (content-addressablememory) to hold tags.
CAM tags have been used in a number of low-
power processors including the StrongARM and
XScale. Although they add roughly 10% to total
cache area, CAMs perform tag checks for all
ways and read out only the matching data in one
cycle. Moreover, a 32-way associative cache
with CAM tags has roughly the same hit energy
as a two-way set associative cache with RAM
tags, but has a higher hit rate. Even so, a CAM
tag lookup still adds considerable energy
overhead to the simple RAM fetch of one
instruction word. Way-prediction can also reduce
the cost of tag accesses by using a way-
prediction table and only accessing the tag and
data from the predicted way.
4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 188
Correct prediction avoids the cost of reading tags
and data from incorrect ways, but a misprediction
requires an extra cycle to perform tag
comparisons from all ways. This scheme has
been used in commercial high-performance
designs to add associativity to off-chip secondary
caches; to on-chip primary instruction caches to
reduce cache hit latencies in superscalar
processors; and has been proposed to reduce the
access energy in low-power microprocessors.
Since way prediction is a speculative technique, it
still requires that we fetch one tag and compare it
against the current PC to check if the prediction
was correct. Though it has never been examined,
way-prediction can also be applied to CAM-
tagged caches. However, because of the
speculative nature of way-prediction, a tag still
needs to be read out and compared. Also, on a
mispredict, the entire access needs to be restarted;
there is no work that can be salvaged. Thus, twice
the number of words are read out of the cache.
An alternative to wayprediction is way
memoization. Way memoization stores tag
lookup results (links) within the instruction cache
in a manner similar to some way prediction
schemes. However, way memoization also
associates a valid bit with each link. These valid
bits indicate, prior to instruction access, whether
the link is correct. This is in contrast to way
prediction where the access needs to be verified
afterward. This is the crucial difference between
the two schemes, and allows way-memoization to
work better in CAM-tagged caches. If the link is
valid, we simply follow the link to fetch the next
instruction and no tag checks are performed.
Otherwise, we fall back on a regular tag search to
find the location of the next instruction and
update the link for future use. The main
complexity in our technique is caused by the need
to invalidate all links to a line when that line is
evicted. The coherence of all the links is
maintained through an invalidation scheme. Way
memoization is orthogonal to and can be used in
conjunction with other cache energy reduction
techniques such as sub-banking, block buffering,
and the filter cache. Another approach to remove
instruction cache tag lookup energy is the L-
cache, however, it is only applicable to loops and
requires compiler support.
The way-memoizing instruction cache keeps
links within the cache. These links allow
instruction fetch to bypass the tag-array and read
out words directly from the instruction array.
Valid bits indicate whether the cache should use
the direct access method or fall back to the
normal access method. These valid bits are the
key to maintaining the coherence of the way-
memoizing cache. When we encounter a valid
link, we follow the link to obtain the cache
address of the next instruction and thereby
completely avoid tag checks. However, when we
encounter an invalid link, we fall back to a
regular tag search to find the target instruction
and update the link. Future instruction fetches
reuse the valid link. Way-memoization can be
applied to a conventional cache, a phased cache,
or a CAM-tag cache. On a correct way
prediction, the way-predicting cache performs
one tag lookup and reads one word, whereas the
way-memoizing cache does no tag lookup, and
only reads out one word. On a way
misprediction, the way-predicting cache is as
power-hungry as the conventional cache, and as
slow as the phased cache. Thus it can be worse
than the normal non-predicting caches. The way-
memoizing cache, however, merely becomes one
of the three normal non-predicting caches in the
worst case. However, the most important
difference is that the waymemoization technique
can be applied to CAM-tagged caches. [6]
There is a new way memoization technique
which eliminates redundant tag and way accesses
to reduce the power consumption. The basic idea
is to keep a small number of Most Recently Used
(MRU) addresses in a Memory Address Buffer
(MAB) and to omit redundant tag and way
accesses when there is a MAB-hit.
4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 189
The MAB is accessed in parallel with the adder
used for address generation. The technique does
not increase the delay of the circuit. Furthermore,
this approach does not require modifying the
cache architecture. This is considered an
important advantage in industry because it makes
it possible to use the processor core with
previously designed caches or IPs provided by
other vendors.
The base address and the displacement for load
and store operations usually take a small number
of distinct values. Therefore, we can improve the
hit rate of the MAB by keeping only a small
number of most recently used tags. Assume the
bit width of tag memory, the number of sets in
the cache, and the size of cache lines are 18, 512,
and 32 bytes, respectively. The width of the
setindex and offset fields will be 9 and 5 bits,
respectively. Since most (according to our
experiments, more than 99% of) displacement
values are less than 214
, we can easily calculate
tag values without address generation. This can
be done by checking the upper 18 bits of the base
address, the sign-extension of the displacement,
and the carry bit of a 14-bit adder which adds the
low 14 bits of the base address and the
displacement. Therefore, the delay of the added
circuit is the sum of the delay of the 14-bit adder
and the delay of accessing the set-index table.
Our experiment shows this delay is smaller than
the delay of the 32-bit adder used to calculate the
address.
Therefore, our technique does not have any delay
penalty. Note that if the displacement value is
more than or equal to 214
or less Than -214
, there
will be a MAB miss, but the chance of this
happening is less than 1%.
To eliminate redundant tag and way accesses for
intercache-line flows, we can use a MAB. Unlike
the MAB used for D-cache, the inputs of the
MAB used for I-cache can be one of the
following three types: 1) an address stored in a
link register, 2) a base address (i.e. the current
program counter address) and a displacement
value (i.e., a branch offset), and 3) the current
program counter address and its stride. In the
case of inter-cacheline sequential flow, the
current program counter address and the stride of
the program counter are chosen as inputs of the
MAB. The stride is treated as the displacement
value. If the current operation is a”branch (or
jump) to the link target”, the address in the link
register is selected as the input of the MAB as
shown in Figure below. Otherwise, the base
address and the displacement are used as done
for the data cache. [7]
A new cache architecture called the location cache. Figure
below illustrates its structure.
4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 190
The location cache is a small virtually-indexed
direct-mapped cache. It caches the location
information (the way number in one set a
memory reference falls into). This cache works in
parallel with the TLB and the L1 cache. On an L1
cache miss, the physical address translated by the
TLB and the way information of the reference are
both presented to the L2 cache. The L2 cache is
then accessed as a direct-mapped cache. There
can be a miss in the location cache, then the L2
cache is accessed as a conventional set-
associative cache. As opposed to way-prediction
information, the cached location is not a
prediction. Thus when there is a hit, both time
and power will be saved. Even if there is a miss,
we do not see any extra delay penalty as seen in
way- prediction caches. Caching the position,
unlike caching the data itself, will not cause
coherence problems in multi-processor systems.
Although the snooping mechanism may modify
the data stored in the L2 cache, the location will
not change. Also, even if a cache line is replaced
in the L2 cache, the way information stored in the
location cache will not generate a fault. One
interesting issue arises here: the locations for
which references should be cached? The location
cache should catch the references which turn out
to be L1 misses. A recency based strategy is not
suitable because the recent accesses to the L2
caches are very likely to be cached in the L1
caches. The equation below defines the optimal
coverage of the location cache.
Opt. coverage = L2 Coverage - L1 Coverage
As the indexing rules of L1 and L2 caches are
different, this optimal coverage is not reachable.
Fortunately, the memory locations are usually
referenced in sequences or strides. Whenever a
reference to the L2 cache is generated, we
calculate the location of the next cache line and
feed it into the location cache. The proposed
cache system works in the following way. The
location cache is accessed in parallel with the L1
caches. If the L1 cache sees a hit, then the results
from the location cache is discarded. If there is a
miss in the L1 cache, and there is a hit in the
location cache, the L2 cache is accessed as a
direct-mapped cache. If both the L1 cache and
the location cache see a miss, then the L2 cache
is accessed as a traditional L2 cache. The tags of
the L2 cache is duplicated. We call the duplicated
tag arrays of the L2 cache location tag arrays.
When the L2 cache is accessed, the location tag
arrays are accessed to generate the location
information for the next memory reference. The
generated location information is then sent to and
stored in the location cache.
The L1 cache is a 16KB 4-way set-associative
cache, with a cache line size of 64-bytes,
implemented with a 0.13μm technology. The
results were produced using the CACTI3.2
simulator. We chose the access delay of a 16KB
direct-mapped cache as the baseline, which is the
best-case delay when a way-prediction
mechanism is implemented in the L1 cache. We
normalized the baseline delay to 1. It is observed
that a location cache with up-to 1024 entries has
shorter access latency than the L1 cache. Though
the organization of the location cache is similar
to that of a direct-mapped cache, there is a small
change in the indexing rule. The block offset is 7
bit as the cache line size for the simulated L2
cache is 128 bytes. Thus the width of the tag is
smaller for the location cache, compared with a
regular cache.
4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 191
Compared to a regular cache design, the
modification is minor. Note that we need to
double the tags (or the number of ports to the tag)
because when the original tags are compared to
validate the accesses, a spare set of tag is
compared to generate the future location
information. This idea is similar to the phased
cache. The difference is that we overlap the tag
comparison for future references with existing
cache reference and use the location cache to
store such location information. The simulated
cache geometry parameters were optimized for
the set-associative cache. The simulation results
show that the access latency for a direct-mapped
hit is 40% faster than a set-associative hit.
Although the extra hardware employed by the
location cache design does not introduce extra
delay on the memory reference critical path, it
does introduce extra power consumption. The
extra power consumption comes from the small
location cache and the duplicated tag arrays. The
power consumption for the tag access of a direct-
mapped hit to one. Comparing to the L2 cache
power consumption, the location cache consumes
a small amount of power is normalized.
However, as the location cache is triggered much
often than the L2 cache, its power consumption
cannot be ignored. The total chip area of the
proposed location cache system (with duplicated
tag and a location cache of 1024 entries) is only
1.39% larger than that of the original cache
system. [8]
The r-a cache is formed by using the tag array of
a set-associative cache with the data array of a
direct-mapped cache, as shown in Figure 1.
For an n-way r-a cache, there is a single data
bank, and n tag banks. The tag array is accessed
using the conventional set-associative index,
probing all the n-ways of the set in parallel, just
as in a normal set-associative cache. The data
array index uses the conventional set-associative
index concatenated with a way number to locate
a block in the set. The way number is log2(n) bits
wide. For the first probe, it may come from either
the conventional set-associative tag field’s lower-
order bits (for the direct-mapped blocks), or the
way-prediction mechanism (for the displaced
blocks). If there is a second probe (due to a
misprediction), then the matching way number is
provided by the tag array. The r-a cache
simultaneously accesses the tag and data arrays
for the first probe, at either the direct-mapped
location or a set-associative position provided by
the way-prediction mechanism. If the first probe,
called probe0, hits, then the access is complete
and the data is returned to the processor. If
probe0 fails to locate the block due to a
misprediction (i.e., either the block is in a set-
associative position when probe0 assumed direct-
mapped access or the block is in a set-associative
position different than the one supplied by way-
prediction), probe0 obtains the correct way-
number from the tag array if the block is in the
cache, and a second probe, called probe1, is done
using the correct way-number.
4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 192
Probe1 probes only the data array, and not the tag
array. If the block is not in the cache, probe0
signals an overall miss and probe1 is not
necessary. Thus there are three possible paths
through the cache for a given address:(1) probe0
is predicted to be a direct mapped access, (2)
probe0 is predicted to be a set-associative access
and the prediction mechanism provides the
predicted way-number, and (3) probe0 is
mispredicted but obtains the correct way-number
from the tag array, and the data array is probed
using the correct way-number in probe1.
On an overall miss, the block is placed in the
direct-mapped position if it is non-conflicting,
and a set-associative position (LRU, random,
etc.) otherwise. Way Prediction: The r-a cache
employs hardware way-prediction to obtain the
way-number for the blocks that are displaced to
set-associative positions before address
computation is complete. The strict timing
constraint of performing the prediction in parallel
with effective address computation requires that
the prediction mechanism use information that is
available in the pipeline earlier than the address
compute stage. The equivalent of way-prediction
for i-caches is often combined with branch
prediction but because D-caches do not interact
with branch prediction, those techniques cannot
be used directly. An alternative to prediction is to
obtain the correct way-number of the displaced
block using the address, which delays initiating
cache access to the displaced block, as is the case
for statically probed schemes such as column-
associative and group-associative caches. We
examine two handles that can be used to perform
way prediction: instruction PC and approximate
data address formed by XORing the register
III.WAY-TAGGED CACHE
A way-tagged cache that exploits the way
information in L2 cache to improve energy
efficiency is introduced. A conventional set-
associative cache system when the L1 data cache
loads/writes data from/into the L2 cache, all
ways in the L2 cache are activated
simultaneously for performance consideration at
the cost of energy overhead.
value with the instruction offset (proposed in,
and used in), which may be faster than
performing a full add. These two handles
represent the two extremes of the trade-off
between prediction accuracy and early
availability in the pipeline.
PC is available much earlier than the XOR
approximation but the XOR approximation is
more accurate because it is hard for PC to
distinguish among different data addresses
touched by the same instruction. Other handles
such as instruction fields (e.g., operand register
numbers) do not have significantly more
information content from a prediction standpoint,
and the PSA paper recommends the XOR scheme
for its high accuracy. In an out-of-order
processor pipeline (Figure above), the instruction
PC of a memory operation is available much
earlier than the source register.
4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 193
Therefore, way-prediction can be done in parallel
with the pipeline front end processing of memory
instructions so that the predicted way-number and
probe0 way# mux select input are ready well
before the data address is computed. The XOR
scheme, on the other hand, needs to squeeze in an
XOR operation on a value often obtained late
from a register-forwarding path followed by
prediction table lookup to produce the predicted
way-number and the probe0 way# mux select, all
within the time the pipeline computes the real
address using a full add. Note that the prediction
table must have more entries or be more
associative than the cache itself to avoid conflicts
among the XORed approximate data addresses,
and therefore will probably have a significant
access time, exacerbating the timing problem.
The above figure illustrates the architecture of the
two-level cache. Only the L1 data cache and L2
unified cache are shown as the L1 instruction
cache only reads from the L2 cache. Under the
write-through policy, the L2 cache always
maintains the most recent copy of the data. Thus,
whenever a data is updated in the L1 cache, the
L2 cache is updated with the same data as well.
This results in an increase in the write accesses to
the L2 cache and consequently more energy
consumption. The locations (i.e., way tags) of L1
data copies in the L2 cache will not change until
the data are evicted from the L2 cache. The
proposed way-tagged cache exploits this fact to
reduce the number of ways accessed during L2
cache accesses. When the L1 data cache loads a
data from the L2 cache, the way tag of the data in
the L2 cache is also sent to the L1 cache and
stored in a new set of way-tag arrays. These way
tags provide the key information for the
subsequent write accesses to the L2 cache.
In general, both write and read accesses in the L1
cache may need to access the L2 cache. These
accesses lead to different operations in the
proposed way-tagged cache, as summarized in
Table I.
Under the write-through policy, all write
operations of the L1 cache need to access the L2
cache. In the case of a write hit in the L1 cache,
only one way in the L2 cache will be activated
because the way tag information of the L2 cache
is available, i.e., from the way-tag arrays we can
obtain the L2 way of the accessed data. While for
a write miss in the L1 cache, the requested data is
not stored in the L1 cache. As a result, its
corresponding L2 way information is not
available in the way-tag arrays. Therefore, all
ways in the L2 cache need to be activated
simultaneously. Since write hit/miss is not known
a priori, the way-tag arrays need to be accessed
simultaneously with all L1 write operations in
order to avoid performance degradation. The
way-tag arrays are very small and the involved
energy overhead.
4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 194
The above figure shows the system diagram of
proposed way-tagged cache. We introduce
several new components: way-tag arrays, way-tag
buffer, way decoder, and way register, all shown
in the dotted line. The way tags of each cache
line in the L2 cache are maintained in the way-
tag arrays, located with the L1 data cache. Note
that write buffers are commonly employed in
write-through caches (and even in many write-
back caches) to improve the performance. With a
write buffer, the data to be written into the L1
cache is also sent to the write buffer. The
operations stored in the write buffer are then sent
to the L2 cache in sequence. This avoids write
stalls when the processor waits for write
operations to be completed in the L2 cache. In the
proposed technique, we also need to send the way
tags stored in the way-tag arrays to the L2 cache
along with the operations in the write buffer.
Thus, a small way-tag buffer is introduced to
buffer the way tags read from the way-tag arrays.
A way decoder is employed to decode way tags
and generate the enable signals for the L2 cache,
which activate only the desired ways in the L2
cache. Each way in the L2 cache is encoded into
a way tag. A way register stores way tags and
provides this information to the way-tag arrays
can be easily compensated for. For L1 read
operations, neither read hits nor misses need to
access the way-tag arrays. This is because read
hits do not need to access the L2 cache; while for
read misses, the corresponding way tag
information is not available in the way-tag arrays.
As a result, all ways in the L2 cache are activated
simultaneously under read misses.
The amount of energy consumption per read and
write across the conventional set-associative L2
cache and proposed L2 cache is shown below:
This cache configuration, used in Pentium-4, will
be used as a baseline system for comparison with
the proposed technique under different cache
configurations.
IV. CONCLUSION
This paper presents a new energy-efficient cache
technique for high-performance microprocessors
employing the write-through policy. The
proposed technique attaches a tag to each way in
the L2 cache. This way tag is sent to the way-tag
arrays in the L1 cache when the data is loaded
from the L2 cache to the L1 cache. Utilizing the
way tags stored in the way-tag arrays, the L2
cache can be accessed as a direct-mapping cache
during the subsequent write hits, thereby
reducing cache energy consumption. Simulation
results demonstrate significantly reduction in
cache energy consumption with minimal area
overhead and no performance degradation.
Furthermore, the idea of way tagging can be
applied to many existing low-power cache
techniques such as the phased access cache to
further reduce cache energy consumption. Future
work is being directed towards extending this
technique to other levels of cache hierarchy and
reducing the energy consumption of other cache
operations.
REFERRENCES
[1].An Energy-Efficient L2 Cache Architecture
Using Way Tag Information Under Write-
Through Policy, Jianwei Dai and Lei Wang,
Senior Member, IEEE, IEEE Transactions on
Very Large Scale Integration (VLSI) Systems,
Vol. 21, No. 1, January 2013
[2].C. Su and A. Despain, “Cache design
tradeoffs for power and performance
optimization: A case study,” in
Proc. Int. Symp. Low Power Electron. Design,
1997, pp. 63–68.
[3]. K. Ghose and M. B.Kamble, “Reducing
power in superscalar processor caches using
subbanking, multiple line buffers and bit-line
segmentation,” in Proc. Int. Symp. Low Power
Electron. Design, 1999, pp. 70–75.
4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 195
[4]. C. Zhang, F. Vahid, and W. Najjar, “A
highly-configurable cache architecture for
embedded systems,” in Proc. Int. Symp. Comput.
Arch., 2003, pp. 136–146.
[5]. K. Inoue, T. Ishihara, and K. Murakami,
“Way-predicting set-associative cache for high
performance and low energy consumption,” in
Proc. Int. Symp. Low Power Electron. Design,
1999, pp. 273–275.
[6].A.Ma, M. Zhang, and K.Asanovi, “Way
memoization to reduce fetch energy in instruction
caches,” in Proc. ISCA Workshop Complexity
Effective Design, 2001, pp. 1–9.
[7]. T. Ishihara and F. Fallah, “A way
memorization technique for reducing power
consumption of caches in application specific
integrated processors,” in Proc. Design Autom.
Test Euro. Conf., 2005, pp. 358–363.
R. Min, W. Jone, and Y. Hu, “Location cache: A
low-power L2 cache system,” in Proc. Int. Symp.
Low Power Electron. Design, 2004, pp. 120–125.
[8]T. N. Vijaykumar, “Reactive-associative
caches,” in
Proc. Int. Conf. Parallel Arch. Compiler Tech.,
2011, p.4961.
[9] Way-Tagged L2 Cache Architecture in
Conjunction with Energy Efficient Datum
Storage Vineeta Vasudevan Nair ECE
Department, ANNA University Chennai Sri
Eshwar College Of Engineering Coimbatore,
India
4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 196
4th
National Conference on Emerging Trends in Engineering Technologies, ETET-2015
20th
& 21st
February 2015
Jyothy Institute of Technology Department of ECE P a g e | 197

More Related Content

What's hot

Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...Dhiraj Chaudhary
 
POWER GATING STRUCTURE FOR REVERSIBLE PROGRAMMABLE LOGIC ARRAY
POWER GATING STRUCTURE FOR REVERSIBLE PROGRAMMABLE LOGIC ARRAYPOWER GATING STRUCTURE FOR REVERSIBLE PROGRAMMABLE LOGIC ARRAY
POWER GATING STRUCTURE FOR REVERSIBLE PROGRAMMABLE LOGIC ARRAYecij
 
Investigations on Implementation of Ternary Content Addressable Memory Archit...
Investigations on Implementation of Ternary Content Addressable Memory Archit...Investigations on Implementation of Ternary Content Addressable Memory Archit...
Investigations on Implementation of Ternary Content Addressable Memory Archit...IRJET Journal
 
Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency modelspalani kumar
 
MPSoC Platform Design and Simulation for Power %0A Performance Estimation
MPSoC Platform Design and  Simulation for Power %0A Performance EstimationMPSoC Platform Design and  Simulation for Power %0A Performance Estimation
MPSoC Platform Design and Simulation for Power %0A Performance EstimationZhengjie Lu
 
Iaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd Iaetsd
 
IRJET- Flexible DSP Accelerator Architecture using Carry Lookahead Tree
IRJET- Flexible DSP Accelerator Architecture using Carry Lookahead TreeIRJET- Flexible DSP Accelerator Architecture using Carry Lookahead Tree
IRJET- Flexible DSP Accelerator Architecture using Carry Lookahead TreeIRJET Journal
 
FPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMS
FPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMSFPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMS
FPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMSIAEME Publication
 
System designing and modelling using fpga
System designing and modelling using fpgaSystem designing and modelling using fpga
System designing and modelling using fpgaIAEME Publication
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
A Simplied Bit-Line Technique for Memory Optimization
A Simplied Bit-Line Technique for Memory OptimizationA Simplied Bit-Line Technique for Memory Optimization
A Simplied Bit-Line Technique for Memory Optimizationijsrd.com
 

What's hot (16)

Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
Coarse Grained Hybrid Reconfigurable Architecture with NoC Router for Variabl...
 
shashank_hpca1995_00386533
shashank_hpca1995_00386533shashank_hpca1995_00386533
shashank_hpca1995_00386533
 
POWER GATING STRUCTURE FOR REVERSIBLE PROGRAMMABLE LOGIC ARRAY
POWER GATING STRUCTURE FOR REVERSIBLE PROGRAMMABLE LOGIC ARRAYPOWER GATING STRUCTURE FOR REVERSIBLE PROGRAMMABLE LOGIC ARRAY
POWER GATING STRUCTURE FOR REVERSIBLE PROGRAMMABLE LOGIC ARRAY
 
Investigations on Implementation of Ternary Content Addressable Memory Archit...
Investigations on Implementation of Ternary Content Addressable Memory Archit...Investigations on Implementation of Ternary Content Addressable Memory Archit...
Investigations on Implementation of Ternary Content Addressable Memory Archit...
 
IEEExeonmem
IEEExeonmemIEEExeonmem
IEEExeonmem
 
Hbdfpga fpl07
Hbdfpga fpl07Hbdfpga fpl07
Hbdfpga fpl07
 
Memory consistency models
Memory consistency modelsMemory consistency models
Memory consistency models
 
Implementation of MAC using Modified Booth Algorithm
Implementation of MAC using Modified Booth AlgorithmImplementation of MAC using Modified Booth Algorithm
Implementation of MAC using Modified Booth Algorithm
 
V3I8-0460
V3I8-0460V3I8-0460
V3I8-0460
 
MPSoC Platform Design and Simulation for Power %0A Performance Estimation
MPSoC Platform Design and  Simulation for Power %0A Performance EstimationMPSoC Platform Design and  Simulation for Power %0A Performance Estimation
MPSoC Platform Design and Simulation for Power %0A Performance Estimation
 
Iaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg as
 
IRJET- Flexible DSP Accelerator Architecture using Carry Lookahead Tree
IRJET- Flexible DSP Accelerator Architecture using Carry Lookahead TreeIRJET- Flexible DSP Accelerator Architecture using Carry Lookahead Tree
IRJET- Flexible DSP Accelerator Architecture using Carry Lookahead Tree
 
FPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMS
FPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMSFPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMS
FPGA IMPLEMENTATION OF PRIORITYARBITER BASED ROUTER DESIGN FOR NOC SYSTEMS
 
System designing and modelling using fpga
System designing and modelling using fpgaSystem designing and modelling using fpga
System designing and modelling using fpga
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
A Simplied Bit-Line Technique for Memory Optimization
A Simplied Bit-Line Technique for Memory OptimizationA Simplied Bit-Line Technique for Memory Optimization
A Simplied Bit-Line Technique for Memory Optimization
 

Viewers also liked (14)

Cecelski, The Fire of Freedom
Cecelski, The Fire of FreedomCecelski, The Fire of Freedom
Cecelski, The Fire of Freedom
 
MSI 2012 Graphics Cards presentation v1.01eu
MSI 2012 Graphics Cards presentation v1.01euMSI 2012 Graphics Cards presentation v1.01eu
MSI 2012 Graphics Cards presentation v1.01eu
 
K. Kathleen O'Neill LoR RA 2012
K. Kathleen O'Neill LoR RA 2012K. Kathleen O'Neill LoR RA 2012
K. Kathleen O'Neill LoR RA 2012
 
Mol·lusc
Mol·luscMol·lusc
Mol·lusc
 
Sasso nello stagno 2
Sasso nello stagno 2Sasso nello stagno 2
Sasso nello stagno 2
 
Star persona - Anissia and Grace
Star persona - Anissia and Grace Star persona - Anissia and Grace
Star persona - Anissia and Grace
 
Fire Drill Certificate
Fire Drill CertificateFire Drill Certificate
Fire Drill Certificate
 
DPS training
DPS trainingDPS training
DPS training
 
Kelly, Becoming Ecological
Kelly, Becoming EcologicalKelly, Becoming Ecological
Kelly, Becoming Ecological
 
Turk Hava Yollari istanbul Paris Ucak Bileti Fiyatlari
Turk Hava Yollari istanbul Paris Ucak Bileti FiyatlariTurk Hava Yollari istanbul Paris Ucak Bileti Fiyatlari
Turk Hava Yollari istanbul Paris Ucak Bileti Fiyatlari
 
ήπειρος
ήπειροςήπειρος
ήπειρος
 
Iso 14001
Iso 14001Iso 14001
Iso 14001
 
Modern Chinese Calligraphy
Modern Chinese CalligraphyModern Chinese Calligraphy
Modern Chinese Calligraphy
 
Dec 2014 ca-cpt question paper mastermind institute
Dec  2014 ca-cpt question paper mastermind instituteDec  2014 ca-cpt question paper mastermind institute
Dec 2014 ca-cpt question paper mastermind institute
 

Similar to Different Approaches in Energy Efficient Cache Memory

AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...Vijay Prime
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...ijesajournal
 
Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...eSAT Publishing House
 
IRJET-A Review on Trends in Multicore Processor Based on Cache and Power Diss...
IRJET-A Review on Trends in Multicore Processor Based on Cache and Power Diss...IRJET-A Review on Trends in Multicore Processor Based on Cache and Power Diss...
IRJET-A Review on Trends in Multicore Processor Based on Cache and Power Diss...IRJET Journal
 
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...Ilango Jeyasubramanian
 
Power minimization of systems using Performance Enhancement Guaranteed Caches
Power minimization of systems using Performance Enhancement Guaranteed CachesPower minimization of systems using Performance Enhancement Guaranteed Caches
Power minimization of systems using Performance Enhancement Guaranteed CachesIJTET Journal
 
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORDESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORVLSICS Design
 
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORDESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORVLSICS Design
 
Design and Analysis of A 32-bit Pipelined MIPS Risc Processor
Design and Analysis of A 32-bit Pipelined MIPS Risc ProcessorDesign and Analysis of A 32-bit Pipelined MIPS Risc Processor
Design and Analysis of A 32-bit Pipelined MIPS Risc ProcessorVLSICS Design
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingeSAT Journals
 
Robust Fault Tolerance in Content Addressable Memory Interface
Robust Fault Tolerance in Content Addressable Memory InterfaceRobust Fault Tolerance in Content Addressable Memory Interface
Robust Fault Tolerance in Content Addressable Memory InterfaceIOSRJVSP
 
Data cache design itanium 2
Data cache design itanium 2Data cache design itanium 2
Data cache design itanium 2Léia de Sousa
 
2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18...
2017 18 ieee vlsi titles,IEEE 2017-18  BULK  NS2 PROJECTS TITLES,IEEE 2017-18...2017 18 ieee vlsi titles,IEEE 2017-18  BULK  NS2 PROJECTS TITLES,IEEE 2017-18...
2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18...Nexgen Technology
 
IRJET- Reduction of Dark Silicon through Efficient Power Reduction Designing ...
IRJET- Reduction of Dark Silicon through Efficient Power Reduction Designing ...IRJET- Reduction of Dark Silicon through Efficient Power Reduction Designing ...
IRJET- Reduction of Dark Silicon through Efficient Power Reduction Designing ...IRJET Journal
 
Postponed Optimized Report Recovery under Lt Based Cloud Memory
Postponed Optimized Report Recovery under Lt Based Cloud MemoryPostponed Optimized Report Recovery under Lt Based Cloud Memory
Postponed Optimized Report Recovery under Lt Based Cloud MemoryIJARIIT
 

Similar to Different Approaches in Energy Efficient Cache Memory (20)

AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER  WR...
AN ENERGY EFFICIENT L2 CACHE ARCHITECTURE USING WAY TAG INFORMATION UNDER WR...
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
 
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...Dominant block guided optimal cache size estimation to maximize ipc of embedd...
Dominant block guided optimal cache size estimation to maximize ipc of embedd...
 
Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...Architecture and implementation issues of multi core processors and caching –...
Architecture and implementation issues of multi core processors and caching –...
 
IRJET-A Review on Trends in Multicore Processor Based on Cache and Power Diss...
IRJET-A Review on Trends in Multicore Processor Based on Cache and Power Diss...IRJET-A Review on Trends in Multicore Processor Based on Cache and Power Diss...
IRJET-A Review on Trends in Multicore Processor Based on Cache and Power Diss...
 
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...
 
Power minimization of systems using Performance Enhancement Guaranteed Caches
Power minimization of systems using Performance Enhancement Guaranteed CachesPower minimization of systems using Performance Enhancement Guaranteed Caches
Power minimization of systems using Performance Enhancement Guaranteed Caches
 
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORDESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
 
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORDESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
 
Design and Analysis of A 32-bit Pipelined MIPS Risc Processor
Design and Analysis of A 32-bit Pipelined MIPS Risc ProcessorDesign and Analysis of A 32-bit Pipelined MIPS Risc Processor
Design and Analysis of A 32-bit Pipelined MIPS Risc Processor
 
An octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passingAn octa core processor with shared memory and message-passing
An octa core processor with shared memory and message-passing
 
Cache memory
Cache memoryCache memory
Cache memory
 
Bg36347351
Bg36347351Bg36347351
Bg36347351
 
Robust Fault Tolerance in Content Addressable Memory Interface
Robust Fault Tolerance in Content Addressable Memory InterfaceRobust Fault Tolerance in Content Addressable Memory Interface
Robust Fault Tolerance in Content Addressable Memory Interface
 
Data cache design itanium 2
Data cache design itanium 2Data cache design itanium 2
Data cache design itanium 2
 
2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18...
2017 18 ieee vlsi titles,IEEE 2017-18  BULK  NS2 PROJECTS TITLES,IEEE 2017-18...2017 18 ieee vlsi titles,IEEE 2017-18  BULK  NS2 PROJECTS TITLES,IEEE 2017-18...
2017 18 ieee vlsi titles,IEEE 2017-18 BULK NS2 PROJECTS TITLES,IEEE 2017-18...
 
IRJET- Reduction of Dark Silicon through Efficient Power Reduction Designing ...
IRJET- Reduction of Dark Silicon through Efficient Power Reduction Designing ...IRJET- Reduction of Dark Silicon through Efficient Power Reduction Designing ...
IRJET- Reduction of Dark Silicon through Efficient Power Reduction Designing ...
 
Postponed Optimized Report Recovery under Lt Based Cloud Memory
Postponed Optimized Report Recovery under Lt Based Cloud MemoryPostponed Optimized Report Recovery under Lt Based Cloud Memory
Postponed Optimized Report Recovery under Lt Based Cloud Memory
 
An efficient multi-level cache system for geometrically interconnected many-...
An efficient multi-level cache system for geometrically  interconnected many-...An efficient multi-level cache system for geometrically  interconnected many-...
An efficient multi-level cache system for geometrically interconnected many-...
 
Aqeel
AqeelAqeel
Aqeel
 

Different Approaches in Energy Efficient Cache Memory

  • 1. 4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015 20th & 21st February 2015 Jyothy Institute of Technology Department of ECE P a g e | 184 EC-29 DIFFERENT APPROACHES IN ENERGY EFFICIENT CACHEMEMORY ARCHITECTURE Dhritiman Halder Dept. of ECE, REVA ITM Yealahanka, Bangalore-64 ABSTRACT - Many high-performance microprocessors employ cache write-through policy for performance improvement and at the same time achieving good tolerance to soft errors in on-chip caches. However, write- through policy also incurs large energy overhead due to the increased accesses to caches at the lower level (e.g., L2 caches) during write operations. In this project, new cache architecture referred to as way-tagged cache to improve the energy efficiency of write-through caches is introduced. By maintaining the way tags of L2 cache in the L1 cache during read operations, the proposed technique enables L2 cache to work in an equivalent direct-mapping manner during write hits, which account for the majority of L2 cache accesses. This leads to significant energy reduction without performance degradation. Index Terms-Cache, low power, write-through policy. I.INTRODUCTION MULTI-LEVEL on-chip cache systems have been widely adopted in high-performance microprocessors. To keep data consistence throughout the memory hierarchy, write-through and write-back policies are commonly employed. Under the write-back policy, a modified cache block is copied back to its corresponding lower level cache only when the block is about to be replaced. While under the write-through policy, all copies of a cache block are updated immediately after the cache block is modified at the current cache, even though the block might not be evicted. As a result, the write-through policy maintains identical data copies at all levels of the cache hierarchy throughout most of their life time of execution. This feature is important as CMOS technology is scaled into the nanometer range, where soft errors have emerged as a major reliability issue in on-chip cache systems. It has been reported that single-event multi-bit upsets are getting worse in on-chip memories. Currently, this problem has been addressed at different levels of the design abstraction. At the architecture level, an effective solution is to keep data consistent among different levels of the memory hierarchy to prevent the system from collapse due to soft errors. Benefited from immediate update, cache write-through policy is inherently tolerant to soft errors because the data at all related levels of the cache hierarchy are always kept consistent. Due to this feature, many high-performance microprocessor designs have adopted the write- through policy. While enabling better tolerance to soft errors, the write-through policy also incurs large energy overhead. This is because under the write-through policy, caches at the lower level experience more accesses during write operations. Consider a two-level (i.e., Level-1 and Level-2) cache system for example. If the L1 data cache implements the write-back policy, a write hit in the L1 cache does not need to access the L2 cache. In contrast, if the L1 cache is write- through, then both L1 and L2 caches need to be accessed for every write operation.Obviously, the write-through policy incurs more write accesses in the L2 cache, which in turn increases the energy consumption of the cache system. Power dissipation is now considered as one of the critical issues in cache design. Studies have shown that on-chip caches can consume about
  • 2. 4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015 20th & 21st February 2015 Jyothy Institute of Technology Department of ECE P a g e | 185 50% of the total power in high-performance microprocessors. In this paper, new cache architecture, referred to as way-tagged cache, to improve the energy efficiency of write-through cache systems with minimal area overhead and no performance degradation is proposed. Consider a two-level cache hierarchy, where the L1 data cache is write-through and the L2 cache is inclusive for high performance. It is observed that all the data residing in the L1 cache will have copies in the L2 cache. In addition, the locations of these copies in the L2 cache will not change until they are evicted from the L2 cache. Thus, a tag to each way in the L2 cache and send this tag information to the L1 cache when the data is loaded to the L1 cache can be attached. By doing so, for all the data in the L1 cache exactly the locations (i.e., ways) of their copies in the L2 cache is known. During the subsequent accesses when there is a write hit in the L1 cache (which also initiates a write access to the L2 cache under the write- through policy), the L2 cache can be accessed in an equivalent direct-mapping manner because the way tag of the data copy in the L2 cache is available. As this operation accounts for the majority of L2 cache accesses in most applications, the energy consumption of L2 cache can be reduced significantly. II. RELATED WORKS The basic idea of the horizontal cache partitioning approach is to partition the cache data memory into several segments. Each segment can be powered individually. Cache sub- banking, proposed in, is one horizontal cache partition technique which partitions the data array of a cache into several banks (called cache sub- banks). Each cache sub-bank can be accessed (powered up) individually. Only the cache sub- bank where the requested data is located consumes power in each cache access. A basic structure for cache sub-banking is presented in Figure below. Cache sub-banking saves power by eliminating unnecessary accesses. The amount of power saving depends on the number of cache sub- banks. More cache sub-banks save more power. One advantage of cache sub-banking over block buffering is that the effective cache hit time of a sub-bank cache can be as fast as a conventional performance-driven cache since the sub-bank selection logic is usually very simple and can be easily hidden in the cache index decoding logic. With the advantage of maintaining the cache performance, cache sub-banking could be very attractive to computer architects in designing energy-efficient high-performance microprocessors. [2] Bit line segmentation offers a solution for further power savings. The internal organization of each column in the data or tag array gets modified as shown in Figure below. Here every column of bitcells, sharing one (or more) pair of bitlines are split into independent segments as shown. An additional pair of lines are run across the segments. The bit lines within each segment can be connected or isolated from these common lines as shown. The metal layer used for clock distribution can implement this line, since the clock does not need to be routed across the bit cell array. Before a readout, all segments are connected to the common lines, which are precharged as usual. In the meantime,
  • 3. 4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015 20th & 21st February 2015 Jyothy Institute of Technology Department of ECE P a g e | 186 the address decoder identifies the segment targeted by the row address issued to the array and isolates all but the targeted segment from the common bit line. This reduces the effective capacitive loading (due to the diffusion capacitances of the pass transistors) on the common line. This reduction is somewhat offset by the additional capacitance of the common line that spans a single segment and the diffusion capacitances of the isolating switches. The common line is then sensed. Because of the reduced loading on the common line, the energy discharged due to readout or spent in a write are small. Thus, smaller drivers, precharging transistors and sense amps can be used. [3] Figure above depicts the architecture of our base cache. The memory address is split into a line- offset field, an index field, and a tag field. For our base cache, those fields are 5, 6 and 21 bits, respectively, assuming a 32-bit address. Being four-way set-associative, the cache contains four tag arrays and four data arrays. During an access, the cache decodes the address’ index field to simultaneously read out the appropriate tag from each of the four tag arrays, while decoding the index field to simultaneously read out the appropriate data from the four data arrays. The cache feeds the decoded lines through two inverters to strengthen their signals. The read tags and data items pass through sense amplifiers. The cache simultaneously compares the four tags with the address’ tag field. If one tag matches, a multiplexor routes the corresponding data to the cache output. [4] The energy consumption of set-associative cache tends to be higher than that of direct-mapped cache, because all the ways in a set are accessed in parallel although at most only one way has the desired data. To solve the energy issue the phased cache divides the cache-access process into the following two phases as shown below. First, all the tags in the set are examined in parallel, and no data accesses occur during this phase. Next, if there is a hit, then a data access is performed for the hit way. The way-predicting cache speculatively chooses one way before starting the normal cache-access process, and then accesses the predicted way as shown below.
  • 4. 4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015 20th & 21st February 2015 Jyothy Institute of Technology Department of ECE P a g e | 187 Fig-a If the prediction is correct, the cache access has been completed successfully. Otherwise, the cache then searches the other remaining ways as shown below: Fig-b On a prediction-hit, shown in Figure (a), the way- predicting cache consumes only energy for activating the predicted way. In addition, the cache access can be completed in one cycle. On prediction-misses (or cache misses), however, the cache-access time of the way-predicting cache increases due to the successive process of two phases as shown in Figure (b). Since all the remaining ways are activated in the same manner as a conventional set-associative cache, the way- predicting cache could not reduce energy consumption in this scenario. The performance/energy efficiency of the way- predicting cache largely depends on the accuracy of the way prediction In this approach MRU algorithm has been introduced. The MRU information for each set, which is a two-bit flag, is used to speculatively choose one way from the corresponding set. These two-bit flags are stored in a table accessed by the set-index address. Reading the MRU information before starting the cache access might make cache access time longer. However, it can be hidden by calculating the set-index address at an earlier pipe-line stage. In addition, way prediction helps reduce cache access-time due to eliminating of a delay for way selection. So, we assumed that the cache-access time on prediction hit of the way-predicting cache is same as that of conventional set-associative cache. [5] Another approach uses a two-phase associative cache: access all tags to determine the correct way in the first phase, and then only access a single data item from the matching way in the second phase. Although this approach has been proposed to reduce primary cache energy, it is more suited for secondary cache designs due to the performance penalty of an extra cycle in cache access time. A higher performance alternative to phased primary cache is to use CAM (content-addressablememory) to hold tags. CAM tags have been used in a number of low- power processors including the StrongARM and XScale. Although they add roughly 10% to total cache area, CAMs perform tag checks for all ways and read out only the matching data in one cycle. Moreover, a 32-way associative cache with CAM tags has roughly the same hit energy as a two-way set associative cache with RAM tags, but has a higher hit rate. Even so, a CAM tag lookup still adds considerable energy overhead to the simple RAM fetch of one instruction word. Way-prediction can also reduce the cost of tag accesses by using a way- prediction table and only accessing the tag and data from the predicted way.
  • 5. 4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015 20th & 21st February 2015 Jyothy Institute of Technology Department of ECE P a g e | 188 Correct prediction avoids the cost of reading tags and data from incorrect ways, but a misprediction requires an extra cycle to perform tag comparisons from all ways. This scheme has been used in commercial high-performance designs to add associativity to off-chip secondary caches; to on-chip primary instruction caches to reduce cache hit latencies in superscalar processors; and has been proposed to reduce the access energy in low-power microprocessors. Since way prediction is a speculative technique, it still requires that we fetch one tag and compare it against the current PC to check if the prediction was correct. Though it has never been examined, way-prediction can also be applied to CAM- tagged caches. However, because of the speculative nature of way-prediction, a tag still needs to be read out and compared. Also, on a mispredict, the entire access needs to be restarted; there is no work that can be salvaged. Thus, twice the number of words are read out of the cache. An alternative to wayprediction is way memoization. Way memoization stores tag lookup results (links) within the instruction cache in a manner similar to some way prediction schemes. However, way memoization also associates a valid bit with each link. These valid bits indicate, prior to instruction access, whether the link is correct. This is in contrast to way prediction where the access needs to be verified afterward. This is the crucial difference between the two schemes, and allows way-memoization to work better in CAM-tagged caches. If the link is valid, we simply follow the link to fetch the next instruction and no tag checks are performed. Otherwise, we fall back on a regular tag search to find the location of the next instruction and update the link for future use. The main complexity in our technique is caused by the need to invalidate all links to a line when that line is evicted. The coherence of all the links is maintained through an invalidation scheme. Way memoization is orthogonal to and can be used in conjunction with other cache energy reduction techniques such as sub-banking, block buffering, and the filter cache. Another approach to remove instruction cache tag lookup energy is the L- cache, however, it is only applicable to loops and requires compiler support. The way-memoizing instruction cache keeps links within the cache. These links allow instruction fetch to bypass the tag-array and read out words directly from the instruction array. Valid bits indicate whether the cache should use the direct access method or fall back to the normal access method. These valid bits are the key to maintaining the coherence of the way- memoizing cache. When we encounter a valid link, we follow the link to obtain the cache address of the next instruction and thereby completely avoid tag checks. However, when we encounter an invalid link, we fall back to a regular tag search to find the target instruction and update the link. Future instruction fetches reuse the valid link. Way-memoization can be applied to a conventional cache, a phased cache, or a CAM-tag cache. On a correct way prediction, the way-predicting cache performs one tag lookup and reads one word, whereas the way-memoizing cache does no tag lookup, and only reads out one word. On a way misprediction, the way-predicting cache is as power-hungry as the conventional cache, and as slow as the phased cache. Thus it can be worse than the normal non-predicting caches. The way- memoizing cache, however, merely becomes one of the three normal non-predicting caches in the worst case. However, the most important difference is that the waymemoization technique can be applied to CAM-tagged caches. [6] There is a new way memoization technique which eliminates redundant tag and way accesses to reduce the power consumption. The basic idea is to keep a small number of Most Recently Used (MRU) addresses in a Memory Address Buffer (MAB) and to omit redundant tag and way accesses when there is a MAB-hit.
  • 6. 4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015 20th & 21st February 2015 Jyothy Institute of Technology Department of ECE P a g e | 189 The MAB is accessed in parallel with the adder used for address generation. The technique does not increase the delay of the circuit. Furthermore, this approach does not require modifying the cache architecture. This is considered an important advantage in industry because it makes it possible to use the processor core with previously designed caches or IPs provided by other vendors. The base address and the displacement for load and store operations usually take a small number of distinct values. Therefore, we can improve the hit rate of the MAB by keeping only a small number of most recently used tags. Assume the bit width of tag memory, the number of sets in the cache, and the size of cache lines are 18, 512, and 32 bytes, respectively. The width of the setindex and offset fields will be 9 and 5 bits, respectively. Since most (according to our experiments, more than 99% of) displacement values are less than 214 , we can easily calculate tag values without address generation. This can be done by checking the upper 18 bits of the base address, the sign-extension of the displacement, and the carry bit of a 14-bit adder which adds the low 14 bits of the base address and the displacement. Therefore, the delay of the added circuit is the sum of the delay of the 14-bit adder and the delay of accessing the set-index table. Our experiment shows this delay is smaller than the delay of the 32-bit adder used to calculate the address. Therefore, our technique does not have any delay penalty. Note that if the displacement value is more than or equal to 214 or less Than -214 , there will be a MAB miss, but the chance of this happening is less than 1%. To eliminate redundant tag and way accesses for intercache-line flows, we can use a MAB. Unlike the MAB used for D-cache, the inputs of the MAB used for I-cache can be one of the following three types: 1) an address stored in a link register, 2) a base address (i.e. the current program counter address) and a displacement value (i.e., a branch offset), and 3) the current program counter address and its stride. In the case of inter-cacheline sequential flow, the current program counter address and the stride of the program counter are chosen as inputs of the MAB. The stride is treated as the displacement value. If the current operation is a”branch (or jump) to the link target”, the address in the link register is selected as the input of the MAB as shown in Figure below. Otherwise, the base address and the displacement are used as done for the data cache. [7] A new cache architecture called the location cache. Figure below illustrates its structure.
  • 7. 4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015 20th & 21st February 2015 Jyothy Institute of Technology Department of ECE P a g e | 190 The location cache is a small virtually-indexed direct-mapped cache. It caches the location information (the way number in one set a memory reference falls into). This cache works in parallel with the TLB and the L1 cache. On an L1 cache miss, the physical address translated by the TLB and the way information of the reference are both presented to the L2 cache. The L2 cache is then accessed as a direct-mapped cache. There can be a miss in the location cache, then the L2 cache is accessed as a conventional set- associative cache. As opposed to way-prediction information, the cached location is not a prediction. Thus when there is a hit, both time and power will be saved. Even if there is a miss, we do not see any extra delay penalty as seen in way- prediction caches. Caching the position, unlike caching the data itself, will not cause coherence problems in multi-processor systems. Although the snooping mechanism may modify the data stored in the L2 cache, the location will not change. Also, even if a cache line is replaced in the L2 cache, the way information stored in the location cache will not generate a fault. One interesting issue arises here: the locations for which references should be cached? The location cache should catch the references which turn out to be L1 misses. A recency based strategy is not suitable because the recent accesses to the L2 caches are very likely to be cached in the L1 caches. The equation below defines the optimal coverage of the location cache. Opt. coverage = L2 Coverage - L1 Coverage As the indexing rules of L1 and L2 caches are different, this optimal coverage is not reachable. Fortunately, the memory locations are usually referenced in sequences or strides. Whenever a reference to the L2 cache is generated, we calculate the location of the next cache line and feed it into the location cache. The proposed cache system works in the following way. The location cache is accessed in parallel with the L1 caches. If the L1 cache sees a hit, then the results from the location cache is discarded. If there is a miss in the L1 cache, and there is a hit in the location cache, the L2 cache is accessed as a direct-mapped cache. If both the L1 cache and the location cache see a miss, then the L2 cache is accessed as a traditional L2 cache. The tags of the L2 cache is duplicated. We call the duplicated tag arrays of the L2 cache location tag arrays. When the L2 cache is accessed, the location tag arrays are accessed to generate the location information for the next memory reference. The generated location information is then sent to and stored in the location cache. The L1 cache is a 16KB 4-way set-associative cache, with a cache line size of 64-bytes, implemented with a 0.13μm technology. The results were produced using the CACTI3.2 simulator. We chose the access delay of a 16KB direct-mapped cache as the baseline, which is the best-case delay when a way-prediction mechanism is implemented in the L1 cache. We normalized the baseline delay to 1. It is observed that a location cache with up-to 1024 entries has shorter access latency than the L1 cache. Though the organization of the location cache is similar to that of a direct-mapped cache, there is a small change in the indexing rule. The block offset is 7 bit as the cache line size for the simulated L2 cache is 128 bytes. Thus the width of the tag is smaller for the location cache, compared with a regular cache.
  • 8. 4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015 20th & 21st February 2015 Jyothy Institute of Technology Department of ECE P a g e | 191 Compared to a regular cache design, the modification is minor. Note that we need to double the tags (or the number of ports to the tag) because when the original tags are compared to validate the accesses, a spare set of tag is compared to generate the future location information. This idea is similar to the phased cache. The difference is that we overlap the tag comparison for future references with existing cache reference and use the location cache to store such location information. The simulated cache geometry parameters were optimized for the set-associative cache. The simulation results show that the access latency for a direct-mapped hit is 40% faster than a set-associative hit. Although the extra hardware employed by the location cache design does not introduce extra delay on the memory reference critical path, it does introduce extra power consumption. The extra power consumption comes from the small location cache and the duplicated tag arrays. The power consumption for the tag access of a direct- mapped hit to one. Comparing to the L2 cache power consumption, the location cache consumes a small amount of power is normalized. However, as the location cache is triggered much often than the L2 cache, its power consumption cannot be ignored. The total chip area of the proposed location cache system (with duplicated tag and a location cache of 1024 entries) is only 1.39% larger than that of the original cache system. [8] The r-a cache is formed by using the tag array of a set-associative cache with the data array of a direct-mapped cache, as shown in Figure 1. For an n-way r-a cache, there is a single data bank, and n tag banks. The tag array is accessed using the conventional set-associative index, probing all the n-ways of the set in parallel, just as in a normal set-associative cache. The data array index uses the conventional set-associative index concatenated with a way number to locate a block in the set. The way number is log2(n) bits wide. For the first probe, it may come from either the conventional set-associative tag field’s lower- order bits (for the direct-mapped blocks), or the way-prediction mechanism (for the displaced blocks). If there is a second probe (due to a misprediction), then the matching way number is provided by the tag array. The r-a cache simultaneously accesses the tag and data arrays for the first probe, at either the direct-mapped location or a set-associative position provided by the way-prediction mechanism. If the first probe, called probe0, hits, then the access is complete and the data is returned to the processor. If probe0 fails to locate the block due to a misprediction (i.e., either the block is in a set- associative position when probe0 assumed direct- mapped access or the block is in a set-associative position different than the one supplied by way- prediction), probe0 obtains the correct way- number from the tag array if the block is in the cache, and a second probe, called probe1, is done using the correct way-number.
  • 9. 4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015 20th & 21st February 2015 Jyothy Institute of Technology Department of ECE P a g e | 192 Probe1 probes only the data array, and not the tag array. If the block is not in the cache, probe0 signals an overall miss and probe1 is not necessary. Thus there are three possible paths through the cache for a given address:(1) probe0 is predicted to be a direct mapped access, (2) probe0 is predicted to be a set-associative access and the prediction mechanism provides the predicted way-number, and (3) probe0 is mispredicted but obtains the correct way-number from the tag array, and the data array is probed using the correct way-number in probe1. On an overall miss, the block is placed in the direct-mapped position if it is non-conflicting, and a set-associative position (LRU, random, etc.) otherwise. Way Prediction: The r-a cache employs hardware way-prediction to obtain the way-number for the blocks that are displaced to set-associative positions before address computation is complete. The strict timing constraint of performing the prediction in parallel with effective address computation requires that the prediction mechanism use information that is available in the pipeline earlier than the address compute stage. The equivalent of way-prediction for i-caches is often combined with branch prediction but because D-caches do not interact with branch prediction, those techniques cannot be used directly. An alternative to prediction is to obtain the correct way-number of the displaced block using the address, which delays initiating cache access to the displaced block, as is the case for statically probed schemes such as column- associative and group-associative caches. We examine two handles that can be used to perform way prediction: instruction PC and approximate data address formed by XORing the register III.WAY-TAGGED CACHE A way-tagged cache that exploits the way information in L2 cache to improve energy efficiency is introduced. A conventional set- associative cache system when the L1 data cache loads/writes data from/into the L2 cache, all ways in the L2 cache are activated simultaneously for performance consideration at the cost of energy overhead. value with the instruction offset (proposed in, and used in), which may be faster than performing a full add. These two handles represent the two extremes of the trade-off between prediction accuracy and early availability in the pipeline. PC is available much earlier than the XOR approximation but the XOR approximation is more accurate because it is hard for PC to distinguish among different data addresses touched by the same instruction. Other handles such as instruction fields (e.g., operand register numbers) do not have significantly more information content from a prediction standpoint, and the PSA paper recommends the XOR scheme for its high accuracy. In an out-of-order processor pipeline (Figure above), the instruction PC of a memory operation is available much earlier than the source register.
  • 10. 4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015 20th & 21st February 2015 Jyothy Institute of Technology Department of ECE P a g e | 193 Therefore, way-prediction can be done in parallel with the pipeline front end processing of memory instructions so that the predicted way-number and probe0 way# mux select input are ready well before the data address is computed. The XOR scheme, on the other hand, needs to squeeze in an XOR operation on a value often obtained late from a register-forwarding path followed by prediction table lookup to produce the predicted way-number and the probe0 way# mux select, all within the time the pipeline computes the real address using a full add. Note that the prediction table must have more entries or be more associative than the cache itself to avoid conflicts among the XORed approximate data addresses, and therefore will probably have a significant access time, exacerbating the timing problem. The above figure illustrates the architecture of the two-level cache. Only the L1 data cache and L2 unified cache are shown as the L1 instruction cache only reads from the L2 cache. Under the write-through policy, the L2 cache always maintains the most recent copy of the data. Thus, whenever a data is updated in the L1 cache, the L2 cache is updated with the same data as well. This results in an increase in the write accesses to the L2 cache and consequently more energy consumption. The locations (i.e., way tags) of L1 data copies in the L2 cache will not change until the data are evicted from the L2 cache. The proposed way-tagged cache exploits this fact to reduce the number of ways accessed during L2 cache accesses. When the L1 data cache loads a data from the L2 cache, the way tag of the data in the L2 cache is also sent to the L1 cache and stored in a new set of way-tag arrays. These way tags provide the key information for the subsequent write accesses to the L2 cache. In general, both write and read accesses in the L1 cache may need to access the L2 cache. These accesses lead to different operations in the proposed way-tagged cache, as summarized in Table I. Under the write-through policy, all write operations of the L1 cache need to access the L2 cache. In the case of a write hit in the L1 cache, only one way in the L2 cache will be activated because the way tag information of the L2 cache is available, i.e., from the way-tag arrays we can obtain the L2 way of the accessed data. While for a write miss in the L1 cache, the requested data is not stored in the L1 cache. As a result, its corresponding L2 way information is not available in the way-tag arrays. Therefore, all ways in the L2 cache need to be activated simultaneously. Since write hit/miss is not known a priori, the way-tag arrays need to be accessed simultaneously with all L1 write operations in order to avoid performance degradation. The way-tag arrays are very small and the involved energy overhead.
  • 11. 4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015 20th & 21st February 2015 Jyothy Institute of Technology Department of ECE P a g e | 194 The above figure shows the system diagram of proposed way-tagged cache. We introduce several new components: way-tag arrays, way-tag buffer, way decoder, and way register, all shown in the dotted line. The way tags of each cache line in the L2 cache are maintained in the way- tag arrays, located with the L1 data cache. Note that write buffers are commonly employed in write-through caches (and even in many write- back caches) to improve the performance. With a write buffer, the data to be written into the L1 cache is also sent to the write buffer. The operations stored in the write buffer are then sent to the L2 cache in sequence. This avoids write stalls when the processor waits for write operations to be completed in the L2 cache. In the proposed technique, we also need to send the way tags stored in the way-tag arrays to the L2 cache along with the operations in the write buffer. Thus, a small way-tag buffer is introduced to buffer the way tags read from the way-tag arrays. A way decoder is employed to decode way tags and generate the enable signals for the L2 cache, which activate only the desired ways in the L2 cache. Each way in the L2 cache is encoded into a way tag. A way register stores way tags and provides this information to the way-tag arrays can be easily compensated for. For L1 read operations, neither read hits nor misses need to access the way-tag arrays. This is because read hits do not need to access the L2 cache; while for read misses, the corresponding way tag information is not available in the way-tag arrays. As a result, all ways in the L2 cache are activated simultaneously under read misses. The amount of energy consumption per read and write across the conventional set-associative L2 cache and proposed L2 cache is shown below: This cache configuration, used in Pentium-4, will be used as a baseline system for comparison with the proposed technique under different cache configurations. IV. CONCLUSION This paper presents a new energy-efficient cache technique for high-performance microprocessors employing the write-through policy. The proposed technique attaches a tag to each way in the L2 cache. This way tag is sent to the way-tag arrays in the L1 cache when the data is loaded from the L2 cache to the L1 cache. Utilizing the way tags stored in the way-tag arrays, the L2 cache can be accessed as a direct-mapping cache during the subsequent write hits, thereby reducing cache energy consumption. Simulation results demonstrate significantly reduction in cache energy consumption with minimal area overhead and no performance degradation. Furthermore, the idea of way tagging can be applied to many existing low-power cache techniques such as the phased access cache to further reduce cache energy consumption. Future work is being directed towards extending this technique to other levels of cache hierarchy and reducing the energy consumption of other cache operations. REFERRENCES [1].An Energy-Efficient L2 Cache Architecture Using Way Tag Information Under Write- Through Policy, Jianwei Dai and Lei Wang, Senior Member, IEEE, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, Vol. 21, No. 1, January 2013 [2].C. Su and A. Despain, “Cache design tradeoffs for power and performance optimization: A case study,” in Proc. Int. Symp. Low Power Electron. Design, 1997, pp. 63–68. [3]. K. Ghose and M. B.Kamble, “Reducing power in superscalar processor caches using subbanking, multiple line buffers and bit-line segmentation,” in Proc. Int. Symp. Low Power Electron. Design, 1999, pp. 70–75.
  • 12. 4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015 20th & 21st February 2015 Jyothy Institute of Technology Department of ECE P a g e | 195 [4]. C. Zhang, F. Vahid, and W. Najjar, “A highly-configurable cache architecture for embedded systems,” in Proc. Int. Symp. Comput. Arch., 2003, pp. 136–146. [5]. K. Inoue, T. Ishihara, and K. Murakami, “Way-predicting set-associative cache for high performance and low energy consumption,” in Proc. Int. Symp. Low Power Electron. Design, 1999, pp. 273–275. [6].A.Ma, M. Zhang, and K.Asanovi, “Way memoization to reduce fetch energy in instruction caches,” in Proc. ISCA Workshop Complexity Effective Design, 2001, pp. 1–9. [7]. T. Ishihara and F. Fallah, “A way memorization technique for reducing power consumption of caches in application specific integrated processors,” in Proc. Design Autom. Test Euro. Conf., 2005, pp. 358–363. R. Min, W. Jone, and Y. Hu, “Location cache: A low-power L2 cache system,” in Proc. Int. Symp. Low Power Electron. Design, 2004, pp. 120–125. [8]T. N. Vijaykumar, “Reactive-associative caches,” in Proc. Int. Conf. Parallel Arch. Compiler Tech., 2011, p.4961. [9] Way-Tagged L2 Cache Architecture in Conjunction with Energy Efficient Datum Storage Vineeta Vasudevan Nair ECE Department, ANNA University Chennai Sri Eshwar College Of Engineering Coimbatore, India
  • 13. 4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015 20th & 21st February 2015 Jyothy Institute of Technology Department of ECE P a g e | 196
  • 14. 4th National Conference on Emerging Trends in Engineering Technologies, ETET-2015 20th & 21st February 2015 Jyothy Institute of Technology Department of ECE P a g e | 197