Book Paid Lohegaon Call Girls Pune 8250192130Low Budget Full Independent High...
Memory Requirements for Convolutional Neural Network Hardware Accelerators
1. Memory Requirements for Convolutional
Neural Network Hardware Accelerators
Memory Requirements
for Convolutional Neural
Network Hardware
Accelerators
Kevin Siu, Dylan Malone Stuart, Mostafa Mahmoud, and
Andreas Moshovos
University of Toronto,
2018 IEEE International Symposium on Workload
Characterization (IISWC)
SEPIDEH SHIRKHANZADEH
1
2. WHY WE NEED HARDWARE ACCELERATORS?
(CNNs) have been highly successful in image
processesing and Image classification
hardware architectures have designed to
accelerate the computations in CNNs
Memory,Band-width, performance
The main challenge in designing efficient
memory system is : sizing memory on-chip to
minimizing off-chip access costs.
2
3. Type of Memory systems
There is 3 type of memory system
1. centralized on-chip global memory
2. specialized partitioned memories
3. partition the storage into space for weights and activations
The hierarchy is fixed or flexible
Benefit of fexible hierarchy : optimum energy for each layer of each network
disadvantage : configuration extraction is a time Consuming process
3
4. Basics of Convolutional Nueral NetworkS
The input activations I are a block of size X * Y * C
There are K filters Fk, which are of size R * S * C.
The output activations O are a block of size P *Q * K
P = (X-R)/m + 1
Q = (Y- S)/m + 1,
where m is the stride length.
4
5. Convolutional Computations
• To filter F0 is multiplied element-wise
with the upper-leftmost values of the
input to produce O(0; 0; 0).
• In the next step, the filter is shifted by
stride m across the input, to produce
O(0, 1, 0).
• This process is repeated for the entirety
of the input block to compute the output
activation plane O(P; Q; 0).
• To compute the other planes, we apply
the same process using filters F1 to Fk
5
6. characterization of the on-chip memory storage requirements
computation of each output window is
independent of one another, which
permits very wide parallelism across the
computation.
the same input activation and filter value
get accessed multiple times throughout
the computation and this is the
opportunity for Data Reuse.
Each calculations in convolutional layer is
independent, so the order does not
affect the final outcome but affects when
the operands are accessed from memory.
So data reuse and local access get
importance
6
7. characterization of the on-chip memory storage requirements
Computation Orders
& Data Reuse
7
8. Computation Order
Order 1 (Input-Major Order)
• each input window of size R x S x C is
multiplied with each of the K filters,
producing a 1 x 1 output activations of
length K.
•
weights are re-accessed P*Q times
In Order 2 (Filter-Major Order)
one filter is convolved
across the entire input activation
• the input activation values are
re-accessed K times
8
10. 1. Everything On Chip:
• We have AM and WM for On-Chip Memory
• the WM is sized such that the weights from all layers can
simultaneously fit onchip.
• the input and output activations for any layer yet one layer at
a time can also fit in the on-chip AM.
• “zero” off-chip bandwidth because everything fits on-chip.
• the weights will be loaded once
• for each inference only off-chip traffic will be for the input
image and for the final output.
• generally infeasible
10
11. 2. Working Set of Activations + All Filters (Off-Chip Activations)
the WM is sized such that all weights for all layers fit on-chip.
AM to be able to hold one “row” of input
windows, namely a block of size X *S*C.
each input activation needs to be only read once from
off-chip
parallel loading the next set
of X *m* C activations so that by the end we have the next
“row“ of activation windows on chip.
The total off-chip = the sum of the input activations and output
activations for each layer.
WEIGHTS ON-CHIP
S
X
C
11
12. 3. Working Set of Filters + All Activations (Off-Chip Activations):
WM hold only a certain number of filters to satisfy the parallel
computation
each filter only needs to be fetched once from offchip
per layer.
The AM is sized to be able to hold both the input and the output
activations per layer.
AM = MAX LAYER
The off-chip traffic for this scheme is simply the size of the
weights across all layers of the network.
ACTIVATIONS ON-CHIP
12
13. 4. Working Set of Filters + Working Set of Activations (Both Off-Chip):
WM = one set of filters ,by the on-chip execution engine
AM is sized to store only one row of activations
To minimize off-chip bandwidth accesses, we can either
choose to re-fetch the activation values K times (as in Order
2), or re-fetch the weight values Q times (as in Order 1).
*always opt for the order that is most favorable to the
metric under study.
13
17. Total Storage for CNNs-Schem 1
MobileNet = 10.5 MB
it is certainly expensive for mobile devices for
which this network targets by sacrificing accuracy
compared to others such as ResNet.
17
TOTAL
18. On-chip Activation Memory Requirements-SCHEM 2
Assuming that all the
weights are stored in on-
chip memory
single “row” of windows of
the activations needs to be
stored on-chip.
computational
imaging
18
19. Weight Memory Requirements –SCHEM 3
• assuming that all the activations
are stored in on-chip memory
• We define the working set of
filters as the number of filters that
are computed in parallel and kept
on-chip at the same time. IMAGECLASSIFICATION
19
20. Total review
Scheme 2 that buffers all weights on-chip is impractical for the classification networks
Scheme 3 which buffers all activations per layer on chip and either all or a subset of the filters ,practical for the
classification models. VGG-19 and DPNet are outliers
Scheme 4 with processing concurrently only 64, 16, or 1 filters does have vastly lower on-chip storage requirements,
it will have much higher offchip bandwidth requirements.
All ActivationsAll weights
20
22. Computational Intensity
• computational intensity is typically much larger in early layers of convolutional neural networks.
• much larger input dimensions in early layers, filter is reused many times over input
• lower computational intensity(reuse) implies larger bandwidth requirements in later layers.
22
23. Peak Bandwidth and Memory Requirement
• under (all weights on chip) Image Classification have very low bandwidth
• under (all act on chip) super-resolution have very low bandwidth
• Under Scheme 4, only one working set on-chip at a time, memory reduced - higher bandwidth
23
24. Related Works
• Yang show how to optimize CNN loop blocking in order to minimize total memory energy expenditure
• DaDianNao used large on-chip eDRAM to store all activations and weights
• SCNN size activation RAMs to capture the capacity requirements of nearly all of the layers in the
networks
• The TPU used multi-megabyte on-chip AM and 64KB double buffers for the weights
24