This seminar emphasizes on increasing the utilization of rarely used silicon called Dark Silicon for an energy efficient architecture in android.
GreenDroid attains this by filling the dark silicon with specialised cores
2. 2
ABSTRACT
This seminar emphasizes on increasing the utilization of rarely used
silicon called Dark Silicon for an energy efficient architecture in
android.
GreenDroid attains this by filling the dark silicon with specialised
cores
3. CONTENTS
3
• Utilization wall and Dark
silicon
• C-core
• Greendroid and its
Architecture
• C-core Energy efficiency
• Conclusion
4. “The number of transistors in a chip doubles every new technological
node”
4
MOORE’S LAW
5. The Scaling Promise Of
Moore’s Law
5
8 Years Ago Today transistors are
3.8 Ghz
1 Core
4x Faster
16x more
plentiful
16 cores
15.2 Ghz
Today
90nm 22nm 3.6 Ghz 6 Cores
64x 5.7x
3.6 Ghz
6 Cores
6. With each successive generation, the percentage of a chip that
can actively switch drops exponentially due to power
constraints
• A direct consequence of this is Dark Silicon
limits the utilization of the application processors
6
UTILIZATION WALL
7. tilization Wall: Dark Implications for Multicor
4 cores @ 3 GHz
4 cores @ 2x3 GHz
(12 cores dark)
2x4 cores @ 3 GHz
(8 cores dark)
(Industry’s Choice)
.…
65 nm 32 nm
.…
.…
Spectrum of tradeoffs
between # cores and
frequency.
e.g.; take
65 nm32 nm
4x4 cores @.9GHz
(GPUs of future?
7
8. WHAT DO WE DO WITH
DARK SILICON??Goal: Leverage Dark silicon for more efficient architecture
Approach:
1. Fill dark silicon with specialised cores to save energy on
common apps.
2. Provide focused re-configurability to evolving
workloads
8
9. CONSERVATION CORES
Specialized cores for reducing
energy
Hotcode run by c-cores,and cold
code runs on host cpu
C-cores uses upto 18x less energy
Shared D-cache ->Coherent memory
Fully automated toolchain
No “deep” analysis required
C-cores automatically generated
from hot program regions
HW generation/SW integration
9
D cache
Host
CPU
(general purpose)
I cache
Hot code
Cold code
C-Core
10. ANDROID
• Google’s OS+app. Environment
for mobile devices
• Java applications run on the
Dalvik virtual machine
• Apps share a set of libraries
(libc,OpenGL,SQLite,etc)
APPLICATIONS
LIBRARIES
DALVIK
CACHE
HARDWARE
LINUX KERNEL
10
11. Applying C-cores to
Android
• Android well suited for c-cores
Core set of commonly used
applications
Libraries are hot code
Dalvik virtual machine is hot
code
Libraries,Dalvik,kernel and
application hotspots c-
cores
APPLICATIONS
LIBRARIES
DALVIK
CACHE
HARDWARE
LINUX KERNEL
C-CORES
11
12. WHAT IS GREENDROID?
A mobile application processor
45-nm multicore research prototype
Targets the Android mobile-phone software stack.
Can execute general-purpose mobile programs with 11 times
less energy
Saves energy by using specialised cores called conservation
cores(c-cores)
C-cores span approximately 95 percent of the execution time
12
17. C-CORE ENERGY EFFICIENCY
c-cores don’t requires overheads.
specialization of the c-cores’ data path.
energy drops from 91 pJ per instruction to just 8 pJ
per instruction.
17
D-Cache
6%
I-Cache
23%
Fetch/D
ecode
19%
Register
,
14%
Datapat
h
38%
D-cache
6%
Datapath
3%
Energy
Saved,
91%
C-cores 8pJ/instr Baseline CPU 91pJ/instr
18. CONCLUSION
•Over the next 5 to 10 years, the amount of dark silicon will
increase exponentially.
•c-cores technique converts dark silicon into energy savings.
•Reduce processor energy consumption by 91 percent for hot code.
18
Convert the cores into verilogs that has this specialised core injected into it.
We just turn on the cores we needed when we need them.
Execution model is by jumping from c-cores to c-cores and for each loop we have we are running specialised hardware that’s been targeted for just that loop.
Trading area, which is dark anyways for energy efficiency,
C-core sents all the memory accessing through the data cache that is shared by host cpu
If it’s a code that’s not executed so much then its execute in host cpu and then while we have hotspots , we jump over to specialised piece of HW, and we don’t have to transfer any data , because data is already in the shared data cache, so allows to jump back and forth very quickly and very efficiently. We generate c-cores using fully automated tool chain. The tool chain generate synthesizable Verilog and at the same time integrate c-cores into the software, it does this by inserting function steps into the application that called the c-cores during the run time
This simple transformation get u about 18x less energy for the code they target, without even trying to parallelise the code
The diag shows android software stack running on typical hardware.
Applications are written in java and compiled to run in DVM
The application also call in a set of libraries including libc,opengl,etc
This software model makes android a great fit for software models
This is because Android runs a core set of commonly used applications,eg.web browser,email and various media player.
This application rely on DVM and libraries making this part of the SW stack particularly hot code.
We can also target specific hotspots from certain applications and linux kernel
We can convert all of these hotspots into conservation cores for great energy savings.
Another reason is the relatively short replacement cycle of the handsets. Most of the android phones are used for only 2-3yrs
We can continuily develop new c-cores as more application appear and become popular.
At the same time the c-cores interface allows us to remove the c-cores at any time without affecting the system. Because the application can fall back to the general purpose host CPU.
It has a specially built structure that can analyze a current Android phone and determine which apps, and which CPU circuits the phone is using the most. Then it can dream up a processor design that best takes advantage of those usage habits, creating a CPU that’s both faster and more energy efficient.
So we have been applying this c-core technique to android environment and actually extract this hot spots from android and the building a chip .
The fig on ri8 is the output of a layout tool basically shows 9 different c-cores clusterd around the datacache with a processor on the left
Look on the left it is the breakdown of energy for one of the very efficient processor, and on the ri8 is one of the c-cores
The main benefit is that we got rid of all the overheads in executing an instruction, we dont have an instruction cache so there is no fetching and decoding of instr. There is no big reg file to write operands to and even most of the data path is eliminated.
All that left is data cache and a little sliver of the datapath where the actual computation takes place