SlideShare ist ein Scribd-Unternehmen logo
1 von 125
Downloaden Sie, um offline zu lesen
May 10, 2014 R.Innocente 1
Reconfigurable ComputingReconfigurable Computing
Roberto Innocente
inno@sissa.it
May 10, 2014 R.Innocente 2
Flexibility
ASIC
Application
Specific
Integrated Circuit
Very inflexible,designed
to solve just 1 problem.
Energy, space and time
efficient
GPP
General
Purpose
Processor
Very flexible,
can solve any problem.
Energy, space and time
inefficient
?
Reconfigurable
Hardware
Flexible,
But enough energy, time
and space efficient
+-
May 10, 2014 R.Innocente 3
History
May 10, 2014 R.Innocente 4
Gerald Estrin/1
is credited with the idea, in the
'60, of the first reconfigurable
(F+V) FIX+Variable computer
Gerald Estrin. ACM 1960. Organization of computer
systems: the fixed plus variable structure computer.
May 10, 2014 R.Innocente 5
Gerald Estrin/2
He envisioned that important gains in performance could be achieved when
many computations are executed on appropriate problem oriented configurations.
F+V is made of :
- high speed general computer(the F part) : initially an ibm7090
- various size high speed special structures (the V part) problem specific: trigonometric functions, logarithms,
exponential, n-th powers, complex arithmetic, …
V is made of a 36 module positions motherboard which can undergo :
- Function reconfiguration: physically changing some modules
- Routing reconfiguration : changing part of the back wiring
The Rammig machine (1977) : investigation of a reconfigurable machine with no manual or mechanical
intervention
May 10, 2014 R.Innocente 6
Today reconfigurable hardware
Is born out of the will to replace different logic
IC(Integrated Circuits), and successively to rapidly
prototype large ASICs(Application Specific ICs) or
implement SoCs (Sytem On Chip).
In the following slides readers are supposed to be involved
in scientific computing and not EE engineers.
May 10, 2014 R.Innocente 7
Basic digital circuits
AND INVERTER
Shift RegD Type FFMUX
Usually 0=0V, 1=some positive voltage
OR
May 10, 2014 R.Innocente 8
SSI 74xx IC
May 10, 2014 R.Innocente 9
PLD
Inconvenience of standard discrete logic circuits :
- 14 pin packages of 4/6 logic functions
- often you had to traverse the PCB to find a free OR or inverter
- if you needed only a few, you had in any case to put an IC with 4/6
Therefore came the idea of PLD (Programmable Logic Device) :
- SPLD (Simple : PAL/PLA)
- CPLD (Complex)
In which a simple interconnection network could be configured melting some internal
fuses (fuse technology) to implement combinatorial logic.
May 10, 2014 R.Innocente 10
disjunctive normal form
(aka Sum of products )
Each boolean function of some boolean variables can be
represented as a sum of minterms (product of all variables
or their complement) .
With 3 boolean vars : a,b,c
are 2 of the 23 = 8 minterms
f (a ,b , c)=a ̄b c+̄a b ̄c
ābc,̄ab̄c
May 10, 2014 R.Innocente 11
PLA (Programmable Logic Array)
f1= p1+ p2 + p3=x1x2 + x1 ̄x3+ ̄x1 ̄x2 x3+ x1 x3
May 10, 2014 R.Innocente 12
FPGA
Also CPLDs showed their limits, therefore in 1985/1990
Xilinx introduced a more flexible design , the
FPGA (Field Programmable Gate Array)
In which the interconnection network is much
more flexible and on which also sequential
circuits can be easily mapped.
May 10, 2014 R.Innocente 13
FPGA idea
1985 Xilinx – Ross Freeman (inventor of
FPGA): “What if we could develop the
equivalent of a circuit board full of
standard logic parts (like TTL and PAL
devices) on a single high density
programmable logic chip ?”
- post fabrication programmability by end
users
- fabless semiconductor company
May 10, 2014 R.Innocente 14
Today
May 10, 2014 R.Innocente 15
FPGA market
Dominated by 2 players :
- Altera
- Xilinx
From 67% of 2010, today they share together 90% of the market
(4.5 billion usd revenues in 2012)
From sourcetech411(2010)
May 10, 2014 R.Innocente 16
An important question: are FPGAs green ?
Virtex-7 2000T (one of the
top FPGAs) :
~ 20 W
Xilinx showed 3600 copies of its 8
bit processor nanoblaze running on
Virtex-7, consuming 20 W
CPU : ~ 100 W
Core i7-4770K Haswell (22 nm) 3.5 GHz@ 4 Cores 84 W
Core i7-3930K Sandybridge-E (32 nm) 3.2 GHz @6Cores 130 W
Xeon E7458 Dunnington (45 nm) 2.4 GHz 90 W
Xeon E7460 Dunnington (45 nm) 2.66 GHz 130 W
GPU : ~ 220 W
Nvidia Tesla M2090 225 W
Nvidia Tesla K20X 235 W
This is a partial answer. We need to be able to estimate FPGA
performance to give a more useful index.
May 10, 2014 R.Innocente 17
FPGA architecture
From RF and
Wireless World
Sea of gates : logic blocks are like islands in a sea of interconnections
May 10, 2014 R.Innocente 18
Virtex family
1998 Virtex 250nm 100mhz 25k-60k cells
2000 Virtex-E 180nm 300mhz 1k-70kcells
2000 Virtex II 150nm to168 mult420mhzupto 93k 4-luts
2005 Virtex-4 90nm 500mhz upto 200k cells
2007 Virtex-5 65nm 550mhz up to 330k cells
Virtex-6 40nm 288-2k DSP to 500k 6-luts
2010 Virtex-7 28nm ~500mhz upto 2000k cells
2014 Virtex-US 20 nm upto 4400k cells
From L Zhuo
Up to ~ 7 billion transistor
Intel 2014 15-core Xeon IvyBridge-EX~ 4.3 billion transistor
Nvidia 2012 GK110 Kepler ~ 7 billion transistor
May 10, 2014 R.Innocente 19
FPGA/CPU evolution
May 10, 2014 R.Innocente 20
Virtex-7 is not monolithic
2.5 D technology : 4 FPGA tiles with silicon interposer that provides 10k
Interconeections between layers
May 10, 2014 R.Innocente 21
Enabling technologies
May 10, 2014 R.Innocente 22
Programming technology/1
Antifuse SRAM
OTP(One time programmable)
Disordered except at very low range
Pass transistor in switch block
May 10, 2014 R.Innocente 23
Programming technology/2
Antifuse
-pros:
cheap, small
-cons:
requires special
processing, One time
programming
SRAM
-pros:
can be deployed with standard
semiconductor process, can be
easily reprogrammed
-cons:
large area required(6
transistors)
May 10, 2014 R.Innocente 24
Confware
The configuration of an FPGA ( that becomes compiled to a
stream of bits) is not hardware, nor software.
Someone invented the neologism
confware
The configuration of a reconfigurable hardware.
May 10, 2014 R.Innocente 25
How you configure an FPGA ?
SRAM cells as a long shift register : loaded serially clocking in the confware
Virtex 7 2000T = 440 Mbits of SRAM cells
(simplified : large fpgas can also parallel load the confware)
May 10, 2014 R.Innocente 26
Logic Blocks/Logic Cells
May 10, 2014 R.Innocente 27
Fine/coarse grain logic blocks
From :
- a single transistor
(Crosspoint : went in
bankrupcy)
- a logic gate
To :
- a complete processor (FPNA: field
programmable node arrays)
NB. FPNA is also field programmable neural array
May 10, 2014 R.Innocente 28
Homogeneous :
- Logic Cells: 4 input LUT(LookUp Table) + FlipFlop
Heterogeneous(modern development) :
- Logic cells
- DSP (Digital Signal Processing)
- Memory blocks
- I/O blocks
The heterogenous architecture is prevalent now. The blocks are configured by SRAM bits usually
loaded trough serial ports as already pointed out.
CLB(Configurable Logic Blocks)
Necessary differentiation to allow
things like multiplication/addition
to be mapped in an efficient way.
May 10, 2014 R.Innocente 29
Standard Logic Cell
4 input LUT
D type FlipFlop
16 bits of SRAM for conf 1 bit SRAM conf
2:1 Mux
May 10, 2014 R.Innocente 30
standard LUT (Look Up Table)
0 0000 0
1 0001 1
2 0010 0
3 0011 0
4 0100 1
5 0101 0
6 0110 1
7 0111 1
.. .. ..
Dec Bin Out
- 16 x 1 memory
- any boolean function of 4
inputs :
Bit 0
Bit 1
Bit 2
Bit 3
f = ̄x3 ̄x2 ̄x1 x0+ ̄x3 x2 ̄x1 ̄x0+ ̄x3 x2 x1 ̄x0+ ̄x3 x2 x1 x0
NB. LUT rhymes
with nut
May 10, 2014 R.Innocente 31
Uses of Logic Cell
2^4 = 16 x 1 bit memory Any boolean function of 4
inputs
4:1 multiplexer
May 10, 2014 R.Innocente 32
Virtex-7 Logic Block basics
May 10, 2014 R.Innocente 33
Virtex-7 Logic slice
From Xilinx
4 x 32=128 bit shift reg
May 10, 2014 R.Innocente 34
Virtex7 CLB slice
- 6-input LUT
- 2 5-input LUTs with same inputs
- 2 arbitrary boolean function on 3-input and 2-input or less
May 10, 2014 R.Innocente 35
Altera ALM
May 10, 2014 R.Innocente 36
Interconnection network
May 10, 2014 R.Innocente 37
Interconnection network
Hierarchical routing Island type routing(predominant)
Interconnection network can consume
80% of the area of an FPGA !
Nearest neighbours
May 10, 2014 R.Innocente 38
Programmable switch
May 10, 2014 R.Innocente 39
SRAM routing: coarse/fine grain
5 bit SRAM 1 bit SRAM
May 10, 2014 R.Innocente 40
Details of island type routing
May 10, 2014 R.Innocente 41
Disjoint/Wilton switch blocks
Disjoint : wire can only go out on
wire of same number, creates routing
domains
Wilton : can change domain in at
least one directions
May 10, 2014 R.Innocente 42
Channel segments distribution
May 10, 2014 R.Innocente 43
Columnar architecture
7 series Xilinx fpga
Columnar architecture
May 10, 2014 R.Innocente 44
DSP blocks &
floating point
May 10, 2014 R.Innocente 45
FPGAs floating point in 1994
B. Fagin and C. Renard. Field Programmable Gate Arrays and
Floating Point Arithmetic. IEEE Transactions on VLSI Systems,
2(3), September 1994.
Fagin & Renard report that you can implement
floating point operators but it is impractical : no
FPGA in existence could contain a single
multiplier circuit !!
May 10, 2014 R.Innocente 46
FPGA fp in 1995
Shirazi & al. On the same line of Fagin & Renard propose
2 custom fp formats 16 and 18 bits total:
they provide for them add,sub, mul, div operators
N. Shirazi, A. Walters, and P. Athanas. Quantitative Analysis of
Floating Point Arithmetic on FPGA Based Custom Computing
Machines. In Proceedings of the IEEE Symposium on FPGAs for
Custom Computing Machines, April 1995.
May 10, 2014 R.Innocente 47
FPGA fp in 2002
Belanovic & Leeser present a library of variable width
parameterized floating point operators (superset of the
ieee formats)
A Library of Parameterized Floating-point Modules
and Their Use
Pavle Belanovic and Miriam Leeser, 2002
May 10, 2014 R.Innocente 48
What allowed the breakthrough ?
The addition, by major vendors, of hardware multipliers (called
DSP blocks) on their FPGA from 2000 on :
- 1st Xilinx on Virtex II
- soon after Altera on Stratix
This started in the last decade also the interest of HPC community :
Cray XD1, Silicon RASC, Convey HC1
HPRC = High Performance Reconfigurable Computing
May 10, 2014 R.Innocente 49
FPGA MAC operation
May 10, 2014 R.Innocente 50
Virtex-7 DSP48 high level
From Xilinx
1 bit 2 bit
May 10, 2014 R.Innocente 51
DSP48E1 details
May 10, 2014 R.Innocente 52
Altera Stratix V DSP block
4 (*) + 3(+) = 7 flop
May 10, 2014 R.Innocente 53
Data Flow
Graphs (DFG)
May 10, 2014 R.Innocente 54
Data flow
A representation of a program as a DG(Directed Graph) in
which the nodes are the operations and the edges represent
the data dependencies from one operation to the next
May 10, 2014 R.Innocente 55
Control flow/Data Flow
dis2=b**2-4*a*c
If dis2 < 0 complex!
dis=sqrt(dis2)
u1=-b/(2*a)
u2=dis/(2*a)
x1=u1+u2
x2=u1-u2
x=
−b
2a
±
√b2
−4ac
2a
May 10, 2014 R.Innocente 56
A scalar product
Fortran :
acc=0.0
do i=1,4
acc=acc+a(i)*b(i)
enddo
C :
acc=0.0;
for(i=0;i<4;i++){
acc=acc+a[i]*b[i];
}
May 10, 2014 R.Innocente 57
Time/Space tradeoffs
May 10, 2014 R.Innocente 58
Systolic array matrix mult
A(n,n) x B(n,n) requires :
2n-1 steps for the last elements to enter the array
n-1 steps to compute the last c(n,n)
n steps to move the result out = 4n-2 steps
May 10, 2014 R.Innocente 59
Codesign
The implementation of algorithms on FPGAs requires a mix
of hw and sw design :
Codesign = hw design + sw design
May 10, 2014 R.Innocente 60
How to program FPGAs?
Mainly with an HDL (Hardware Description Language):
- Verilog(intially developed by Gateway Design Automation, now a std)
- VHDL (out of a standard committee)
But OpenCL, ImpulseC, SystemC, C, Handel-C translators .. are also available.Is this a good idea ?
The problem is that those languages are not thought for describing hardware and the translation finish
up usually with a FSM(finite state machine) with 1 state for every statement and then the FSM
machine moves along the states .
This is not the way someone skilled would program the FPGA.
Next state
logic
State
register
Output
Logic
input
clk
D Q
Out
FSM finite state
machine
May 10, 2014 R.Innocente 61
FPGA will win
For many years FPGAs were just prototyping vehicles for
ASICs
– Now they are replacing many ASICS & ASSPs
– Watch for the same Trojan effect with FPGAs in HPC
May 10, 2014 R.Innocente 62
FPGA lingo
May 10, 2014 R.Innocente 63
Core
Core in FPGA lingo is a function ready to be instantiated
into your design as a “black box”. It can be suppliad as
HDL or schematic.
It supports design re-use.
May 10, 2014 R.Innocente 64
Soft/hard cores
On FPGAs functional modules can be implemented :
- using std FPGA resources(logic blocks, DSPs, memory
blocks) : softcores
- as an ASIC on the FPGA : hardcores
When the manufacturer puts a processor as an hardcore on
the FPGA then it sells this as a SoC (Sytem On Chip) :
Dual ARM on Zync-7000 chip, PowerPC on Altera FPGA
May 10, 2014 R.Innocente 65
IP/open cores
The soft attribute is implied.
Hardware designs in an HDL(eventually using vendor libraries):
- opensource cores : http://opencores.org/
OpenRISC 1000 architecture from the OpenCores community,
the Lattice Semiconductor LM32, the LEON3 from Aeroflex
Gaisler and the OpenSPARC family from Oracle
- proprietary : IP(Intellectual Property) cores
Floating point operators, fft, matrix computations
May 10, 2014 R.Innocente 66
Commercial offers
May 10, 2014 R.Innocente 67
Picocomputing
SC6 1U Upto 16 FPGA SC6 4U upto 48
EX-600EX-800
From
PICOCOMPUTING
May 10, 2014 R.Innocente 68
Bittware Terabox
16 altera stratix-V
From Bittware
May 10, 2014 R.Innocente 69
DINIGROUP Cluster of 4 Virtex7
From
DINIGROUP
May 10, 2014 R.Innocente 70
Dinigroup Cluster 40 Kintex-7
From DINIGROUP
May 10, 2014 R.Innocente 71
Maxeler MPC-X
Daresbury Lab UK :
The dataflow supercomputer
will feature Maxeler developed
MPC-X nodes capable of an
equivalent 8.52TFLOPs per
1U and 8.97 GFLOPs/Watt.
May 10, 2014 R.Innocente 72
Convey HC-2 , HC-2ex
May 10, 2014 R.Innocente 73
Cray XT5h
“Cray introduces an hybrid supercomputer
that
can integrate multiple processor architectures
into a single system and accelerate high
performance computing (HPC) workflows.
The Cray XT5h delivers higher sustained
performance, by applying alternative
processor architectures across selected
applications within an HPC workflow. The
Cray XT5h supports a
variety of processor technologies, including
scalar processors based on AMD OpteronTM
dual and quad-core technologies, vector
processors, and FPGA accelerators.”
May 10, 2014 R.Innocente 74
CHREC
Center for High Performance
Reconfigurable Computing
UF/BYU/GWU/VTECH
May 10, 2014 R.Innocente 75
CHREC Novo-G 384 FPGAs
“Novo-G is the most powerful
reconfigurable supercomputer in
the known world. This unique
machine features 192 top-end,
40nm FPGAs (Altera Stratix-IV
E530) and 192 top-end, 65nm
FPGAs (Stratix-III E260). “
http://www.chrec.org/
(pronounce it as shreck)
May 10, 2014 R.Innocente 76
BLAST like Smith-Waterman computes local alignment of 2
sequences :
- Novo-BLAST Novo-G/CHREC implementation : faster, same
sensitivity
IPC(Isotope Pattern Calculator) of Protein Identification Algorithm :
- speed up 52-366 on single fpga, 1259 on 4 fpgas, 3340 on a node(16
fpgas)
CHREC/2
May 10, 2014 R.Innocente 77
References for
Applications
May 10, 2014 R.Innocente 78
Linear Algebra for RC
Juan Gonzalez and
Rafael C. Núñez
LAPACKrc: Fast linear
algebra kernels/solvers
for FPGA
accelerators(JP 2009)
DOD funded
May 10, 2014 R.Innocente 79
DCT, FFT on FPGAs
Digital Signal Processing with Field Programmable Gate
Arrays ,
3d edition(2007)
U.Mayer Baese, Springer Verlag
May 10, 2014 R.Innocente 80
MD on FPGA
There are many papers about porting Molecular Dynamics algorithms on
FPGAs with substantial positive conclusions about experiments on 1-2
FPGAs. But in the last years there is an embarassing comparison with
ANTON (Shaw et al.).
We cant forget that ANTON is a really huge machine consuming over 100
KW !!!!
And is made out of 512 dedicated ASICs at 1ghz!
The comparison with some FPGAs consuming 40/60 W is improper.
FPGA-Accelerated Molecular Dynamics(2013) M. A. Khan,M. Chiu, M. C. Herbordt
May 10, 2014 R.Innocente 81
Neural networks on FPGAs
Editors : Omondi , Rajakapse (2006)
FPGA implementation of neural networks
ANN(Artificial Neural Network) in integer arithmetic
performs 40x better than on GPP (old FPGA, 3
generation old)
May 10, 2014 R.Innocente 82
Altera Arria 10
May 10, 2014 R.Innocente 83
Arria10
May 10, 2014 R.Innocente 84
Arria 10 variable precision DSP block
Altera
A
B
C
D
A+C*D = 2 flop
May 10, 2014 R.Innocente 85
Arria10 estimated sp fp performance
- 2 flops per cycle
- 1688 fp single precision DSP (GX660)
1688*2 = 3376 flops per cycle
3376 * 0.5 ghz ~ 1.7 Teraflops in single precision
May 10, 2014 R.Innocente 86
Hard single prec FP on FPGA ?!?
For people that can live with single precision this seems a very
attractive new feature.
But many think that it is too much a waste of generic resources
and claim that what was missing were simpler blocks !
May 10, 2014 R.Innocente 87
Back of the envelope
performance estimation
May 10, 2014 R.Innocente 88
Back of the envelope performance estimation
Given number of
- LUTs
- FFs
- DSPs
offered by an FPGA,
and utilization of resources
by operators, estimate the
max number of operators
that can be implemented on
the FPGA
Today FPGA clocks are ~500Mhz=0.5GHz
(unavoidable price for flexibility)
2000 flops per cycle = 1 Teraflops
May 10, 2014 R.Innocente 89
Xilinx Virtex-7 family
Virtex-7 slices : 4 x 6-input LUTs, 8 FFs
Virtex-7 DSPs : 48 bits pre-adder, 25x18 multiplier, 48 bits accumulator
Virtex LUT ~ 1.6 standard LUT
May 10, 2014 R.Innocente 90
Custom precision 17/24 bits floating
dsp lut+f lut f # tot dsp tot lut tot f
* 2 103 90 112 1080 2160 208440 232200
1 113 97 104 0 0 0 0
0 377 336 376 0 0 0
0 0 0 0 0 0 0
0 0 0
+ 0 369 301 393 1510 0 1011700 1150620
0 0 0 0 0 0 0 0
Tot 2590 2160 1220140 1382820
Virtex-7 V2000T available resources
slices LUT x FF x dsp 6 input ff
slice slice LUT
305400 4 8 2160 1221600 2443200
1.6
standard LUTs 1954560
May 10, 2014 R.Innocente 91
IEEE single precision – 32 bits
dsp lut+f lut f # tot dsp tot lut tot f
* 3 120 103 105 700 2100 156100 157500
2 160 128 160 0 0 0 0
1 331 283 331 0 0 0
0 665 629 669 0 0 0
0 0 0
+ 2 293 225 327 25 50 12950 15500
0 500 407 541 1160 0 1052120 1207560
Tot 1885 2150 1221170 1380560
Virtex-7 V2000T available resources
slices LUT x FF x dsp 6 input ff
slice slice LUT
305400 4 8 2160 1221600 2443200
1.6
standard LUTs 1954560
May 10, 2014 R.Innocente 92
IEEE double precision – 64 bits
dsp lut+f lut f # tot dsp tot lut tot f
* 11 325 279 421 196 2156 118384 146216
10 371 299 456 0 0 0 0
9 439 356 510 0 0 0
0 2361 2317 2418 0 0 0
0 0 0
+ 3 895 705 945 1 3 1600 1840
0 989 794 1029 617 0 1100111 1245106
Tot 814 2159 1220095 1393162
Virtex-7 V2000T available resources
slices LUT x FF x dsp 6 input ff
slice slice LUT
305400 4 8 2160 1221600 2443200
1.6
standard LUTs 1954560
May 10, 2014 R.Innocente 93
Virtex UltraScale XCVU440 20nm -sampling out
IEEE double precision – 64 bits
dsp lut+f lut f # tot dsp tot lut tot f
* 11 325 279 421 261 2871 157644 194706
10 371 299 456 0 0 0 0
9 439 356 510 0 0 0
0 2361 2317 2418 0 0 0
0 0 0
+ 3 895 705 945 3 9 4800 5520
0 989 794 1029 1321 0 2355343 2665778
Tot 1585 2880 2517787 2866004
Virtex Ultra Scale - available resources
slices LUT x FF x dsp 6 input ff
slice slice LUT
314820 8 16 2880 2518560 5037120
1.6
standard LUTs 4029696
May 10, 2014 R.Innocente 94
Relative power dissipation/1
TDP/peak nominal double fp performance :
Intel Q6600 2.4ghz 105W/ 38 gflops = 2763mW/gflops
Intel Haswell i7-4770K 3.5ghz 84W/ 112 gflops = 750mW/gflops
Intel IvyBridge 3770K 3.5ghz 77W/ 112 gflops = 687mW/gflops
Nvidia Tesla M2090 225W/ 666 gflops = 337mW/gflops
Nvidia Tesla K20X 235W/1310gflops = 179mW/gflops
Xilinx Virtex-US 20W/ 800gflops = 25mW/gflops
Ro w 1
0
FPGA computing = green computing
}
} ~10x
~30x
May 10, 2014 R.Innocente 95
Relative power dissipation/2
Intel 2.4 ghz q6600
intel 4770k
intel i7-3770k
tesla m2090
tesla k20x
virtex7
0 500 1000 1500 2000 2500 3000
mW / Gflops
mW
May 10, 2014 R.Innocente 96
Gflops per Watt
peak nominal double fp performance/TDP :
Intel Q6600 2.4ghz 38 gflops/105 W = 0.36 gflops/W
Intel Haswell i7-4770K 3.5ghz 112 gflops/84 W = 1.33 gflops/W
Intel IvyBridge 3770K 3.5ghz 112 gflops/77 W = 1.45 gflops/W
Nvidia Tesla M2090 666 gflops/225 W = 2.96 gflops/W
Nvidia Tesla K20X 1310 gflops/235 W = 5.57 gflops/W
Xilinx Virtex-US 800 gflops/20 W = 40 gflops/W
Ro w 1
0
FPGA computing = green computing
}
} ~10x
~30x
May 10, 2014 R.Innocente 97
Top green500 list
green500_ranktotal_power Year name Total CoresName ManufacturerCountry
1 28 4,503 2013 2720TSUBAME-KFC NEC Japan
2 53 3,632 2013 5120Wilkes Dell United Kingdom
3 79 3,518 2013 4864HA-PACS TCA Cray Inc. Japan
4 1,754 3,186 2012 115984 Cray Inc. Switzerland
5 81 3,131 2013 5720romeo Bull SA France
6 923 3,069 2013 74358TSUBAME 2.5 NEC/HP Japan
7 54 2,702 2013 3080 IBM United States
8 270 2,629 2013 15840 IBM Germany
9 56 2,629 2013 3264 IBM United States
10 71 2,359 2010 4620CSIRO GPU Cluster Xenon SystemsAustralia
11 179 2,351 2012 38400SANAM Saudi Arabia
12 82 2,299 2011 16384 IBM United States
13 82 2,299 2012 16384Cetus IBM United States
14 82 2,299 2012 16384 IBM Poland
15 82 2,299 2013 16384 IBM United States
16 82 2,299 2012 16384Vesta IBM United States
17 82 2,299 2012 16384 IBM United States
18 237 2,243 2013 10920HPCC Hewlett-PackardUnited States
Mflops/Watt
LX 1U-4GPU/104Re-1G Cluster, Intel Xeon E5-2620v2 6C 2.100GHz, Infiniband FDR, NVIDIA K20x
Dell T620 Cluster, Intel Xeon E5-2630v2 6C 2.600GHz, Infiniband FDR, NVIDIA K20
Cray 3623G4-SM Cluster, Intel Xeon E5-2680v2 10C 2.800GHz, Infiniband QDR, NVIDIA K20x
Cray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect , NVIDIA K20xPiz Daint
Bull R421-E3 Cluster, Intel Xeon E5-2650v2 8C 2.600GHz, Infiniband FDR, NVIDIA K20x
Cluster Platform SL390s G7, Xeon X5670 6C 2.930GHz, Infiniband QDR, NVIDIA K20x
iDataPlex DX360M4, Intel Xeon E5-2650v2 8C 2.600GHz, Infiniband FDR14, NVIDIA K20x
iDataPlex DX360M4, Intel Xeon E5-2680v2 10C 2.800GHz, Infiniband, NVIDIA K20x
iDataPlex DX360M4, Intel Xeon E5-2680v2 10C 2.800GHz, Infiniband, NVIDIA K20x
Nitro G16 3GPU, Xeon E5-2650 8C 2.000GHz, Infiniband FDR, Nvidia K20m
Adtech, ASUS ESC4000/FDR G2, Xeon E5-2650 8C 2.000GHz, Infiniband FDR, AMD FirePro S10000Adtech
BlueGene/Q, Power BQC 16C 1.60 GHz, Custom
BlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect
BlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect
BlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect
BlueGene/Q, Power BQC 16C 1.60GHz, Custom
BlueGene/Q, Power BQC 16C 1.60GHz, Custom
Cluster Platform SL250s Gen8, Xeon E5-2665 8C 2.400GHz, Infiniband FDR, Nvidia K20m
May 10, 2014 R.Innocente 98
Power/Energy efficiency
May 10, 2014 R.Innocente 99
Power Dissipation
PT =k C V
2
f +Ps
Ed=
1
2
C V 2
A chip is made of millions of CMOS FETs. When
input switches, you need to charge the small
capacitance :
f times a second gives, together with some
constant static dissipation :
Anyway increasing a lot the frequency, the
chip becomes unstable unless you increase
also the voltage(leakage). Therefore there is
in fact a superlinear behaviour vs f:
May 10, 2014 R.Innocente 100
Dennard scaling(1974)
1
S
S3
S2
= 2x more
transistors
S = 1.4x lower
capacitance
Scale Vdd by S =>
S2
= 2x lower
energy
S2
S = 1.4x faster
transistors
Performance scales as S3
= 2.8 while power density stays constant across
generations
May 10, 2014 R.Innocente 101
Fred Pollack(Intel) famous graph(1999)
Power density increases !!!
In 2004/2005 we hit the power wall => stop frequency
increases
“New
microarchitecture
challenges in the
coming generations
of CMOS process
technology”
F.Pollack
May 10, 2014 R.Innocente 102
End of Dennard scaling
1
S
S3
S2
= 2x more
transistors
S = 1.4x lower
capacitance
S2
S = 1.4x faster
transistors
In submicron
technology rigidity
in voltage scaling.
Power increases by
S2
= 2
May 10, 2014 R.Innocente 103
MOS subthreshold current
Scaling down geometry you scale down drain voltage to
avoid high electric fields and to decrease energy required to
switch. You have to scale down also the threshold voltage
to sustain the 30% decrease of gate delay. The small voltage
swing that remains is not able to completely turn off the
transistor. Subthreshold leakage that was ignored in the
past can on modern VLSI chips consume up to ½ of the
total power.
May 10, 2014 R.Innocente 104
Subthreshold leakage
May 10, 2014 R.Innocente 105
VT
design tradeoff
VGS
log IDS
- Low VT
for high ON current :
- High VT
for low OFF current
Phenomenology :
60-200 mV of VGS
swing decreases IDS
by
one order of magnitude. Today 0.5-0.2
VT
doesn't allow the needed swing of VGS
to
shutoff the transistor.
I Dsat ∝(V DD−VT )2
Low VT
=> high IDS
good for ON condition
High VT
=> low leakage
good for OFF condition
May 10, 2014 R.Innocente 106
Multicore scaling
65 nm 45 nm 32 nm
4-core 8-core 16-core
Every generation 2x cores, at same or slightly
increasing frequency.
May 10, 2014 R.Innocente 107
Multicore scaling at constant frequency
1
S
S2
S2
= 2x more
transistors
S = 1.4x lower
capacitance
} S = 1.4x lower
utilization
We hit the utilization wall => dark silicon
May 10, 2014 R.Innocente 108
End of multicore scaling
65 nm 32 nm
4 cores 8 cores
Every generation 1.4x cores, at same or
slightly increasing frequency.
Dark or dim silicon
(“uncore”)
45 nm
5.7 cores
May 10, 2014 R.Innocente 109
Dark silicon and the end of multicore scaling
Doug Burger (Microsoft) at HiPEAC 2013 :
- till 2004: each semiconductor generation gave transistors
smaller, faster and that consume less
- from 2004 to now: we still got smaller transistors, but we
could not run them faster (power wall)
- in the future : we will still get smaller transistors but we
will not be able to use all of them together(dark silicon) or
at max speed.
May 10, 2014 R.Innocente 110
Scaling the utilization wall
G.Venkatesh ASPLOS 10 :
“while the area budget continues to increase exponentially, the power budget has
become a first-order design constraint in current processors. In this regime, utilizing
transistors to design specialized cores that optimize energy-per-computation
becomes an effective approach to improve the system performance.
”The Utilization Wall : With each successive process generation, the percentage of a
chip that can switch at full frequency drops exponentially due to power constraints.
[Venkatesh, ASPLOS ‘10]
Single chip heterogeneous computer (E.Chung)
Greater energy efficiency combining GPP with unconventional cores (U-cores) :
GPU,FPGA,DSP,ASICs ..
May 10, 2014 R.Innocente 111
3D FinFET promise
Below 20nm the roadmap is
to use 3D FinFETs :
- Faster : +37%
- Dynamic Power: -50%
- Static Power: -90%
KAIST demonstrated a 3nm
FinFET in lab
May 10, 2014 R.Innocente 112
The trouble with multicore
A famous article of David Patterson (of “Computer architecture:
a quantitative approach” fame) on IEEE Spectrum, 2010 :
“Chipmakers are busy designing microprocessors that most
programmers can’t program”
“... the semiconductor industry threw the equivalent of a Hail
Mary pass when it switched from making microprocessors run
faster to putting more of them on a chip - doing so without any
clear notion of how such devices would in general be
programmed. The hope is that someone will be able to figure out
how to do that, but at the moment, the ball is still in the air.”
May 10, 2014 R.Innocente 113
Verilog
May 10, 2014 R.Innocente 114
Using Verilog
You write a functional specification (usually) splitted in
modules that documents the exact behaviour of the system.
Logic
Synthesis
Place &
Route
HDL
(Verilog)
FPGA
ASIC
Functional
design
Physical
design
Gate
netlist
Simulated annealing
used here !
NB. place and route of a large design can take 1
day of a fast CPU !!
May 10, 2014 R.Innocente 115
Verilog/1
Basic module :
// comments in this way
module name(input x0,x1,input [3:0]y, output out);
// x0,x1 are wires, y is a 4 wires bus
// out is an output wire
// combinational logic use assign
  wire x0,x1, [3:0]y, out
endmodule
May 10, 2014 R.Innocente 116
Verilog/2
Combinatorial circuit :
// performs not a b c + a not b not c
module dummy(input a,b,c, output y,z);
wire a,b,c,y;
assign y = ~a & b & c | a & ~b & ~c;
assign z = ~c;
endmodule
This is not C !
a,b,c,y,z are wires and y,z change whenever
a or b or c change. To avoid this drama for complex circuits
we use synchronous logic
(everything is stepped in docking stations = Flip flops)
May 10, 2014 R.Innocente 117
Verilog/3
May 10, 2014 R.Innocente 118
Verilog/4
A sequential circuit :
// a flip flop described in verilog
module ff(input d, clk, output q, qbar);
wire d, clk;
reg q, qbar;
always @(posedge clk)
begin
q <= d;
qbar <= ~d;
end
endmodule
At a raising edge of the wire clk copy the signal to q and
the inverse of d to qbar
May 10, 2014 R.Innocente 119
Verilog/5
May 10, 2014 R.Innocente 120
Verilog/6
A more complicate sequential circuit :
// in verilog FF with clear/reset
module ff(input d, clk,clr, output q, qbar);
wire d, clk;
reg q, qbar;
always @(posedge clk, posedge clr)
if (clr)
q <= 0;
else
begin
q <= d;
end
endmodule
At a raising edge of the wire clr set q=0, at the raising edge
of clk copy the signal to q and the inverse of d to qbar
May 10, 2014 R.Innocente 121
Verilog/7
May 10, 2014 R.Innocente 122
BORPH : Berkeley Operating system for
ReProgrammable Hardware
PETALINUX : Xilinx linux for Zynq et al.
May 10, 2014 R.Innocente 123
- Idea of HW unix process :
has pid, can be killed like a
normal unix process, but in
fact is an HW instance on
FPGA
- ioreg Virtual File System
interface
Borph : Berkeley Operating System
May 10, 2014 R.Innocente 124
Xilinx Petalinux
The PetaLinux Software Development Kit (SDK) is a
development tool that contains everything necessary to build,
develop, test and deploy Embedded Linux systems on : Zync-
7000, Zedboard, Kintex-7 boards.
PetaLinux consists of : pre-configured binary bootable images,
fully customizable Linux for the Xilinx device, and PetaLinux
SDK which includes tools and utilities to automate complex
tasks across configuration, build, and deployment.
PetaLinux is offered under two separate licenses :
No charge Evaluation license or Commercial licenses
May 10, 2014 R.Innocente 125
END

Weitere ähnliche Inhalte

Was ist angesagt?

Processors used in System on chip
Processors used in System on chip Processors used in System on chip
Processors used in System on chip A B Shinde
 
HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?NVIDIA Japan
 
RISC-V Introduction
RISC-V IntroductionRISC-V Introduction
RISC-V IntroductionYi-Hsiu Hsu
 
Introduction to ARM big.LITTLE technology
Introduction to ARM big.LITTLE technologyIntroduction to ARM big.LITTLE technology
Introduction to ARM big.LITTLE technology義洋 顏
 
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APUDelivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APUAMD
 
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based MultiprocessingArm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based MultiprocessingArm
 
An introduction to digital signal processors 1
An introduction to digital signal processors 1An introduction to digital signal processors 1
An introduction to digital signal processors 1Hossam Hassan
 
Composants reconfigurables
Composants reconfigurablesComposants reconfigurables
Composants reconfigurablesPeronnin Eric
 
Digital Systems Design
Digital Systems DesignDigital Systems Design
Digital Systems DesignReza Sameni
 
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareHKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareLinaro
 
Presentation fpga
Presentation fpgaPresentation fpga
Presentation fpgaImad Bourja
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDARaymond Tay
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Saksham Tanwar
 

Was ist angesagt? (20)

Processors used in System on chip
Processors used in System on chip Processors used in System on chip
Processors used in System on chip
 
Embedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernelEmbedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernel
 
ARM Processor Tutorial
ARM Processor Tutorial ARM Processor Tutorial
ARM Processor Tutorial
 
HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?
 
RISC-V Introduction
RISC-V IntroductionRISC-V Introduction
RISC-V Introduction
 
Introduction to ARM big.LITTLE technology
Introduction to ARM big.LITTLE technologyIntroduction to ARM big.LITTLE technology
Introduction to ARM big.LITTLE technology
 
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APUDelivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
Delivering a new level of visual performance in an SoC AMD "Raven Ridge" APU
 
Asic vs fpga
Asic vs fpgaAsic vs fpga
Asic vs fpga
 
ARM CORTEX M3 PPT
ARM CORTEX M3 PPTARM CORTEX M3 PPT
ARM CORTEX M3 PPT
 
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based MultiprocessingArm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
Arm DynamIQ: Intelligent Solutions Using Cluster Based Multiprocessing
 
An introduction to digital signal processors 1
An introduction to digital signal processors 1An introduction to digital signal processors 1
An introduction to digital signal processors 1
 
Composants reconfigurables
Composants reconfigurablesComposants reconfigurables
Composants reconfigurables
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Digital Systems Design
Digital Systems DesignDigital Systems Design
Digital Systems Design
 
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted FirmwareHKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
HKG15-505: Power Management interactions with OP-TEE and Trusted Firmware
 
Presentation fpga
Presentation fpgaPresentation fpga
Presentation fpga
 
Introduction to CUDA
Introduction to CUDAIntroduction to CUDA
Introduction to CUDA
 
Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)Nvidia (History, GPU Architecture and New Pascal Architecture)
Nvidia (History, GPU Architecture and New Pascal Architecture)
 
Parallel Computing on the GPU
Parallel Computing on the GPUParallel Computing on the GPU
Parallel Computing on the GPU
 
Session 2,3 FPGAs
Session 2,3 FPGAsSession 2,3 FPGAs
Session 2,3 FPGAs
 

Andere mochten auch

CPU Verification Metrics
CPU Verification MetricsCPU Verification Metrics
CPU Verification MetricsDVClub
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overviewSunil Thorat
 
Front–End Tools for Dynamic Reconfiguration in FPGA Devices 2005
Front–End Tools for Dynamic Reconfiguration in FPGA Devices 2005Front–End Tools for Dynamic Reconfiguration in FPGA Devices 2005
Front–End Tools for Dynamic Reconfiguration in FPGA Devices 2005Kamil Kedzierski
 
Introduction to Srping Web Flow
Introduction to Srping Web Flow Introduction to Srping Web Flow
Introduction to Srping Web Flow Emad Omara
 
ARM AAE - Memory Systems
ARM AAE - Memory SystemsARM AAE - Memory Systems
ARM AAE - Memory SystemsAnh Dung NGUYEN
 
AAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's modelAAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's modelAnh Dung NGUYEN
 
Hardware accelerated Virtualization in the ARM Cortex™ Processors
Hardware accelerated Virtualization in the ARM Cortex™ ProcessorsHardware accelerated Virtualization in the ARM Cortex™ Processors
Hardware accelerated Virtualization in the ARM Cortex™ ProcessorsThe Linux Foundation
 
FPGA_Overview_Ibr_2014
FPGA_Overview_Ibr_2014FPGA_Overview_Ibr_2014
FPGA_Overview_Ibr_2014Ibrahim Hejab
 
Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Deepak Kumar
 
fpga programming
fpga programmingfpga programming
fpga programmingAnish Gupta
 
Q4.11: ARM Architecture
Q4.11: ARM ArchitectureQ4.11: ARM Architecture
Q4.11: ARM ArchitectureLinaro
 
Programmable Logic Devices Plds
Programmable Logic Devices PldsProgrammable Logic Devices Plds
Programmable Logic Devices PldsGaditek
 
Xilinx lca and altera flex
Xilinx lca and altera flexXilinx lca and altera flex
Xilinx lca and altera flexanishgoel
 

Andere mochten auch (20)

Review Multicore processing based on ARM architecture
Review Multicore processing based on ARM architectureReview Multicore processing based on ARM architecture
Review Multicore processing based on ARM architecture
 
arm-cortex-a8
arm-cortex-a8arm-cortex-a8
arm-cortex-a8
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overview
 
Microblaze
MicroblazeMicroblaze
Microblaze
 
CPU Verification Metrics
CPU Verification MetricsCPU Verification Metrics
CPU Verification Metrics
 
Memory model
Memory modelMemory model
Memory model
 
Arm architecture overview
Arm architecture overviewArm architecture overview
Arm architecture overview
 
Front–End Tools for Dynamic Reconfiguration in FPGA Devices 2005
Front–End Tools for Dynamic Reconfiguration in FPGA Devices 2005Front–End Tools for Dynamic Reconfiguration in FPGA Devices 2005
Front–End Tools for Dynamic Reconfiguration in FPGA Devices 2005
 
Introduction to Srping Web Flow
Introduction to Srping Web Flow Introduction to Srping Web Flow
Introduction to Srping Web Flow
 
ARM AAE - Memory Systems
ARM AAE - Memory SystemsARM AAE - Memory Systems
ARM AAE - Memory Systems
 
AAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's modelAAME ARM Techcon2013 001v02 Architecture and Programmer's model
AAME ARM Techcon2013 001v02 Architecture and Programmer's model
 
Hardware accelerated Virtualization in the ARM Cortex™ Processors
Hardware accelerated Virtualization in the ARM Cortex™ ProcessorsHardware accelerated Virtualization in the ARM Cortex™ Processors
Hardware accelerated Virtualization in the ARM Cortex™ Processors
 
FPGA In a Nutshell
FPGA In a NutshellFPGA In a Nutshell
FPGA In a Nutshell
 
FPGA_Overview_Ibr_2014
FPGA_Overview_Ibr_2014FPGA_Overview_Ibr_2014
FPGA_Overview_Ibr_2014
 
Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)Implementation of Soft-core processor on FPGA (Final Presentation)
Implementation of Soft-core processor on FPGA (Final Presentation)
 
fpga programming
fpga programmingfpga programming
fpga programming
 
Q4.11: ARM Architecture
Q4.11: ARM ArchitectureQ4.11: ARM Architecture
Q4.11: ARM Architecture
 
Programmable Logic Devices Plds
Programmable Logic Devices PldsProgrammable Logic Devices Plds
Programmable Logic Devices Plds
 
Xilinx lca and altera flex
Xilinx lca and altera flexXilinx lca and altera flex
Xilinx lca and altera flex
 
CPU Architecture
CPU ArchitectureCPU Architecture
CPU Architecture
 

Ähnlich wie FPGA/Reconfigurable computing (HPRC)

Fpga computing
Fpga computingFpga computing
Fpga computingrinnocente
 
Introduction to Advanced embedded systems course
Introduction to Advanced embedded systems courseIntroduction to Advanced embedded systems course
Introduction to Advanced embedded systems courseanishgoel
 
Nios2 and ip core
Nios2 and ip coreNios2 and ip core
Nios2 and ip coreanishgoel
 
Altera Cyclone IV FPGA Customer Presentation
Altera Cyclone IV FPGA Customer PresentationAltera Cyclone IV FPGA Customer Presentation
Altera Cyclone IV FPGA Customer PresentationAltera Corporation
 
An FPGA for high end Open Networking
An FPGA for high end Open NetworkingAn FPGA for high end Open Networking
An FPGA for high end Open Networkingrinnocente
 
fpga1 - What is.pptx
fpga1 - What is.pptxfpga1 - What is.pptx
fpga1 - What is.pptxssuser0de10a
 
Intel 14nm aug11
Intel 14nm aug11Intel 14nm aug11
Intel 14nm aug11lopatto
 
SFScon 2020 - Roberto Innocenti - 202x Open Hardware Concrete Approach
SFScon 2020 - Roberto Innocenti - 202x Open Hardware Concrete ApproachSFScon 2020 - Roberto Innocenti - 202x Open Hardware Concrete Approach
SFScon 2020 - Roberto Innocenti - 202x Open Hardware Concrete ApproachSouth Tyrol Free Software Conference
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!elliando dias
 
1. FPGA architectures.pdf
1. FPGA architectures.pdf1. FPGA architectures.pdf
1. FPGA architectures.pdfTesfuFiseha1
 
Shoftcore Processors
Shoftcore ProcessorsShoftcore Processors
Shoftcore ProcessorsAnish Goel
 
SmartCore System for Dependable Many-core Processor with Multifunction Router...
SmartCore System for Dependable Many-core Processor with Multifunction Router...SmartCore System for Dependable Many-core Processor with Multifunction Router...
SmartCore System for Dependable Many-core Processor with Multifunction Router...Shinya Takamaeda-Y
 
FPGA in outer space seminar report
FPGA in outer space seminar reportFPGA in outer space seminar report
FPGA in outer space seminar reportrahul kumar verma
 
MYC-C7Z015 CPU Module
MYC-C7Z015 CPU ModuleMYC-C7Z015 CPU Module
MYC-C7Z015 CPU ModuleLinda Zhang
 

Ähnlich wie FPGA/Reconfigurable computing (HPRC) (20)

Fpga computing
Fpga computingFpga computing
Fpga computing
 
Introduction to Advanced embedded systems course
Introduction to Advanced embedded systems courseIntroduction to Advanced embedded systems course
Introduction to Advanced embedded systems course
 
9.atmel
9.atmel9.atmel
9.atmel
 
Nios2 and ip core
Nios2 and ip coreNios2 and ip core
Nios2 and ip core
 
uElectronics ongoing activities at ESA
uElectronics ongoing activities at ESAuElectronics ongoing activities at ESA
uElectronics ongoing activities at ESA
 
Altera Cyclone IV FPGA Customer Presentation
Altera Cyclone IV FPGA Customer PresentationAltera Cyclone IV FPGA Customer Presentation
Altera Cyclone IV FPGA Customer Presentation
 
An FPGA for high end Open Networking
An FPGA for high end Open NetworkingAn FPGA for high end Open Networking
An FPGA for high end Open Networking
 
fpga1 - What is.pptx
fpga1 - What is.pptxfpga1 - What is.pptx
fpga1 - What is.pptx
 
Intel 14nm aug11
Intel 14nm aug11Intel 14nm aug11
Intel 14nm aug11
 
4_BIT_ALU
4_BIT_ALU4_BIT_ALU
4_BIT_ALU
 
VLSI Design
VLSI DesignVLSI Design
VLSI Design
 
SFScon 2020 - Roberto Innocenti - 202x Open Hardware Concrete Approach
SFScon 2020 - Roberto Innocenti - 202x Open Hardware Concrete ApproachSFScon 2020 - Roberto Innocenti - 202x Open Hardware Concrete Approach
SFScon 2020 - Roberto Innocenti - 202x Open Hardware Concrete Approach
 
Shantanu's Resume
Shantanu's ResumeShantanu's Resume
Shantanu's Resume
 
How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!How to Make an Eight Bit Computer and Save the World!
How to Make an Eight Bit Computer and Save the World!
 
1. FPGA architectures.pdf
1. FPGA architectures.pdf1. FPGA architectures.pdf
1. FPGA architectures.pdf
 
Shoftcore Processors
Shoftcore ProcessorsShoftcore Processors
Shoftcore Processors
 
SmartCore System for Dependable Many-core Processor with Multifunction Router...
SmartCore System for Dependable Many-core Processor with Multifunction Router...SmartCore System for Dependable Many-core Processor with Multifunction Router...
SmartCore System for Dependable Many-core Processor with Multifunction Router...
 
FPGA in outer space seminar report
FPGA in outer space seminar reportFPGA in outer space seminar report
FPGA in outer space seminar report
 
Technology (1)
Technology (1)Technology (1)
Technology (1)
 
MYC-C7Z015 CPU Module
MYC-C7Z015 CPU ModuleMYC-C7Z015 CPU Module
MYC-C7Z015 CPU Module
 

Mehr von rinnocente

Random Number Generators 2018
Random Number Generators 2018Random Number Generators 2018
Random Number Generators 2018rinnocente
 
Docker containers : introduction
Docker containers : introductionDocker containers : introduction
Docker containers : introductionrinnocente
 
WiFi placement, can we use Maxwell ?
WiFi placement, can we use Maxwell ?WiFi placement, can we use Maxwell ?
WiFi placement, can we use Maxwell ?rinnocente
 
TLS, SPF, DKIM, DMARC, authenticated email
TLS, SPF, DKIM, DMARC, authenticated emailTLS, SPF, DKIM, DMARC, authenticated email
TLS, SPF, DKIM, DMARC, authenticated emailrinnocente
 
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernels
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernelsRefreshing computer-skills: markdown, mathjax, jupyter, docker, microkernels
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernelsrinnocente
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computingrinnocente
 
features of tcp important for the web
features of tcp  important for the webfeatures of tcp  important for the web
features of tcp important for the webrinnocente
 
Public key cryptography
Public key cryptography Public key cryptography
Public key cryptography rinnocente
 
End nodes in the Multigigabit era
End nodes in the Multigigabit eraEnd nodes in the Multigigabit era
End nodes in the Multigigabit erarinnocente
 
Mosix : automatic load balancing and migration
Mosix : automatic load balancing and migration Mosix : automatic load balancing and migration
Mosix : automatic load balancing and migration rinnocente
 
Comp architecture : branch prediction
Comp architecture : branch predictionComp architecture : branch prediction
Comp architecture : branch predictionrinnocente
 
Data mining : rule mining algorithms
Data mining : rule mining algorithmsData mining : rule mining algorithms
Data mining : rule mining algorithmsrinnocente
 
radius dhcp dot1.x (802.1x)
radius dhcp dot1.x (802.1x)radius dhcp dot1.x (802.1x)
radius dhcp dot1.x (802.1x)rinnocente
 

Mehr von rinnocente (14)

Random Number Generators 2018
Random Number Generators 2018Random Number Generators 2018
Random Number Generators 2018
 
Docker containers : introduction
Docker containers : introductionDocker containers : introduction
Docker containers : introduction
 
WiFi placement, can we use Maxwell ?
WiFi placement, can we use Maxwell ?WiFi placement, can we use Maxwell ?
WiFi placement, can we use Maxwell ?
 
TLS, SPF, DKIM, DMARC, authenticated email
TLS, SPF, DKIM, DMARC, authenticated emailTLS, SPF, DKIM, DMARC, authenticated email
TLS, SPF, DKIM, DMARC, authenticated email
 
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernels
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernelsRefreshing computer-skills: markdown, mathjax, jupyter, docker, microkernels
Refreshing computer-skills: markdown, mathjax, jupyter, docker, microkernels
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computing
 
features of tcp important for the web
features of tcp  important for the webfeatures of tcp  important for the web
features of tcp important for the web
 
Public key cryptography
Public key cryptography Public key cryptography
Public key cryptography
 
End nodes in the Multigigabit era
End nodes in the Multigigabit eraEnd nodes in the Multigigabit era
End nodes in the Multigigabit era
 
Mosix : automatic load balancing and migration
Mosix : automatic load balancing and migration Mosix : automatic load balancing and migration
Mosix : automatic load balancing and migration
 
Comp architecture : branch prediction
Comp architecture : branch predictionComp architecture : branch prediction
Comp architecture : branch prediction
 
Data mining : rule mining algorithms
Data mining : rule mining algorithmsData mining : rule mining algorithms
Data mining : rule mining algorithms
 
Ipv6 course
Ipv6  courseIpv6  course
Ipv6 course
 
radius dhcp dot1.x (802.1x)
radius dhcp dot1.x (802.1x)radius dhcp dot1.x (802.1x)
radius dhcp dot1.x (802.1x)
 

Kürzlich hochgeladen

➥🔝 7737669865 🔝▻ kakinada Call-girls in Women Seeking Men 🔝kakinada🔝 Escor...
➥🔝 7737669865 🔝▻ kakinada Call-girls in Women Seeking Men  🔝kakinada🔝   Escor...➥🔝 7737669865 🔝▻ kakinada Call-girls in Women Seeking Men  🔝kakinada🔝   Escor...
➥🔝 7737669865 🔝▻ kakinada Call-girls in Women Seeking Men 🔝kakinada🔝 Escor...amitlee9823
 
➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men 🔝Vijayawada🔝 E...
➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men  🔝Vijayawada🔝   E...➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men  🔝Vijayawada🔝   E...
➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men 🔝Vijayawada🔝 E...amitlee9823
 
Vip Mumbai Call Girls Kalyan Call On 9920725232 With Body to body massage wit...
Vip Mumbai Call Girls Kalyan Call On 9920725232 With Body to body massage wit...Vip Mumbai Call Girls Kalyan Call On 9920725232 With Body to body massage wit...
Vip Mumbai Call Girls Kalyan Call On 9920725232 With Body to body massage wit...amitlee9823
 
Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...
Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...
Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...amitlee9823
 
Call Girls Banashankari Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
Call Girls Banashankari Just Call 👗 7737669865 👗 Top Class Call Girl Service ...Call Girls Banashankari Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
Call Girls Banashankari Just Call 👗 7737669865 👗 Top Class Call Girl Service ...amitlee9823
 
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...Pooja Nehwal
 
Call Girls in Vashi Escorts Services - 7738631006
Call Girls in Vashi Escorts Services - 7738631006Call Girls in Vashi Escorts Services - 7738631006
Call Girls in Vashi Escorts Services - 7738631006Pooja Nehwal
 
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)amitlee9823
 
怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证
怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证
怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证tufbav
 
Top Rated Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Call Girls in Nagpur High Profile
 
Abort pregnancy in research centre+966_505195917 abortion pills in Kuwait cyt...
Abort pregnancy in research centre+966_505195917 abortion pills in Kuwait cyt...Abort pregnancy in research centre+966_505195917 abortion pills in Kuwait cyt...
Abort pregnancy in research centre+966_505195917 abortion pills in Kuwait cyt...drmarathore
 
Shikrapur Call Girls Most Awaited Fun 6297143586 High Profiles young Beautie...
Shikrapur Call Girls Most Awaited Fun  6297143586 High Profiles young Beautie...Shikrapur Call Girls Most Awaited Fun  6297143586 High Profiles young Beautie...
Shikrapur Call Girls Most Awaited Fun 6297143586 High Profiles young Beautie...tanu pandey
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Th...
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Th...Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Th...
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Th...Pooja Nehwal
 
Escorts Service Sanjay Nagar ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Sanjay Nagar ☎ 7737669865☎ Book Your One night Stand (Bangalore)Escorts Service Sanjay Nagar ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Sanjay Nagar ☎ 7737669865☎ Book Your One night Stand (Bangalore)amitlee9823
 

Kürzlich hochgeladen (20)

➥🔝 7737669865 🔝▻ kakinada Call-girls in Women Seeking Men 🔝kakinada🔝 Escor...
➥🔝 7737669865 🔝▻ kakinada Call-girls in Women Seeking Men  🔝kakinada🔝   Escor...➥🔝 7737669865 🔝▻ kakinada Call-girls in Women Seeking Men  🔝kakinada🔝   Escor...
➥🔝 7737669865 🔝▻ kakinada Call-girls in Women Seeking Men 🔝kakinada🔝 Escor...
 
➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men 🔝Vijayawada🔝 E...
➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men  🔝Vijayawada🔝   E...➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men  🔝Vijayawada🔝   E...
➥🔝 7737669865 🔝▻ Vijayawada Call-girls in Women Seeking Men 🔝Vijayawada🔝 E...
 
Vip Mumbai Call Girls Kalyan Call On 9920725232 With Body to body massage wit...
Vip Mumbai Call Girls Kalyan Call On 9920725232 With Body to body massage wit...Vip Mumbai Call Girls Kalyan Call On 9920725232 With Body to body massage wit...
Vip Mumbai Call Girls Kalyan Call On 9920725232 With Body to body massage wit...
 
Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...
Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...
Vip Mumbai Call Girls Andheri East Call On 9920725232 With Body to body massa...
 
Call Girls Banashankari Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
Call Girls Banashankari Just Call 👗 7737669865 👗 Top Class Call Girl Service ...Call Girls Banashankari Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
Call Girls Banashankari Just Call 👗 7737669865 👗 Top Class Call Girl Service ...
 
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Kothrud Call Me 7737669865 Budget Friendly No Advance Booking
 
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verifiedSector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
Sector 18, Noida Call girls :8448380779 Model Escorts | 100% verified
 
CHEAP Call Girls in Vinay Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Vinay Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Vinay Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Vinay Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
9892124323 Pooja Nehwal Call Girls Services Call Girls service in Santacruz A...
 
Call Girls in Vashi Escorts Services - 7738631006
Call Girls in Vashi Escorts Services - 7738631006Call Girls in Vashi Escorts Services - 7738631006
Call Girls in Vashi Escorts Services - 7738631006
 
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Arekere ☎ 7737669865☎ Book Your One night Stand (Bangalore)
 
怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证
怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证
怎样办理维多利亚大学毕业证(UVic毕业证书)成绩单留信认证
 
(INDIRA) Call Girl Napur Call Now 8617697112 Napur Escorts 24x7
(INDIRA) Call Girl Napur Call Now 8617697112 Napur Escorts 24x7(INDIRA) Call Girl Napur Call Now 8617697112 Napur Escorts 24x7
(INDIRA) Call Girl Napur Call Now 8617697112 Napur Escorts 24x7
 
Top Rated Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Ravet ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
CHEAP Call Girls in Ashok Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Ashok Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Ashok Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Ashok Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Abort pregnancy in research centre+966_505195917 abortion pills in Kuwait cyt...
Abort pregnancy in research centre+966_505195917 abortion pills in Kuwait cyt...Abort pregnancy in research centre+966_505195917 abortion pills in Kuwait cyt...
Abort pregnancy in research centre+966_505195917 abortion pills in Kuwait cyt...
 
Shikrapur Call Girls Most Awaited Fun 6297143586 High Profiles young Beautie...
Shikrapur Call Girls Most Awaited Fun  6297143586 High Profiles young Beautie...Shikrapur Call Girls Most Awaited Fun  6297143586 High Profiles young Beautie...
Shikrapur Call Girls Most Awaited Fun 6297143586 High Profiles young Beautie...
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
Call Now ≽ 9953056974 ≼🔝 Call Girls In Yusuf Sarai ≼🔝 Delhi door step delevry≼🔝
 
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Th...
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Th...Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Th...
Pooja 9892124323, Call girls Services and Mumbai Escort Service Near Hotel Th...
 
Escorts Service Sanjay Nagar ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Sanjay Nagar ☎ 7737669865☎ Book Your One night Stand (Bangalore)Escorts Service Sanjay Nagar ☎ 7737669865☎ Book Your One night Stand (Bangalore)
Escorts Service Sanjay Nagar ☎ 7737669865☎ Book Your One night Stand (Bangalore)
 

FPGA/Reconfigurable computing (HPRC)

  • 1. May 10, 2014 R.Innocente 1 Reconfigurable ComputingReconfigurable Computing Roberto Innocente inno@sissa.it
  • 2. May 10, 2014 R.Innocente 2 Flexibility ASIC Application Specific Integrated Circuit Very inflexible,designed to solve just 1 problem. Energy, space and time efficient GPP General Purpose Processor Very flexible, can solve any problem. Energy, space and time inefficient ? Reconfigurable Hardware Flexible, But enough energy, time and space efficient +-
  • 3. May 10, 2014 R.Innocente 3 History
  • 4. May 10, 2014 R.Innocente 4 Gerald Estrin/1 is credited with the idea, in the '60, of the first reconfigurable (F+V) FIX+Variable computer Gerald Estrin. ACM 1960. Organization of computer systems: the fixed plus variable structure computer.
  • 5. May 10, 2014 R.Innocente 5 Gerald Estrin/2 He envisioned that important gains in performance could be achieved when many computations are executed on appropriate problem oriented configurations. F+V is made of : - high speed general computer(the F part) : initially an ibm7090 - various size high speed special structures (the V part) problem specific: trigonometric functions, logarithms, exponential, n-th powers, complex arithmetic, … V is made of a 36 module positions motherboard which can undergo : - Function reconfiguration: physically changing some modules - Routing reconfiguration : changing part of the back wiring The Rammig machine (1977) : investigation of a reconfigurable machine with no manual or mechanical intervention
  • 6. May 10, 2014 R.Innocente 6 Today reconfigurable hardware Is born out of the will to replace different logic IC(Integrated Circuits), and successively to rapidly prototype large ASICs(Application Specific ICs) or implement SoCs (Sytem On Chip). In the following slides readers are supposed to be involved in scientific computing and not EE engineers.
  • 7. May 10, 2014 R.Innocente 7 Basic digital circuits AND INVERTER Shift RegD Type FFMUX Usually 0=0V, 1=some positive voltage OR
  • 8. May 10, 2014 R.Innocente 8 SSI 74xx IC
  • 9. May 10, 2014 R.Innocente 9 PLD Inconvenience of standard discrete logic circuits : - 14 pin packages of 4/6 logic functions - often you had to traverse the PCB to find a free OR or inverter - if you needed only a few, you had in any case to put an IC with 4/6 Therefore came the idea of PLD (Programmable Logic Device) : - SPLD (Simple : PAL/PLA) - CPLD (Complex) In which a simple interconnection network could be configured melting some internal fuses (fuse technology) to implement combinatorial logic.
  • 10. May 10, 2014 R.Innocente 10 disjunctive normal form (aka Sum of products ) Each boolean function of some boolean variables can be represented as a sum of minterms (product of all variables or their complement) . With 3 boolean vars : a,b,c are 2 of the 23 = 8 minterms f (a ,b , c)=a ̄b c+̄a b ̄c ābc,̄ab̄c
  • 11. May 10, 2014 R.Innocente 11 PLA (Programmable Logic Array) f1= p1+ p2 + p3=x1x2 + x1 ̄x3+ ̄x1 ̄x2 x3+ x1 x3
  • 12. May 10, 2014 R.Innocente 12 FPGA Also CPLDs showed their limits, therefore in 1985/1990 Xilinx introduced a more flexible design , the FPGA (Field Programmable Gate Array) In which the interconnection network is much more flexible and on which also sequential circuits can be easily mapped.
  • 13. May 10, 2014 R.Innocente 13 FPGA idea 1985 Xilinx – Ross Freeman (inventor of FPGA): “What if we could develop the equivalent of a circuit board full of standard logic parts (like TTL and PAL devices) on a single high density programmable logic chip ?” - post fabrication programmability by end users - fabless semiconductor company
  • 14. May 10, 2014 R.Innocente 14 Today
  • 15. May 10, 2014 R.Innocente 15 FPGA market Dominated by 2 players : - Altera - Xilinx From 67% of 2010, today they share together 90% of the market (4.5 billion usd revenues in 2012) From sourcetech411(2010)
  • 16. May 10, 2014 R.Innocente 16 An important question: are FPGAs green ? Virtex-7 2000T (one of the top FPGAs) : ~ 20 W Xilinx showed 3600 copies of its 8 bit processor nanoblaze running on Virtex-7, consuming 20 W CPU : ~ 100 W Core i7-4770K Haswell (22 nm) 3.5 GHz@ 4 Cores 84 W Core i7-3930K Sandybridge-E (32 nm) 3.2 GHz @6Cores 130 W Xeon E7458 Dunnington (45 nm) 2.4 GHz 90 W Xeon E7460 Dunnington (45 nm) 2.66 GHz 130 W GPU : ~ 220 W Nvidia Tesla M2090 225 W Nvidia Tesla K20X 235 W This is a partial answer. We need to be able to estimate FPGA performance to give a more useful index.
  • 17. May 10, 2014 R.Innocente 17 FPGA architecture From RF and Wireless World Sea of gates : logic blocks are like islands in a sea of interconnections
  • 18. May 10, 2014 R.Innocente 18 Virtex family 1998 Virtex 250nm 100mhz 25k-60k cells 2000 Virtex-E 180nm 300mhz 1k-70kcells 2000 Virtex II 150nm to168 mult420mhzupto 93k 4-luts 2005 Virtex-4 90nm 500mhz upto 200k cells 2007 Virtex-5 65nm 550mhz up to 330k cells Virtex-6 40nm 288-2k DSP to 500k 6-luts 2010 Virtex-7 28nm ~500mhz upto 2000k cells 2014 Virtex-US 20 nm upto 4400k cells From L Zhuo Up to ~ 7 billion transistor Intel 2014 15-core Xeon IvyBridge-EX~ 4.3 billion transistor Nvidia 2012 GK110 Kepler ~ 7 billion transistor
  • 19. May 10, 2014 R.Innocente 19 FPGA/CPU evolution
  • 20. May 10, 2014 R.Innocente 20 Virtex-7 is not monolithic 2.5 D technology : 4 FPGA tiles with silicon interposer that provides 10k Interconeections between layers
  • 21. May 10, 2014 R.Innocente 21 Enabling technologies
  • 22. May 10, 2014 R.Innocente 22 Programming technology/1 Antifuse SRAM OTP(One time programmable) Disordered except at very low range Pass transistor in switch block
  • 23. May 10, 2014 R.Innocente 23 Programming technology/2 Antifuse -pros: cheap, small -cons: requires special processing, One time programming SRAM -pros: can be deployed with standard semiconductor process, can be easily reprogrammed -cons: large area required(6 transistors)
  • 24. May 10, 2014 R.Innocente 24 Confware The configuration of an FPGA ( that becomes compiled to a stream of bits) is not hardware, nor software. Someone invented the neologism confware The configuration of a reconfigurable hardware.
  • 25. May 10, 2014 R.Innocente 25 How you configure an FPGA ? SRAM cells as a long shift register : loaded serially clocking in the confware Virtex 7 2000T = 440 Mbits of SRAM cells (simplified : large fpgas can also parallel load the confware)
  • 26. May 10, 2014 R.Innocente 26 Logic Blocks/Logic Cells
  • 27. May 10, 2014 R.Innocente 27 Fine/coarse grain logic blocks From : - a single transistor (Crosspoint : went in bankrupcy) - a logic gate To : - a complete processor (FPNA: field programmable node arrays) NB. FPNA is also field programmable neural array
  • 28. May 10, 2014 R.Innocente 28 Homogeneous : - Logic Cells: 4 input LUT(LookUp Table) + FlipFlop Heterogeneous(modern development) : - Logic cells - DSP (Digital Signal Processing) - Memory blocks - I/O blocks The heterogenous architecture is prevalent now. The blocks are configured by SRAM bits usually loaded trough serial ports as already pointed out. CLB(Configurable Logic Blocks) Necessary differentiation to allow things like multiplication/addition to be mapped in an efficient way.
  • 29. May 10, 2014 R.Innocente 29 Standard Logic Cell 4 input LUT D type FlipFlop 16 bits of SRAM for conf 1 bit SRAM conf 2:1 Mux
  • 30. May 10, 2014 R.Innocente 30 standard LUT (Look Up Table) 0 0000 0 1 0001 1 2 0010 0 3 0011 0 4 0100 1 5 0101 0 6 0110 1 7 0111 1 .. .. .. Dec Bin Out - 16 x 1 memory - any boolean function of 4 inputs : Bit 0 Bit 1 Bit 2 Bit 3 f = ̄x3 ̄x2 ̄x1 x0+ ̄x3 x2 ̄x1 ̄x0+ ̄x3 x2 x1 ̄x0+ ̄x3 x2 x1 x0 NB. LUT rhymes with nut
  • 31. May 10, 2014 R.Innocente 31 Uses of Logic Cell 2^4 = 16 x 1 bit memory Any boolean function of 4 inputs 4:1 multiplexer
  • 32. May 10, 2014 R.Innocente 32 Virtex-7 Logic Block basics
  • 33. May 10, 2014 R.Innocente 33 Virtex-7 Logic slice From Xilinx 4 x 32=128 bit shift reg
  • 34. May 10, 2014 R.Innocente 34 Virtex7 CLB slice - 6-input LUT - 2 5-input LUTs with same inputs - 2 arbitrary boolean function on 3-input and 2-input or less
  • 35. May 10, 2014 R.Innocente 35 Altera ALM
  • 36. May 10, 2014 R.Innocente 36 Interconnection network
  • 37. May 10, 2014 R.Innocente 37 Interconnection network Hierarchical routing Island type routing(predominant) Interconnection network can consume 80% of the area of an FPGA ! Nearest neighbours
  • 38. May 10, 2014 R.Innocente 38 Programmable switch
  • 39. May 10, 2014 R.Innocente 39 SRAM routing: coarse/fine grain 5 bit SRAM 1 bit SRAM
  • 40. May 10, 2014 R.Innocente 40 Details of island type routing
  • 41. May 10, 2014 R.Innocente 41 Disjoint/Wilton switch blocks Disjoint : wire can only go out on wire of same number, creates routing domains Wilton : can change domain in at least one directions
  • 42. May 10, 2014 R.Innocente 42 Channel segments distribution
  • 43. May 10, 2014 R.Innocente 43 Columnar architecture 7 series Xilinx fpga Columnar architecture
  • 44. May 10, 2014 R.Innocente 44 DSP blocks & floating point
  • 45. May 10, 2014 R.Innocente 45 FPGAs floating point in 1994 B. Fagin and C. Renard. Field Programmable Gate Arrays and Floating Point Arithmetic. IEEE Transactions on VLSI Systems, 2(3), September 1994. Fagin & Renard report that you can implement floating point operators but it is impractical : no FPGA in existence could contain a single multiplier circuit !!
  • 46. May 10, 2014 R.Innocente 46 FPGA fp in 1995 Shirazi & al. On the same line of Fagin & Renard propose 2 custom fp formats 16 and 18 bits total: they provide for them add,sub, mul, div operators N. Shirazi, A. Walters, and P. Athanas. Quantitative Analysis of Floating Point Arithmetic on FPGA Based Custom Computing Machines. In Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, April 1995.
  • 47. May 10, 2014 R.Innocente 47 FPGA fp in 2002 Belanovic & Leeser present a library of variable width parameterized floating point operators (superset of the ieee formats) A Library of Parameterized Floating-point Modules and Their Use Pavle Belanovic and Miriam Leeser, 2002
  • 48. May 10, 2014 R.Innocente 48 What allowed the breakthrough ? The addition, by major vendors, of hardware multipliers (called DSP blocks) on their FPGA from 2000 on : - 1st Xilinx on Virtex II - soon after Altera on Stratix This started in the last decade also the interest of HPC community : Cray XD1, Silicon RASC, Convey HC1 HPRC = High Performance Reconfigurable Computing
  • 49. May 10, 2014 R.Innocente 49 FPGA MAC operation
  • 50. May 10, 2014 R.Innocente 50 Virtex-7 DSP48 high level From Xilinx 1 bit 2 bit
  • 51. May 10, 2014 R.Innocente 51 DSP48E1 details
  • 52. May 10, 2014 R.Innocente 52 Altera Stratix V DSP block 4 (*) + 3(+) = 7 flop
  • 53. May 10, 2014 R.Innocente 53 Data Flow Graphs (DFG)
  • 54. May 10, 2014 R.Innocente 54 Data flow A representation of a program as a DG(Directed Graph) in which the nodes are the operations and the edges represent the data dependencies from one operation to the next
  • 55. May 10, 2014 R.Innocente 55 Control flow/Data Flow dis2=b**2-4*a*c If dis2 < 0 complex! dis=sqrt(dis2) u1=-b/(2*a) u2=dis/(2*a) x1=u1+u2 x2=u1-u2 x= −b 2a ± √b2 −4ac 2a
  • 56. May 10, 2014 R.Innocente 56 A scalar product Fortran : acc=0.0 do i=1,4 acc=acc+a(i)*b(i) enddo C : acc=0.0; for(i=0;i<4;i++){ acc=acc+a[i]*b[i]; }
  • 57. May 10, 2014 R.Innocente 57 Time/Space tradeoffs
  • 58. May 10, 2014 R.Innocente 58 Systolic array matrix mult A(n,n) x B(n,n) requires : 2n-1 steps for the last elements to enter the array n-1 steps to compute the last c(n,n) n steps to move the result out = 4n-2 steps
  • 59. May 10, 2014 R.Innocente 59 Codesign The implementation of algorithms on FPGAs requires a mix of hw and sw design : Codesign = hw design + sw design
  • 60. May 10, 2014 R.Innocente 60 How to program FPGAs? Mainly with an HDL (Hardware Description Language): - Verilog(intially developed by Gateway Design Automation, now a std) - VHDL (out of a standard committee) But OpenCL, ImpulseC, SystemC, C, Handel-C translators .. are also available.Is this a good idea ? The problem is that those languages are not thought for describing hardware and the translation finish up usually with a FSM(finite state machine) with 1 state for every statement and then the FSM machine moves along the states . This is not the way someone skilled would program the FPGA. Next state logic State register Output Logic input clk D Q Out FSM finite state machine
  • 61. May 10, 2014 R.Innocente 61 FPGA will win For many years FPGAs were just prototyping vehicles for ASICs – Now they are replacing many ASICS & ASSPs – Watch for the same Trojan effect with FPGAs in HPC
  • 62. May 10, 2014 R.Innocente 62 FPGA lingo
  • 63. May 10, 2014 R.Innocente 63 Core Core in FPGA lingo is a function ready to be instantiated into your design as a “black box”. It can be suppliad as HDL or schematic. It supports design re-use.
  • 64. May 10, 2014 R.Innocente 64 Soft/hard cores On FPGAs functional modules can be implemented : - using std FPGA resources(logic blocks, DSPs, memory blocks) : softcores - as an ASIC on the FPGA : hardcores When the manufacturer puts a processor as an hardcore on the FPGA then it sells this as a SoC (Sytem On Chip) : Dual ARM on Zync-7000 chip, PowerPC on Altera FPGA
  • 65. May 10, 2014 R.Innocente 65 IP/open cores The soft attribute is implied. Hardware designs in an HDL(eventually using vendor libraries): - opensource cores : http://opencores.org/ OpenRISC 1000 architecture from the OpenCores community, the Lattice Semiconductor LM32, the LEON3 from Aeroflex Gaisler and the OpenSPARC family from Oracle - proprietary : IP(Intellectual Property) cores Floating point operators, fft, matrix computations
  • 66. May 10, 2014 R.Innocente 66 Commercial offers
  • 67. May 10, 2014 R.Innocente 67 Picocomputing SC6 1U Upto 16 FPGA SC6 4U upto 48 EX-600EX-800 From PICOCOMPUTING
  • 68. May 10, 2014 R.Innocente 68 Bittware Terabox 16 altera stratix-V From Bittware
  • 69. May 10, 2014 R.Innocente 69 DINIGROUP Cluster of 4 Virtex7 From DINIGROUP
  • 70. May 10, 2014 R.Innocente 70 Dinigroup Cluster 40 Kintex-7 From DINIGROUP
  • 71. May 10, 2014 R.Innocente 71 Maxeler MPC-X Daresbury Lab UK : The dataflow supercomputer will feature Maxeler developed MPC-X nodes capable of an equivalent 8.52TFLOPs per 1U and 8.97 GFLOPs/Watt.
  • 72. May 10, 2014 R.Innocente 72 Convey HC-2 , HC-2ex
  • 73. May 10, 2014 R.Innocente 73 Cray XT5h “Cray introduces an hybrid supercomputer that can integrate multiple processor architectures into a single system and accelerate high performance computing (HPC) workflows. The Cray XT5h delivers higher sustained performance, by applying alternative processor architectures across selected applications within an HPC workflow. The Cray XT5h supports a variety of processor technologies, including scalar processors based on AMD OpteronTM dual and quad-core technologies, vector processors, and FPGA accelerators.”
  • 74. May 10, 2014 R.Innocente 74 CHREC Center for High Performance Reconfigurable Computing UF/BYU/GWU/VTECH
  • 75. May 10, 2014 R.Innocente 75 CHREC Novo-G 384 FPGAs “Novo-G is the most powerful reconfigurable supercomputer in the known world. This unique machine features 192 top-end, 40nm FPGAs (Altera Stratix-IV E530) and 192 top-end, 65nm FPGAs (Stratix-III E260). “ http://www.chrec.org/ (pronounce it as shreck)
  • 76. May 10, 2014 R.Innocente 76 BLAST like Smith-Waterman computes local alignment of 2 sequences : - Novo-BLAST Novo-G/CHREC implementation : faster, same sensitivity IPC(Isotope Pattern Calculator) of Protein Identification Algorithm : - speed up 52-366 on single fpga, 1259 on 4 fpgas, 3340 on a node(16 fpgas) CHREC/2
  • 77. May 10, 2014 R.Innocente 77 References for Applications
  • 78. May 10, 2014 R.Innocente 78 Linear Algebra for RC Juan Gonzalez and Rafael C. Núñez LAPACKrc: Fast linear algebra kernels/solvers for FPGA accelerators(JP 2009) DOD funded
  • 79. May 10, 2014 R.Innocente 79 DCT, FFT on FPGAs Digital Signal Processing with Field Programmable Gate Arrays , 3d edition(2007) U.Mayer Baese, Springer Verlag
  • 80. May 10, 2014 R.Innocente 80 MD on FPGA There are many papers about porting Molecular Dynamics algorithms on FPGAs with substantial positive conclusions about experiments on 1-2 FPGAs. But in the last years there is an embarassing comparison with ANTON (Shaw et al.). We cant forget that ANTON is a really huge machine consuming over 100 KW !!!! And is made out of 512 dedicated ASICs at 1ghz! The comparison with some FPGAs consuming 40/60 W is improper. FPGA-Accelerated Molecular Dynamics(2013) M. A. Khan,M. Chiu, M. C. Herbordt
  • 81. May 10, 2014 R.Innocente 81 Neural networks on FPGAs Editors : Omondi , Rajakapse (2006) FPGA implementation of neural networks ANN(Artificial Neural Network) in integer arithmetic performs 40x better than on GPP (old FPGA, 3 generation old)
  • 82. May 10, 2014 R.Innocente 82 Altera Arria 10
  • 83. May 10, 2014 R.Innocente 83 Arria10
  • 84. May 10, 2014 R.Innocente 84 Arria 10 variable precision DSP block Altera A B C D A+C*D = 2 flop
  • 85. May 10, 2014 R.Innocente 85 Arria10 estimated sp fp performance - 2 flops per cycle - 1688 fp single precision DSP (GX660) 1688*2 = 3376 flops per cycle 3376 * 0.5 ghz ~ 1.7 Teraflops in single precision
  • 86. May 10, 2014 R.Innocente 86 Hard single prec FP on FPGA ?!? For people that can live with single precision this seems a very attractive new feature. But many think that it is too much a waste of generic resources and claim that what was missing were simpler blocks !
  • 87. May 10, 2014 R.Innocente 87 Back of the envelope performance estimation
  • 88. May 10, 2014 R.Innocente 88 Back of the envelope performance estimation Given number of - LUTs - FFs - DSPs offered by an FPGA, and utilization of resources by operators, estimate the max number of operators that can be implemented on the FPGA Today FPGA clocks are ~500Mhz=0.5GHz (unavoidable price for flexibility) 2000 flops per cycle = 1 Teraflops
  • 89. May 10, 2014 R.Innocente 89 Xilinx Virtex-7 family Virtex-7 slices : 4 x 6-input LUTs, 8 FFs Virtex-7 DSPs : 48 bits pre-adder, 25x18 multiplier, 48 bits accumulator Virtex LUT ~ 1.6 standard LUT
  • 90. May 10, 2014 R.Innocente 90 Custom precision 17/24 bits floating dsp lut+f lut f # tot dsp tot lut tot f * 2 103 90 112 1080 2160 208440 232200 1 113 97 104 0 0 0 0 0 377 336 376 0 0 0 0 0 0 0 0 0 0 0 0 0 + 0 369 301 393 1510 0 1011700 1150620 0 0 0 0 0 0 0 0 Tot 2590 2160 1220140 1382820 Virtex-7 V2000T available resources slices LUT x FF x dsp 6 input ff slice slice LUT 305400 4 8 2160 1221600 2443200 1.6 standard LUTs 1954560
  • 91. May 10, 2014 R.Innocente 91 IEEE single precision – 32 bits dsp lut+f lut f # tot dsp tot lut tot f * 3 120 103 105 700 2100 156100 157500 2 160 128 160 0 0 0 0 1 331 283 331 0 0 0 0 665 629 669 0 0 0 0 0 0 + 2 293 225 327 25 50 12950 15500 0 500 407 541 1160 0 1052120 1207560 Tot 1885 2150 1221170 1380560 Virtex-7 V2000T available resources slices LUT x FF x dsp 6 input ff slice slice LUT 305400 4 8 2160 1221600 2443200 1.6 standard LUTs 1954560
  • 92. May 10, 2014 R.Innocente 92 IEEE double precision – 64 bits dsp lut+f lut f # tot dsp tot lut tot f * 11 325 279 421 196 2156 118384 146216 10 371 299 456 0 0 0 0 9 439 356 510 0 0 0 0 2361 2317 2418 0 0 0 0 0 0 + 3 895 705 945 1 3 1600 1840 0 989 794 1029 617 0 1100111 1245106 Tot 814 2159 1220095 1393162 Virtex-7 V2000T available resources slices LUT x FF x dsp 6 input ff slice slice LUT 305400 4 8 2160 1221600 2443200 1.6 standard LUTs 1954560
  • 93. May 10, 2014 R.Innocente 93 Virtex UltraScale XCVU440 20nm -sampling out IEEE double precision – 64 bits dsp lut+f lut f # tot dsp tot lut tot f * 11 325 279 421 261 2871 157644 194706 10 371 299 456 0 0 0 0 9 439 356 510 0 0 0 0 2361 2317 2418 0 0 0 0 0 0 + 3 895 705 945 3 9 4800 5520 0 989 794 1029 1321 0 2355343 2665778 Tot 1585 2880 2517787 2866004 Virtex Ultra Scale - available resources slices LUT x FF x dsp 6 input ff slice slice LUT 314820 8 16 2880 2518560 5037120 1.6 standard LUTs 4029696
  • 94. May 10, 2014 R.Innocente 94 Relative power dissipation/1 TDP/peak nominal double fp performance : Intel Q6600 2.4ghz 105W/ 38 gflops = 2763mW/gflops Intel Haswell i7-4770K 3.5ghz 84W/ 112 gflops = 750mW/gflops Intel IvyBridge 3770K 3.5ghz 77W/ 112 gflops = 687mW/gflops Nvidia Tesla M2090 225W/ 666 gflops = 337mW/gflops Nvidia Tesla K20X 235W/1310gflops = 179mW/gflops Xilinx Virtex-US 20W/ 800gflops = 25mW/gflops Ro w 1 0 FPGA computing = green computing } } ~10x ~30x
  • 95. May 10, 2014 R.Innocente 95 Relative power dissipation/2 Intel 2.4 ghz q6600 intel 4770k intel i7-3770k tesla m2090 tesla k20x virtex7 0 500 1000 1500 2000 2500 3000 mW / Gflops mW
  • 96. May 10, 2014 R.Innocente 96 Gflops per Watt peak nominal double fp performance/TDP : Intel Q6600 2.4ghz 38 gflops/105 W = 0.36 gflops/W Intel Haswell i7-4770K 3.5ghz 112 gflops/84 W = 1.33 gflops/W Intel IvyBridge 3770K 3.5ghz 112 gflops/77 W = 1.45 gflops/W Nvidia Tesla M2090 666 gflops/225 W = 2.96 gflops/W Nvidia Tesla K20X 1310 gflops/235 W = 5.57 gflops/W Xilinx Virtex-US 800 gflops/20 W = 40 gflops/W Ro w 1 0 FPGA computing = green computing } } ~10x ~30x
  • 97. May 10, 2014 R.Innocente 97 Top green500 list green500_ranktotal_power Year name Total CoresName ManufacturerCountry 1 28 4,503 2013 2720TSUBAME-KFC NEC Japan 2 53 3,632 2013 5120Wilkes Dell United Kingdom 3 79 3,518 2013 4864HA-PACS TCA Cray Inc. Japan 4 1,754 3,186 2012 115984 Cray Inc. Switzerland 5 81 3,131 2013 5720romeo Bull SA France 6 923 3,069 2013 74358TSUBAME 2.5 NEC/HP Japan 7 54 2,702 2013 3080 IBM United States 8 270 2,629 2013 15840 IBM Germany 9 56 2,629 2013 3264 IBM United States 10 71 2,359 2010 4620CSIRO GPU Cluster Xenon SystemsAustralia 11 179 2,351 2012 38400SANAM Saudi Arabia 12 82 2,299 2011 16384 IBM United States 13 82 2,299 2012 16384Cetus IBM United States 14 82 2,299 2012 16384 IBM Poland 15 82 2,299 2013 16384 IBM United States 16 82 2,299 2012 16384Vesta IBM United States 17 82 2,299 2012 16384 IBM United States 18 237 2,243 2013 10920HPCC Hewlett-PackardUnited States Mflops/Watt LX 1U-4GPU/104Re-1G Cluster, Intel Xeon E5-2620v2 6C 2.100GHz, Infiniband FDR, NVIDIA K20x Dell T620 Cluster, Intel Xeon E5-2630v2 6C 2.600GHz, Infiniband FDR, NVIDIA K20 Cray 3623G4-SM Cluster, Intel Xeon E5-2680v2 10C 2.800GHz, Infiniband QDR, NVIDIA K20x Cray XC30, Xeon E5-2670 8C 2.600GHz, Aries interconnect , NVIDIA K20xPiz Daint Bull R421-E3 Cluster, Intel Xeon E5-2650v2 8C 2.600GHz, Infiniband FDR, NVIDIA K20x Cluster Platform SL390s G7, Xeon X5670 6C 2.930GHz, Infiniband QDR, NVIDIA K20x iDataPlex DX360M4, Intel Xeon E5-2650v2 8C 2.600GHz, Infiniband FDR14, NVIDIA K20x iDataPlex DX360M4, Intel Xeon E5-2680v2 10C 2.800GHz, Infiniband, NVIDIA K20x iDataPlex DX360M4, Intel Xeon E5-2680v2 10C 2.800GHz, Infiniband, NVIDIA K20x Nitro G16 3GPU, Xeon E5-2650 8C 2.000GHz, Infiniband FDR, Nvidia K20m Adtech, ASUS ESC4000/FDR G2, Xeon E5-2650 8C 2.000GHz, Infiniband FDR, AMD FirePro S10000Adtech BlueGene/Q, Power BQC 16C 1.60 GHz, Custom BlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect BlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect BlueGene/Q, Power BQC 16C 1.600GHz, Custom Interconnect BlueGene/Q, Power BQC 16C 1.60GHz, Custom BlueGene/Q, Power BQC 16C 1.60GHz, Custom Cluster Platform SL250s Gen8, Xeon E5-2665 8C 2.400GHz, Infiniband FDR, Nvidia K20m
  • 98. May 10, 2014 R.Innocente 98 Power/Energy efficiency
  • 99. May 10, 2014 R.Innocente 99 Power Dissipation PT =k C V 2 f +Ps Ed= 1 2 C V 2 A chip is made of millions of CMOS FETs. When input switches, you need to charge the small capacitance : f times a second gives, together with some constant static dissipation : Anyway increasing a lot the frequency, the chip becomes unstable unless you increase also the voltage(leakage). Therefore there is in fact a superlinear behaviour vs f:
  • 100. May 10, 2014 R.Innocente 100 Dennard scaling(1974) 1 S S3 S2 = 2x more transistors S = 1.4x lower capacitance Scale Vdd by S => S2 = 2x lower energy S2 S = 1.4x faster transistors Performance scales as S3 = 2.8 while power density stays constant across generations
  • 101. May 10, 2014 R.Innocente 101 Fred Pollack(Intel) famous graph(1999) Power density increases !!! In 2004/2005 we hit the power wall => stop frequency increases “New microarchitecture challenges in the coming generations of CMOS process technology” F.Pollack
  • 102. May 10, 2014 R.Innocente 102 End of Dennard scaling 1 S S3 S2 = 2x more transistors S = 1.4x lower capacitance S2 S = 1.4x faster transistors In submicron technology rigidity in voltage scaling. Power increases by S2 = 2
  • 103. May 10, 2014 R.Innocente 103 MOS subthreshold current Scaling down geometry you scale down drain voltage to avoid high electric fields and to decrease energy required to switch. You have to scale down also the threshold voltage to sustain the 30% decrease of gate delay. The small voltage swing that remains is not able to completely turn off the transistor. Subthreshold leakage that was ignored in the past can on modern VLSI chips consume up to ½ of the total power.
  • 104. May 10, 2014 R.Innocente 104 Subthreshold leakage
  • 105. May 10, 2014 R.Innocente 105 VT design tradeoff VGS log IDS - Low VT for high ON current : - High VT for low OFF current Phenomenology : 60-200 mV of VGS swing decreases IDS by one order of magnitude. Today 0.5-0.2 VT doesn't allow the needed swing of VGS to shutoff the transistor. I Dsat ∝(V DD−VT )2 Low VT => high IDS good for ON condition High VT => low leakage good for OFF condition
  • 106. May 10, 2014 R.Innocente 106 Multicore scaling 65 nm 45 nm 32 nm 4-core 8-core 16-core Every generation 2x cores, at same or slightly increasing frequency.
  • 107. May 10, 2014 R.Innocente 107 Multicore scaling at constant frequency 1 S S2 S2 = 2x more transistors S = 1.4x lower capacitance } S = 1.4x lower utilization We hit the utilization wall => dark silicon
  • 108. May 10, 2014 R.Innocente 108 End of multicore scaling 65 nm 32 nm 4 cores 8 cores Every generation 1.4x cores, at same or slightly increasing frequency. Dark or dim silicon (“uncore”) 45 nm 5.7 cores
  • 109. May 10, 2014 R.Innocente 109 Dark silicon and the end of multicore scaling Doug Burger (Microsoft) at HiPEAC 2013 : - till 2004: each semiconductor generation gave transistors smaller, faster and that consume less - from 2004 to now: we still got smaller transistors, but we could not run them faster (power wall) - in the future : we will still get smaller transistors but we will not be able to use all of them together(dark silicon) or at max speed.
  • 110. May 10, 2014 R.Innocente 110 Scaling the utilization wall G.Venkatesh ASPLOS 10 : “while the area budget continues to increase exponentially, the power budget has become a first-order design constraint in current processors. In this regime, utilizing transistors to design specialized cores that optimize energy-per-computation becomes an effective approach to improve the system performance. ”The Utilization Wall : With each successive process generation, the percentage of a chip that can switch at full frequency drops exponentially due to power constraints. [Venkatesh, ASPLOS ‘10] Single chip heterogeneous computer (E.Chung) Greater energy efficiency combining GPP with unconventional cores (U-cores) : GPU,FPGA,DSP,ASICs ..
  • 111. May 10, 2014 R.Innocente 111 3D FinFET promise Below 20nm the roadmap is to use 3D FinFETs : - Faster : +37% - Dynamic Power: -50% - Static Power: -90% KAIST demonstrated a 3nm FinFET in lab
  • 112. May 10, 2014 R.Innocente 112 The trouble with multicore A famous article of David Patterson (of “Computer architecture: a quantitative approach” fame) on IEEE Spectrum, 2010 : “Chipmakers are busy designing microprocessors that most programmers can’t program” “... the semiconductor industry threw the equivalent of a Hail Mary pass when it switched from making microprocessors run faster to putting more of them on a chip - doing so without any clear notion of how such devices would in general be programmed. The hope is that someone will be able to figure out how to do that, but at the moment, the ball is still in the air.”
  • 113. May 10, 2014 R.Innocente 113 Verilog
  • 114. May 10, 2014 R.Innocente 114 Using Verilog You write a functional specification (usually) splitted in modules that documents the exact behaviour of the system. Logic Synthesis Place & Route HDL (Verilog) FPGA ASIC Functional design Physical design Gate netlist Simulated annealing used here ! NB. place and route of a large design can take 1 day of a fast CPU !!
  • 115. May 10, 2014 R.Innocente 115 Verilog/1 Basic module : // comments in this way module name(input x0,x1,input [3:0]y, output out); // x0,x1 are wires, y is a 4 wires bus // out is an output wire // combinational logic use assign   wire x0,x1, [3:0]y, out endmodule
  • 116. May 10, 2014 R.Innocente 116 Verilog/2 Combinatorial circuit : // performs not a b c + a not b not c module dummy(input a,b,c, output y,z); wire a,b,c,y; assign y = ~a & b & c | a & ~b & ~c; assign z = ~c; endmodule This is not C ! a,b,c,y,z are wires and y,z change whenever a or b or c change. To avoid this drama for complex circuits we use synchronous logic (everything is stepped in docking stations = Flip flops)
  • 117. May 10, 2014 R.Innocente 117 Verilog/3
  • 118. May 10, 2014 R.Innocente 118 Verilog/4 A sequential circuit : // a flip flop described in verilog module ff(input d, clk, output q, qbar); wire d, clk; reg q, qbar; always @(posedge clk) begin q <= d; qbar <= ~d; end endmodule At a raising edge of the wire clk copy the signal to q and the inverse of d to qbar
  • 119. May 10, 2014 R.Innocente 119 Verilog/5
  • 120. May 10, 2014 R.Innocente 120 Verilog/6 A more complicate sequential circuit : // in verilog FF with clear/reset module ff(input d, clk,clr, output q, qbar); wire d, clk; reg q, qbar; always @(posedge clk, posedge clr) if (clr) q <= 0; else begin q <= d; end endmodule At a raising edge of the wire clr set q=0, at the raising edge of clk copy the signal to q and the inverse of d to qbar
  • 121. May 10, 2014 R.Innocente 121 Verilog/7
  • 122. May 10, 2014 R.Innocente 122 BORPH : Berkeley Operating system for ReProgrammable Hardware PETALINUX : Xilinx linux for Zynq et al.
  • 123. May 10, 2014 R.Innocente 123 - Idea of HW unix process : has pid, can be killed like a normal unix process, but in fact is an HW instance on FPGA - ioreg Virtual File System interface Borph : Berkeley Operating System
  • 124. May 10, 2014 R.Innocente 124 Xilinx Petalinux The PetaLinux Software Development Kit (SDK) is a development tool that contains everything necessary to build, develop, test and deploy Embedded Linux systems on : Zync- 7000, Zedboard, Kintex-7 boards. PetaLinux consists of : pre-configured binary bootable images, fully customizable Linux for the Xilinx device, and PetaLinux SDK which includes tools and utilities to automate complex tasks across configuration, build, and deployment. PetaLinux is offered under two separate licenses : No charge Evaluation license or Commercial licenses
  • 125. May 10, 2014 R.Innocente 125 END