DOME 64-bit μDataCenter

DOME 64-bit µDataCenter
Ronald P. Luijten – Data Motion Architect
lui@zurich.ibm.com
IBM Research - Zurich
9 April 2017

COMPUTE is FREE – DATA is NOT
Ronald P. Luijten – Data Motion Architect
lui@zurich.ibm.com
IBM Research - Zurich
9 April 2017

DOME
ppp Astron, IBM, Dutch gvt
Ronald P. Luijten / April 2017 3

SKA (Square Kilometer Array) to measure Big Bang
Picture source: NZZ march 2014
0 10-32s 10-6s 0.01s 3min 380’000 years 13.8 Billion years
Big
Bang Inflation
Protons
created
Start of
nucleosynthesis
through fusion
End of
nucleo-
synthesis
Modern
Universe

SKA: What is it?
Top 500: Sum=123 PFlops. 2GFlops/watt.
100x Flops of Sum! ~ 7GWh
~3000 Dishes
3GHz-10GHz.
~0.5M Antennae
.5GHz-1.7GHz.
~0.5M Antennae
.07GHz-0.45GHz.
1. 109 samples/second * .5M antennae: .5 1015 samples/sec.
2. 3.5 109 samples/second * .5M antennae: 1.7 1015 samples/sec.
3. 2 1010 samples/second * 3K antennae: 6.1013 samples/sec
Sum = 2 1015 samples/second @ 86400 seconds/day:
170 1018 (Exa) samples/day. Assume 10-12x reduction @antenna:
14 Exabytes/day (minimum).

© 2016 IBM Corporation
~ 10 Pb/s
86’400 sec/day
14 ExaByte/day
??
~ 1 PB/Day.
330 disks/day
120’000 disks/yr
??
Top-500 Supercomputing(11/2013)…. 0.3Watt/Gflop/s
Today’s industry focus is 1 Eflop @ 20MW. (2018)
( 0.02 Gflop/s)
Most recent data from SKA:
CSP….max. power 7.5MW
SDP….max. power 1 MW
Latest need for SKA – 4 Exaflop (SKA1 - Mid)
1.2GW…80MW
Too easy (for us)
Too hard
Moore’s lawFactor 80-1200
SDPCSP
multiple breakthroughs needed

Dome Project:
System Analysis
Data & StreamingSustainable (Green)
Computing Nanophotonics
Computing Transport Storage
Algorithms & Machines
- Nanophotonics
- Real-Time
Communications
- New Algorithms
- Microservers
- Accelerators
- Access Patterns
Research Streams…
…are mapped to research projects:
…plus an
open user
platform:
User platform
- Student
projects
- Events
- Research
Collaboration
33M€ 5-year Research Project: 76 IBM PY (32 in NL); 50 ASTRON PY

Major SKA elements & DOME
Beamforming at
stations
Reconstruction of sky
image
Interferometry, cor-
relation of station beams
Station
Station
Central Signal Processor (CSP) Science Data Processor (SDP)
Archive
Algorithms and Machines (P1)
Access Patterns (P2)
Nanophotonics (P3)
Microservers (P4)
Accelerators (P5)
New Algorithms (P6)
Real-Time Communications (P7)

Definition
µDataCenter:
• Ultra-compact self-contained DataCenter using MicroServers
• 64 bit, Server-class computing (ECC on DRAM and caches)
• Ethernet networking
• Storage
• Hot-water cooling, air cooling with 4x less density
• High performance
• Best-of-Breed energy-efficiency
• Competitive cost
• Commodity and standards based
• ‘Appliance’
Allows deployment in space-constrained locations
Edge DataCenter for IoT
The integration of a compute, storage, networking, power &
cooling into ultra-compact form factor

The economist, technology quarterly, 12March2016
Moore’s law: the reality

On-chip communication trends
Local vs global chip wiring (interconnect)
S. Borkar, Intel, 2013

Global chip wiring vs compute energy
130nm 1.4 0.7 1.2 0.3 42.85714
90nm 1 0.5 1 0.25 50
45nm 100 0.35 100 0.175 0.58 0.145 82.85714
32nm 60 0.22 62.85714 0.11 0.49 0.1225 111.3636
22nm 45 0.146 41.71429 0.073 0.43 0.1075 147.2603
14nm 30 0.097 27.71429 0.0485 0.4 0.1 206.1856
from fig3, borkar2013fig 9
for comp only
0
50
100
150
200
250
90nm 45nm 32nm 22nm 14nm
Relative global chip interconnect versus computation energy in %
Computation energy includes local wiring

CMOS scaling era’s
K. Rupp et al, 2015
Era of Dennard (constant energy density) scaling Non-Dennard
scaling
Communication Energy
dominated scaling

Learnings
Communication Energy
dominated scaling
Rethink
data motion
system partitioning
memory hierarchy
packaging &Cooling

Definition
µServer:
The integration of an entire server node motherboard*
into a single microchip except DRAM, Nor-boot flash
and power conversion logic.
305mm
245mm
139mmx62mm
* no graphics

Definition
µServer:
The integration of an entire server node motherboard*
into a single microchip except DRAM, Nor-boot flash
and power conversion logic.
305mm
245mm
139mmx62mm
This does NOT imply low performance!
* no graphics

Definition
µSwitch:
139mmx62mm
µSwitch
The integration of a Top-of-Rack switch* into ultra-
compact form factor
* no PHYs
64 ports @ 10GbE

Indirect Hot-water cooling
133 mm
Standard 240 pin
DDR3 DIMM board
SoC
(Lid Removed)
139 mm
30 mm
61.5 mm
Dual use Cu
-Cooling
-Power dist
DIMM connector
replaced with high
speed SPD08
Cooling plate over
Circuit board
integrated heat-pipes

What we get
32-way carrier “BB2”
(8 nodes populated in this picture)
12V power supply
Cooling rails

View from above
Server nodes
Power node
Storage node
10 GbE Switch
QSFP cages
Water In/Out
Cooling Rails

DOME compute node board diagram
T4240
16GB
DRAM
72bit
16GB
DRAM
72bit
PSoC
1Gbit SPI
flash
Power
converter
USB
JTAG
Serial
I2C
4 x
10 GbE
PCIe x8 2 x SATA
16GB
DRAM
72bit
1866 MT/s 1866 MT/s
1866 MT/s
1V / 40A
12V / 2.5A

DOME compute node board diagram
T4240
DRAM DRAM
PSoC
SPI
flash
Power
converter
USB
JTAG
Serial
I2C
4 x
10 GbE
PCIe x8 2 x SATA
DRAM
12V / 2.5A
PSOC collapses 7 functions into a small chip to
save Area, Power and Cost
1. On/Off & Power up sequencing voltage
domains
2. Monitor power supply voltages / current
3. Provide uServer boot configuration (I2C)
4. JTAG debug + HW counter performance access
5. Serial port forward over USB (Linux console)
6. Temperature monitoring and protection
7. Management interface and control (version
management; MAC address assignment etc.)

DOME Compute Node Options
61.5mm
T4240 SoC
139 mm
61.5mm
Node ISA DRAM I/O
T4240ZMS ppc64 24 GB 4x 10GbE
28nm Bulk 24 core 3 channel PCI x8
43W TDP 1.8GHz DDR3 2 SATA
e6500 72bit ECC USB, µSD
LS2088ZMS ARMv8 32 GB 6x10GbE
28nm Bulk 8 core 2 channel PCI x4,x2,x1,x1
35W TDP 2+GHz DDR4 2 SATA
A72 72bit ECC USB, µSD
LS2088 SoC

DOME Accelerator Node
PCI- and/or Network-Attached FPGA module
FPGA
Xilinx® Kintex® UltraScale™
Five devices options
- XCKU025 (downgrade)
- XCKU035 (downgrade)
-XCKU040 (downgrade)
- XCKU060 (default)
- XCKU095 (upgrade)
Memory (DDR4)
16 GB total (default)
- x2 banks of 8GB x72
- 2400 MT/s, w/ ECC
32 GB total (option)
- x2 banks of 16GB x72
- 2400 MT/s, w/ ECC
Flash
1 Gb x 16 (default)
- Multi-boot support
- Encryption support
2 Gb x16 (option)
reconfigurable accelerator module (FPGA)
Connected thru Ethernet network without any host interaction.
up to 1024 cards can be fit into a single 19” by 42U rack.
FPGA: Xilinx® Kintex® UltraScale™ with two independent DDR4 memory channels (8–16GB each
Top edge extension connector with128 Gbps of bandwidth over 8 lanes,
Daughter card and I/O connectors for plugging an I/O mezzanine
6 x 10 GBE, PCIe3 x8, 2 x SATA3
Status: In bringup

Industry I/O interface board
IoT Daughter-Board to FPGA module
USB 2.0 host 2x
Optocoupler in 4 100Mbps Avago
Optocoupler out 4 Dto.
LVDS 7 pairs For ADC, etc.
CAN 2
Output Level Shifter 18 Programmable output level, 1Mbps
Input Level Shifter 12 Programmable input level 10Mbps
Isolated USB Low-Speed host 1
MIPI PHY 1+1
Serial (RS232, RS484, etc.) 2 Or 4 without handshake
attached to Mezzanine connector
providing various IoTinterface standards relevant
seamlessly embedded in the DOME IoT Edge Compute platform
Standard DOME interface (Ethernet, PCI, SATA)
IoT Daughter-Card interfaces:
Note: In addition to the above, the IoT daughter card can be connected to the FPGA via 8 lanes of PCIe3

32-way carrier board
32-way carrier-board
Storage
Switch
Power
Compute
Cooling
only left rail shown
Compute
520mm
200mm
32-way Carrier:
– compute node (32x):
32 ppc or 32 ARM or (16 ppc + 16 ARM)
– 64 port Ethernet switch
– 32x 10 GbE to compute nodes
– 8x 40GbE external links
Expect ~1TFlop/s linpack w/ T4240 nodes
2 Carriers in 2U rack unit:
– 64 Compute nodes with total 1536 cores
– 1536 Gbyte DRAM
– 16x 40GbE
– 64 TB storage

32-way carrier structure
8x 40G
switch
N N N
1 2 8
S
N N N
9 10 16
S
P N N N
17 18 23
S
N N N
24 25 32
S
P
M
10 GbE
1 GbE management
N
P
S
M
= General Purpose Node (Compute, accelerator)
= Storage node (8x mSATA)
= Power node (DRAM + I/O suplies)
= Management node (T4240 w/ IPMI)
Management bus
SATA
Supply bus

32-way carrier network topology
T4240
module
32 way carrier
FM6000 switch
32x 10 GbE internal connectivity from switch
8 x 40GbE external connectivity (QSFP+)
Green links optionally connect to other 32way carrier

32-way carrier network topology
T4240
module
32 way carrier
FM6000 switch
32x 10 GbE internal connectivity from switch
8 x 40GbE external connectivity (QSFP+)
Green links optionally connect to other 32way carrier
Short electrical links on carrier board
(Copper backplane standard 10GBASE-KR)
MAC to MAC Ethernet links - eliminate PHY chips
128 PHYs on server nodes
32 PHYs on switch node

Currently in bringup (April 2017)
Water-cooled bringup:
SATA carrier (MM node)
USB hub module
Power node
T4240 management node
Storage node
8 T4240 server nodes
Switch node (from right to left)

Performance Measurement Results
CPU Freescale T4240
12 cores; 24 thr.
28nm Bulk
Intel Xeon E3-1230L v3
4 cores; 8 threads
22nm FinFet
CPU2006 Benchmark
Test Environment
System: T4240RDB-PB
1.666 GHz core clock,
1.866 GT/s 6GB DRAM, 3 channels
Fedora 20, Kernel 3.12.19
GCC 4.7.2
gcc options: -O3 -mcpu=powerpc64
System: Supermicro X10SAE
1.8 GHz core clock; Turbo disabled
1.666 GT/s 8 GB DRAM, 2 channels
GCC 4.8.2
gcc options: -O3 -march=native -mtune=native
CINT-base – 1 thread
6.86 20.7
CINT-base – all threads 109.34 (24 threads) 77.6 (8 threads)
Coremark - all threads 188K (24 threads) 65K (8 threads)
40% more performance @ 70% of node level energy
consumption 2x more operations per Joule

Performance Measurement Results
CPU Freescale T4240
12 cores; 24 thr.
28nm Bulk
Intel Xeon E3-1230L v3
4 cores; 8 threads
22nm FinFet
CPU2006 Benchmark
Test Environment
System: T4240RDB-PB
1.666 GHz core clock,
1.866 GT/s 6GB DRAM, 3 channels
GCC 4.7.2
gcc options: -O3 -mcpu=powerpc64
System: Supermicro X10SAE
1.8 GHz core clock; Turbo disabled
1.666 GT/s 8 GB DRAM, 2 channels
GCC 4.8.2
gcc options: -O3 -march=native -mtune=native
CINT-base – 1 thread
6.86 20.7
CINT-base – all threads 109.34 (24 threads) 77.6 (8 threads)
Coremark - all threads 188K (24 threads) 65K (8 threads)
40% more performance @ 70% of node level energy
consumption 2x more operations per Joule

Key Features DOME µDataCenter
2x Operations per Joule compared to energy-efficient Xeon E3-1230Lv3 (SpecBench)
20x denser with watercooling (5x with aircooling)
No moving parts (drives, fans)
Highest system memory bandwidth density: 159GB/s/Liter (peak)
Value:
• Density + Energy-Efficiency + commodity components + standards
• minimal component count
– SoC, PSoC, System partitioning
• Packaging, power and cooling
• Connector definition

Product version being finished
The “edge-of-IOT” microDataCenter is being productized – 64 servers in 2U
Market introduction planned summer 2017
rendering of two BB2 carriers in 2U rack unit

µDataCenter plans
• Finish ARMv8 server board
• Finish FPGA board
• Obtain funding to build GPU board + Xeon-D board
• Bring µDataCenter to market
• Product launch: Summer 2017
• H2020 proposal for next step in packaging integration:
– use high performance SoC die based on ARMv8
– package with 3D packaged DRAM
– chip carrier technology in size of DOME node cards, but thicker
ZRL Prototype
3D packaged
DRAM

Application Areas
•Managing
unstructured data
for Industry 4.0
•Smarter Cities:
Carbon Emissions,
Traffic Flow & Noise
•Computational
Musicology
•Processing
petabytes of data
from the Big Bang
•Industry 4.0 •Internet of Things •Aerospace •Vehicles
CeBIT ‘16 live demo

Trends, Conclusions
Making it small really works to improve energy-efficiency
- SoC removes many chip crossings
- short distance
- Save power in unexpected places (PHY, DRAM)
- PSoC eliminates many components
- Water cooling reduces power consumption even further
The future scaling roadmap is in ultra-dense packaging
Big Data changing workloads
IOT distributed DataCenters

SKA: http://www.skatelescope.org
DOME: http://www.dome-exascale.nl
µServer: http://www.zurich.ibm.com/microserver
T4240 system: http://swissdutch.ch:6999
Wikipedia: https://en.wikipedia.org/wiki/Microserver
Twitter: https://twitter.com/ronaldgadget
Videos:
Impossible µServer: http://t.co/4vEkEVEazO
Innovators Dilemma: http://youtu.be/imweQe8NgnI
DOME T4240 Fedora: http://youtu.be/D6da5DqcyQk
4.4: Energy-Efficient Microserver Based on a 12-Core 1.8GHz 188K-CoreMark 28nm Bulk CMOS 64b SoC
for Big-Data Applications with 159GB/s/L Memory Bandwidth System Density 39 of 15
Links

“Energy-Efficient Microserver Based on a 12-Core 1.8GHz 188K-CoreMark 28nm
Bulk CMOS 64b SoC for Big-Data Applications with 159GB/s/L Memory Bandwidth
System Density”, R.Luijten et al., ISSCC15, San Francisco, Feb 2015
“The DOME embedded 64 bit microserver demonstrator”, R. Luijten and A. Doering,
ICICDT 2013, Pavia, Italy, May 2013
“Quantitative Analysis of the Berkeley Dwarfs' Parallelism and Data Movement
Properties”, Victoria Caparros Cabezas, Phillip Stanley-Marbell, ACM CF 2011, May
2011
“Performance, Power, and Thermal Analysis of Low-Power Processors for Scale-
Out Systems”, Phillip Stanley-Marbell, Victoria Caparros Cabezas, IEEE HPPAC 2011,
May 2011
“Pinned to the Walls—Impact of Packaging and Application Properties on the
Memory and Power Walls”, Phillip Stanley-Marbell, Victoria Caparros Cabezas,
Ronald P. Luijten, IEEE ISLPED 2011, Aug 2011.
4.4: Energy-Efficient Microserver Based on a 12-Core 1.8GHz 188K-CoreMark 28nm Bulk CMOS 64b SoC
for Big-Data Applications with 159GB/s/L Memory Bandwidth System Density
© 2015 IEEE
International Solid-State Circuits Conference 40 of 15
Literature

Acknowledgements
This work is the results of many people
• Andreas Doering, IBM ZRL
• Matteo Cossale, IBM ZRL
• Stephan Paredes, IBM ZRL
• Francois Abel, IBM ZRL
• Beat Weiss, IBM ZRL
• Peter v. Ackeren, NXP
• Ed Swarthout, NXP Austin
• Dac Pham, (formerly NXP Austin)
• Yvonne Chan, IBM Toronto
• Alessandro Curioni, IBM ZRL
• Ton Engbersen, IBM ZRL
• James Nigel, FSL
• Boris Bialek, IBM Toronto
• Marco de Vos, Astron NL
• And many more remain unnamed….
Companies: NXP Austin, Belgium & Germany; IBM worldwide; Transfer – NL
Dutch Gvt for DOME grant

Questions???
PS. I like lightweight things
µServer website: www.swissdutch.ch

DOME 64-bit μDataCenter

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to DOME 64-bit μDataCenter

Similar to DOME 64-bit μDataCenter (20)

More from inside-BigData.com

More from inside-BigData.com (20)

Recently uploaded

Recently uploaded (20)

DOME 64-bit μDataCenter