Deview 2013 rise of the wimpy machines - john mao

Rise of the (Wimpy) Machines
Datacenter Efficiency with ARM-based Servers
John Mao!
Director of Strategy, Calxeda!

What is the name of the computer system in
this movie that tried to end the human-race?

Skynet

Origins of Wimpy Core Computing
•  FAWN:
A
Fast
Array
of
Wimpy
Nodes

–  Project
from
CMU
led
by
Prof.
David
Anderson,

started
in
2008
(acDve
through
2012)

–  Measure
and
compare
performance
per
Joule
of

energy
advantages
over
tradiDonal
servers

–  Original
focus
on
large
distributed
key-‐value
store

applicaDons
and
use-‐cases
(i.e.
Amazon
Dynamo,

LinkedIn’s
Voldemort,
Facebook’s
memcached)

[PublicaDon]
hTp://www.sigops.org/sosp/sosp09/papers/andersen-‐sosp09.pdf

[Website]
hTp://www.cs.cmu.edu/~fawnproj/

FAWN: A Fast Array of Wimpy Nodes
•  Why
FAWN?
MoDvated
by
key
trends:

–  Increasing
CPU-‐I/O
Gap

–  CPU
power
consumpDon
grows
super-‐linearly

with
speed

–  Dynamic
power
scaling
on
tradiDonal
systems
is

surprisingly
ineﬃcient


1G

3G

2G
5G

4G

[Photo
Credit]

h-p://www.cs.cmu.edu/~fawnproj/

•  Multiple generations of hardware used:
–  1G (2008)
•  Single-core 500MHz AMD Geode LX processor
•  256MB DDR SDRAM (400MHz)
•  100Mbps Ethernet

–  5G (2012)
•  Intel Atom D510 – 1.66GHz dual-core w/HT
•  2-4GB DDR2 (667MHz)
•  100Mbps Ethernet

Key Findings from FAWN Project

“The
FAWN
cluster
achieves
364
queries
per

Joule
—
two
orders
of
magnitude
be-er
than

tradiDonal
disk-‐based
clusters.”

[Source]
hTp://www.sigops.org/sosp/sosp09/papers/andersen-‐sosp09.pdf

So what about

®?
ARM

ARM is a good “wimpy” processor & CPU
architecture for the datacenter because:
1.  Focus on low power: origins in embedded
systems and mobile devices
2.  Datacenter focused roadmap: 32-bit CPUs
today, 64-bit CPUs in 1-2 years; increasing
performance (with same energy efficiency)
3.  Business model: ability to integrate for specific
markets and applications
4.  Emerging software ecosystem: while not x86,
ARM has growing ecosystem

Focus on Low Power
•  History in targeting energy-sensitive markets:
–  Netbooks, Smartbooks, Tablets, Thin Clients
–  Smartphones, Feature phones
–  Set-top Box, Digital TV, Blu-Ray players, Gaming
consoles
–  Automotive Infotainment, Navigation
–  Wireless base-stations, VoIP phones and
equipment

•  Design Goals
–  Performance, Power, Easy Synthesis

Focus on Low Power
In
2005,
about
98%
of
all
mobile
phones
sold

used
at
least
one
ARM
processor.

As
of
2009,
due
to
low
power
consumpDon
the
ARM

architecture
is
the
most
widely
used
32-‐bit
RISC

architecture
in
mobile
devices
and
embedded

systems.

[Source]
hTp://en.wikipedia.org/wiki/ARM_architecture

Focus on Low Power
Translating ARM energy-efficiency into the
modern datacenter with Cortex-A9:
Total System* Power
(Today!)

~Power per ECX-1000 Node
(with disk @Wall)

Linux at Rest

130 W

5.4 W

phpbench

155 W

6.5 W

Coremark (4 threads per SOC)

169 W

7.0 W

Website @ 70% Utilization

172 W

7.2 W

LINPACK

191 W

7.9 W

STREAM

205 W

8.5 W

Workload
(on 24 nodes & SSDs)

*All measurements done on a 24-node system @1.1GHz, with 24 SSDs and 96 GB DRAM in the Calxeda Lab.

For specific workloads, ECX-1000 can enable a complete
24-node cluster at similar power level as a 2 socket x86.

Online Review: Calxeda’s ARM Server Tested

Anandtech chartered review
comparing Boston Viridis’
24-Calxeda ECX-1000
(Cortex-A9) cluster against
Intel E5-2650Lsystem.
(March 2012)

http://www.anandtech.com/show/6757/calxedas-arm-server-tested

Calxeda Provides Better Web Throughput

Boston Viridis outperforms
Xeon E5-2650L by 30% with
more than 15 users.

Test
is
PHPbb
running
on
Apache2
with

variable
numbers
of
users
(concurrency)

generaDng
traﬃc.

Calxeda Provides Lower Response Times

Boston Viridis outperforms
Xeon E5-2650L by 60% with
more than 15 users.

Test
is
PHPbb
running
on
Apache2
with

variable
numbers
of
users
(concurrency)

generaDng
traﬃc.

Calxeda Provides Highest Performance/Watt

Boston Viridis provides 80%
more throughput per Watt
than Xeon E5.
•  10-36% less raw power

Test
is
PHPbb
running
on
Apache2
with

variable
numbers
of
users
(concurrency)

generaDng
traﬃc.

Online Review: Calxeda’s ARM Server Tested
Reviewer’s Key Takeaways:
–  For scale-out workloads, Calxeda’s ARM-based scale-out
hardware architecture is very promising.
–  Microbenchmarks show Calxeda ECX-1000 ~10% behind
Intel Atom N2800 @1.86 MHz
–  “Real World” Application Benchmarking shows 70%+ higher
performance-per-watt than Intel Xeon E5 at mid to high user load
–  “Calxeda really did it: each server needs about 8.3W (200W/24),
measured at the wall…about 6W (at 1.4GHz) per server node…”
–  “So on the one hand, no, the current Calxeda servers are no
Intel Xeon killers (yet). However, we feel that Calxeda's
ECX-1000 server node is revolutionary technology.”

®
ARM

Cortex-A15

•  Based on ARMv7A architecture
–  Ensures software application compatibility
with orther Cortex-A processors

•  LPAE support up to 1TB physical memory
•  Full hardware virtualization support
•  From ARM: delivers 2X performance over
Cortex-A9 processor with similar power
•  big.LITTLE configuration support for
mobile devices

Datacenter Focused Roadmap
3rd Generation
Calxeda Fabric and I/O

Lago (ARM® Cortex A57)

“Triple Play”: 3 Generations
of Pin-Compatible SOCs

Sarita (ARM® Cortex A57)

Flagship 64-bit Product for a
Broader Application Set

Compatible 64-bit On-Ramp for Early Access and
Ecosystem Enablement

Midway: ECX-2000 (4 Core, ARM® Cortex A15)
Performance/$ for Cloud and Analytics

Highbank: ECX-1000 (4 Core, ARM® Cortex A9)
Power Efficient Solution for Storage and Web Hosting

2013

2014

2015

[Source] Calxeda public SOC roadmap (June 2013)

“Midway”: Calxeda ECX-2000
Compared to Calxeda’s Cortex-A9 SOC
(ECX-1000), the “Midway” SOC delivers:
–  1.5X more single-thread performance
–  2X more floating point performance
–  3X STREAM (memory b/w) performance
–  4X+ more physical memory support (16GB+)
–  Same performance-per-Watt
Plan to update Anandtech benchmark report

But, ARM doesn’t make/sell SOCs?

®
ARM

Business Model

•  ARM does not make or sell SOC.
•  Instead, ARM licenses IP and technology
to partners (like Calxeda) who design and
build System-on-Chips (SOCs) for various
industries and markets.
•  Calxeda is focused exclusively on bringing
ARM-based technology to the datacenter.
–  Calxeda provides own IP (e.g. Fabric) as
additional value for servers.

EnergyCore® architecture at a glance
A complete building block for hyper-efficient computing

EnergyCore
Management Engine
Advanced system, power
and fabric management for
energy-proportional
computing

I/O Controllers
Standard drivers, standard
interfaces. No surprises.

Processor Complex
Multi-core ARM®
processors integrated
with high bandwidth
memory controllers

EnergyCore
Fabric Switch
Integrated high-performance
fabric provides inter-node
connectivity with industry
standard networking

®
EnergyCore

Fabric (F1/F2)
Integrated 80Gb (8x10Gb cross-bar)
Fabric Switch:
•  Up to 5 external links:
–  Dynamic bandwidth: 1Gb to 10 Gb
per link
–  < 200 Nano-Seconds latency,
node to node

•  3 internal links (to the SOC):
–  2x 10Gb Ethernet ports to the OS
–  1x 10Gb Ethernet port to Mgmt
–  Transparent to OS and software

•  Topology agnostic

à Eliminates Top-of-Rack-Switch ports & cabling
à Enables extreme density; lowers cost and power

Target Workloads
•  Data-Intensive Applications:
–  Storage (scale-out, distributed storage)
•  i.e. Ceph, Gluster, etc.

–  Analytics (NoSQL, MapReduce, distributed
databases)
•  i.e. Hadoop, Cassandra, etc.

•  Distributed, State-less Applications
–  Web Front End
–  Caching Servers
–  Content Distribution Networks (CDN)

Use-Case: Storage via Ceph
•  Official Ceph “Dumpling”+ release now supports
Calxeda-based platforms
•  Initial benchmarks complete (with x86 comparison)
–  Even without optimizations, performance is promising

•  Identified optimization areas (under investigation):
–  Potentially use NEON instructions for CRC32
–  Implement zero-copy on OSD’s
–  Transition reads/write to bufferlists
–  Optimize client side too – librados/librbd

Use-Case: Storage via Ceph

With same number of HDD’s,
Calxeda-based system delivers
50% more performance than
traditional x86-servers.

The AAEON CRS-200S-2R Advantage
An ARM-based, lower cost, higher performance server platform for scale-out storage

Calxeda’s ARM-based SOCs:
•  Energy Efficient
•  More cores per HDD
•  Lower system power
•  High Bandwidth Fabric
•  Multi-10Gb links for
data-intensive apps

Compared to traditional x86-based,
2U rack mount servers, the AAEON
CRS-200S-2R server platform is:

ü  35% Lower TCO*
ü  66% Less Rack Space
ü  50% Higher performance

Summary
•  Even 64-bit ARM processors are not ideal for
every single workload.
•  However, scale-out, data-intensive, workloads
can leverage ARM’s energy-efficiency to provide a
significantly better TCO.
•  For the server market (especially with scale-out
apps), replacing the CPU core is not enough.
–  Look for SOCs that optimize “between the nodes” in a
cluster (e.g. fabric interconnects will help dramatically)

•  Interested in joining the “ARM revolution”?
–  Contact us! – John Mao, john.mao@calxeda.com

Deview 2013 rise of the wimpy machines - john mao

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (12)

Ähnlich wie Deview 2013 rise of the wimpy machines - john mao

Ähnlich wie Deview 2013 rise of the wimpy machines - john mao (20)

Mehr von NAVER D2

Mehr von NAVER D2 (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Deview 2013 rise of the wimpy machines - john mao