Calxeda's ARM-based servers provide significant efficiency advantages over traditional x86 servers for scale-out workloads. The FAWN project at CMU demonstrated that a cluster of low-power ARM nodes could achieve over 360 queries per joule for key-value store applications, two orders of magnitude better than traditional servers. Calxeda's ECX-1000 servers based on Cortex-A9 ARM processors showed 70% higher performance per watt than Intel Xeon servers for web application workloads. Upcoming servers using Cortex-A15 and Cortex-A57 ARM processors are expected to provide even better performance. These efficiency gains make ARM servers well-suited for distributed applications like storage, analytics and
SQL Database Design For Developers at php[tek] 2024
Deview 2013 rise of the wimpy machines - john mao
1. Rise of the (Wimpy) Machines
Datacenter Efficiency with ARM-based Servers
John Mao!
Director of Strategy, Calxeda!
2. What is the name of the computer system in
this movie that tried to end the human-race?
Skynet
3.
4. Origins of Wimpy Core Computing
• FAWN:
A
Fast
Array
of
Wimpy
Nodes
– Project
from
CMU
led
by
Prof.
David
Anderson,
started
in
2008
(acDve
through
2012)
– Measure
and
compare
performance
per
Joule
of
energy
advantages
over
tradiDonal
servers
– Original
focus
on
large
distributed
key-‐value
store
applicaDons
and
use-‐cases
(i.e.
Amazon
Dynamo,
LinkedIn’s
Voldemort,
Facebook’s
memcached)
[PublicaDon]
hTp://www.sigops.org/sosp/sosp09/papers/andersen-‐sosp09.pdf
[Website]
hTp://www.cs.cmu.edu/~fawnproj/
5. FAWN: A Fast Array of Wimpy Nodes
• Why
FAWN?
MoDvated
by
key
trends:
– Increasing
CPU-‐I/O
Gap
– CPU
power
consumpDon
grows
super-‐linearly
with
speed
– Dynamic
power
scaling
on
tradiDonal
systems
is
surprisingly
inefficient
6. FAWN: A Fast Array of Wimpy Nodes
1G
3G
2G
5G
4G
[Photo
Credit]
h-p://www.cs.cmu.edu/~fawnproj/
7. FAWN: A Fast Array of Wimpy Nodes
• Multiple generations of hardware used:
– 1G (2008)
• Single-core 500MHz AMD Geode LX processor
• 256MB DDR SDRAM (400MHz)
• 100Mbps Ethernet
– 5G (2012)
• Intel Atom D510 – 1.66GHz dual-core w/HT
• 2-4GB DDR2 (667MHz)
• 100Mbps Ethernet
8. Key Findings from FAWN Project
“The
FAWN
cluster
achieves
364
queries
per
Joule
—
two
orders
of
magnitude
be-er
than
tradiDonal
disk-‐based
clusters.”
[Source]
hTp://www.sigops.org/sosp/sosp09/papers/andersen-‐sosp09.pdf
9. So what about
®?
ARM
ARM is a good “wimpy” processor & CPU
architecture for the datacenter because:
1. Focus on low power: origins in embedded
systems and mobile devices
2. Datacenter focused roadmap: 32-bit CPUs
today, 64-bit CPUs in 1-2 years; increasing
performance (with same energy efficiency)
3. Business model: ability to integrate for specific
markets and applications
4. Emerging software ecosystem: while not x86,
ARM has growing ecosystem
10. Focus on Low Power
• History in targeting energy-sensitive markets:
– Netbooks, Smartbooks, Tablets, Thin Clients
– Smartphones, Feature phones
– Set-top Box, Digital TV, Blu-Ray players, Gaming
consoles
– Automotive Infotainment, Navigation
– Wireless base-stations, VoIP phones and
equipment
• Design Goals
– Performance, Power, Easy Synthesis
11. Focus on Low Power
In
2005,
about
98%
of
all
mobile
phones
sold
used
at
least
one
ARM
processor.
As
of
2009,
due
to
low
power
consumpDon
the
ARM
architecture
is
the
most
widely
used
32-‐bit
RISC
architecture
in
mobile
devices
and
embedded
systems.
[Source]
hTp://en.wikipedia.org/wiki/ARM_architecture
12. Focus on Low Power
Translating ARM energy-efficiency into the
modern datacenter with Cortex-A9:
Total System* Power
(Today!)
~Power per ECX-1000 Node
(with disk @Wall)
Linux at Rest
130 W
5.4 W
phpbench
155 W
6.5 W
Coremark (4 threads per SOC)
169 W
7.0 W
Website @ 70% Utilization
172 W
7.2 W
LINPACK
191 W
7.9 W
STREAM
205 W
8.5 W
Workload
(on 24 nodes & SSDs)
*All measurements done on a 24-node system @1.1GHz, with 24 SSDs and 96 GB DRAM in the Calxeda Lab.
For specific workloads, ECX-1000 can enable a complete
24-node cluster at similar power level as a 2 socket x86.
14. Online Review: Calxeda’s ARM Server Tested
Anandtech chartered review
comparing Boston Viridis’
24-Calxeda ECX-1000
(Cortex-A9) cluster against
Intel E5-2650Lsystem.
(March 2012)
http://www.anandtech.com/show/6757/calxedas-arm-server-tested
15. Calxeda Provides Better Web Throughput
Boston Viridis outperforms
Xeon E5-2650L by 30% with
more than 15 users.
Test
is
PHPbb
running
on
Apache2
with
variable
numbers
of
users
(concurrency)
generaDng
traffic.
16. Calxeda Provides Lower Response Times
Boston Viridis outperforms
Xeon E5-2650L by 60% with
more than 15 users.
Test
is
PHPbb
running
on
Apache2
with
variable
numbers
of
users
(concurrency)
generaDng
traffic.
17. Calxeda Provides Highest Performance/Watt
Boston Viridis provides 80%
more throughput per Watt
than Xeon E5.
• 10-36% less raw power
Test
is
PHPbb
running
on
Apache2
with
variable
numbers
of
users
(concurrency)
generaDng
traffic.
18. Online Review: Calxeda’s ARM Server Tested
Reviewer’s Key Takeaways:
– For scale-out workloads, Calxeda’s ARM-based scale-out
hardware architecture is very promising.
– Microbenchmarks show Calxeda ECX-1000 ~10% behind
Intel Atom N2800 @1.86 MHz
– “Real World” Application Benchmarking shows 70%+ higher
performance-per-watt than Intel Xeon E5 at mid to high user load
– “Calxeda really did it: each server needs about 8.3W (200W/24),
measured at the wall…about 6W (at 1.4GHz) per server node…”
– “So on the one hand, no, the current Calxeda servers are no
Intel Xeon killers (yet). However, we feel that Calxeda's
ECX-1000 server node is revolutionary technology.”
19. ®
ARM
Cortex-A15
• Based on ARMv7A architecture
– Ensures software application compatibility
with orther Cortex-A processors
• LPAE support up to 1TB physical memory
• Full hardware virtualization support
• From ARM: delivers 2X performance over
Cortex-A9 processor with similar power
• big.LITTLE configuration support for
mobile devices
20. Datacenter Focused Roadmap
3rd Generation
Calxeda Fabric and I/O
Lago (ARM® Cortex A57)
“Triple Play”: 3 Generations
of Pin-Compatible SOCs
Sarita (ARM® Cortex A57)
Flagship 64-bit Product for a
Broader Application Set
Compatible 64-bit On-Ramp for Early Access and
Ecosystem Enablement
Midway: ECX-2000 (4 Core, ARM® Cortex A15)
Performance/$ for Cloud and Analytics
Highbank: ECX-1000 (4 Core, ARM® Cortex A9)
Power Efficient Solution for Storage and Web Hosting
2013
2014
2015
[Source] Calxeda public SOC roadmap (June 2013)
21. “Midway”: Calxeda ECX-2000
Compared to Calxeda’s Cortex-A9 SOC
(ECX-1000), the “Midway” SOC delivers:
– 1.5X more single-thread performance
– 2X more floating point performance
– 3X STREAM (memory b/w) performance
– 4X+ more physical memory support (16GB+)
– Same performance-per-Watt
Plan to update Anandtech benchmark report
23. ®
ARM
Business Model
• ARM does not make or sell SOC.
• Instead, ARM licenses IP and technology
to partners (like Calxeda) who design and
build System-on-Chips (SOCs) for various
industries and markets.
• Calxeda is focused exclusively on bringing
ARM-based technology to the datacenter.
– Calxeda provides own IP (e.g. Fabric) as
additional value for servers.
24. EnergyCore® architecture at a glance
A complete building block for hyper-efficient computing
EnergyCore
Management Engine
Advanced system, power
and fabric management for
energy-proportional
computing
I/O Controllers
Standard drivers, standard
interfaces. No surprises.
Processor Complex
Multi-core ARM®
processors integrated
with high bandwidth
memory controllers
EnergyCore
Fabric Switch
Integrated high-performance
fabric provides inter-node
connectivity with industry
standard networking
25. ®
EnergyCore
Fabric (F1/F2)
Integrated 80Gb (8x10Gb cross-bar)
Fabric Switch:
• Up to 5 external links:
– Dynamic bandwidth: 1Gb to 10 Gb
per link
– < 200 Nano-Seconds latency,
node to node
• 3 internal links (to the SOC):
– 2x 10Gb Ethernet ports to the OS
– 1x 10Gb Ethernet port to Mgmt
– Transparent to OS and software
• Topology agnostic
à Eliminates Top-of-Rack-Switch ports & cabling
à Enables extreme density; lowers cost and power
27. Target Workloads
• Data-Intensive Applications:
– Storage (scale-out, distributed storage)
• i.e. Ceph, Gluster, etc.
– Analytics (NoSQL, MapReduce, distributed
databases)
• i.e. Hadoop, Cassandra, etc.
• Distributed, State-less Applications
– Web Front End
– Caching Servers
– Content Distribution Networks (CDN)
28. Use-Case: Storage via Ceph
• Official Ceph “Dumpling”+ release now supports
Calxeda-based platforms
• Initial benchmarks complete (with x86 comparison)
– Even without optimizations, performance is promising
• Identified optimization areas (under investigation):
– Potentially use NEON instructions for CRC32
– Implement zero-copy on OSD’s
– Transition reads/write to bufferlists
– Optimize client side too – librados/librbd
29. Use-Case: Storage via Ceph
With same number of HDD’s,
Calxeda-based system delivers
50% more performance than
traditional x86-servers.
30. The AAEON CRS-200S-2R Advantage
An ARM-based, lower cost, higher performance server platform for scale-out storage
Calxeda’s ARM-based SOCs:
• Energy Efficient
• More cores per HDD
• Lower system power
• High Bandwidth Fabric
• Multi-10Gb links for
data-intensive apps
Compared to traditional x86-based,
2U rack mount servers, the AAEON
CRS-200S-2R server platform is:
ü 35% Lower TCO*
ü 66% Less Rack Space
ü 50% Higher performance
31. Summary
• Even 64-bit ARM processors are not ideal for
every single workload.
• However, scale-out, data-intensive, workloads
can leverage ARM’s energy-efficiency to provide a
significantly better TCO.
• For the server market (especially with scale-out
apps), replacing the CPU core is not enough.
– Look for SOCs that optimize “between the nodes” in a
cluster (e.g. fabric interconnects will help dramatically)
• Interested in joining the “ARM revolution”?
– Contact us! – John Mao, john.mao@calxeda.com