Low Power High-Performance Computing on the BeagleBoard Platform

Low Power High-Performance Computing on the
BeagleBoard Platform
E. Principi, V. Colagiacomo, S. Squartini, and F. Piazza
A3Lab, Department of Information Engineering
Universit`a Politecnica delle Marche
5th European DSP Education and Research Conference
13th and 14th September, 2012, Amsterdam, Netherlands

Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Outline
1 Introduction
2 Purpose of this work
3 The BeagleCluster
Hardware Platform
Software Platform
4 Experiments
High-Performance Linpack
Matrix Multiplication
Speaker Diarization
Analysis of power consumption
5 Conclusions and Future Developments
2 / 25

Introduction
High-performance computing clusters are employed in computation-
ally intensive tasks (e.g., weather prediction, astronomical mod-
elling).
Usually, they are evaluated only in terms of Floating Point Opera-
tions Per Second (FLOPS) (e.g., Top500 list).
The costs of energy and infrastructure exceed the costs of the
computational devices, and this gap is expected to grow by 2014
[Belady, 2007].
A new metric
FLOPS/Watt
3 / 25

Tendency in the industry
• Use of processors traditionally employed in the mobile world.
• Canonical built a 42-core ARM cluster for compiling the
Ubuntu distribution.
• Calxeda developed the EnergyCore ECX-1000 series of
server-on-a-chip based on ARM Cortex-A9.
4 / 25

Tendency in the industry
• Use of processors traditionally employed in the mobile world.
• Canonical built a 42-core ARM cluster for compiling the
Ubuntu distribution.
• Calxeda developed the EnergyCore ECX-1000 series of
server-on-a-chip based on ARM Cortex-A9.
• Hewlett-Packard Redstone servers
• Four rack chassis = 2800
conventional servers
• Energy saving: 90%
• Space saving: 94%
• Currently employed in TryStack
free cloud service
(http://trystack.org)
4 / 25

Purpose of this work
Develop
Develop an energy efficient cluster computer composed of off-the-
shelf inexpensive hardware and open software and propose it to the
scientific community.
Evaluate
Evaluate the cluster both through conventional benchmarks and a
real-time constrained speech processing application.
Measure
Measure the power consumption of the cluster, assess the energy
efficiency, and compare it with a laptop PC.
5 / 25

Hardware Platform
Cluster description
The BeagleCluster is composed of ﬁve BeagleBoard-xM.
Beagleboard-xM
Processor TI DM3730
ARM subsystem Cortex-A8 @ 1 GHz
DSP subsystem C64x+ @ 800 MHz
Graphics accelerator PowerVR SGX @ 200 MHz
RAM 512 MB DDR @ 200 MHz
Network interface Ethernet 10/100
6 / 25

Hardware Platform
Cluster description (cont.)
• Asymmetric topology: one head node, four worker nodes.
• Nodes are connected to a Hewlett-Packard ProCurve 1410-8G
switch through the BeagleBoard-xM 100 Mbit interface.
• Nodes are powered by a Lambda AC-DC power supply.
7 / 25

Software Platform
Software Platform
• Operating system: ˚Angstr¨om GNU/Linux distribution (worker
nodes do not have a GUI).
• Tool-chain: CodeSourcery.
• Network File System: data and code are shared throughout
the cluster using Network File System.
• Cluster Command Control: a suite of tools for managing the
cluster (e.g., terminating processes, rebooting worker nodes,
pushing drive images).
• Message Passing Interface (Argonne National Laboratory
MPICH2): application programming interface that allows the
exchange of messages and data among processes running on
the nodes of a cluster.
8 / 25

Software Platform
Software Platform (cont.)
• Ganglia: oﬀers a web interface used to monitor the cluster
activity and to detect abnormal functioning.
9 / 25

High-Performance Linpack (HPL)
• HPL is the de-facto standard benchmark for ﬂoating point
performance measurement.
• It is employed in the Top500 and Green500 lists.
• HPL solves a dense system of linear equations using double
precision arithmetic.
• Parallelism is obtained by means of MPI.
• Computation is based on BLAS (Vesperix ATLAS-ARM).
10 / 25

High-Performance Linpack (HPL) (cont.)
MFLOPS
258.6
MFLOPS/W
13.26
Green500 500th position (June 2012)
Cray XT5 SixCore, Opteron Six Core 6C 2.6 GHz, XT4 Internal
Interconnect: 32.05 MFLOPS/W
Note
Arithmetic operations are performed in double precision in the
Vector Floating Point unit: NEON unit cannot be employed.
11 / 25

• This benchmark shows the performance improvement that can
be obtained using NEON optimized code.
• The benchmark multiplies an m × n matrix A with an n × p
matrix B.
• It operates dividing the rows of matrix A in groups, and
processing each group in a diﬀerent node.
• Communication among nodes is based on MPI.
Platform Execution time
BeagleCluster 42.13 s
BeagleCluster w/ NEON 5.18 s
NEON optimized code signiﬁcantly reduces the execution time ⇒
HPL performance can be improved by properly exploiting NEON
12 / 25

Speaker Diarization
Speaker Diarization
• A speaker diarization algorithm detects “who speaks now”.
• The algorithm here addressed is based on the real-time
implementation described in [Colagiacomo, et al. 2010].
• The calculation of the cross-correlations between the channel
i signal xi(t) and the channel j signal xj(t) is the most
computational demanding part:
Cij(t) = max
τ
{IFFT[FFT(xi(t)xj(t − τ)) • FFT(w(t))]} .
Here, t is the time index, τ is the correlation lag, w(t) is the
Hamming window and • denotes the element-wise product.
13 / 25

Speaker Diarization
Speaker Diarization (cont.)
• Cluster-wide parallelism has been obtained assigning the
feature extraction stage of each channel to one of the worker
nodes.
• The server process in the head node dispatches audio frames
to the worker nodes through the MPI Bcast instruction and
performs the ﬁnal classiﬁcation.
• Performance have been evaluated in terms of Real-Time
Factor (RTF):
RTF = Total execution time
Speech segment duration
14 / 25

Speaker Diarization
• Audio data: four lapel microphone signals of meeting
ES2009b contained in the AMI corpus.
• Comparison with an Asus F9SG laptop (Intel Core2 Duo
T8300 CPU running at 2.4 GHz and with 2 GB of RAM)
• Power consumption is measured switching the LCD monitor
oﬀ.
15 / 25

Speaker Diarization
Single-board implementation results
• Real-time execution is achieved through the NEON instruction
set and reducing the number of cross-correlations: the
maximum of Cij(t) is searched incrementing τ by ∆τ > 1.
∆τ Laptop BeagleBoard-xM
1 2.47 12.73
16 0.25 1.02
32 0.18 0.63
64 0.14 0.44
128 0.12 0.36
The choice of ∆τ is critical both
for the laptop and the
BeagleBoard-xM.
16 / 25

Speaker Diarization
Cluster-wide implementation results
∆τ Single-board Five nodes
1 12.73 4.71
16 1.02 1.69
32 0.63 1.63
64 0.44 1.56
128 0.36 1.55
• The MPI version is almost 3 times as fast as the single-board one when
∆τ = 1.
• As ∆τ increases, the MPI implementation performance decreases: the
communication overhead becomes the new bottleneck.
17 / 25

Speaker Diarization
Cluster-wide implementation
• This has been verified in a four nodes cluster.
• Nodes read audio data directly from the local file system.
• One of the worker nodes performs both the feature extraction
and the classification tasks.
∆τ Five nodes Four nodes (w/ local data)
1 4.71 3.35
16 1.69 0.33
32 1.63 0.23
64 1.56 0.18
128 1.55 0.16
Reducing the communication overhead real-time execution can be
achieved with ∆τ = 16.
18 / 25

Speaker Diarization
Analysis of power consumption
BeagleCluster
20.32 W
Laptop
32.36 W
Energy ratio
Er =
RTFcluster · Pcluster
RTFlaptop · Plaptop
∼= 1.2
The communication overhead limits the energy eﬃciency of the Bea-
gleCluster.
Energy ratio of the four nodes cluster
Er
∼= 0.69
Reducing the communication overhead the BeagleCluster is more
eﬃcient than the laptop PC.
19 / 25

Conclusions
• A cluster computer based on the BeagleBoard-xM platform
has been described.
• The cluster is based on open software for executing parallel
tasks, management, and monitoring the nodes status.
• High Performance Linpack has been used to obtain the
number of ﬂoating point operations per second.
• The performance improvement that can be achieved using
NEON optimized code has been shown by means of a matrix
multiplication benchmark.
• Processing time and power consumption have been measured
by means of a cluster-wide speaker diarization algorithm to
evaluate the real-time capabilities and the energy eﬃciency of
the cluster.
20 / 25

Conclusions (cont.)
• Results showed that using the 100 Mbit Ethernet interface,
the BeagleCluster consumes 1.2 times the energy spent by the
laptop PC.
• Removing the communication bottleneck, the BeagleCluster
achieves a superior energy eﬃciency.
• The cost of the 5 nodes cluster is 655 e. Compared to the
laptop PC, whose cost is 1100 e, the BeagleCluster is about
500 e cheaper.
21 / 25

Future developments
• The software platform will be expanded with a resource
manager and a scheduler to enable the execution of batch
jobs.
• The energy eﬃciency will be assessed in a High-Availability
scenario, for example using the cluster for hosting websites.
• The use of more eﬃcient hardware platforms (e.g.,
PandaBoards) and of the DM3730 DSP will be considered.
22 / 25

Thank you for your attention!
Emanuele Principi Vito Colagiacomo
e.principi@univpm.it s1037562@studenti.univpm.it
Stefano Squartini Francesco Piazza
s.squartini@univpm.it f.piazza@univpm.it
23 / 25

Manufacturer AMPROBE
Model LH41A
Measuring Range 0-40A, DC or AC peak
Resolution 1 mA in 4 A range
10 mA in 40 A range
Accuracy ±1.3% + 5 digits
Frequency Range DC in DC
40 Hz to 400 Hz in AC
24 / 25

High-Performance Linpack: details
Rmax 258.6 MFLOPS
Problem size 15000
Block size 16
Grid ratio 2 × 2
25 / 25

H. W. Meuer, “The TOP500 Project: Looking Back Over 15 Years of
Supercomputing Experience,” Informatik-Spektrum, vol. 31, no. 3, pp. 203–222,
2008. [Online]. Available: http://www.top500.org
C. L. Belady, “In the Data Center, Power and Cooling Cost More Than the IT
Equipment It Supports,” Electronics Cooling Magazine, vol. 13, no. 1, May 2007.
W.-c. Feng and K. Cameron, “The Green500 List: Encouraging Sustainable
Supercomputing,” IEEE Computer, vol. 40, no. 12, pp. 50–55, Dec. 2007.
[Online]. Available: http://www.green500.org
I. Ahmad and S. Ranka, Eds., Handbook of Energy-Aware and Green Computing,
1st ed., ser. Information Science. Boca Raton, US: CRC Press, Jan. 2012.
S. Andrade, J. Dourado, and C. Maciel, “Low-power cluster using OMAP3530,”
in Proc. of EDERC, Nice, France, Dec. 2010, pp. 220–224.
K. Fürlinger, C. Klausecker, and D. Kranzlmüller, “Towards energy efficient
parallel computing on consumer electronic devices,” in Proc. of ICT-GLOW.
Berlin, Heidelberg: Springer-Verlag, 2011, pp. 1–9.
M. Brim, R. Flanery, A. Geist, B. Luethke, and S. L. Scott, “Cluster Command
and Control (C3) Tool Suite,” Parallel and Distributed Computing Practices,
vol. 4, no. 4, Dec. 2001.
25 / 25

Argonne National Laboratory, “MPICH2,”
http://www.mcs.anl.gov/research/projects/mpich2/.
M. L. Massie, B. N. Chun, and D. E. Culler, “The Ganglia distributed monitoring
system: design, implementation, and experience,” Parallel Computing, vol. 30,
no. 7, pp. 817–840, 2004.
M. Moattar and M. Homayounpour, “A review on speaker diarization systems
and approaches,” Speech Communication, vol. 54, no. 10, pp. 1065–1103, 2012.
E. Principi, R. Rotili, M. Wöllmer, F. Eyben, S. Squartini, and B. Schuller,
“Real-Time Activity Detection in a Multi-Talker Reverberated Environment,”
Cognitive Computation, pp. 1–12, 2012.
V. Colagiacomo, E. Principi, S. Cifani, and S. Squartini, “Real-Time Speaker
Diarization on TI OMAP3530,” in Proc. of EDERC, Nice, France, Dec. 1st-2nd
2010.
InfiniBand Trade Association, “InfiniBand Architecture Specification Release
1.2.1,” Jan. 2008.
N. J. Boden, D. Cohen, R. E. Felderman, A. Kulawik, C. Seitz, J. N. Seizovic,
and W. Su, “Myrinet: A Gigabit-per-second Local Area Network,” IEEE Micro,
vol. 15, no. 1, pp. 29–36, Feb. 1995.
25 / 25

Low Power High-Performance Computing on the BeagleBoard Platform

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (8)

Ähnlich wie Low Power High-Performance Computing on the BeagleBoard Platform

Ähnlich wie Low Power High-Performance Computing on the BeagleBoard Platform (20)

Mehr von a3labdsp

Mehr von a3labdsp (8)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Low Power High-Performance Computing on the BeagleBoard Platform