The ever increasing energy requirements of supercomputers and server farms is driving the scientific and industrial communities to take in deeper consideration the energy efficiency of computing equipments. This contribution addresses the issue proposing a cluster of ARM processors for high-performance computing. The cluster is composed of five BeagleBoard-xM, with one board managing the cluster, and the other boards executing the actual processing. The software platform is based on the Angstrom GNU/Linux distribution and is equipped with a distributed file system to ease sharing data and code among the nodes of the cluster, and with tools for managing tasks and monitoring the status of each node. The computational capabilities of the cluster have been assessed through High-Performance Linpack and a cluster-wide speaker diarization algorithm, while power consumption has been measured using a clamp meter. Experimental results obtained in the speaker diarization task showed that the energy efficiency of the BeagleBoard-xM cluster is comparable to the one of a laptop computer equipped with a Intel Core2 Duo T8300 running at 2.4 GHz. Furthermore, removing the bottleneck due to the Ethernet interface, the BeagleBoard-xM cluster is able to achieve a superior energy efficiency.
Low Power High-Performance Computing on the BeagleBoard Platform
1. Low Power High-Performance Computing on the
BeagleBoard Platform
E. Principi, V. Colagiacomo, S. Squartini, and F. Piazza
A3Lab, Department of Information Engineering
Universit`a Politecnica delle Marche
5th European DSP Education and Research Conference
13th and 14th September, 2012, Amsterdam, Netherlands
2. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Outline
1 Introduction
2 Purpose of this work
3 The BeagleCluster
Hardware Platform
Software Platform
4 Experiments
High-Performance Linpack
Matrix Multiplication
Speaker Diarization
Analysis of power consumption
5 Conclusions and Future Developments
2 / 25
3. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Introduction
High-performance computing clusters are employed in computation-
ally intensive tasks (e.g., weather prediction, astronomical mod-
elling).
Usually, they are evaluated only in terms of Floating Point Opera-
tions Per Second (FLOPS) (e.g., Top500 list).
The costs of energy and infrastructure exceed the costs of the
computational devices, and this gap is expected to grow by 2014
[Belady, 2007].
A new metric
FLOPS/Watt
3 / 25
4. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Introduction
High-performance computing clusters are employed in computation-
ally intensive tasks (e.g., weather prediction, astronomical mod-
elling).
Usually, they are evaluated only in terms of Floating Point Opera-
tions Per Second (FLOPS) (e.g., Top500 list).
The costs of energy and infrastructure exceed the costs of the
computational devices, and this gap is expected to grow by 2014
[Belady, 2007].
A new metric
FLOPS/Watt
3 / 25
5. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Introduction
High-performance computing clusters are employed in computation-
ally intensive tasks (e.g., weather prediction, astronomical mod-
elling).
Usually, they are evaluated only in terms of Floating Point Opera-
tions Per Second (FLOPS) (e.g., Top500 list).
The costs of energy and infrastructure exceed the costs of the
computational devices, and this gap is expected to grow by 2014
[Belady, 2007].
A new metric
FLOPS/Watt
3 / 25
6. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Introduction
High-performance computing clusters are employed in computation-
ally intensive tasks (e.g., weather prediction, astronomical mod-
elling).
Usually, they are evaluated only in terms of Floating Point Opera-
tions Per Second (FLOPS) (e.g., Top500 list).
The costs of energy and infrastructure exceed the costs of the
computational devices, and this gap is expected to grow by 2014
[Belady, 2007].
A new metric
FLOPS/Watt
3 / 25
7. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Tendency in the industry
• Use of processors traditionally employed in the mobile world.
• Canonical built a 42-core ARM cluster for compiling the
Ubuntu distribution.
• Calxeda developed the EnergyCore ECX-1000 series of
server-on-a-chip based on ARM Cortex-A9.
4 / 25
8. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Tendency in the industry
• Use of processors traditionally employed in the mobile world.
• Canonical built a 42-core ARM cluster for compiling the
Ubuntu distribution.
• Calxeda developed the EnergyCore ECX-1000 series of
server-on-a-chip based on ARM Cortex-A9.
4 / 25
9. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Tendency in the industry
• Use of processors traditionally employed in the mobile world.
• Canonical built a 42-core ARM cluster for compiling the
Ubuntu distribution.
• Calxeda developed the EnergyCore ECX-1000 series of
server-on-a-chip based on ARM Cortex-A9.
• Hewlett-Packard Redstone servers
• Four rack chassis = 2800
conventional servers
• Energy saving: 90%
• Space saving: 94%
• Currently employed in TryStack
free cloud service
(http://trystack.org)
4 / 25
10. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Purpose of this work
Develop
Develop an energy efficient cluster computer composed of off-the-
shelf inexpensive hardware and open software and propose it to the
scientific community.
Evaluate
Evaluate the cluster both through conventional benchmarks and a
real-time constrained speech processing application.
Measure
Measure the power consumption of the cluster, assess the energy
efficiency, and compare it with a laptop PC.
5 / 25
11. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Purpose of this work
Develop
Develop an energy efficient cluster computer composed of off-the-
shelf inexpensive hardware and open software and propose it to the
scientific community.
Evaluate
Evaluate the cluster both through conventional benchmarks and a
real-time constrained speech processing application.
Measure
Measure the power consumption of the cluster, assess the energy
efficiency, and compare it with a laptop PC.
5 / 25
12. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Purpose of this work
Develop
Develop an energy efficient cluster computer composed of off-the-
shelf inexpensive hardware and open software and propose it to the
scientific community.
Evaluate
Evaluate the cluster both through conventional benchmarks and a
real-time constrained speech processing application.
Measure
Measure the power consumption of the cluster, assess the energy
efficiency, and compare it with a laptop PC.
5 / 25
13. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Hardware Platform
Cluster description
The BeagleCluster is composed of five BeagleBoard-xM.
Beagleboard-xM
Processor TI DM3730
ARM subsystem Cortex-A8 @ 1 GHz
DSP subsystem C64x+ @ 800 MHz
Graphics accelerator PowerVR SGX @ 200 MHz
RAM 512 MB DDR @ 200 MHz
Network interface Ethernet 10/100
6 / 25
14. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Hardware Platform
Cluster description (cont.)
• Asymmetric topology: one head node, four worker nodes.
• Nodes are connected to a Hewlett-Packard ProCurve 1410-8G
switch through the BeagleBoard-xM 100 Mbit interface.
• Nodes are powered by a Lambda AC-DC power supply.
7 / 25
15. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Software Platform
Software Platform
• Operating system: ˚Angstr¨om GNU/Linux distribution (worker
nodes do not have a GUI).
• Tool-chain: CodeSourcery.
• Network File System: data and code are shared throughout
the cluster using Network File System.
• Cluster Command Control: a suite of tools for managing the
cluster (e.g., terminating processes, rebooting worker nodes,
pushing drive images).
• Message Passing Interface (Argonne National Laboratory
MPICH2): application programming interface that allows the
exchange of messages and data among processes running on
the nodes of a cluster.
8 / 25
16. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Software Platform
Software Platform (cont.)
• Ganglia: offers a web interface used to monitor the cluster
activity and to detect abnormal functioning.
9 / 25
17. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
High-Performance Linpack
High-Performance Linpack (HPL)
• HPL is the de-facto standard benchmark for floating point
performance measurement.
• It is employed in the Top500 and Green500 lists.
• HPL solves a dense system of linear equations using double
precision arithmetic.
• Parallelism is obtained by means of MPI.
• Computation is based on BLAS (Vesperix ATLAS-ARM).
10 / 25
18. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
High-Performance Linpack
High-Performance Linpack (HPL) (cont.)
MFLOPS
258.6
MFLOPS/W
13.26
Green500 500th position (June 2012)
Cray XT5 SixCore, Opteron Six Core 6C 2.6 GHz, XT4 Internal
Interconnect: 32.05 MFLOPS/W
Note
Arithmetic operations are performed in double precision in the
Vector Floating Point unit: NEON unit cannot be employed.
11 / 25
19. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
High-Performance Linpack
High-Performance Linpack (HPL) (cont.)
MFLOPS
258.6
MFLOPS/W
13.26
Green500 500th position (June 2012)
Cray XT5 SixCore, Opteron Six Core 6C 2.6 GHz, XT4 Internal
Interconnect: 32.05 MFLOPS/W
Note
Arithmetic operations are performed in double precision in the
Vector Floating Point unit: NEON unit cannot be employed.
11 / 25
20. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Matrix Multiplication
Matrix Multiplication
• This benchmark shows the performance improvement that can
be obtained using NEON optimized code.
• The benchmark multiplies an m × n matrix A with an n × p
matrix B.
• It operates dividing the rows of matrix A in groups, and
processing each group in a different node.
• Communication among nodes is based on MPI.
Platform Execution time
BeagleCluster 42.13 s
BeagleCluster w/ NEON 5.18 s
NEON optimized code significantly reduces the execution time ⇒
HPL performance can be improved by properly exploiting NEON
12 / 25
21. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Matrix Multiplication
Matrix Multiplication
• This benchmark shows the performance improvement that can
be obtained using NEON optimized code.
• The benchmark multiplies an m × n matrix A with an n × p
matrix B.
• It operates dividing the rows of matrix A in groups, and
processing each group in a different node.
• Communication among nodes is based on MPI.
Platform Execution time
BeagleCluster 42.13 s
BeagleCluster w/ NEON 5.18 s
NEON optimized code significantly reduces the execution time ⇒
HPL performance can be improved by properly exploiting NEON
12 / 25
22. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Speaker Diarization
Speaker Diarization
• A speaker diarization algorithm detects “who speaks now”.
• The algorithm here addressed is based on the real-time
implementation described in [Colagiacomo, et al. 2010].
• The calculation of the cross-correlations between the channel
i signal xi(t) and the channel j signal xj(t) is the most
computational demanding part:
Cij(t) = max
τ
{IFFT[FFT(xi(t)xj(t − τ)) • FFT(w(t))]} .
Here, t is the time index, τ is the correlation lag, w(t) is the
Hamming window and • denotes the element-wise product.
13 / 25
23. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Speaker Diarization
Speaker Diarization (cont.)
• Cluster-wide parallelism has been obtained assigning the
feature extraction stage of each channel to one of the worker
nodes.
• The server process in the head node dispatches audio frames
to the worker nodes through the MPI Bcast instruction and
performs the final classification.
• Performance have been evaluated in terms of Real-Time
Factor (RTF):
RTF = Total execution time
Speech segment duration
14 / 25
24. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Speaker Diarization
Speaker Diarization (cont.)
• Audio data: four lapel microphone signals of meeting
ES2009b contained in the AMI corpus.
• Comparison with an Asus F9SG laptop (Intel Core2 Duo
T8300 CPU running at 2.4 GHz and with 2 GB of RAM)
• Power consumption is measured switching the LCD monitor
off.
15 / 25
25. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Speaker Diarization
Speaker Diarization (cont.)
Single-board implementation results
• Real-time execution is achieved through the NEON instruction
set and reducing the number of cross-correlations: the
maximum of Cij(t) is searched incrementing τ by ∆τ > 1.
∆τ Laptop BeagleBoard-xM
1 2.47 12.73
16 0.25 1.02
32 0.18 0.63
64 0.14 0.44
128 0.12 0.36
The choice of ∆τ is critical both
for the laptop and the
BeagleBoard-xM.
16 / 25
26. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Speaker Diarization
Speaker Diarization (cont.)
Cluster-wide implementation results
∆τ Single-board Five nodes
1 12.73 4.71
16 1.02 1.69
32 0.63 1.63
64 0.44 1.56
128 0.36 1.55
• The MPI version is almost 3 times as fast as the single-board one when
∆τ = 1.
• As ∆τ increases, the MPI implementation performance decreases: the
communication overhead becomes the new bottleneck.
17 / 25
27. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Speaker Diarization
Speaker Diarization (cont.)
Cluster-wide implementation
• This has been verified in a four nodes cluster.
• Nodes read audio data directly from the local file system.
• One of the worker nodes performs both the feature extraction
and the classification tasks.
∆τ Five nodes Four nodes (w/ local data)
1 4.71 3.35
16 1.69 0.33
32 1.63 0.23
64 1.56 0.18
128 1.55 0.16
Reducing the communication overhead real-time execution can be
achieved with ∆τ = 16.
18 / 25
28. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Speaker Diarization
Analysis of power consumption
BeagleCluster
20.32 W
Laptop
32.36 W
Energy ratio
Er =
RTFcluster · Pcluster
RTFlaptop · Plaptop
∼= 1.2
The communication overhead limits the energy efficiency of the Bea-
gleCluster.
Energy ratio of the four nodes cluster
Er
∼= 0.69
Reducing the communication overhead the BeagleCluster is more
efficient than the laptop PC.
19 / 25
29. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Conclusions
• A cluster computer based on the BeagleBoard-xM platform
has been described.
• The cluster is based on open software for executing parallel
tasks, management, and monitoring the nodes status.
• High Performance Linpack has been used to obtain the
number of floating point operations per second.
• The performance improvement that can be achieved using
NEON optimized code has been shown by means of a matrix
multiplication benchmark.
• Processing time and power consumption have been measured
by means of a cluster-wide speaker diarization algorithm to
evaluate the real-time capabilities and the energy efficiency of
the cluster.
20 / 25
30. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Conclusions (cont.)
• Results showed that using the 100 Mbit Ethernet interface,
the BeagleCluster consumes 1.2 times the energy spent by the
laptop PC.
• Removing the communication bottleneck, the BeagleCluster
achieves a superior energy efficiency.
• The cost of the 5 nodes cluster is 655 e. Compared to the
laptop PC, whose cost is 1100 e, the BeagleCluster is about
500 e cheaper.
21 / 25
31. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Future developments
• The software platform will be expanded with a resource
manager and a scheduler to enable the execution of batch
jobs.
• The energy efficiency will be assessed in a High-Availability
scenario, for example using the cluster for hosting websites.
• The use of more efficient hardware platforms (e.g.,
PandaBoards) and of the DM3730 DSP will be considered.
22 / 25
32. Introduction Purpose of this work The BeagleCluster Experiments Conclusions and Future Developments
Thank you for your attention!
Emanuele Principi Vito Colagiacomo
e.principi@univpm.it s1037562@studenti.univpm.it
Stefano Squartini Francesco Piazza
s.squartini@univpm.it f.piazza@univpm.it
23 / 25
33. Manufacturer AMPROBE
Model LH41A
Measuring Range 0-40A, DC or AC peak
Resolution 1 mA in 4 A range
10 mA in 40 A range
Accuracy ±1.3% + 5 digits
Frequency Range DC in DC
40 Hz to 400 Hz in AC
24 / 25
35. H. W. Meuer, “The TOP500 Project: Looking Back Over 15 Years of
Supercomputing Experience,” Informatik-Spektrum, vol. 31, no. 3, pp. 203–222,
2008. [Online]. Available: http://www.top500.org
C. L. Belady, “In the Data Center, Power and Cooling Cost More Than the IT
Equipment It Supports,” Electronics Cooling Magazine, vol. 13, no. 1, May 2007.
W.-c. Feng and K. Cameron, “The Green500 List: Encouraging Sustainable
Supercomputing,” IEEE Computer, vol. 40, no. 12, pp. 50–55, Dec. 2007.
[Online]. Available: http://www.green500.org
I. Ahmad and S. Ranka, Eds., Handbook of Energy-Aware and Green Computing,
1st ed., ser. Information Science. Boca Raton, US: CRC Press, Jan. 2012.
S. Andrade, J. Dourado, and C. Maciel, “Low-power cluster using OMAP3530,”
in Proc. of EDERC, Nice, France, Dec. 2010, pp. 220–224.
K. F¨urlinger, C. Klausecker, and D. Kranzlm¨uller, “Towards energy efficient
parallel computing on consumer electronic devices,” in Proc. of ICT-GLOW.
Berlin, Heidelberg: Springer-Verlag, 2011, pp. 1–9.
M. Brim, R. Flanery, A. Geist, B. Luethke, and S. L. Scott, “Cluster Command
and Control (C3) Tool Suite,” Parallel and Distributed Computing Practices,
vol. 4, no. 4, Dec. 2001.
25 / 25
36. Argonne National Laboratory, “MPICH2,”
http://www.mcs.anl.gov/research/projects/mpich2/.
M. L. Massie, B. N. Chun, and D. E. Culler, “The Ganglia distributed monitoring
system: design, implementation, and experience,” Parallel Computing, vol. 30,
no. 7, pp. 817–840, 2004.
M. Moattar and M. Homayounpour, “A review on speaker diarization systems
and approaches,” Speech Communication, vol. 54, no. 10, pp. 1065–1103, 2012.
E. Principi, R. Rotili, M. W¨ollmer, F. Eyben, S. Squartini, and B. Schuller,
“Real-Time Activity Detection in a Multi-Talker Reverberated Environment,”
Cognitive Computation, pp. 1–12, 2012.
V. Colagiacomo, E. Principi, S. Cifani, and S. Squartini, “Real-Time Speaker
Diarization on TI OMAP3530,” in Proc. of EDERC, Nice, France, Dec. 1st-2nd
2010.
InfiniBand Trade Association, “InfiniBand Architecture Specification Release
1.2.1,” Jan. 2008.
N. J. Boden, D. Cohen, R. E. Felderman, A. Kulawik, C. Seitz, J. N. Seizovic,
and W. Su, “Myrinet: A Gigabit-per-second Local Area Network,” IEEE Micro,
vol. 15, no. 1, pp. 29–36, Feb. 1995.
25 / 25