Voltaire fca en_nov10

© 2010 Voltaire Inc.
November 19, 2010
Voltaire Fabric Collective Accelerator™ (FCA)
Ghislain de Jacquelot – ghislaindj@voltaire.com

© 2010 Voltaire Inc. 2
MPI Collectives
Percentage
► Collective Operations = Group Communication (All to All, One to
All, All to One)
► Synchronous by nature = consume many “Wait” cycles on large
clusters
► Popular examples:
• Reduce
• Allreduce
• Barrier
• Bcast
• Gather
• Allgather
0
10
20
30
40
50
60
70
80
90
100
ANSYS
FLUENT
SAGE CPMD LSTC LS-
DYNA
CD-Adapco
STAR-CD
Dacapo
Collective Operations % of MPI Job Runtime
Your cluster might be spending half its time on idle collective cycles

The Challenge:
Collective Operations Scalability
► Grouping algorithms are unaware of the topology
and inefficient
► Network congestion due to “All-to-All”
communication
► Slow nodes & OS involvement impair scalability
and predictability
► The more powerful servers get (GPUs, more
cores), the poorer collectives scale in the fabric
Expected Actual

The Voltaire InfiniBand Fabric:
Equipped for the Challenge
4036
SM
PWR PS/Fan
Rst
CLI
Eth
Info SM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
……….
4036
SM
PWR PS/Fan
Rst
CLI
Eth
Info SM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
……….
++
+ +
Grid Director
Switches:
Fabric
Processing
Power
Unified Fabric
Manager (UFM):
Topology Aware
Orchestrator
Fabric computing in use to address the collective challenge

Introducing:
Voltaire Fabric Collective Accelerator
4036
SM
PWR PS/Fan
Rst
CLI
Eth
Info SM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
……….
4036
SM
PWR PS/Fan
Rst
CLI
Eth
Info SM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
……….
++
+ +
Grid Director
Switches:
Fabric
Processing
Power
Breakthrough performance with no additional hardware
Grid Director
Switches:
Collective
operations
offloaded to
switch CPUs
FCA Agent:
 Inter-core processing
localized & optimized
Unified Fabric
Manager
(UFM):
Topology Aware
Orchestrator
FCA Manager:
Topology-based collective tree
Separate Virtual network
IB multicast for result distribution
Integration with job schedulers

Efficient Collectives with FCA
1 2
3 4
5 6
7 8
1 2
3 4
5 6
7 8
1 2
3 4
5 6
7 8
4036
SM
PWR PS/Fan
Rst
CLI
Eth
Info SM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
4036
SM
PWR PS/Fan
Rst
CLI
Eth
Info SM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
4036
SM
PWR PS/Fan
Rst
CLI
Eth
Info SM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
4036
SM
PWR PS/Fan
Rst
CLI
Eth
Info SM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
3. 1st tier
offload
648
4. 2nd tier offload
(result at root)
11664
1. Pre-config
2. Inter-core
processing
36 36 36
36 36
648 648
5. Result distribution
(single message)
6. Allreduce on 100K
cores in 25 usec
1166411664
1166411664
1166411664
1166411664
1166411664
1166411664
1166411664
1166411664
1166411664
1166411664
1166411664
1166411664
64836 36

FCA Benefits:
Slashing Job Runtime
► Slashing Runtime
► Eliminating Runtime Variation
• OS jitter – eliminated in switches
• Traffic congestion – significantly lower number of messages
• Cross-application interference – collectives offloaded on a private virtual network
IMB Allreduce 2048 Cores
0
500
1000
1500
2000
2500
3000
3500
4000
usec
Completion Time Distribution
Server-based
Collectives
FCA-based
Collectives
FCA: <30usec
Open MPI:
>3000usec

FCA Benefits:
Unprecedented Scalability on HPC Clusters
1
10
100
1000
10000
0 200 400 600 800 1000 1200
ompi-Allreduce-bynode
ompi-Barrier-bynode
FCA-Allreduce
FCA-Barrier
► Extreme performance
improvement on raw
collectives
► Scale according to number
of switch hops, not number
of nodes – O(log18)
► As process count increases
• % of time spent in MPI
increases
• % of time spent in collectives
increases
Enabling capability computing on HPC clusters
> 100X > 50%

Additional Benefits
► Simple, fully integrated
• No changes to application required
► Tolerance to higher oversubscription (blocking) ratio
• Same performance at lower cost
► Enables use of non-blocking collectives
• Part of future MPI implementations
• FCA guarantees no computation power penalty

FCA
What is the alternative/competitive solution?
FCA NIC-based
offload
Topology aware
Network Congestion Elimination
Fabric switches offload computation
Result distribution based on IB multicast
Support non-blocking collectives
OS “noise” reduction
Expected MPI Job runtime Improvement 30-40% 1-2%
A Fabric Wide Challenge requires a Fabric Wide Solution

Benchmarks 1/4

FCA Impact on Fluent
Rating: Higher is Better!
2800
3000
3200
3400
3600
3800
Rating
88 Ranks
aircraft_2m
InfiniBand
InfiniBand +
FCA
0
1000
2000
3000
4000
5000
Rating
88 Ranks
eddy_417k
InfiniBand
InfiniBand +
FCA
3500
3600
3700
3800
3900
4000
4100
Rating
88 Ranks
sedan_4m
InfiniBand
InfiniBand +
FCA
42
44
46
48
50
52
54
56
Rating
88 Ranks
truck_111m
InfiniBand
InfiniBand +
FCA
Setup: 11 x HP DL160; Intel Xeon 5550; Parallel FLUENT 12.1.4 (1998); CentOS 5.4; Open MPI 1.4.1

Benchmarks 2/4

System Configuration
Newest installation:
► Nodes type: NEC HPC 1812Rb-2
• CPU: 2 x Intel X5550, MEM: 6 x 2GB, IB: 1 x Infinihost DDR onboard
► System Configuration: 186 nodes
• 24 nodes per switch (DDR), 12 QDR links to tier2 switches (non-blocking)
► OS: CentOS 5.4
► Open MPI: 1.4.1
► FCA:1.0_RC3 rev 2760
► UFM: 2.3 RC7
► Switch: 3.0.629
24 x DDR 24 x DDR
4 x QDR4 x QDR

IMB (Pallas) Benchmark Results
Collective latency (usec)
10
100
1000
10000
0 500 1000 1500 2000 2500
Number of ranks (16 ranks per node)
ompi-Allreduce
ompi-Reduce
ompi-Barrier
FCA-Allreduce
FCA-Reduce
FCA-Barrier
Up to 100X Faster
Collective run time reduction (%) - FCA vs Open MPI
0%
20%
40%
60%
80%
100%
0 500 1000 1500 2000 2500
Number of ranks
Allreduce
Reduce
Barrier
Up to 99.5% Runtime
Reduction

Open Foam CFD Aerodynamic Benchmark (64 cores)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1
Seconds
Open MPI 1.4.1
Open MPI 1.4.1 + FCA
OpenFOAM - I
► OpenFOAM
• Open source CFD solver produced by a commercial company, OpenCFD
• Used by many leading automotive companies

Benchmarks 3/4

System Configuration
► Nodes type: NEC HPC
• CPU: Nehalem X5560 2.8 Ghz, 4 cores * 2 sockets, IB: 1 x Infinihost
DDR HCA
► System Configuration: 700 nodes
• 30 nodes per switch (DDR), 6 QDR links to tier2 switches (oversubscribed)
► OS: Scientific Linux 5.3
► Open MPI: 1.4.1
► FCA:1.1
► UFM: 2.3
► Switch: 3.0.629
30 x DDR 30 x DDR
3 x QDR3 x QDR

OpenFOAM - II
► ERCOFTAC UFR 2-02
• http://qnet-ercoftac.cfms.org.uk/index.php?title=Flow_past_cylinder
• Used in many areas of engineering, including civil and environmental
• Run with OpenFOAM (pimpleFoam solver)
0
500
1000
1500
2000
2500
3000
3500
4000
ERCOFTAC UFR 2-02: Flow past a square cylinder
(256 cores)
Open MPI 1.4.1
FCA

Molecular Dynamics: LS1-Mardyn
► The case is 50000 molecules, single Lennard Jones, distribution of molecules is homogenous
at the beginning of simulation time.
► "agglo" uses a custom reduce operator (not supported by FCA), while “split” uses a standard
one
>95% Improvement

Benchmarks 4/4

Setup
► 80 x BL460 Blades each with two Intel(R) Xeon(R) CPU X5670 @ 2.93 GHz
► Voltaire QDR InfiniBand
► Platform MPI 8.0
► Fluent version 12.1
► Star-CD version 4.12
192 cores per
enclosure

Fluent 192 Cores
Rating: Higher is Better
1000
1050
1100
1150
1200
1250
1300
PMPI PMPI + FCA
truck_poly_14m
truck_poly_14m
1100
1150
1200
1250
1300
1350
1400
1450
PMPI PMPI + FCA
truck_14m
truck_14m
0
20
40
60
80
100
120
140
160
180
PMPI PMPI + FCA
truck_111m
truck_111m

Star-CD A-Class benchmark 192 cores
Runtime – Lower is Better

November 19, 2010
Logistics & Roadmap

FCA Ordering & Packaging
SWL-00347 FCA Add-on License for 1 node
SWL-00344 UFM-FCA Bundle License for 1 node
► Switch CPU software shipping automatically on all switches
starting from version 3.0
• Recommended to upgrade to latest version
► FCA Add-on package includes:
• FCA Manager - add-on to UFM
• OMA - host add-on for Open MPI (not required for other MPIs once supported)
► Bundle includes the above as well as UFM itself
► FCA license is installed on the UFM server

FCA Roadmap
► FCA v1.1 (Available Q2 2010)
• Collective Operations
 MPI_Reduce, MPI_Allreduce (MAX & SUM)
 MPI_Bcast
 Integer & floating point (32/64), up to 8 elements (128 byte)
 MPI_Barrier
• Topologies
 Fat Tree
 HyperScale
 Torus
• MPI
 Open MPI
 SDK available for MPI integration
► FCA v2.0 (Available Q4 2010)
• Allgather
• Support for all well known arithmetic functions for Reduce/Allreduce (Min, XOR, etc)
• Increased Message size for Bcast, Reduce & Allreduce

FCA SDK – Integration with Additional MPIs
► Easy to use software development kit
► Integration to be performed by MPI vendor
► Package includes:
• Documentation
• High level & flow presentation
• Software packages
 Dynamically linked library – binary only
 Header files
 Sample application

Coming Soon:
Platform MPI (formerly HP MPI) Support
► Platform MPI version 8.x - Q3 2010
► Initial benchmarking expected end of Q2 2010
► Other MPI vendors evaluating the technology as well
• Leveraging Voltaire SDK
Platform MPI 8.x
(formerly HP-MPI)

Voltaire Fabric Collective Accelerator
Summary
► Fabric computing offload
• Combination of SW & HW in a single solution
• Offloading blocking computational tasks
• Algorithms leveraging the topology for computation (trees)
► Extreme MPI performance & scalability
• Capability computing on commodity clusters
• Two orders of magnitude, hundred-times faster collective runtime
• Scale by number of hops, not number of nodes
• Variation eliminated - Consistent results
► Transparent to the application
• Plug & play - No need for code changes
Accelerate your fabric!

November 19, 2010
Thank You

Voltaire fca en_nov10

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (7)

Ähnlich wie Voltaire fca en_nov10

Ähnlich wie Voltaire fca en_nov10 (20)

Voltaire fca en_nov10