SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
© 2010 Voltaire Inc.
November 19, 2010
Voltaire Fabric Collective Accelerator™ (FCA)
Ghislain de Jacquelot – ghislaindj@voltaire.com
© 2010 Voltaire Inc. 2
MPI Collectives
Percentage
► Collective Operations = Group Communication (All to All, One to
All, All to One)
► Synchronous by nature = consume many “Wait” cycles on large
clusters
► Popular examples:
• Reduce
• Allreduce
• Barrier
• Bcast
• Gather
• Allgather
0
10
20
30
40
50
60
70
80
90
100
ANSYS
FLUENT
SAGE CPMD LSTC LS-
DYNA
CD-Adapco
STAR-CD
Dacapo
Collective Operations % of MPI Job Runtime
Your cluster might be spending half its time on idle collective cycles
© 2010 Voltaire Inc. 3
The Challenge:
Collective Operations Scalability
► Grouping algorithms are unaware of the topology
and inefficient
► Network congestion due to “All-to-All”
communication
► Slow nodes & OS involvement impair scalability
and predictability
► The more powerful servers get (GPUs, more
cores), the poorer collectives scale in the fabric
Expected Actual
© 2010 Voltaire Inc. 4
The Voltaire InfiniBand Fabric:
Equipped for the Challenge
4036
SM
PWR PS/Fan
Rst
CLI
Eth
Info SM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
……….
4036
SM
PWR PS/Fan
Rst
CLI
Eth
Info SM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
……….
++
+ +
Grid Director
Switches:
Fabric
Processing
Power
Unified Fabric
Manager (UFM):
Topology Aware
Orchestrator
Fabric computing in use to address the collective challenge
© 2010 Voltaire Inc. 5
Introducing:
Voltaire Fabric Collective Accelerator
4036
SM
PWR PS/Fan
Rst
CLI
Eth
Info SM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
……….
4036
SM
PWR PS/Fan
Rst
CLI
Eth
Info SM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
……….
++
+ +
Grid Director
Switches:
Fabric
Processing
Power
Breakthrough performance with no additional hardware
Grid Director
Switches:
Collective
operations
offloaded to
switch CPUs
FCA Agent:
 Inter-core processing
localized & optimized
Unified Fabric
Manager
(UFM):
Topology Aware
Orchestrator
FCA Manager:
Topology-based collective tree
Separate Virtual network
IB multicast for result distribution
Integration with job schedulers
© 2010 Voltaire Inc. 6
Efficient Collectives with FCA
1 2
3 4
5 6
7 8
1 2
3 4
5 6
7 8
1 2
3 4
5 6
7 8
4036
SM
PWR PS/Fan
Rst
CLI
Eth
Info SM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
4036
SM
PWR PS/Fan
Rst
CLI
Eth
Info SM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
4036
SM
PWR PS/Fan
Rst
CLI
Eth
Info SM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
4036
SM
PWR PS/Fan
Rst
CLI
Eth
Info SM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
3. 1st tier
offload
648
4. 2nd tier offload
(result at root)
11664
1. Pre-config
2. Inter-core
processing
36 36 36
36 36
648 648
5. Result distribution
(single message)
6. Allreduce on 100K
cores in 25 usec
1166411664
1166411664
1166411664
1166411664
1166411664
1166411664
1166411664
1166411664
1166411664
1166411664
1166411664
1166411664
64836 36
© 2010 Voltaire Inc. 7
FCA Benefits:
Slashing Job Runtime
► Slashing Runtime
► Eliminating Runtime Variation
• OS jitter – eliminated in switches
• Traffic congestion – significantly lower number of messages
• Cross-application interference – collectives offloaded on a private virtual network
IMB Allreduce 2048 Cores
0
500
1000
1500
2000
2500
3000
3500
4000
usec
Completion Time Distribution
Server-based
Collectives
FCA-based
Collectives
FCA: <30usec
Open MPI:
>3000usec
© 2010 Voltaire Inc. 8
FCA Benefits:
Unprecedented Scalability on HPC Clusters
1
10
100
1000
10000
0 200 400 600 800 1000 1200
ompi-Allreduce-bynode
ompi-Barrier-bynode
FCA-Allreduce
FCA-Barrier
► Extreme performance
improvement on raw
collectives
► Scale according to number
of switch hops, not number
of nodes – O(log18)
► As process count increases
• % of time spent in MPI
increases
• % of time spent in collectives
increases
Enabling capability computing on HPC clusters
> 100X > 50%
© 2010 Voltaire Inc. 9
Additional Benefits
► Simple, fully integrated
• No changes to application required
► Tolerance to higher oversubscription (blocking) ratio
• Same performance at lower cost
► Enables use of non-blocking collectives
• Part of future MPI implementations
• FCA guarantees no computation power penalty
© 2010 Voltaire Inc. 10
FCA
What is the alternative/competitive solution?
FCA NIC-based
offload
Topology aware
Network Congestion Elimination
Fabric switches offload computation
Result distribution based on IB multicast
Support non-blocking collectives
OS “noise” reduction
Expected MPI Job runtime Improvement 30-40% 1-2%
A Fabric Wide Challenge requires a Fabric Wide Solution
© 2010 Voltaire Inc. 11
Benchmarks 1/4
© 2010 Voltaire Inc. 12
FCA Impact on Fluent
Rating: Higher is Better!
2800
3000
3200
3400
3600
3800
Rating
88 Ranks
aircraft_2m
InfiniBand
InfiniBand +
FCA
0
1000
2000
3000
4000
5000
Rating
88 Ranks
eddy_417k
InfiniBand
InfiniBand +
FCA
3500
3600
3700
3800
3900
4000
4100
Rating
88 Ranks
sedan_4m
InfiniBand
InfiniBand +
FCA
42
44
46
48
50
52
54
56
Rating
88 Ranks
truck_111m
InfiniBand
InfiniBand +
FCA
Setup: 11 x HP DL160; Intel Xeon 5550; Parallel FLUENT 12.1.4 (1998); CentOS 5.4; Open MPI 1.4.1
© 2010 Voltaire Inc. 13
Benchmarks 2/4
© 2010 Voltaire Inc. 14
System Configuration
Newest installation:
► Nodes type: NEC HPC 1812Rb-2
• CPU: 2 x Intel X5550, MEM: 6 x 2GB, IB: 1 x Infinihost DDR onboard
► System Configuration: 186 nodes
• 24 nodes per switch (DDR), 12 QDR links to tier2 switches (non-blocking)
► OS: CentOS 5.4
► Open MPI: 1.4.1
► FCA:1.0_RC3 rev 2760
► UFM: 2.3 RC7
► Switch: 3.0.629
24 x DDR 24 x DDR
4 x QDR4 x QDR
© 2010 Voltaire Inc. 15
IMB (Pallas) Benchmark Results
Collective latency (usec)
10
100
1000
10000
0 500 1000 1500 2000 2500
Number of ranks (16 ranks per node)
ompi-Allreduce
ompi-Reduce
ompi-Barrier
FCA-Allreduce
FCA-Reduce
FCA-Barrier
Up to 100X Faster
Collective run time reduction (%) - FCA vs Open MPI
0%
20%
40%
60%
80%
100%
0 500 1000 1500 2000 2500
Number of ranks
Allreduce
Reduce
Barrier
Up to 99.5% Runtime
Reduction
© 2010 Voltaire Inc. 16
Open Foam CFD Aerodynamic Benchmark (64 cores)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1
Seconds
Open MPI 1.4.1
Open MPI 1.4.1 + FCA
OpenFOAM - I
► OpenFOAM
• Open source CFD solver produced by a commercial company, OpenCFD
• Used by many leading automotive companies
© 2010 Voltaire Inc. 17
Benchmarks 3/4
© 2010 Voltaire Inc. 18
System Configuration
► Nodes type: NEC HPC
• CPU: Nehalem X5560 2.8 Ghz, 4 cores * 2 sockets, IB: 1 x Infinihost
DDR HCA
► System Configuration: 700 nodes
• 30 nodes per switch (DDR), 6 QDR links to tier2 switches (oversubscribed)
► OS: Scientific Linux 5.3
► Open MPI: 1.4.1
► FCA:1.1
► UFM: 2.3
► Switch: 3.0.629
30 x DDR 30 x DDR
3 x QDR3 x QDR
© 2010 Voltaire Inc. 19
OpenFOAM - II
► ERCOFTAC UFR 2-02
• http://qnet-ercoftac.cfms.org.uk/index.php?title=Flow_past_cylinder
• Used in many areas of engineering, including civil and environmental
• Run with OpenFOAM (pimpleFoam solver)
0
500
1000
1500
2000
2500
3000
3500
4000
ERCOFTAC UFR 2-02: Flow past a square cylinder
(256 cores)
Open MPI 1.4.1
FCA
© 2010 Voltaire Inc. 20
Molecular Dynamics: LS1-Mardyn
► The case is 50000 molecules, single Lennard Jones, distribution of molecules is homogenous
at the beginning of simulation time.
► "agglo" uses a custom reduce operator (not supported by FCA), while “split” uses a standard
one
>95% Improvement
© 2010 Voltaire Inc. 21
Benchmarks 4/4
© 2010 Voltaire Inc. 22
Setup
► 80 x BL460 Blades each with two Intel(R) Xeon(R) CPU X5670 @ 2.93 GHz
► Voltaire QDR InfiniBand
► Platform MPI 8.0
► Fluent version 12.1
► Star-CD version 4.12
192 cores per
enclosure
© 2010 Voltaire Inc. 23
Fluent 192 Cores
Rating: Higher is Better
1000
1050
1100
1150
1200
1250
1300
PMPI PMPI + FCA
truck_poly_14m
truck_poly_14m
1100
1150
1200
1250
1300
1350
1400
1450
PMPI PMPI + FCA
truck_14m
truck_14m
0
20
40
60
80
100
120
140
160
180
PMPI PMPI + FCA
truck_111m
truck_111m
© 2010 Voltaire Inc. 24
Star-CD A-Class benchmark 192 cores
Runtime – Lower is Better
© 2010 Voltaire Inc.
November 19, 2010
Logistics & Roadmap
© 2010 Voltaire Inc. 26
FCA Ordering & Packaging
SWL-00347 FCA Add-on License for 1 node
SWL-00344 UFM-FCA Bundle License for 1 node
► Switch CPU software shipping automatically on all switches
starting from version 3.0
• Recommended to upgrade to latest version
► FCA Add-on package includes:
• FCA Manager - add-on to UFM
• OMA - host add-on for Open MPI (not required for other MPIs once supported)
► Bundle includes the above as well as UFM itself
► FCA license is installed on the UFM server
© 2010 Voltaire Inc. 27
FCA Roadmap
► FCA v1.1 (Available Q2 2010)
• Collective Operations
 MPI_Reduce, MPI_Allreduce (MAX & SUM)
 MPI_Bcast
 Integer & floating point (32/64), up to 8 elements (128 byte)
 MPI_Barrier
• Topologies
 Fat Tree
 HyperScale
 Torus
• MPI
 Open MPI
 SDK available for MPI integration
► FCA v2.0 (Available Q4 2010)
• Allgather
• Support for all well known arithmetic functions for Reduce/Allreduce (Min, XOR, etc)
• Increased Message size for Bcast, Reduce & Allreduce
© 2010 Voltaire Inc. 28
FCA SDK – Integration with Additional MPIs
► Easy to use software development kit
► Integration to be performed by MPI vendor
► Package includes:
• Documentation
• High level & flow presentation
• Software packages
 Dynamically linked library – binary only
 Header files
 Sample application
© 2010 Voltaire Inc. 29
Coming Soon:
Platform MPI (formerly HP MPI) Support
► Platform MPI version 8.x - Q3 2010
► Initial benchmarking expected end of Q2 2010
► Other MPI vendors evaluating the technology as well
• Leveraging Voltaire SDK
Platform MPI 8.x
(formerly HP-MPI)
© 2010 Voltaire Inc. 30
Voltaire Fabric Collective Accelerator
Summary
► Fabric computing offload
• Combination of SW & HW in a single solution
• Offloading blocking computational tasks
• Algorithms leveraging the topology for computation (trees)
► Extreme MPI performance & scalability
• Capability computing on commodity clusters
• Two orders of magnitude, hundred-times faster collective runtime
• Scale by number of hops, not number of nodes
• Variation eliminated - Consistent results
► Transparent to the application
• Plug & play - No need for code changes
Accelerate your fabric!
© 2010 Voltaire Inc.
November 19, 2010
Thank You

Weitere ähnliche Inhalte

Was ist angesagt?

VYATTAによるマルチパスVPN接続手法
VYATTAによるマルチパスVPN接続手法VYATTAによるマルチパスVPN接続手法
VYATTAによるマルチパスVPN接続手法Naoto MATSUMOTO
 
Juniper Networks Product Comparisons
Juniper Networks Product ComparisonsJuniper Networks Product Comparisons
Juniper Networks Product ComparisonsAltaware, Inc.
 
Opensource將如何對第五代行動通訊(5g)造成革命性的改變
Opensource將如何對第五代行動通訊(5g)造成革命性的改變Opensource將如何對第五代行動通訊(5g)造成革命性的改變
Opensource將如何對第五代行動通訊(5g)造成革命性的改變Chiahan Wu
 
82599 sriov vm configuration notes
82599 sriov vm configuration notes82599 sriov vm configuration notes
82599 sriov vm configuration notesRyan Aydelott
 
JomaSoft VDCF - Solaris Private Cloud
JomaSoft VDCF - Solaris Private CloudJomaSoft VDCF - Solaris Private Cloud
JomaSoft VDCF - Solaris Private CloudJomaSoft
 
SR-IOV, KVM and Intel X520 10Gbps cards on Debian/Stable
SR-IOV, KVM and Intel X520 10Gbps cards on Debian/StableSR-IOV, KVM and Intel X520 10Gbps cards on Debian/Stable
SR-IOV, KVM and Intel X520 10Gbps cards on Debian/Stablejuet-y
 
Juniper Networks: Virtual Chassis High Availability
Juniper Networks: Virtual Chassis High AvailabilityJuniper Networks: Virtual Chassis High Availability
Juniper Networks: Virtual Chassis High AvailabilityJuniper Networks
 
From virtual to high end HW routing for the adult
From virtual to high end HW routing for the adultFrom virtual to high end HW routing for the adult
From virtual to high end HW routing for the adultMarketingArrowECS_CZ
 
Juniper Chassis Cluster Configuration with SRX-1500s
Juniper Chassis Cluster Configuration with SRX-1500sJuniper Chassis Cluster Configuration with SRX-1500s
Juniper Chassis Cluster Configuration with SRX-1500sAshutosh Patel
 
Crear vlan
Crear vlanCrear vlan
Crear vlan1 2d
 
SR-IOV+KVM on Debian/Stable
SR-IOV+KVM on Debian/StableSR-IOV+KVM on Debian/Stable
SR-IOV+KVM on Debian/Stablejuet-y
 
ONIC Japan 2016 - Contrail アップデート
ONIC Japan 2016 - Contrail アップデートONIC Japan 2016 - Contrail アップデート
ONIC Japan 2016 - Contrail アップデートJuniper Networks (日本)
 
End to End Convergence
End to End ConvergenceEnd to End Convergence
End to End ConvergenceSkillFactory
 

Was ist angesagt? (20)

Новый функционал JunOS для маршрутизаторов
Новый функционал JunOS для маршрутизаторовНовый функционал JunOS для маршрутизаторов
Новый функционал JunOS для маршрутизаторов
 
VYATTAによるマルチパスVPN接続手法
VYATTAによるマルチパスVPN接続手法VYATTAによるマルチパスVPN接続手法
VYATTAによるマルチパスVPN接続手法
 
WAN - trends and use cases
WAN - trends and use casesWAN - trends and use cases
WAN - trends and use cases
 
Juniper Networks Product Comparisons
Juniper Networks Product ComparisonsJuniper Networks Product Comparisons
Juniper Networks Product Comparisons
 
Opensource將如何對第五代行動通訊(5g)造成革命性的改變
Opensource將如何對第五代行動通訊(5g)造成革命性的改變Opensource將如何對第五代行動通訊(5g)造成革命性的改變
Opensource將如何對第五代行動通訊(5g)造成革命性的改變
 
82599 sriov vm configuration notes
82599 sriov vm configuration notes82599 sriov vm configuration notes
82599 sriov vm configuration notes
 
JomaSoft VDCF - Solaris Private Cloud
JomaSoft VDCF - Solaris Private CloudJomaSoft VDCF - Solaris Private Cloud
JomaSoft VDCF - Solaris Private Cloud
 
SR-IOV, KVM and Intel X520 10Gbps cards on Debian/Stable
SR-IOV, KVM and Intel X520 10Gbps cards on Debian/StableSR-IOV, KVM and Intel X520 10Gbps cards on Debian/Stable
SR-IOV, KVM and Intel X520 10Gbps cards on Debian/Stable
 
ShowNet2013-Topology
ShowNet2013-TopologyShowNet2013-Topology
ShowNet2013-Topology
 
Juniper Networks: Virtual Chassis High Availability
Juniper Networks: Virtual Chassis High AvailabilityJuniper Networks: Virtual Chassis High Availability
Juniper Networks: Virtual Chassis High Availability
 
From virtual to high end HW routing for the adult
From virtual to high end HW routing for the adultFrom virtual to high end HW routing for the adult
From virtual to high end HW routing for the adult
 
Juniper Chassis Cluster Configuration with SRX-1500s
Juniper Chassis Cluster Configuration with SRX-1500sJuniper Chassis Cluster Configuration with SRX-1500s
Juniper Chassis Cluster Configuration with SRX-1500s
 
Crear vlan
Crear vlanCrear vlan
Crear vlan
 
SR-IOV+KVM on Debian/Stable
SR-IOV+KVM on Debian/StableSR-IOV+KVM on Debian/Stable
SR-IOV+KVM on Debian/Stable
 
Brkdct 3101
Brkdct 3101Brkdct 3101
Brkdct 3101
 
Brkarc 3601
Brkarc 3601Brkarc 3601
Brkarc 3601
 
ONIC Japan 2016 - Contrail アップデート
ONIC Japan 2016 - Contrail アップデートONIC Japan 2016 - Contrail アップデート
ONIC Japan 2016 - Contrail アップデート
 
VyattaCore TIPS2013
VyattaCore TIPS2013VyattaCore TIPS2013
VyattaCore TIPS2013
 
End to End Convergence
End to End ConvergenceEnd to End Convergence
End to End Convergence
 
LAN Visio
LAN VisioLAN Visio
LAN Visio
 

Andere mochten auch

Andere mochten auch (7)

Tomela1
Tomela1Tomela1
Tomela1
 
Agile Team Review
Agile Team ReviewAgile Team Review
Agile Team Review
 
Tomela1
Tomela1Tomela1
Tomela1
 
Entrevista ushuaia
Entrevista ushuaiaEntrevista ushuaia
Entrevista ushuaia
 
Artenaif
ArtenaifArtenaif
Artenaif
 
SEGUNDA ETAPA DE LA XVI OLIMPIADA NACIONAL DE MATEMÁTICA
SEGUNDA ETAPA DE LA XVI OLIMPIADA NACIONAL DE MATEMÁTICASEGUNDA ETAPA DE LA XVI OLIMPIADA NACIONAL DE MATEMÁTICA
SEGUNDA ETAPA DE LA XVI OLIMPIADA NACIONAL DE MATEMÁTICA
 
Evangelizing Agile/Scrum
Evangelizing Agile/ScrumEvangelizing Agile/Scrum
Evangelizing Agile/Scrum
 

Ähnlich wie Voltaire fca en_nov10

Cisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationCisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationJeff Squyres
 
Voltaire ufm en_nov10
Voltaire ufm en_nov10Voltaire ufm en_nov10
Voltaire ufm en_nov10sciecomp
 
PLNOG 8: Piotr Szolkowski - Bezpieczne i wysoce skalowalne Data Center
PLNOG 8: Piotr Szolkowski - Bezpieczne i wysoce skalowalne Data CenterPLNOG 8: Piotr Szolkowski - Bezpieczne i wysoce skalowalne Data Center
PLNOG 8: Piotr Szolkowski - Bezpieczne i wysoce skalowalne Data CenterPROIDEA
 
Extending the life of your device (firmware updates over LoRa) - LoRa AMM
Extending the life of your device (firmware updates over LoRa) - LoRa AMMExtending the life of your device (firmware updates over LoRa) - LoRa AMM
Extending the life of your device (firmware updates over LoRa) - LoRa AMMJan Jongboom
 
BRKRST-3068 Troubleshooting Catalyst 2K and 3K.pdf
BRKRST-3068  Troubleshooting Catalyst 2K and 3K.pdfBRKRST-3068  Troubleshooting Catalyst 2K and 3K.pdf
BRKRST-3068 Troubleshooting Catalyst 2K and 3K.pdfssusercbaa33
 
6th SDN Interest Group Seminar - Session1 (131210)
6th SDN Interest Group Seminar - Session1 (131210)6th SDN Interest Group Seminar - Session1 (131210)
6th SDN Interest Group Seminar - Session1 (131210)NAIM Networks, Inc.
 
2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANLdgoodell
 
Microsofts Configurable Cloud
Microsofts Configurable CloudMicrosofts Configurable Cloud
Microsofts Configurable CloudChris Genazzio
 
cisco-n3k-c31108pc-v-datasheet.pdf
cisco-n3k-c31108pc-v-datasheet.pdfcisco-n3k-c31108pc-v-datasheet.pdf
cisco-n3k-c31108pc-v-datasheet.pdfHi-Network.com
 
cisco-n3k-c3064pq-10gx-datasheet.pdf
cisco-n3k-c3064pq-10gx-datasheet.pdfcisco-n3k-c3064pq-10gx-datasheet.pdf
cisco-n3k-c3064pq-10gx-datasheet.pdfHi-Network.com
 
Cisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPICisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPIJeff Squyres
 
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)The Linux Foundation
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioOPNFV
 
Krzysztof Mazepa - Netflow/cflow - ulubionym narzędziem operatorów SP
Krzysztof Mazepa - Netflow/cflow - ulubionym narzędziem operatorów SPKrzysztof Mazepa - Netflow/cflow - ulubionym narzędziem operatorów SP
Krzysztof Mazepa - Netflow/cflow - ulubionym narzędziem operatorów SPPROIDEA
 
Cisco Cloud Networking Workshop
Cisco Cloud Networking Workshop Cisco Cloud Networking Workshop
Cisco Cloud Networking Workshop Cisco Canada
 
Cyclone III FPGA Overview Part2
Cyclone III FPGA Overview Part2Cyclone III FPGA Overview Part2
Cyclone III FPGA Overview Part2Premier Farnell
 
PLNOG 13: Krzysztof Konkowski: Cisco Access Architectures: GPON, Ethernet, Ac...
PLNOG 13: Krzysztof Konkowski: Cisco Access Architectures: GPON, Ethernet, Ac...PLNOG 13: Krzysztof Konkowski: Cisco Access Architectures: GPON, Ethernet, Ac...
PLNOG 13: Krzysztof Konkowski: Cisco Access Architectures: GPON, Ethernet, Ac...PROIDEA
 
Design and Operation of OpenStack Cloud on 100 Physical Servers - OpenStack S...
Design and Operation of OpenStack Cloud on 100 Physical Servers - OpenStack S...Design and Operation of OpenStack Cloud on 100 Physical Servers - OpenStack S...
Design and Operation of OpenStack Cloud on 100 Physical Servers - OpenStack S...VirtualTech Japan Inc.
 
2014-4Q-OpenStack-Fall-presentation-public-20150310a
2014-4Q-OpenStack-Fall-presentation-public-20150310a2014-4Q-OpenStack-Fall-presentation-public-20150310a
2014-4Q-OpenStack-Fall-presentation-public-20150310aKen Igarashi
 

Ähnlich wie Voltaire fca en_nov10 (20)

Cisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentationCisco EuroMPI'13 vendor session presentation
Cisco EuroMPI'13 vendor session presentation
 
Voltaire ufm en_nov10
Voltaire ufm en_nov10Voltaire ufm en_nov10
Voltaire ufm en_nov10
 
PLNOG 8: Piotr Szolkowski - Bezpieczne i wysoce skalowalne Data Center
PLNOG 8: Piotr Szolkowski - Bezpieczne i wysoce skalowalne Data CenterPLNOG 8: Piotr Szolkowski - Bezpieczne i wysoce skalowalne Data Center
PLNOG 8: Piotr Szolkowski - Bezpieczne i wysoce skalowalne Data Center
 
Extending the life of your device (firmware updates over LoRa) - LoRa AMM
Extending the life of your device (firmware updates over LoRa) - LoRa AMMExtending the life of your device (firmware updates over LoRa) - LoRa AMM
Extending the life of your device (firmware updates over LoRa) - LoRa AMM
 
BRKRST-3068 Troubleshooting Catalyst 2K and 3K.pdf
BRKRST-3068  Troubleshooting Catalyst 2K and 3K.pdfBRKRST-3068  Troubleshooting Catalyst 2K and 3K.pdf
BRKRST-3068 Troubleshooting Catalyst 2K and 3K.pdf
 
6th SDN Interest Group Seminar - Session1 (131210)
6th SDN Interest Group Seminar - Session1 (131210)6th SDN Interest Group Seminar - Session1 (131210)
6th SDN Interest Group Seminar - Session1 (131210)
 
2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL2014/09/02 Cisco UCS HPC @ ANL
2014/09/02 Cisco UCS HPC @ ANL
 
Microsofts Configurable Cloud
Microsofts Configurable CloudMicrosofts Configurable Cloud
Microsofts Configurable Cloud
 
cisco-n3k-c31108pc-v-datasheet.pdf
cisco-n3k-c31108pc-v-datasheet.pdfcisco-n3k-c31108pc-v-datasheet.pdf
cisco-n3k-c31108pc-v-datasheet.pdf
 
cisco-n3k-c3064pq-10gx-datasheet.pdf
cisco-n3k-c3064pq-10gx-datasheet.pdfcisco-n3k-c3064pq-10gx-datasheet.pdf
cisco-n3k-c3064pq-10gx-datasheet.pdf
 
Cisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPICisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPI
 
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
CIF16: Building the Superfluid Cloud with Unikernels (Simon Kuenzer, NEC Europe)
 
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.ioFast datastacks - fast and flexible nfv solution stacks leveraging fd.io
Fast datastacks - fast and flexible nfv solution stacks leveraging fd.io
 
Cma5000a uta
Cma5000a utaCma5000a uta
Cma5000a uta
 
Krzysztof Mazepa - Netflow/cflow - ulubionym narzędziem operatorów SP
Krzysztof Mazepa - Netflow/cflow - ulubionym narzędziem operatorów SPKrzysztof Mazepa - Netflow/cflow - ulubionym narzędziem operatorów SP
Krzysztof Mazepa - Netflow/cflow - ulubionym narzędziem operatorów SP
 
Cisco Cloud Networking Workshop
Cisco Cloud Networking Workshop Cisco Cloud Networking Workshop
Cisco Cloud Networking Workshop
 
Cyclone III FPGA Overview Part2
Cyclone III FPGA Overview Part2Cyclone III FPGA Overview Part2
Cyclone III FPGA Overview Part2
 
PLNOG 13: Krzysztof Konkowski: Cisco Access Architectures: GPON, Ethernet, Ac...
PLNOG 13: Krzysztof Konkowski: Cisco Access Architectures: GPON, Ethernet, Ac...PLNOG 13: Krzysztof Konkowski: Cisco Access Architectures: GPON, Ethernet, Ac...
PLNOG 13: Krzysztof Konkowski: Cisco Access Architectures: GPON, Ethernet, Ac...
 
Design and Operation of OpenStack Cloud on 100 Physical Servers - OpenStack S...
Design and Operation of OpenStack Cloud on 100 Physical Servers - OpenStack S...Design and Operation of OpenStack Cloud on 100 Physical Servers - OpenStack S...
Design and Operation of OpenStack Cloud on 100 Physical Servers - OpenStack S...
 
2014-4Q-OpenStack-Fall-presentation-public-20150310a
2014-4Q-OpenStack-Fall-presentation-public-20150310a2014-4Q-OpenStack-Fall-presentation-public-20150310a
2014-4Q-OpenStack-Fall-presentation-public-20150310a
 

Voltaire fca en_nov10

  • 1. © 2010 Voltaire Inc. November 19, 2010 Voltaire Fabric Collective Accelerator™ (FCA) Ghislain de Jacquelot – ghislaindj@voltaire.com
  • 2. © 2010 Voltaire Inc. 2 MPI Collectives Percentage ► Collective Operations = Group Communication (All to All, One to All, All to One) ► Synchronous by nature = consume many “Wait” cycles on large clusters ► Popular examples: • Reduce • Allreduce • Barrier • Bcast • Gather • Allgather 0 10 20 30 40 50 60 70 80 90 100 ANSYS FLUENT SAGE CPMD LSTC LS- DYNA CD-Adapco STAR-CD Dacapo Collective Operations % of MPI Job Runtime Your cluster might be spending half its time on idle collective cycles
  • 3. © 2010 Voltaire Inc. 3 The Challenge: Collective Operations Scalability ► Grouping algorithms are unaware of the topology and inefficient ► Network congestion due to “All-to-All” communication ► Slow nodes & OS involvement impair scalability and predictability ► The more powerful servers get (GPUs, more cores), the poorer collectives scale in the fabric Expected Actual
  • 4. © 2010 Voltaire Inc. 4 The Voltaire InfiniBand Fabric: Equipped for the Challenge 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 ………. 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 ………. ++ + + Grid Director Switches: Fabric Processing Power Unified Fabric Manager (UFM): Topology Aware Orchestrator Fabric computing in use to address the collective challenge
  • 5. © 2010 Voltaire Inc. 5 Introducing: Voltaire Fabric Collective Accelerator 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 ………. 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 ………. ++ + + Grid Director Switches: Fabric Processing Power Breakthrough performance with no additional hardware Grid Director Switches: Collective operations offloaded to switch CPUs FCA Agent:  Inter-core processing localized & optimized Unified Fabric Manager (UFM): Topology Aware Orchestrator FCA Manager: Topology-based collective tree Separate Virtual network IB multicast for result distribution Integration with job schedulers
  • 6. © 2010 Voltaire Inc. 6 Efficient Collectives with FCA 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 4036 SM PWR PS/Fan Rst CLI Eth Info SM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 3. 1st tier offload 648 4. 2nd tier offload (result at root) 11664 1. Pre-config 2. Inter-core processing 36 36 36 36 36 648 648 5. Result distribution (single message) 6. Allreduce on 100K cores in 25 usec 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 1166411664 64836 36
  • 7. © 2010 Voltaire Inc. 7 FCA Benefits: Slashing Job Runtime ► Slashing Runtime ► Eliminating Runtime Variation • OS jitter – eliminated in switches • Traffic congestion – significantly lower number of messages • Cross-application interference – collectives offloaded on a private virtual network IMB Allreduce 2048 Cores 0 500 1000 1500 2000 2500 3000 3500 4000 usec Completion Time Distribution Server-based Collectives FCA-based Collectives FCA: <30usec Open MPI: >3000usec
  • 8. © 2010 Voltaire Inc. 8 FCA Benefits: Unprecedented Scalability on HPC Clusters 1 10 100 1000 10000 0 200 400 600 800 1000 1200 ompi-Allreduce-bynode ompi-Barrier-bynode FCA-Allreduce FCA-Barrier ► Extreme performance improvement on raw collectives ► Scale according to number of switch hops, not number of nodes – O(log18) ► As process count increases • % of time spent in MPI increases • % of time spent in collectives increases Enabling capability computing on HPC clusters > 100X > 50%
  • 9. © 2010 Voltaire Inc. 9 Additional Benefits ► Simple, fully integrated • No changes to application required ► Tolerance to higher oversubscription (blocking) ratio • Same performance at lower cost ► Enables use of non-blocking collectives • Part of future MPI implementations • FCA guarantees no computation power penalty
  • 10. © 2010 Voltaire Inc. 10 FCA What is the alternative/competitive solution? FCA NIC-based offload Topology aware Network Congestion Elimination Fabric switches offload computation Result distribution based on IB multicast Support non-blocking collectives OS “noise” reduction Expected MPI Job runtime Improvement 30-40% 1-2% A Fabric Wide Challenge requires a Fabric Wide Solution
  • 11. © 2010 Voltaire Inc. 11 Benchmarks 1/4
  • 12. © 2010 Voltaire Inc. 12 FCA Impact on Fluent Rating: Higher is Better! 2800 3000 3200 3400 3600 3800 Rating 88 Ranks aircraft_2m InfiniBand InfiniBand + FCA 0 1000 2000 3000 4000 5000 Rating 88 Ranks eddy_417k InfiniBand InfiniBand + FCA 3500 3600 3700 3800 3900 4000 4100 Rating 88 Ranks sedan_4m InfiniBand InfiniBand + FCA 42 44 46 48 50 52 54 56 Rating 88 Ranks truck_111m InfiniBand InfiniBand + FCA Setup: 11 x HP DL160; Intel Xeon 5550; Parallel FLUENT 12.1.4 (1998); CentOS 5.4; Open MPI 1.4.1
  • 13. © 2010 Voltaire Inc. 13 Benchmarks 2/4
  • 14. © 2010 Voltaire Inc. 14 System Configuration Newest installation: ► Nodes type: NEC HPC 1812Rb-2 • CPU: 2 x Intel X5550, MEM: 6 x 2GB, IB: 1 x Infinihost DDR onboard ► System Configuration: 186 nodes • 24 nodes per switch (DDR), 12 QDR links to tier2 switches (non-blocking) ► OS: CentOS 5.4 ► Open MPI: 1.4.1 ► FCA:1.0_RC3 rev 2760 ► UFM: 2.3 RC7 ► Switch: 3.0.629 24 x DDR 24 x DDR 4 x QDR4 x QDR
  • 15. © 2010 Voltaire Inc. 15 IMB (Pallas) Benchmark Results Collective latency (usec) 10 100 1000 10000 0 500 1000 1500 2000 2500 Number of ranks (16 ranks per node) ompi-Allreduce ompi-Reduce ompi-Barrier FCA-Allreduce FCA-Reduce FCA-Barrier Up to 100X Faster Collective run time reduction (%) - FCA vs Open MPI 0% 20% 40% 60% 80% 100% 0 500 1000 1500 2000 2500 Number of ranks Allreduce Reduce Barrier Up to 99.5% Runtime Reduction
  • 16. © 2010 Voltaire Inc. 16 Open Foam CFD Aerodynamic Benchmark (64 cores) 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 1 Seconds Open MPI 1.4.1 Open MPI 1.4.1 + FCA OpenFOAM - I ► OpenFOAM • Open source CFD solver produced by a commercial company, OpenCFD • Used by many leading automotive companies
  • 17. © 2010 Voltaire Inc. 17 Benchmarks 3/4
  • 18. © 2010 Voltaire Inc. 18 System Configuration ► Nodes type: NEC HPC • CPU: Nehalem X5560 2.8 Ghz, 4 cores * 2 sockets, IB: 1 x Infinihost DDR HCA ► System Configuration: 700 nodes • 30 nodes per switch (DDR), 6 QDR links to tier2 switches (oversubscribed) ► OS: Scientific Linux 5.3 ► Open MPI: 1.4.1 ► FCA:1.1 ► UFM: 2.3 ► Switch: 3.0.629 30 x DDR 30 x DDR 3 x QDR3 x QDR
  • 19. © 2010 Voltaire Inc. 19 OpenFOAM - II ► ERCOFTAC UFR 2-02 • http://qnet-ercoftac.cfms.org.uk/index.php?title=Flow_past_cylinder • Used in many areas of engineering, including civil and environmental • Run with OpenFOAM (pimpleFoam solver) 0 500 1000 1500 2000 2500 3000 3500 4000 ERCOFTAC UFR 2-02: Flow past a square cylinder (256 cores) Open MPI 1.4.1 FCA
  • 20. © 2010 Voltaire Inc. 20 Molecular Dynamics: LS1-Mardyn ► The case is 50000 molecules, single Lennard Jones, distribution of molecules is homogenous at the beginning of simulation time. ► "agglo" uses a custom reduce operator (not supported by FCA), while “split” uses a standard one >95% Improvement
  • 21. © 2010 Voltaire Inc. 21 Benchmarks 4/4
  • 22. © 2010 Voltaire Inc. 22 Setup ► 80 x BL460 Blades each with two Intel(R) Xeon(R) CPU X5670 @ 2.93 GHz ► Voltaire QDR InfiniBand ► Platform MPI 8.0 ► Fluent version 12.1 ► Star-CD version 4.12 192 cores per enclosure
  • 23. © 2010 Voltaire Inc. 23 Fluent 192 Cores Rating: Higher is Better 1000 1050 1100 1150 1200 1250 1300 PMPI PMPI + FCA truck_poly_14m truck_poly_14m 1100 1150 1200 1250 1300 1350 1400 1450 PMPI PMPI + FCA truck_14m truck_14m 0 20 40 60 80 100 120 140 160 180 PMPI PMPI + FCA truck_111m truck_111m
  • 24. © 2010 Voltaire Inc. 24 Star-CD A-Class benchmark 192 cores Runtime – Lower is Better
  • 25. © 2010 Voltaire Inc. November 19, 2010 Logistics & Roadmap
  • 26. © 2010 Voltaire Inc. 26 FCA Ordering & Packaging SWL-00347 FCA Add-on License for 1 node SWL-00344 UFM-FCA Bundle License for 1 node ► Switch CPU software shipping automatically on all switches starting from version 3.0 • Recommended to upgrade to latest version ► FCA Add-on package includes: • FCA Manager - add-on to UFM • OMA - host add-on for Open MPI (not required for other MPIs once supported) ► Bundle includes the above as well as UFM itself ► FCA license is installed on the UFM server
  • 27. © 2010 Voltaire Inc. 27 FCA Roadmap ► FCA v1.1 (Available Q2 2010) • Collective Operations  MPI_Reduce, MPI_Allreduce (MAX & SUM)  MPI_Bcast  Integer & floating point (32/64), up to 8 elements (128 byte)  MPI_Barrier • Topologies  Fat Tree  HyperScale  Torus • MPI  Open MPI  SDK available for MPI integration ► FCA v2.0 (Available Q4 2010) • Allgather • Support for all well known arithmetic functions for Reduce/Allreduce (Min, XOR, etc) • Increased Message size for Bcast, Reduce & Allreduce
  • 28. © 2010 Voltaire Inc. 28 FCA SDK – Integration with Additional MPIs ► Easy to use software development kit ► Integration to be performed by MPI vendor ► Package includes: • Documentation • High level & flow presentation • Software packages  Dynamically linked library – binary only  Header files  Sample application
  • 29. © 2010 Voltaire Inc. 29 Coming Soon: Platform MPI (formerly HP MPI) Support ► Platform MPI version 8.x - Q3 2010 ► Initial benchmarking expected end of Q2 2010 ► Other MPI vendors evaluating the technology as well • Leveraging Voltaire SDK Platform MPI 8.x (formerly HP-MPI)
  • 30. © 2010 Voltaire Inc. 30 Voltaire Fabric Collective Accelerator Summary ► Fabric computing offload • Combination of SW & HW in a single solution • Offloading blocking computational tasks • Algorithms leveraging the topology for computation (trees) ► Extreme MPI performance & scalability • Capability computing on commodity clusters • Two orders of magnitude, hundred-times faster collective runtime • Scale by number of hops, not number of nodes • Variation eliminated - Consistent results ► Transparent to the application • Plug & play - No need for code changes Accelerate your fabric!
  • 31. © 2010 Voltaire Inc. November 19, 2010 Thank You