SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
To Infiniband and Beyond: High
Speed Interconnects in Commodity
           HPC Clusters
           Teresa Kaltz, PhD
          Research Computing
           December 3, 2009


                                   1
Interconnect Types on Top 500




On the latest TOP500 list, there is exactly one 10 GigE deployment,
compared to 181 InfiniBand-connected systems.
Michael Feldman, HPCwire Editor


                                                                      2
Top 500 Interconnects 2002-2009

 500

 450

 400

 350

 300

                                                               Other
 250
                                                               Infiniband
 200                                                           Ethernet

 150

 100

  50

   0
       2002   2003   2004   2005   2006   2007   2008   2009




                                                                            3
What is Infiniband Anyway?

•  Open, standard interconnect architecture



  –  http://www.infinibandta.org/index.php
  –  Complete specification available for download
•  Complete "ecosystem"
  –  Both hardware and software
•  High bandwidth, low latency, switch-based
•  Allows remote direct memory access (RDMA)
                                                     4
Why Remote DMA?

•  TCP offload engines reduce overhead via
   offloading protocol processing like checksum
•  2 copies on receive: NIC  kernel  user
•  Solution is Remote DMA (RDMA)
        Per Byte               Percent Overhead
        User-system copy            16.5 %
        TCP Checksum                15.2 %
        Network-memory copy         31.8 %
        Per Packet
        Driver                      8.2 %
        TCP+IP+ARP protocols        8.2 %
        OS overhead                 19.8 %


                                                  5
What is RDMA?




                6
Infiniband Signalling Rate

•  Each link is a point to point serial connection
•  Usually aggregated into groups of four
•  Unidirectional effective bandwidth
   –  SDR 4X: 1 GB/s
   –  DDR 4X: 2 GB/s
   –  QDR 4X: 4 GB/s
•  Bidirectional bandwidth twice unidirectional
•  Many factors impact measured performance!


                                                     7
Infiniband Roadmap from IBTA




                               8
DDR 4X Unidirectional Bandwidth


                    •  Achieved bandwidth
                       limited by
                       PCIe 8x Gen 1

                    •  Current platforms
                      mostly ship with
                      PCIe Gen 2




                                           9
QDR 4X Unidirectional Bandwidth



                                              •  Still seem to
                                                have bottleneck
                                                at host if
                                                using QDR




   http://mvapich.cse.ohio-state.edu/performance/interNode.shtml   10
Latency Measurements: IB vs GbE




                                  11
Infiniband Latency Measurements




                                  12
Infiniband Silicon Vendors




•  Both switch and HCA parts
  –  Mellanox: Infiniscale, Infinihost
  –  Qlogic: Truescale, Infinipath
•  Many OEM's use their silicon
•  Large switches
  –  Parts arranged in fat tree topology

                                           13
Infiniband Switch Hardware

  24 port silicon product line at right
  Scales to thousands of ports                   288 Ports
  Host-based and hardware-
   based subnet management
  Current generation (QDR) based on            144 Ports
   36 port silicon
  Up to 864 ports in single                  96 Ports
   switch!!

                                             48 Ports
                                           24 Ports
                                                              14
Infiniband Topology

•  Infiniband uses credit-based flow control
   –  Need to avoid loops in topology that may produce
      deadlock

•  Common topology for small
   and medium size
   networks is tree (CLOS)
•  Mesh/torus more cost effective
   for large clusters (>2500 hosts)

                                                         15
Infiniband Routing

•  Infiniband is statically routed
•  Subnet management software discovers fabric
   and generates set of routing tables
  –  Most subnet managers support multiple routing
     algorithms
•  Tables updated with changes in topology only
•  Often cannot achieve theoretical bisection
   bandwidth with static routing
•  QDR silicon introduces adaptive routing

                                                     16
HPCC Random Ring Benchmark

                       1600

                       1400
Avg Bandwidth (MB/s)




                       1200

                       1000                          "Routing 1"
                                                     "Routing 2"
                       800
                                                     "Routing 3"
                       600                           "Routing 4"

                       400

                       200

                          0




                              Number of Enclosures




                                                                   17
Infiniband Specification for Software

•  IB specification does not define API
•  Actions are known as "verbs"
   –  Services provided to upper layer protocols
   –  Send verb, receive verb, etc
•  Community has standardized around open
   source distribution called OFED to provide verbs
•  Some Infiniband software is also available from
   vendors
   –  Subnet management

                                                   18
Application Support of Infiniband

•  All MPI implementations support native IB
   –  OpenMPI, MVAPICH, Intel MPI
•  Existing socket applications
   –  IP over IB
   –  Sockets direct protocol (SDP)
      •  Does NOT require re-link of application
•  Oracle uses RDS (reliable datagram sockets)
   –  First available in Oracle 10g R2
•  Developer can program to "verbs" layer

                                                   19
Infiniband Software Layers




                             20
OFED Software

•  Openfabrics Enterprise Distribution software
   from Openfabrics Alliance
   –  http://www.openfabrics.org/
•  Contains everything needed to run Infiniband
   –  HCA drivers
   –  verbs implementation
   –  subnet management
   –  diagnostic tools
•  Versions qualified together

                                                  21
Openfabrics Software Components




                                  22
"High Performance" Ethernet

•  1 GbE cheap and ubiquitous
  –  hardware acceleration
  –  multiple multiport NIC's
  –  supported in kernel
•  10 GbE still used primarily as uplinks from edge
   switches and as backbone
•  Some vendors providing 10 GbE to server
  –  low cost NIC on motherboard
  –  HCA's with performance proportional to cost

                                                   23
RDMA over Ethernet

•  NIC capable of RDMA is called RNIC
•  RDMA is primary method of reducing latency on
   host side
•  Multiple vendors have RNIC's
  –  Mainstream: Broadcom, Intel, etc.
  –  Boutique: Chelsio, Mellanox, etc.
•  New Ethernet standards
  –  "Data Center Bridging"; "Converged Enhanced
     Ethernet"; "Data Center Ethernet"; etc

                                                   24
What is iWarp?

•  RDMA consortium (RDMAC) standardized some
   protocols with are now part of the IETF Remote
   Data Direct Placement (RDDP) working group
•  http://www.rdmaconsortium.org/home
•  Also defined SRP, iSER in addition to verbs
•  iWARP supported in OFED
•  Most specification work complete in ~2003



                                                25
RDMA over Ethernet?

The name ‘RoCEE’ (RDMA over Converged Enhanced Ethernet),
is a working name.

You might hear me say RoXE, RoE, RDMAoE, IBXoE, IBXE or
any other of a host of equally obscure names.


Tom Talpey, Microsoft Corporation
Paul Grun, System Fabric Works
August 2009




                                                            26
The Future: InfiniFibreNet

•  Vendors moving towards "converged fabrics"
•  Using same "fabric" for both networking and
   storage
•  Storage protocols and IB over Ethernet
•  Storage protocols over Infiniband
  –  NFS over RDMA, lustre
•  Gateway switches and converged adapters
  –  Various combinations of Ethernet, IB and FC


                                                   27
Any Questions?




      THANK YOU!

(And no mention of The Cloud)




                                28

Weitere ähnliche Inhalte

Was ist angesagt?

2011 TWNIC SP IPv6 Transition
2011 TWNIC SP IPv6 Transition2011 TWNIC SP IPv6 Transition
2011 TWNIC SP IPv6 Transition
Johnson Liu
 
4.) switch performance (w features)
4.) switch performance (w features)4.) switch performance (w features)
4.) switch performance (w features)
Jeff Green
 
Cisco data center support
Cisco data center supportCisco data center support
Cisco data center support
Krunal Shah
 

Was ist angesagt? (20)

Software Defined Network (SDN) using ASR9000 :: BRKSPG-2722 | San Diego 2015
Software Defined Network (SDN) using ASR9000 :: BRKSPG-2722 | San Diego 2015Software Defined Network (SDN) using ASR9000 :: BRKSPG-2722 | San Diego 2015
Software Defined Network (SDN) using ASR9000 :: BRKSPG-2722 | San Diego 2015
 
Brocade IP Quick Guide
Brocade IP Quick GuideBrocade IP Quick Guide
Brocade IP Quick Guide
 
Cisco Live! :: Introduction to IOS XR for Enterprises and Service Providers
Cisco Live! :: Introduction to IOS XR for Enterprises and Service ProvidersCisco Live! :: Introduction to IOS XR for Enterprises and Service Providers
Cisco Live! :: Introduction to IOS XR for Enterprises and Service Providers
 
Brocade: Storage Networking For the Virtual Enterprise
Brocade: Storage Networking For the Virtual Enterprise Brocade: Storage Networking For the Virtual Enterprise
Brocade: Storage Networking For the Virtual Enterprise
 
2011 TWNIC SP IPv6 Transition
2011 TWNIC SP IPv6 Transition2011 TWNIC SP IPv6 Transition
2011 TWNIC SP IPv6 Transition
 
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...PLNOG16: Obsługa 100M pps na platformie PC, Przemysław Frasunek, Paweł Mała...
PLNOG16: Obsługa 100M pps na platformie PC , Przemysław Frasunek, Paweł Mała...
 
Deploying IPv6 in Cisco's Labs by Robert Beckett at gogoNET LIVE! 3 IPv6 Conf...
Deploying IPv6 in Cisco's Labs by Robert Beckett at gogoNET LIVE! 3 IPv6 Conf...Deploying IPv6 in Cisco's Labs by Robert Beckett at gogoNET LIVE! 3 IPv6 Conf...
Deploying IPv6 in Cisco's Labs by Robert Beckett at gogoNET LIVE! 3 IPv6 Conf...
 
4.) switch performance (w features)
4.) switch performance (w features)4.) switch performance (w features)
4.) switch performance (w features)
 
Advances in IPv6 Mobile Access
Advances in IPv6 Mobile AccessAdvances in IPv6 Mobile Access
Advances in IPv6 Mobile Access
 
Advances in IPv6 in Mobile Networks Globecom 2011
Advances in IPv6 in Mobile Networks Globecom 2011Advances in IPv6 in Mobile Networks Globecom 2011
Advances in IPv6 in Mobile Networks Globecom 2011
 
20.) physical (optics copper and power)
20.) physical (optics copper and power)20.) physical (optics copper and power)
20.) physical (optics copper and power)
 
Deploying Carrier Ethernet features on ASR 9000
Deploying Carrier Ethernet features on ASR 9000Deploying Carrier Ethernet features on ASR 9000
Deploying Carrier Ethernet features on ASR 9000
 
IPv6 in 3G Core Networks
IPv6 in 3G Core NetworksIPv6 in 3G Core Networks
IPv6 in 3G Core Networks
 
Big Data Benchmarking with RDMA solutions
Big Data Benchmarking with RDMA solutions Big Data Benchmarking with RDMA solutions
Big Data Benchmarking with RDMA solutions
 
Cisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPICisco usNIC: how it works, how it is used in Open MPI
Cisco usNIC: how it works, how it is used in Open MPI
 
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Milan Jan/2014
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Milan Jan/2014Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Milan Jan/2014
Cisco Live! :: Cisco ASR 9000 Architecture :: BRKARC-2003 | Milan Jan/2014
 
Cisco data center support
Cisco data center supportCisco data center support
Cisco data center support
 
10.) vxlan
10.) vxlan10.) vxlan
10.) vxlan
 
Cisco nx os
Cisco nx os Cisco nx os
Cisco nx os
 
Cisco data center training for ibm
Cisco data center training for ibmCisco data center training for ibm
Cisco data center training for ibm
 

Ähnlich wie To Infiniband and Beyond

InfiniBand Essentials Every HPC Expert Must Know
InfiniBand Essentials Every HPC Expert Must KnowInfiniBand Essentials Every HPC Expert Must Know
InfiniBand Essentials Every HPC Expert Must Know
Mellanox Technologies
 
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK
 
Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance
Ceph Community
 

Ähnlich wie To Infiniband and Beyond (20)

pps Matters
pps Matterspps Matters
pps Matters
 
100G Networking Berlin.pdf
100G Networking Berlin.pdf100G Networking Berlin.pdf
100G Networking Berlin.pdf
 
InfiniBand Essentials Every HPC Expert Must Know
InfiniBand Essentials Every HPC Expert Must KnowInfiniBand Essentials Every HPC Expert Must Know
InfiniBand Essentials Every HPC Expert Must Know
 
Building a cost-effective and high-performing public cloud
Building a cost-effective and high-performing public cloudBuilding a cost-effective and high-performing public cloud
Building a cost-effective and high-performing public cloud
 
Continuum PCAP
Continuum PCAP Continuum PCAP
Continuum PCAP
 
QsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale SystemsQsNetIII, An HPC Interconnect For Peta Scale Systems
QsNetIII, An HPC Interconnect For Peta Scale Systems
 
IBTA Releases Updated Specification for RoCEv2
IBTA Releases Updated Specification for RoCEv2IBTA Releases Updated Specification for RoCEv2
IBTA Releases Updated Specification for RoCEv2
 
Platforms for Accelerating the Software Defined and Virtual Infrastructure
Platforms for Accelerating the Software Defined and Virtual InfrastructurePlatforms for Accelerating the Software Defined and Virtual Infrastructure
Platforms for Accelerating the Software Defined and Virtual Infrastructure
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane
 
Ip over wdm
Ip over wdmIp over wdm
Ip over wdm
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
Ethernetv infiniband
Ethernetv infinibandEthernetv infiniband
Ethernetv infiniband
 
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
LF_DPDK17_OpenNetVM: A high-performance NFV platforms to meet future communic...
 
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...
22 - IDNOG03 - Christopher Lim (Mellanox) - Efficient Virtual Network for Ser...
 
Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance Deploying flash storage for Ceph without compromising performance
Deploying flash storage for Ceph without compromising performance
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
Multi protocol label switching (mpls)
Multi protocol label switching (mpls)Multi protocol label switching (mpls)
Multi protocol label switching (mpls)
 
ipv4 to 6
ipv4 to 6ipv4 to 6
ipv4 to 6
 
Infini Band
Infini BandInfini Band
Infini Band
 
L6 6 lowpan
L6 6 lowpanL6 6 lowpan
L6 6 lowpan
 

Mehr von Boston Consulting Group

Adapting federated cyberinfrastructure for shared data collection facilities ...
Adapting federated cyberinfrastructure for shared data collection facilities ...Adapting federated cyberinfrastructure for shared data collection facilities ...
Adapting federated cyberinfrastructure for shared data collection facilities ...
Boston Consulting Group
 
2012 02 pre_hbs_grid_overview_ianstokesrees_pt2
2012 02 pre_hbs_grid_overview_ianstokesrees_pt22012 02 pre_hbs_grid_overview_ianstokesrees_pt2
2012 02 pre_hbs_grid_overview_ianstokesrees_pt2
Boston Consulting Group
 
2012 02 pre_hbs_grid_overview_ianstokesrees_pt1
2012 02 pre_hbs_grid_overview_ianstokesrees_pt12012 02 pre_hbs_grid_overview_ianstokesrees_pt1
2012 02 pre_hbs_grid_overview_ianstokesrees_pt1
Boston Consulting Group
 
Wide Search Molecular Replacement and the NEBioGrid portal interface
Wide Search Molecular Replacement and the NEBioGrid portal interfaceWide Search Molecular Replacement and the NEBioGrid portal interface
Wide Search Molecular Replacement and the NEBioGrid portal interface
Boston Consulting Group
 
2010 06 pre_show_computing_lifesciences_stokesrees
2010 06 pre_show_computing_lifesciences_stokesrees2010 06 pre_show_computing_lifesciences_stokesrees
2010 06 pre_show_computing_lifesciences_stokesrees
Boston Consulting Group
 

Mehr von Boston Consulting Group (16)

Cloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsCloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science Teams
 
Cloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science TeamsCloud-native Enterprise Data Science Teams
Cloud-native Enterprise Data Science Teams
 
Beyond the Science Gateway
Beyond the Science GatewayBeyond the Science Gateway
Beyond the Science Gateway
 
Anaconda Data Science Collaboration
Anaconda Data Science CollaborationAnaconda Data Science Collaboration
Anaconda Data Science Collaboration
 
Python Blaze Overview
Python Blaze OverviewPython Blaze Overview
Python Blaze Overview
 
Making Data Analytics Awesome
Making Data Analytics AwesomeMaking Data Analytics Awesome
Making Data Analytics Awesome
 
Adapting federated cyberinfrastructure for shared data collection facilities ...
Adapting federated cyberinfrastructure for shared data collection facilities ...Adapting federated cyberinfrastructure for shared data collection facilities ...
Adapting federated cyberinfrastructure for shared data collection facilities ...
 
SBGrid Science Portal - eScience 2012
SBGrid Science Portal - eScience 2012SBGrid Science Portal - eScience 2012
SBGrid Science Portal - eScience 2012
 
2012 02 pre_hbs_grid_overview_ianstokesrees_pt2
2012 02 pre_hbs_grid_overview_ianstokesrees_pt22012 02 pre_hbs_grid_overview_ianstokesrees_pt2
2012 02 pre_hbs_grid_overview_ianstokesrees_pt2
 
2012 02 pre_hbs_grid_overview_ianstokesrees_pt1
2012 02 pre_hbs_grid_overview_ianstokesrees_pt12012 02 pre_hbs_grid_overview_ianstokesrees_pt1
2012 02 pre_hbs_grid_overview_ianstokesrees_pt1
 
2011 11 pre_cs50_accelerating_sciencegrid_ianstokesrees
2011 11 pre_cs50_accelerating_sciencegrid_ianstokesrees2011 11 pre_cs50_accelerating_sciencegrid_ianstokesrees
2011 11 pre_cs50_accelerating_sciencegrid_ianstokesrees
 
2011 10 pre_broad_grid_overview_ianstokesrees
2011 10 pre_broad_grid_overview_ianstokesrees2011 10 pre_broad_grid_overview_ianstokesrees
2011 10 pre_broad_grid_overview_ianstokesrees
 
Grid Computing Overview
Grid Computing OverviewGrid Computing Overview
Grid Computing Overview
 
Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
 
Wide Search Molecular Replacement and the NEBioGrid portal interface
Wide Search Molecular Replacement and the NEBioGrid portal interfaceWide Search Molecular Replacement and the NEBioGrid portal interface
Wide Search Molecular Replacement and the NEBioGrid portal interface
 
2010 06 pre_show_computing_lifesciences_stokesrees
2010 06 pre_show_computing_lifesciences_stokesrees2010 06 pre_show_computing_lifesciences_stokesrees
2010 06 pre_show_computing_lifesciences_stokesrees
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

To Infiniband and Beyond

  • 1. To Infiniband and Beyond: High Speed Interconnects in Commodity HPC Clusters Teresa Kaltz, PhD Research Computing December 3, 2009 1
  • 2. Interconnect Types on Top 500 On the latest TOP500 list, there is exactly one 10 GigE deployment, compared to 181 InfiniBand-connected systems. Michael Feldman, HPCwire Editor 2
  • 3. Top 500 Interconnects 2002-2009 500 450 400 350 300 Other 250 Infiniband 200 Ethernet 150 100 50 0 2002 2003 2004 2005 2006 2007 2008 2009 3
  • 4. What is Infiniband Anyway? •  Open, standard interconnect architecture –  http://www.infinibandta.org/index.php –  Complete specification available for download •  Complete "ecosystem" –  Both hardware and software •  High bandwidth, low latency, switch-based •  Allows remote direct memory access (RDMA) 4
  • 5. Why Remote DMA? •  TCP offload engines reduce overhead via offloading protocol processing like checksum •  2 copies on receive: NIC  kernel  user •  Solution is Remote DMA (RDMA) Per Byte Percent Overhead User-system copy 16.5 % TCP Checksum 15.2 % Network-memory copy 31.8 % Per Packet Driver 8.2 % TCP+IP+ARP protocols 8.2 % OS overhead 19.8 % 5
  • 7. Infiniband Signalling Rate •  Each link is a point to point serial connection •  Usually aggregated into groups of four •  Unidirectional effective bandwidth –  SDR 4X: 1 GB/s –  DDR 4X: 2 GB/s –  QDR 4X: 4 GB/s •  Bidirectional bandwidth twice unidirectional •  Many factors impact measured performance! 7
  • 9. DDR 4X Unidirectional Bandwidth •  Achieved bandwidth limited by PCIe 8x Gen 1 •  Current platforms mostly ship with PCIe Gen 2 9
  • 10. QDR 4X Unidirectional Bandwidth •  Still seem to have bottleneck at host if using QDR http://mvapich.cse.ohio-state.edu/performance/interNode.shtml 10
  • 13. Infiniband Silicon Vendors •  Both switch and HCA parts –  Mellanox: Infiniscale, Infinihost –  Qlogic: Truescale, Infinipath •  Many OEM's use their silicon •  Large switches –  Parts arranged in fat tree topology 13
  • 14. Infiniband Switch Hardware   24 port silicon product line at right   Scales to thousands of ports 288 Ports   Host-based and hardware- based subnet management   Current generation (QDR) based on 144 Ports 36 port silicon   Up to 864 ports in single 96 Ports switch!! 48 Ports 24 Ports 14
  • 15. Infiniband Topology •  Infiniband uses credit-based flow control –  Need to avoid loops in topology that may produce deadlock •  Common topology for small and medium size networks is tree (CLOS) •  Mesh/torus more cost effective for large clusters (>2500 hosts) 15
  • 16. Infiniband Routing •  Infiniband is statically routed •  Subnet management software discovers fabric and generates set of routing tables –  Most subnet managers support multiple routing algorithms •  Tables updated with changes in topology only •  Often cannot achieve theoretical bisection bandwidth with static routing •  QDR silicon introduces adaptive routing 16
  • 17. HPCC Random Ring Benchmark 1600 1400 Avg Bandwidth (MB/s) 1200 1000 "Routing 1" "Routing 2" 800 "Routing 3" 600 "Routing 4" 400 200 0 Number of Enclosures 17
  • 18. Infiniband Specification for Software •  IB specification does not define API •  Actions are known as "verbs" –  Services provided to upper layer protocols –  Send verb, receive verb, etc •  Community has standardized around open source distribution called OFED to provide verbs •  Some Infiniband software is also available from vendors –  Subnet management 18
  • 19. Application Support of Infiniband •  All MPI implementations support native IB –  OpenMPI, MVAPICH, Intel MPI •  Existing socket applications –  IP over IB –  Sockets direct protocol (SDP) •  Does NOT require re-link of application •  Oracle uses RDS (reliable datagram sockets) –  First available in Oracle 10g R2 •  Developer can program to "verbs" layer 19
  • 21. OFED Software •  Openfabrics Enterprise Distribution software from Openfabrics Alliance –  http://www.openfabrics.org/ •  Contains everything needed to run Infiniband –  HCA drivers –  verbs implementation –  subnet management –  diagnostic tools •  Versions qualified together 21
  • 23. "High Performance" Ethernet •  1 GbE cheap and ubiquitous –  hardware acceleration –  multiple multiport NIC's –  supported in kernel •  10 GbE still used primarily as uplinks from edge switches and as backbone •  Some vendors providing 10 GbE to server –  low cost NIC on motherboard –  HCA's with performance proportional to cost 23
  • 24. RDMA over Ethernet •  NIC capable of RDMA is called RNIC •  RDMA is primary method of reducing latency on host side •  Multiple vendors have RNIC's –  Mainstream: Broadcom, Intel, etc. –  Boutique: Chelsio, Mellanox, etc. •  New Ethernet standards –  "Data Center Bridging"; "Converged Enhanced Ethernet"; "Data Center Ethernet"; etc 24
  • 25. What is iWarp? •  RDMA consortium (RDMAC) standardized some protocols with are now part of the IETF Remote Data Direct Placement (RDDP) working group •  http://www.rdmaconsortium.org/home •  Also defined SRP, iSER in addition to verbs •  iWARP supported in OFED •  Most specification work complete in ~2003 25
  • 26. RDMA over Ethernet? The name ‘RoCEE’ (RDMA over Converged Enhanced Ethernet), is a working name. You might hear me say RoXE, RoE, RDMAoE, IBXoE, IBXE or any other of a host of equally obscure names. Tom Talpey, Microsoft Corporation Paul Grun, System Fabric Works August 2009 26
  • 27. The Future: InfiniFibreNet •  Vendors moving towards "converged fabrics" •  Using same "fabric" for both networking and storage •  Storage protocols and IB over Ethernet •  Storage protocols over Infiniband –  NFS over RDMA, lustre •  Gateway switches and converged adapters –  Various combinations of Ethernet, IB and FC 27
  • 28. Any Questions? THANK YOU! (And no mention of The Cloud) 28