SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Cluster 2010 Presentation

Optimization Techniques at the I/O Forwarding Layer

Kazuki Ohta (presenter):
Preferred Infrastructure, Inc., University of Tokyo

Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross:
Argonne National Laboratory

Yutaka Ishikawa:
University of Tokyo

                                                      Contact: kazuki.ohta@gmail.com
                                                                                   1
Background: Compute and Storage Imbalance

• Leadership-class computational scale:
  • 100,000+ processes
  • Advanced Multi-core architectures, Compute node OSs
• Leadership-class storage scale:
  • 100+ servers
  • Commercial storage hardware, Cluster file system


• Current leadership-class machines supply only 1GB/s of storage
  throughput for every 10TF of compute performance. This gap grew
  factor of 10 in recent years.
• Bridging this imbalance between compute and storage is a critical
  problem for the large-scale computation.

                                                                      2
Previous Studies: Current I/O Software Stack


   Storage Abstraction,
     Data Portability
     (HDF5, NetCDF,                     Parallel/Serial Applications
         ADIOS)
                                         High-Level I/O Libraries
    Organizing Accesses
    from Many Clients
        (ROMIO)                                                     POSIX I/O
                               MPI-IO
                                                                    VFS, FUSE

                                          Parallel File Systems
    Logical File System
over Many Storage Devices
(PVFS2, Lustre, GPFS, PanFS,                 Storage Devices
         Ceph, etc)




                                                                                3
Challenge: Millions of Concurrent Clients

• 1,000,000+ concurrent clients present a challenge to current I/O stack
   • e,g. metadata performance, locking, network incast problem, etc.
• I/O Forwarding Layer is introduced.
   • All I/O requests are delegated to dedicated I/O forwarder process.
   • I/O forwarder reduces the number of clients seen by the file system for all
     applications, without collective I/O.

                                 I/O Path
              Compute                       Compute     Compute   Compute   Compute    Compute   ....    Compute
              Processes


            I/O Forwarder
                                                                       I/O Forwarder




          Parallel File System                        PVFS2       Parallel File System
                                                                   PVFS2         PVFS2           PVFS2




                                                                                                                   4
I/O Software Stack with I/O Forwarding


                                       Parallel/Serial Applications


                                        High-Level I/O Libraries

                                                                   POSIX I/O
                              MPI-IO
                                                                   VFS, FUSE
  Bridge Between Compute
Process and Storage System                   I/O Forwarding
(IBM ciod, Cray DVS, IOFSL)

                                         Parallel File Systems


                                            Storage Devices




                                                                               5
Example I/O System: Blue Gene/P Architecture




                                               6
I/O Forwarding Challenges

• Large Requests
  • Latency of the forwarding
  • Memory limit of the I/O
  • Variety of backend file system node performance
• Small Requests
  • Current I/O forwarding mechanism reduces the number of clients, but
    does not reduces the number of requests.
  • Request processing overheads at the file systems


• We proposed two optimization techniques for the I/O forwarding layer.
  • Out-Of-Order I/O Pipelining, for large requests.
  • I/O Request Scheduler, for small requests.
                                                                          7
Out-Of-Order I/O Pipelining
• Split large I/O requests into
  small fixed-size chunks
• These chunks are forwarded in                     File                        FileSystem
  an out-of-order way.              Client   IOFSL System   Client   IOFSL       Threads

• Good points
   • Reduce forwarding latency,
     by overlapping the I/O
     requests and the network
     transfer.
   • I/O sizes are not limited by
     the memory size at the
     forwarding node.                    No-Pipelining          Out-Of-Order Pipelining

   • Little effect by the slowest
     file system node.
                                                                                             8
I/O Request Scheduler

• Scheduling and Merging the small requests at the forwarder
  • Reduce number of seeks
  • Reduce number of requests, the file systems actually sees
• Scheduling overhead must be minimum
  • Handle-Based Round-Robin algorithm for the fairness between files
  • Ranges are managed by Interval Tree
     • The contiguous requests are merged

                                                        Pick N Requests
                              Q                          and Issue I/O

                H       H          H       H      H



               Read    Read       Write   Read   Read


                                                                          9
I/O Forwarding and Scalability Layer (IOFSL)

• IOFSL Project [Nawab 2009]
  • Open-Source I/O Forwarding Implementation
  • http://www.iofsl.org/
• Portable on most HPC environment
  • Network Independent
    • All network communication is done by BMI [Carns 2005]
       • TCP/IP, Infiniband, Myrinet, Blue Gene/P Tree,
         Portals, etc.
  • File System Independent
  • MPI-IO (ROMIO) / FUSE Client
                                                              10
IOFSL Software Stack

               Application

    POSIXInterface     MPI-IO Interface

        FUSE           ROMIO-ZOIDFS                       IOFSL Server

           ZOIDFS Client API                                FileSystem Dispatcher

                     BMI                            BMI     PVFS2        libsysio


                             ZOIDFS Protocol                        I/O Request
                      (TCP, Infiniband, Myrinet, etc. )


• Out-Of-Order I/O Pipelining and the I/O request scheduler have been implemented
  in the IOFSL, and evaluated on two environments.
   • T2K Tokyo (Linux Cluster), and ANL Surveyor (Blue Gene/P)
                                                                                    11
Evaluation on T2K: Spec

• T2K Open Super Computer, Tokyo Sites
  • http://www.open-supercomputer.org/
  • 32 node Research Cluster
  • 16 cores: 2.3 GHz Quad-Core Opteron*4
  • 32GB Memory
  • 10Gbps Myrinet Network
  • SATA Disk (Read: 49.52 MB/sec, Write 39.76 MB/sec)
• One IOFSL, Four PVFS2, 128 MPI Processes
• Software
  • MPICH2 1.1.1p1
  • PVFS2 CVS (almost 2.8.2)
     • both are configured to use Myrinet
                                                         12
Evaluation on T2K: IOR Benchmark

• Each process issues the same amount of I/O
• Gradually increasing the message size, and see the
  bandwidth change
  • Note: modified to do fsync() for MPI-IO




              Message Size

                                                       13
Evaluation on T2K: IOR Benchmark, 128procs
                     120
                               ROMIO PVFS2
                                 IOFSL none
                                  IOFSL hbrr
                     100


                      80
  Bandwidth (MB/s)




                      60


                      40


                      20


                       0
                           4        16         64     256      1024     4096   16384   65536
                                                    Message Size (KB)                          14
Evaluation on T2K: IOR Benchmark, 128procs
                     120
                               ROMIO PVFS2
                                 IOFSL none
                                  IOFSL hbrr
                     100


                      80
  Bandwidth (MB/s)




                      60                                    Out-Of-Order Pipelining Improvements
                                                                         ( 29.5%)
                      40


                      20


                       0
                           4        16         64     256      1024     4096   16384   65536
                                                    Message Size (KB)                              14
Evaluation on T2K: IOR Benchmark, 128procs
                     120
                               ROMIO PVFS2
                                 IOFSL none
                                  IOFSL hbrr
                     100


                      80
  Bandwidth (MB/s)




                      60                                    Out-Of-Order Pipelining Improvements
                                                                         ( 29.5%)
                      40

                                                          I/O Scheduler Improvement ( 40.0%)
                      20


                       0
                           4        16         64     256      1024     4096   16384   65536
                                                    Message Size (KB)                              14
Evaluation on Blue Gene/P: Spec

• Argonne National Laboratory BG/P “Surveyor”
  • Blue Gene/P platform for research and development
  • 1024 nodes, 4096-core
  • Four PVFS2 servers
  • DataDirect Networks S2A9550 SAN
• 256 compute nodes, with 4 I/O nodes were used.




         Node Card: 4 core     Node Board: 128 core   Rack: 4096 core   15
Evaluation on BG/P: BMI PingPong
                    900
                                        BMI TCP/IP
                    800                  BMI ZOID
                              CNK BG/P Tree Network
                    700

                    600
 Bandwidth (MB/s)




                    500

                    400

                    300

                    200

                    100

                      0
                          1      4       16       64        256   1024   4096
                                              Buffer Size (KB)                  16
Evaluation on BG/P: IOR Benchmark, 256nodes
                      800
                                               CIOD
                                IOFSL FIFO(32threads)
                      700

                      600
  Bandwidth (MiB/s)




                      500

                      400

                      300

                      200

                      100

                        0
                            1   4      16     64    256      1024   4096   16384   65536
                                              Message Size (KB)
                                                                                           17
Evaluation on BG/P: IOR Benchmark, 256nodes
                      800
                                                CIOD
                                 IOFSL FIFO(32threads)
                      700

                      600
  Bandwidth (MiB/s)




                                Performance Improvements
                      500
                                        ( 42.0%)
                      400

                      300

                      200

                      100

                        0
                            1    4      16     64    256      1024   4096   16384   65536
                                               Message Size (KB)
                                                                                            17
Evaluation on BG/P: IOR Benchmark, 256nodes
                      800
                                                CIOD
                                 IOFSL FIFO(32threads)
                      700

                      600
  Bandwidth (MiB/s)




                                Performance Improvements
                      500
                                        ( 42.0%)
                      400

                      300                                                   Performance Drop
                                                                               ( -38.5%)
                      200

                      100

                        0
                            1    4      16     64    256      1024   4096     16384   65536
                                               Message Size (KB)
                                                                                               17
Evaluation on BG/P: Thread Count Effect
                      800
                                IOFSL FIFO (16threads)
                                IOFSL FIFO (32threads)
                      700

                      600
  Bandwidth (MiB/s)




                      500

                      400

                      300

                      200

                      100

                        0
                            1   4     16      64    256      1024   4096   16384   65536
                                              Message Size (KB)
                                                                                           18
Evaluation on BG/P: Thread Count Effect
                      800
                                IOFSL FIFO (16threads)
                                IOFSL FIFO (32threads)
                      700

                      600
  Bandwidth (MiB/s)




                      500

                      400

                      300
                                                                      16 threads > 32threads
                      200

                      100

                        0
                            1   4     16      64    256      1024   4096   16384   65536
                                              Message Size (KB)
                                                                                               18
Related Work

• Computational Plant Project @ Sandia National Laboratory
  • First introduced I/O Forwarding Layer
• IBM Blue Gene/L, Blue Gene/P
  • All I/O requests are forwarded to I/O nodes
     • Compute OS can be stripped down to minimum
      functionality, and reduces the OS noise
  • ZOID: I/O Forwarding Project [Kamil 2008]
     • Only on Blue Gene
• Lustre Network Request Scheduler (NRS) [Qian 2009]
  • Request scheduler at the parallel file system nodes
  • Only simulation results
                                                             19
Future Work

• Event-driven server architecture
  • reduced thread contension
• Collaborative Caching at the I/O forwarding layer
  • multiple I/O forwarder works collaboratively for caching data
   and also metadata
• Hints from MPI-IO
  • Better cooperation with collective I/O
• Evaluation on other leadership scale machines
  • ORNL Jaguar, Cray XT4, XT5 systems




                                                                    20
Conclusions
• Implementation and evaluation of two optimization techniques at the I/O
  Forwarding Layer
  • I/O pipelining that overlaps the file system requests and the network
    communication.
  • I/O scheduler that reduces the number of independent, non-contiguous
    file systems accesses.
• Demonstrating portable I/O forwarding layer, and performance comparison
  with existing HPC I/O software stack.
  • Two Environments
     • T2K Tokyo Linux cluster
     • ANL Blue Gene/P Surveyor
  • First I/O forwarding evaluations on linux cluster and Blue Gene/P
  • First comparison between Blue Gene/P IBM I/O stack with OSS I/O stack

                                                                            21
Thanks!

Kazuki Ohta (presenter):
Preferred Infrastructure, Inc., University of Tokyo

Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross:
Argonne National Laboratory

Yutaka Ishikawa:
University of Tokyo
                                                      Contact: kazuki.ohta@gmail.com
                                                                                  22

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to ipv6 v1.3
Introduction to ipv6 v1.3Introduction to ipv6 v1.3
Introduction to ipv6 v1.3
Karunakant Rai
 
Cisco presentation2
Cisco presentation2Cisco presentation2
Cisco presentation2
ehsan nazer
 

Was ist angesagt? (20)

Ipv6 introduction - MUM 2011 presentation
Ipv6 introduction - MUM 2011 presentationIpv6 introduction - MUM 2011 presentation
Ipv6 introduction - MUM 2011 presentation
 
i pv6
i pv6i pv6
i pv6
 
Intel's Nehalem Microarchitecture by Glenn Hinton
Intel's Nehalem Microarchitecture by Glenn HintonIntel's Nehalem Microarchitecture by Glenn Hinton
Intel's Nehalem Microarchitecture by Glenn Hinton
 
"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011
"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011
"FCoE vs. iSCSI - Making the Choice" from Interop Las Vegas 2011
 
Technology Updates in IPv6
Technology Updates in IPv6Technology Updates in IPv6
Technology Updates in IPv6
 
Overview of HPC Interconnects
Overview of HPC InterconnectsOverview of HPC Interconnects
Overview of HPC Interconnects
 
Cisco IPv6 Tutorial by Hinwoto
Cisco IPv6 Tutorial by HinwotoCisco IPv6 Tutorial by Hinwoto
Cisco IPv6 Tutorial by Hinwoto
 
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinksVSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
VSPERF BEnchmarking the Network Data Plane of NFV VDevices and VLinks
 
Introduction to ipv6 v1.3
Introduction to ipv6 v1.3Introduction to ipv6 v1.3
Introduction to ipv6 v1.3
 
Introduction to IPv6
Introduction to IPv6Introduction to IPv6
Introduction to IPv6
 
High-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale SystemsHigh-Performance and Scalable Designs of Programming Models for Exascale Systems
High-Performance and Scalable Designs of Programming Models for Exascale Systems
 
IPv6 Transition & Deployment, including IPv6-only in cellular and broadband
IPv6 Transition & Deployment, including IPv6-only in cellular and broadbandIPv6 Transition & Deployment, including IPv6-only in cellular and broadband
IPv6 Transition & Deployment, including IPv6-only in cellular and broadband
 
Cisco presentation2
Cisco presentation2Cisco presentation2
Cisco presentation2
 
IPV6 Hands on Lab
IPV6 Hands on Lab IPV6 Hands on Lab
IPV6 Hands on Lab
 
Intel the-latest-on-ofi
Intel the-latest-on-ofiIntel the-latest-on-ofi
Intel the-latest-on-ofi
 
Data Plane and VNF Acceleration Mini Summit
Data Plane and VNF Acceleration Mini Summit Data Plane and VNF Acceleration Mini Summit
Data Plane and VNF Acceleration Mini Summit
 
IPv6
IPv6IPv6
IPv6
 
Named Data Networking Operational Aspects - IoT as a Use-case
Named Data Networking Operational Aspects - IoT as a Use-caseNamed Data Networking Operational Aspects - IoT as a Use-case
Named Data Networking Operational Aspects - IoT as a Use-case
 
Nehalem
NehalemNehalem
Nehalem
 
Khan and morrison_dq207
Khan and morrison_dq207Khan and morrison_dq207
Khan and morrison_dq207
 

Ähnlich wie Optimization Techniques at the I/O Forwarding Layer

The Modern Telco Network: Defining The Telco Cloud
The Modern Telco Network: Defining The Telco CloudThe Modern Telco Network: Defining The Telco Cloud
The Modern Telco Network: Defining The Telco Cloud
Marco Rodrigues
 

Ähnlich wie Optimization Techniques at the I/O Forwarding Layer (20)

PVFS: A Parallel File System for Linux Clusters
PVFS: A Parallel File System for Linux ClustersPVFS: A Parallel File System for Linux Clusters
PVFS: A Parallel File System for Linux Clusters
 
To Infiniband and Beyond
To Infiniband and BeyondTo Infiniband and Beyond
To Infiniband and Beyond
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Scaling the Container Dataplane
Scaling the Container Dataplane Scaling the Container Dataplane
Scaling the Container Dataplane
 
Linux one vs x86
Linux one vs x86 Linux one vs x86
Linux one vs x86
 
Linux one vs x86 18 july
Linux one vs x86 18 julyLinux one vs x86 18 july
Linux one vs x86 18 july
 
NIO.pdf
NIO.pdfNIO.pdf
NIO.pdf
 
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
Ceph Day SF 2015 - Deploying flash storage for Ceph without compromising perf...
 
Efficient Support for MPI-I/O Atomicity
Efficient Support for MPI-I/O AtomicityEfficient Support for MPI-I/O Atomicity
Efficient Support for MPI-I/O Atomicity
 
Three years of OFELIA - taking stock
Three years of OFELIA - taking stockThree years of OFELIA - taking stock
Three years of OFELIA - taking stock
 
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance BarriersCeph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
Ceph Day Berlin: Ceph on All Flash Storage - Breaking Performance Barriers
 
Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications Latest (storage IO) patterns for cloud-native applications
Latest (storage IO) patterns for cloud-native applications
 
The Modern Telco Network: Defining The Telco Cloud
The Modern Telco Network: Defining The Telco CloudThe Modern Telco Network: Defining The Telco Cloud
The Modern Telco Network: Defining The Telco Cloud
 
Openflow overview
Openflow overviewOpenflow overview
Openflow overview
 
Intel omni path architecture
Intel omni path architectureIntel omni path architecture
Intel omni path architecture
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
 
Systems Support for Many Task Computing
Systems Support for Many Task ComputingSystems Support for Many Task Computing
Systems Support for Many Task Computing
 
Plank
PlankPlank
Plank
 
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
 
Web Services for the Internet of Things
Web Services for the Internet of ThingsWeb Services for the Internet of Things
Web Services for the Internet of Things
 

Mehr von Kazuki Ohta (6)

Using MPI
Using MPIUsing MPI
Using MPI
 
修士中間発表
修士中間発表修士中間発表
修士中間発表
 
Hadoop @ Java CCC 2008 Spring
Hadoop @ Java CCC 2008 SpringHadoop @ Java CCC 2008 Spring
Hadoop @ Java CCC 2008 Spring
 
Googleの基盤クローン Hadoopについて
Googleの基盤クローン HadoopについてGoogleの基盤クローン Hadoopについて
Googleの基盤クローン Hadoopについて
 
Sedue at Hatena::Bookmark
Sedue at Hatena::BookmarkSedue at Hatena::Bookmark
Sedue at Hatena::Bookmark
 
Google Perf Tools (tcmalloc) の使い方
Google Perf Tools (tcmalloc) の使い方Google Perf Tools (tcmalloc) の使い方
Google Perf Tools (tcmalloc) の使い方
 

Kürzlich hochgeladen

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Optimization Techniques at the I/O Forwarding Layer

  • 1. Cluster 2010 Presentation Optimization Techniques at the I/O Forwarding Layer Kazuki Ohta (presenter): Preferred Infrastructure, Inc., University of Tokyo Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross: Argonne National Laboratory Yutaka Ishikawa: University of Tokyo Contact: kazuki.ohta@gmail.com 1
  • 2. Background: Compute and Storage Imbalance • Leadership-class computational scale: • 100,000+ processes • Advanced Multi-core architectures, Compute node OSs • Leadership-class storage scale: • 100+ servers • Commercial storage hardware, Cluster file system • Current leadership-class machines supply only 1GB/s of storage throughput for every 10TF of compute performance. This gap grew factor of 10 in recent years. • Bridging this imbalance between compute and storage is a critical problem for the large-scale computation. 2
  • 3. Previous Studies: Current I/O Software Stack Storage Abstraction, Data Portability (HDF5, NetCDF, Parallel/Serial Applications ADIOS) High-Level I/O Libraries Organizing Accesses from Many Clients (ROMIO) POSIX I/O MPI-IO VFS, FUSE Parallel File Systems Logical File System over Many Storage Devices (PVFS2, Lustre, GPFS, PanFS, Storage Devices Ceph, etc) 3
  • 4. Challenge: Millions of Concurrent Clients • 1,000,000+ concurrent clients present a challenge to current I/O stack • e,g. metadata performance, locking, network incast problem, etc. • I/O Forwarding Layer is introduced. • All I/O requests are delegated to dedicated I/O forwarder process. • I/O forwarder reduces the number of clients seen by the file system for all applications, without collective I/O. I/O Path Compute Compute Compute Compute Compute Compute .... Compute Processes I/O Forwarder I/O Forwarder Parallel File System PVFS2 Parallel File System PVFS2 PVFS2 PVFS2 4
  • 5. I/O Software Stack with I/O Forwarding Parallel/Serial Applications High-Level I/O Libraries POSIX I/O MPI-IO VFS, FUSE Bridge Between Compute Process and Storage System I/O Forwarding (IBM ciod, Cray DVS, IOFSL) Parallel File Systems Storage Devices 5
  • 6. Example I/O System: Blue Gene/P Architecture 6
  • 7. I/O Forwarding Challenges • Large Requests • Latency of the forwarding • Memory limit of the I/O • Variety of backend file system node performance • Small Requests • Current I/O forwarding mechanism reduces the number of clients, but does not reduces the number of requests. • Request processing overheads at the file systems • We proposed two optimization techniques for the I/O forwarding layer. • Out-Of-Order I/O Pipelining, for large requests. • I/O Request Scheduler, for small requests. 7
  • 8. Out-Of-Order I/O Pipelining • Split large I/O requests into small fixed-size chunks • These chunks are forwarded in File FileSystem an out-of-order way. Client IOFSL System Client IOFSL Threads • Good points • Reduce forwarding latency, by overlapping the I/O requests and the network transfer. • I/O sizes are not limited by the memory size at the forwarding node. No-Pipelining Out-Of-Order Pipelining • Little effect by the slowest file system node. 8
  • 9. I/O Request Scheduler • Scheduling and Merging the small requests at the forwarder • Reduce number of seeks • Reduce number of requests, the file systems actually sees • Scheduling overhead must be minimum • Handle-Based Round-Robin algorithm for the fairness between files • Ranges are managed by Interval Tree • The contiguous requests are merged Pick N Requests Q and Issue I/O H H H H H Read Read Write Read Read 9
  • 10. I/O Forwarding and Scalability Layer (IOFSL) • IOFSL Project [Nawab 2009] • Open-Source I/O Forwarding Implementation • http://www.iofsl.org/ • Portable on most HPC environment • Network Independent • All network communication is done by BMI [Carns 2005] • TCP/IP, Infiniband, Myrinet, Blue Gene/P Tree, Portals, etc. • File System Independent • MPI-IO (ROMIO) / FUSE Client 10
  • 11. IOFSL Software Stack Application POSIXInterface MPI-IO Interface FUSE ROMIO-ZOIDFS IOFSL Server ZOIDFS Client API FileSystem Dispatcher BMI BMI PVFS2 libsysio ZOIDFS Protocol I/O Request (TCP, Infiniband, Myrinet, etc. ) • Out-Of-Order I/O Pipelining and the I/O request scheduler have been implemented in the IOFSL, and evaluated on two environments. • T2K Tokyo (Linux Cluster), and ANL Surveyor (Blue Gene/P) 11
  • 12. Evaluation on T2K: Spec • T2K Open Super Computer, Tokyo Sites • http://www.open-supercomputer.org/ • 32 node Research Cluster • 16 cores: 2.3 GHz Quad-Core Opteron*4 • 32GB Memory • 10Gbps Myrinet Network • SATA Disk (Read: 49.52 MB/sec, Write 39.76 MB/sec) • One IOFSL, Four PVFS2, 128 MPI Processes • Software • MPICH2 1.1.1p1 • PVFS2 CVS (almost 2.8.2) • both are configured to use Myrinet 12
  • 13. Evaluation on T2K: IOR Benchmark • Each process issues the same amount of I/O • Gradually increasing the message size, and see the bandwidth change • Note: modified to do fsync() for MPI-IO Message Size 13
  • 14. Evaluation on T2K: IOR Benchmark, 128procs 120 ROMIO PVFS2 IOFSL none IOFSL hbrr 100 80 Bandwidth (MB/s) 60 40 20 0 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 14
  • 15. Evaluation on T2K: IOR Benchmark, 128procs 120 ROMIO PVFS2 IOFSL none IOFSL hbrr 100 80 Bandwidth (MB/s) 60 Out-Of-Order Pipelining Improvements ( 29.5%) 40 20 0 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 14
  • 16. Evaluation on T2K: IOR Benchmark, 128procs 120 ROMIO PVFS2 IOFSL none IOFSL hbrr 100 80 Bandwidth (MB/s) 60 Out-Of-Order Pipelining Improvements ( 29.5%) 40 I/O Scheduler Improvement ( 40.0%) 20 0 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 14
  • 17. Evaluation on Blue Gene/P: Spec • Argonne National Laboratory BG/P “Surveyor” • Blue Gene/P platform for research and development • 1024 nodes, 4096-core • Four PVFS2 servers • DataDirect Networks S2A9550 SAN • 256 compute nodes, with 4 I/O nodes were used. Node Card: 4 core Node Board: 128 core Rack: 4096 core 15
  • 18. Evaluation on BG/P: BMI PingPong 900 BMI TCP/IP 800 BMI ZOID CNK BG/P Tree Network 700 600 Bandwidth (MB/s) 500 400 300 200 100 0 1 4 16 64 256 1024 4096 Buffer Size (KB) 16
  • 19. Evaluation on BG/P: IOR Benchmark, 256nodes 800 CIOD IOFSL FIFO(32threads) 700 600 Bandwidth (MiB/s) 500 400 300 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 17
  • 20. Evaluation on BG/P: IOR Benchmark, 256nodes 800 CIOD IOFSL FIFO(32threads) 700 600 Bandwidth (MiB/s) Performance Improvements 500 ( 42.0%) 400 300 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 17
  • 21. Evaluation on BG/P: IOR Benchmark, 256nodes 800 CIOD IOFSL FIFO(32threads) 700 600 Bandwidth (MiB/s) Performance Improvements 500 ( 42.0%) 400 300 Performance Drop ( -38.5%) 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 17
  • 22. Evaluation on BG/P: Thread Count Effect 800 IOFSL FIFO (16threads) IOFSL FIFO (32threads) 700 600 Bandwidth (MiB/s) 500 400 300 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 18
  • 23. Evaluation on BG/P: Thread Count Effect 800 IOFSL FIFO (16threads) IOFSL FIFO (32threads) 700 600 Bandwidth (MiB/s) 500 400 300 16 threads > 32threads 200 100 0 1 4 16 64 256 1024 4096 16384 65536 Message Size (KB) 18
  • 24. Related Work • Computational Plant Project @ Sandia National Laboratory • First introduced I/O Forwarding Layer • IBM Blue Gene/L, Blue Gene/P • All I/O requests are forwarded to I/O nodes • Compute OS can be stripped down to minimum functionality, and reduces the OS noise • ZOID: I/O Forwarding Project [Kamil 2008] • Only on Blue Gene • Lustre Network Request Scheduler (NRS) [Qian 2009] • Request scheduler at the parallel file system nodes • Only simulation results 19
  • 25. Future Work • Event-driven server architecture • reduced thread contension • Collaborative Caching at the I/O forwarding layer • multiple I/O forwarder works collaboratively for caching data and also metadata • Hints from MPI-IO • Better cooperation with collective I/O • Evaluation on other leadership scale machines • ORNL Jaguar, Cray XT4, XT5 systems 20
  • 26. Conclusions • Implementation and evaluation of two optimization techniques at the I/O Forwarding Layer • I/O pipelining that overlaps the file system requests and the network communication. • I/O scheduler that reduces the number of independent, non-contiguous file systems accesses. • Demonstrating portable I/O forwarding layer, and performance comparison with existing HPC I/O software stack. • Two Environments • T2K Tokyo Linux cluster • ANL Blue Gene/P Surveyor • First I/O forwarding evaluations on linux cluster and Blue Gene/P • First comparison between Blue Gene/P IBM I/O stack with OSS I/O stack 21
  • 27. Thanks! Kazuki Ohta (presenter): Preferred Infrastructure, Inc., University of Tokyo Dries Kimpe, Jason Cope, Kamil Iskra, Robert Ross: Argonne National Laboratory Yutaka Ishikawa: University of Tokyo Contact: kazuki.ohta@gmail.com 22

Hinweis der Redaktion