SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Research Computing @ CU Boulder
Deploying	RMACC	Summit:	
An	HPC	Resource	for	the	Rocky	Mountain	Region
Thomas	Hauser	and	Peter	Ruprecht
Research	Computing	Group
University	of	Colorado	Boulder																	www.rc.colorado.edu
Research Computing @ CU Boulder
Outline
• Part 1 – Regional partnership
• Rocky Mountain Advanced Computing Consortium
(RMACC)
• Proposal process
• RFP
• Purchasing
• Part 2 – Technical details and implementation
• Motivation and design
• Technological challenges and successes
• OPA testing
• OPA plus GPFS
• Federation and authentication for RMACC users
7/12/17PEARC'17 2
Research Computing @ CU Boulder
●●
●
●
●
●
●
●
●
●●●●● ●●
Rocky Mountain
Advanced Computing
Consortium (RMACC)
• www.rmacc.org
• Collaboration among
academic and research
institutions to facilitate
widespread effective use of
HPC throughout the Rocky
Mountain region
• 15-17 August 2017 – 7th
RMACC HPC Symposium in
Boulder
7/12/17PEARC'17 3
Research Computing @ CU Boulder
Joint Proposal to NSF MRI program
• Partnership between University of Colorado Boulder
and Colorado State University
• Initially one MRI NSF proposal from each institution
• Collaborative proposal for one piece of equipment located
in Boulder
• Saving on operations staff and costs
• CU Boulder PIs
• T. Hauser, P. Ruprecht, K. Jansen, J. Syvitski, A. Hasenfratz
• CSU PIs
• H. Siegel, P. Burns, E. Chong, J. Prenni
• 2 NSF awards
• #1532235 to CSU for $700,000
• #1532236 to CU Boulder for $2,030,000
7/12/17PEARC'17 4
Research Computing @ CU Boulder
Procurement - 2 Universities
• RFP process
• CU Boulder purchasing rules
• Input from CSU purchasing
• Evaluation team included CSU and CU Boulder members
• CU Boulder contract with review and approval from CSU
legal
• One vendor project - billing to two different universities
• CSU owns the storage
• CU Boulder owns compute and networking
• Software licensing – site licenses
• Access to software controlled via LDAP groups
• CU Boulder
• CSU
• Credits to Pat Burns (CSU)
7/12/17PEARC'17 5
Research Computing @ CU Boulder
Operations and User Support
• Jonathon Anderson and Peter Ruprecht – CU Boulder
• Design and implementation leads
• Working closely with vendor
• Daniel Milroy and John Blaas - CU Boulder
• Performance testing and ongoing administration
• Richard Casey - CSU
• CSU technical liaison and user support
• Becky Yeager, Bruce Loftis, Shelley Knuth – CU Boulder
• RMACC outreach
• Joel Frahm, Nick Featherstone, Peter Ruprecht – CU Boulder
• User support
• Shelley Knuth
• Training lead
7/12/17PEARC'17 6
Research Computing @ CU Boulder
Requirements for the System
• User driven requirements
• Analysis of usage patterns on existing system
• Outreach to key user communities
• Compute requirements
• 450 TFLOPS or greater
• 4 types of nodes
• Haswell
• Knights Landing
• GPU
• High Memory
• Single node performance within 5% variance
• Scratch storage requirements
• 1PB or more of parallel file system
• 20+ GB/s bandwidth to nodes
• 10,000+ file creation operations/second
• 500 simultaneous file system clients
• High-performance interconnect
7/12/17PEARC'17 7
Research Computing @ CU Boulder
RMACC Summit: Node Types
• 457 general compute nodes (Dell C6320)
• 24 real cores (Intel Haswell)
• 128 GB RAM => 5 GB/core
• local SSD
• 5 high-memory nodes
• 48 real cores (Intel Haswell)
• 2048 GB RAM => 42 GB/core
• 12-drive local RAID
7/12/17PEARC'17 8
Research Computing @ CU Boulder
Summit: Node Types (cont’d)
• 11 GPGPU/visualization nodes (C4130)
• Same CPU and RAM as general
• 2x NVIDIA K80 card (effectively 4 GPUs)
• 20 Xeon Phi (“Knights Landing”) nodes
• Phi installed directly in CPU socket; not a separate
accelerator card
• 68 cores / 272 HW threads
• installed in second phase, Q1 2017
7/12/17PEARC'17 9
Research Computing @ CU Boulder 7/12/17PEARC'17
1
0
Research Computing @ CU Boulder 7/12/17PEARC'17
1
1
Research Computing @ CU Boulder
Omni-Path (OPA) Interconnect
• Desirable for two main reasons: cost compared to EDR
IB, and expected close integration with KnL
• Availability was formally announced near the end of our
RFP process
• 100 Gb/s bandwidth
• Extremely low latency for MPI performance (1.5us)
• "Islands" of 28 - 32 nodes fully non-blocking
• 2:1 blocking factor between islands
7/12/17PEARC'17
1
2
Research Computing @ CU Boulder
OPA Challenges/Successes
• OPA software and driver stack from Intel is nicely
integrated, but had some issues in early releases
• Not currently seeing many bugs; upgrade process
smoothing out
• Many management and monitoring utilities; voluminous
output
• Application performance matches a similar IB-based
cluster at another site
• Note that full performance testing must be done with a
high-CPU, high-communication application
7/12/17PEARC'17
1
3
Research Computing @ CU Boulder
GPFS Scratch Storage
• 1.2 PB of high-performance scratch storage
• GPFS for parallel access and improved small-file and
file-ops performance
• GPFS has distributed file management
• Metadata on SSD
• Contents of small files (<3.6KB) fits right in the inode
• Measured >21 GB/s parallel throughput and >18K file
creations or deletions per second
7/12/17PEARC'17
1
4
Research Computing @ CU Boulder
GPFS plus OPA - problem
• GPFS was designed for close integration with Verbs
• But OPA has best performance with PSM2 …
• DDN SFA14K RAID controller designed to have direct
OPA connectivity – but that feature not available in time
7/12/17PEARC'17
1
5
Research Computing @ CU Boulder
GPFS plus OPA - solution
• Very careful tuning of both OPA and EDR configurations
in the NSD servers
• Intel OPA software update for better Verbs performance
• Vendor added more spindles to the scratch filesystem
to boost backend bandwidth performance
• Now it’s working really well!
• Note that OPA has Traffic Flow Optimization (QoS) to
reduce MPI latency on a fabric that also carries storage
I/O – we don’t have this enabled
• More details about GPFS/OPA in ARCC talk this
afternoon
7/12/17PEARC'17
1
6
Research Computing @ CU Boulder
Authentication
• Two-factor auth required for ALL access
• PAM + sssd + LDAP allows separate domains for users
• Use their local 2-factor auth system
• Username is user@domain, eg jane@colostate.edu
• Worked with CSU to access their user account info and
2-factor infrastructure
• CSU users create a Summit account using CSU username
and credentials
• This account name is “CSU_ID@colostate.edu”
• @colostate logins are authenticated against CSU Duo 2-
factor
7/12/17PEARC'17
1
7
Research Computing @ CU Boulder
Authentication (cont’d)
• RMACC Summit is now a Level 3 XSEDE resource
• Provides “login allocations”
• RMACC users (non-UCB/CSU) will
• create an XSEDE account that will be associated with a
UCB-RC account
• auth using XSEDE’s Duo 2-factor infrastructure
• log in through XSEDE Single Sign-On Hub (project XCI-36)
• Disclaimer – XSEDE login integration is still a work in
progress … details may change
• Some uncertainty regarding future of GSI-SSH
7/12/17PEARC'17
1
8
Research Computing @ CU Boulder
Allocations - implementation
• Fairshare:
• Administratively more straightforward way to divide up
access between multiple sites
• More user education required than with a “credit limit”
allocation
• Slurm scheduler’s “Fair Tree” fairshare is configured
with a target share for each allocation
• This target corresponds to the number of core-hours
awarded
• Fair Tree allows subdividing the share of a branch
7/12/17PEARC'17
1
9
Research Computing @ CU Boulder
Allocations – fairshare tree
• Root share 100%
• CU-Boulder share (67.5%)
• General (20% of 67.5%)
• Projects (80% of 67.5%)
• UCB1
• UCB2 …
• CSU share (22.5%)
• General (20%)
• Projects (80%)
• CSU1
• CSU2 …
• RMACC share (10%)
• General (20%)
• Projects (80%)
• RMACC1
• RMACC2 …
• Condo shares (variable)
7/12/17PEARC'17
2
0
Research Computing @ CU Boulder
Summit Condo
• Charging node hardware cost to researchers
• $1800/node subsidy to cover networking and DC costs
• Partners get priority access to an allocation share
scaled to size of node purchase
• Condo nodes
• 47 from CU Boulder
• 25 from CSU
• 2 from CU Denver
• RMACC Summit was not operational when we issued
the call for condo purchases
7/12/17PEARC'17
2
1
Research Computing @ CU Boulder
Conclusions
• Multi-site grant and procurement is possible, and saves
overall operational cost
• Delays due to new technology must be expected and
planned for
• Multi-site federation and authentication is harder than
we expected – especially with 2-factor auth – but
leveraging XSEDE should help
• Multi-site technical support and consulting is likely to be
harder than expected
• Summit has provided new outreach and collaboration
opportunities for RMACC
7/12/17PEARC'17
2
2
Research Computing @ CU Boulder 7/12/17PEARC'17
2
3
CU-Boulder	Research	Computing
www.rc.colorado.edu rc-help@colorado.edu
CSU	High-Performance	Computing
www.acns.colostate.edu/hpc help@colostate.edu
Thank	you!		Questions?

Weitere ähnliche Inhalte

Ähnlich wie PEARC17: Deploying RMACC Summit: An HPC Resource for the Rocky Mountain Region

Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsIgor Sfiligoi
 
"The BG collaboration, Past, Present, Future. The new available resources". P...
"The BG collaboration, Past, Present, Future. The new available resources". P..."The BG collaboration, Past, Present, Future. The new available resources". P...
"The BG collaboration, Past, Present, Future. The new available resources". P...lccausp
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsDataWorks Summit
 
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf....NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...Karel Zikmund
 
Project Controls Expo, 13th Nov 2013 - "Loading Cost and Activity data into P...
Project Controls Expo, 13th Nov 2013 - "Loading Cost and Activity data into P...Project Controls Expo, 13th Nov 2013 - "Loading Cost and Activity data into P...
Project Controls Expo, 13th Nov 2013 - "Loading Cost and Activity data into P...Project Controls Expo
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spacejsvetter
 
Panel: Building the NRP Ecosystem with the Regional Networks on their Campuses;
Panel: Building the NRP Ecosystem with the Regional Networks on their Campuses;Panel: Building the NRP Ecosystem with the Regional Networks on their Campuses;
Panel: Building the NRP Ecosystem with the Regional Networks on their Campuses;Larry Smarr
 
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Subbu Rama
 
Development Trends of Next-Generation Supercomputers
Development Trends of Next-Generation SupercomputersDevelopment Trends of Next-Generation Supercomputers
Development Trends of Next-Generation Supercomputersinside-BigData.com
 
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...IBM Research
 
LISA_Sol_Linux_Perf.ppt
LISA_Sol_Linux_Perf.pptLISA_Sol_Linux_Perf.ppt
LISA_Sol_Linux_Perf.pptssuser16421a
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovBig Data Spain
 
(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP PerformanceBIOVIA
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiConquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiDatabricks
 
Data Mobility Exhibition
Data Mobility ExhibitionData Mobility Exhibition
Data Mobility ExhibitionGlobus
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningDataWorks Summit
 
Big data talk barcelona - jsr - jc
Big data talk   barcelona - jsr - jcBig data talk   barcelona - jsr - jc
Big data talk barcelona - jsr - jcJames Saint-Rossy
 

Ähnlich wie PEARC17: Deploying RMACC Summit: An HPC Resource for the Rocky Mountain Region (20)

Demonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the CloudsDemonstrating 100 Gbps in and out of the Clouds
Demonstrating 100 Gbps in and out of the Clouds
 
"The BG collaboration, Past, Present, Future. The new available resources". P...
"The BG collaboration, Past, Present, Future. The new available resources". P..."The BG collaboration, Past, Present, Future. The new available resources". P...
"The BG collaboration, Past, Present, Future. The new available resources". P...
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
 
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf....NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...
.NET Core Summer event 2019 in Brno, CZ - .NET Core Networking stack and perf...
 
Project Controls Expo, 13th Nov 2013 - "Loading Cost and Activity data into P...
Project Controls Expo, 13th Nov 2013 - "Loading Cost and Activity data into P...Project Controls Expo, 13th Nov 2013 - "Loading Cost and Activity data into P...
Project Controls Expo, 13th Nov 2013 - "Loading Cost and Activity data into P...
 
Exploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design spaceExploring emerging technologies in the HPC co-design space
Exploring emerging technologies in the HPC co-design space
 
Panel: Building the NRP Ecosystem with the Regional Networks on their Campuses;
Panel: Building the NRP Ecosystem with the Regional Networks on their Campuses;Panel: Building the NRP Ecosystem with the Regional Networks on their Campuses;
Panel: Building the NRP Ecosystem with the Regional Networks on their Campuses;
 
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures Bitfusion Nimbix Dev Summit Heterogeneous Architectures
Bitfusion Nimbix Dev Summit Heterogeneous Architectures
 
Development Trends of Next-Generation Supercomputers
Development Trends of Next-Generation SupercomputersDevelopment Trends of Next-Generation Supercomputers
Development Trends of Next-Generation Supercomputers
 
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
IBM and ASTRON 64-Bit Microserver Prototype Prepares for Big Bang's Big Data,...
 
LISA_Sol_Linux_Perf.ppt
LISA_Sol_Linux_Perf.pptLISA_Sol_Linux_Perf.ppt
LISA_Sol_Linux_Perf.ppt
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
 
E3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - SundanceE3MV - Embedded Vision - Sundance
E3MV - Embedded Vision - Sundance
 
(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance(ATS6-PLAT06) Maximizing AEP Performance
(ATS6-PLAT06) Maximizing AEP Performance
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay RaiConquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
Conquering Hadoop and Apache Spark with Operational Intelligence with Akshay Rai
 
Data Mobility Exhibition
Data Mobility ExhibitionData Mobility Exhibition
Data Mobility Exhibition
 
Accelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learningAccelerating TensorFlow with RDMA for high-performance deep learning
Accelerating TensorFlow with RDMA for high-performance deep learning
 
Big data talk barcelona - jsr - jc
Big data talk   barcelona - jsr - jcBig data talk   barcelona - jsr - jc
Big data talk barcelona - jsr - jc
 

Kürzlich hochgeladen

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

PEARC17: Deploying RMACC Summit: An HPC Resource for the Rocky Mountain Region

  • 1. Research Computing @ CU Boulder Deploying RMACC Summit: An HPC Resource for the Rocky Mountain Region Thomas Hauser and Peter Ruprecht Research Computing Group University of Colorado Boulder www.rc.colorado.edu
  • 2. Research Computing @ CU Boulder Outline • Part 1 – Regional partnership • Rocky Mountain Advanced Computing Consortium (RMACC) • Proposal process • RFP • Purchasing • Part 2 – Technical details and implementation • Motivation and design • Technological challenges and successes • OPA testing • OPA plus GPFS • Federation and authentication for RMACC users 7/12/17PEARC'17 2
  • 3. Research Computing @ CU Boulder ●● ● ● ● ● ● ● ● ●●●●● ●● Rocky Mountain Advanced Computing Consortium (RMACC) • www.rmacc.org • Collaboration among academic and research institutions to facilitate widespread effective use of HPC throughout the Rocky Mountain region • 15-17 August 2017 – 7th RMACC HPC Symposium in Boulder 7/12/17PEARC'17 3
  • 4. Research Computing @ CU Boulder Joint Proposal to NSF MRI program • Partnership between University of Colorado Boulder and Colorado State University • Initially one MRI NSF proposal from each institution • Collaborative proposal for one piece of equipment located in Boulder • Saving on operations staff and costs • CU Boulder PIs • T. Hauser, P. Ruprecht, K. Jansen, J. Syvitski, A. Hasenfratz • CSU PIs • H. Siegel, P. Burns, E. Chong, J. Prenni • 2 NSF awards • #1532235 to CSU for $700,000 • #1532236 to CU Boulder for $2,030,000 7/12/17PEARC'17 4
  • 5. Research Computing @ CU Boulder Procurement - 2 Universities • RFP process • CU Boulder purchasing rules • Input from CSU purchasing • Evaluation team included CSU and CU Boulder members • CU Boulder contract with review and approval from CSU legal • One vendor project - billing to two different universities • CSU owns the storage • CU Boulder owns compute and networking • Software licensing – site licenses • Access to software controlled via LDAP groups • CU Boulder • CSU • Credits to Pat Burns (CSU) 7/12/17PEARC'17 5
  • 6. Research Computing @ CU Boulder Operations and User Support • Jonathon Anderson and Peter Ruprecht – CU Boulder • Design and implementation leads • Working closely with vendor • Daniel Milroy and John Blaas - CU Boulder • Performance testing and ongoing administration • Richard Casey - CSU • CSU technical liaison and user support • Becky Yeager, Bruce Loftis, Shelley Knuth – CU Boulder • RMACC outreach • Joel Frahm, Nick Featherstone, Peter Ruprecht – CU Boulder • User support • Shelley Knuth • Training lead 7/12/17PEARC'17 6
  • 7. Research Computing @ CU Boulder Requirements for the System • User driven requirements • Analysis of usage patterns on existing system • Outreach to key user communities • Compute requirements • 450 TFLOPS or greater • 4 types of nodes • Haswell • Knights Landing • GPU • High Memory • Single node performance within 5% variance • Scratch storage requirements • 1PB or more of parallel file system • 20+ GB/s bandwidth to nodes • 10,000+ file creation operations/second • 500 simultaneous file system clients • High-performance interconnect 7/12/17PEARC'17 7
  • 8. Research Computing @ CU Boulder RMACC Summit: Node Types • 457 general compute nodes (Dell C6320) • 24 real cores (Intel Haswell) • 128 GB RAM => 5 GB/core • local SSD • 5 high-memory nodes • 48 real cores (Intel Haswell) • 2048 GB RAM => 42 GB/core • 12-drive local RAID 7/12/17PEARC'17 8
  • 9. Research Computing @ CU Boulder Summit: Node Types (cont’d) • 11 GPGPU/visualization nodes (C4130) • Same CPU and RAM as general • 2x NVIDIA K80 card (effectively 4 GPUs) • 20 Xeon Phi (“Knights Landing”) nodes • Phi installed directly in CPU socket; not a separate accelerator card • 68 cores / 272 HW threads • installed in second phase, Q1 2017 7/12/17PEARC'17 9
  • 10. Research Computing @ CU Boulder 7/12/17PEARC'17 1 0
  • 11. Research Computing @ CU Boulder 7/12/17PEARC'17 1 1
  • 12. Research Computing @ CU Boulder Omni-Path (OPA) Interconnect • Desirable for two main reasons: cost compared to EDR IB, and expected close integration with KnL • Availability was formally announced near the end of our RFP process • 100 Gb/s bandwidth • Extremely low latency for MPI performance (1.5us) • "Islands" of 28 - 32 nodes fully non-blocking • 2:1 blocking factor between islands 7/12/17PEARC'17 1 2
  • 13. Research Computing @ CU Boulder OPA Challenges/Successes • OPA software and driver stack from Intel is nicely integrated, but had some issues in early releases • Not currently seeing many bugs; upgrade process smoothing out • Many management and monitoring utilities; voluminous output • Application performance matches a similar IB-based cluster at another site • Note that full performance testing must be done with a high-CPU, high-communication application 7/12/17PEARC'17 1 3
  • 14. Research Computing @ CU Boulder GPFS Scratch Storage • 1.2 PB of high-performance scratch storage • GPFS for parallel access and improved small-file and file-ops performance • GPFS has distributed file management • Metadata on SSD • Contents of small files (<3.6KB) fits right in the inode • Measured >21 GB/s parallel throughput and >18K file creations or deletions per second 7/12/17PEARC'17 1 4
  • 15. Research Computing @ CU Boulder GPFS plus OPA - problem • GPFS was designed for close integration with Verbs • But OPA has best performance with PSM2 … • DDN SFA14K RAID controller designed to have direct OPA connectivity – but that feature not available in time 7/12/17PEARC'17 1 5
  • 16. Research Computing @ CU Boulder GPFS plus OPA - solution • Very careful tuning of both OPA and EDR configurations in the NSD servers • Intel OPA software update for better Verbs performance • Vendor added more spindles to the scratch filesystem to boost backend bandwidth performance • Now it’s working really well! • Note that OPA has Traffic Flow Optimization (QoS) to reduce MPI latency on a fabric that also carries storage I/O – we don’t have this enabled • More details about GPFS/OPA in ARCC talk this afternoon 7/12/17PEARC'17 1 6
  • 17. Research Computing @ CU Boulder Authentication • Two-factor auth required for ALL access • PAM + sssd + LDAP allows separate domains for users • Use their local 2-factor auth system • Username is user@domain, eg jane@colostate.edu • Worked with CSU to access their user account info and 2-factor infrastructure • CSU users create a Summit account using CSU username and credentials • This account name is “CSU_ID@colostate.edu” • @colostate logins are authenticated against CSU Duo 2- factor 7/12/17PEARC'17 1 7
  • 18. Research Computing @ CU Boulder Authentication (cont’d) • RMACC Summit is now a Level 3 XSEDE resource • Provides “login allocations” • RMACC users (non-UCB/CSU) will • create an XSEDE account that will be associated with a UCB-RC account • auth using XSEDE’s Duo 2-factor infrastructure • log in through XSEDE Single Sign-On Hub (project XCI-36) • Disclaimer – XSEDE login integration is still a work in progress … details may change • Some uncertainty regarding future of GSI-SSH 7/12/17PEARC'17 1 8
  • 19. Research Computing @ CU Boulder Allocations - implementation • Fairshare: • Administratively more straightforward way to divide up access between multiple sites • More user education required than with a “credit limit” allocation • Slurm scheduler’s “Fair Tree” fairshare is configured with a target share for each allocation • This target corresponds to the number of core-hours awarded • Fair Tree allows subdividing the share of a branch 7/12/17PEARC'17 1 9
  • 20. Research Computing @ CU Boulder Allocations – fairshare tree • Root share 100% • CU-Boulder share (67.5%) • General (20% of 67.5%) • Projects (80% of 67.5%) • UCB1 • UCB2 … • CSU share (22.5%) • General (20%) • Projects (80%) • CSU1 • CSU2 … • RMACC share (10%) • General (20%) • Projects (80%) • RMACC1 • RMACC2 … • Condo shares (variable) 7/12/17PEARC'17 2 0
  • 21. Research Computing @ CU Boulder Summit Condo • Charging node hardware cost to researchers • $1800/node subsidy to cover networking and DC costs • Partners get priority access to an allocation share scaled to size of node purchase • Condo nodes • 47 from CU Boulder • 25 from CSU • 2 from CU Denver • RMACC Summit was not operational when we issued the call for condo purchases 7/12/17PEARC'17 2 1
  • 22. Research Computing @ CU Boulder Conclusions • Multi-site grant and procurement is possible, and saves overall operational cost • Delays due to new technology must be expected and planned for • Multi-site federation and authentication is harder than we expected – especially with 2-factor auth – but leveraging XSEDE should help • Multi-site technical support and consulting is likely to be harder than expected • Summit has provided new outreach and collaboration opportunities for RMACC 7/12/17PEARC'17 2 2
  • 23. Research Computing @ CU Boulder 7/12/17PEARC'17 2 3 CU-Boulder Research Computing www.rc.colorado.edu rc-help@colorado.edu CSU High-Performance Computing www.acns.colostate.edu/hpc help@colostate.edu Thank you! Questions?