4. • State-of-the-art national facility with computing, data
and resources to enable users to solve science and
technological problems, and stimulate industry to use
computing for problem solving, testing designs and
advancing technologies.
• Facility will be linked by high bandwidth networks to
connect these resources and provide high speed access
to users anywhere and everyone.
Introduction:
The National Supercomputing Centre (NSCC)
4
5. Introduction: Vision & Objectives
Vision: “Democratising Access to Supercomputing”
5
Making Petascale Supercomputing accessible to the
ordinary researcher1
Bringing Petascale Computing and Storage and
Gigabit speed networking to the ordinary person2
Supporting National R&D Initiatives1
Objectives of NSCC
Attracting Industrial Research Collaborations2
Enhancing Singapore’s Research Capabilities3
7. 7
What is HPC?
• Term HPC stands for High Performance Computing or High
Performance Computer
• Tightly coupled personal computers with high speed
interconnect
• Measured in FLOPS (FLoating point Operations Per Second)
• Architectures
– NUMA (Non-uniform memory access)
8. Major Domains where HPC is used
Engineering
Analysis
• Fluid
Dynamics
• Materials
Simulation
• Crash
simulations
• Finite
Element
Analysis
Scientific
Analysis
• Molecular
modelling
• Computational
Chemistry
• High energy
physics
• Quantum
Chemistry
Life Sciences
• Genomic
Sequencing
and Analysis
• Protein
folding
• Drug design
• Metabolic
modelling
Seismic
analysis
• Reservoir
Simulations
and modelling
• Seismic data
processing
8
9. Major Domains where HPC is used
Chip design &
Semiconductor
• Transistor
simulation
• Logic Simulation
• Electromagnetic
field solver
Computational
Mathematics
• Monte-Carlo
methods
• Time stepping
and parallel time
algorithms
• Iterative
methods
Media and
Animation
• VFX and
visualization
• Animation
Weather
research
• Atmospheric
modelling
• Seasonal time-
scale research
• -
Major Domains where HPC is used
9
10. Major Domains where HPC is used
• And More
– Bigdata
– Information Technology
– Cyber security
– Banking and Finance
– Data mining
10
12. Executive Summary
• 1 Petaflop System
– About 1300 nodes
– Homogeneous and Heterogeneous architectures
• 13 Petabytes of Storage
– One of the Largest and state of the art Storage architecture
• Research and Industry
– A*STAR, NUS, NTU, SUTD
– And many more commercial and academic organizations
12
13. HPC Stack in NSCC
Mellanox 100 Gbps Network
Intel Parallel
studio
Allinea Tools
PBSPro
Scheduler
Lustre & GPFS
HPC Application software
Operating System
RHEL 6.6 and CentOS 6.6
Fujitsu x86 Servers NVidia Tesla K40 GPUDDN Storage
Application
Modules
13
17. 17
Genomic Institute of
Singapore (GIS)
National
Supercomputing
Center (NSCC)
2km
Connection between GIS and NSCC
Large memory
node (1TB),
Ultra high speed
500Gbps
enabled
2012:
300 Gbytes/week
2015:
4300 Gbytes/week
x 14
18. NGSP Sequencers at B2
(Illumina + PacBio)
NSCC
Gateway
STEP 2: Automated
pipeline analysis once
sequencing completes.
Processed data resides in
NSCC
500Gbps
Primary
Link
Data Manager
STEP 3: Data manager index
and annotates processed data.
Replicate metadata to GIS.
Allowing data to be search and
retrieved from GIS
Data ManagerCompute Tiered Storage
POLARIS, Genotyping &
other Platforms in L4~L8
Tiered Storage
STEP 1: Sequencers
stream directly to
NSCC Storage
(NO footprint in GIS)
Compute
1 Gbps per
sequencer
10 Gbps
1 Gbps per
machine
100 Gbps
10 Gbps
A*CRC-NSCC
GIS
A*CRC: A*Star Computational Resource Center
GIS: Genome Institute of Singapore
Direct streaming of Sequence Data from GIS
to remote Supercomputer in NSCC
2km
19. The Hardware
EDR Interconnect
• Mellanox EDR Fat Tree
within cluster
• InfiniBand connection
to all end-points (login
nodes) at three
campuses
• 40/80/500 Gbps
throughput network
extend to three
campuses
(NUS/NTU/GIS)
Over13PB Storage
• HSM Tiered, 3 Tiers
• I/O 500 GBps flash
burst buffer , 10x
Infinite Memory
Engine (IME)
~1 PFlops System
• 1,288 nodes (dual
socket, 12 cores/CPU
E5-2690v3)
• 128 GB DDR4 / node
• 10 Large memory
nodes (1x6TB, 4x2TB,
5x 1TB)
19
20. Compute nodes
20
• Large Memory Nodes
– 9 Nodes configured with high memory
– FUJITSU Server PRIMERGY RX4770 M2
– Intel(R) Xeon(R) CPU E7-4830 v3 @
2.10GHz
– 4 x 1 TB, 4x 2 TB, and 1x 6 TB Memory
configuration
– EDR Infiniband
• Standard Compute nodes
– 1160 nodes
– Fujitsu Server PRIMERGY CX2550 M1
– 27840 CPU Cores
– Intel(R) Xeon(R) CPU E5-2690 v3 @
2.60GHz
– 128 GB / Server
– EDR InfiniBand
– Liquid cooling system
21. Accelerate your computing
Accelerators nodes
• 128 nodes with NVIDIA GPUs (identical to the compute
nodes)
• NVIDIA K40 (2880 cores)
• 368,640 total GPU cores
Visualization nodes
• 2 nodes Fujitsu Celsius R940 graphic workstations
• Each with 2 x NVIDIA Quadro K4200
• NVIDIA Quadro Sync support
21
22. NSCC Data Centre – Green features
Warm water cooling for CPUs
– First free-cooling system in Singapore and
South-East Asia.
– Water is maintained at a temperature of
40ºC. Enters the racks at 40ºC, exits the
racks at 45ºC.
– Equipment placed in a technical floor(18th)
cool down the water down only using fans.
– The system can easily be extended for
future expansion.
Green features of Data Centre
– PUE of 1.4 (average for Singapore is above
2.5)
22
Cool-Central® Liquid Cooling
technology
23. Parallel file system
• Components
– Burst Buffer
• 265 TB Burst Buffer
• 500 GB/s throughput
• Infinite Memory Engine (IME)
– Scratch
• 4 PB scratch storage
• 210 GB/s
• SFA12KX EXAScalar storage
• Lustre file system
– home and secure
• 4 PB Persistent storage
• GridScalar storage
• 100 GB/s throughput
• IBM Spectrum Scale (formerly GPFS)
– Archive storage
• 5 PB storage
• Archive purpose only
• WOS based archive system
23
29. Why PBS Professional (Scheduler)?
29
Workload management solution that maximizes the efficiency and
utilization of high-performance computing (HPC) resources and
improves job turnaround
Robust Workload
Management
Floating licenses
Scalability, with flexible queues
Job arrays
User and administrator interface
Job suspend/resume
Application checkpoint/restart
Automatic file staging
Accounting logs
Access control lists
Advanced Scheduling
Algorithms
Resource-based scheduling
Preemptive scheduling
Optimized node sorting
Enhanced job placement
Advance & standing reservations
Cycle harvesting across workstations
Scheduling across multiple complexes
Network topology scheduling
Manages both batch and interactive
work
BackfillingReliability, Availability and Scalability
Server failover feature
Automatic job recovery
System monitoring
Integration with MPI solutions
Tested to manage 1,000,000+ jobs per day
Tested to accept 30,000 Jobs per minute
EAL3+ security
Checkpoint support
30. Process Flow of a PBS Job
1. User submits job
2. PBS server returns a job ID
3. PBS scheduler requests a list of resources from the server *
4. PBS scheduler sorts all the resources and jobs *
5. PBS scheduler informs PBS server which host(s) that job can run on *
6. PBS server pushes job script to execution host(s)
7. PBS MoM executes job script
8. PBS MoM periodically reports resource usage back to PBS server *
9. When job is completed PBS MoM copies output and error files
10. Job execution completed/user notification sent
HOST A HOST B HOST C
PBS SCHEDULER
PBS SERVER
pbsworks
ncpus
mem
host
pbsworks on HOST A
pbsworks
Note: * This information is for debugging purposes
only. It may change in future releases.
30
Cluster Network
31. Compute Manager GUI: Job Submission Page
• Applications panel
– Displays the applications available on the registered PAS server
• Submission Form panel
– Displays a job submission form for the application selecting the Applications panel
• Directory Structure panel
– Displays the directory structure of the location specified in the Address box
– Files panel
– Displays the contents of the directory, files, and subdirectories selected in the Directory Structure panel
31
Directory Structure
Files
Applications
32. Job Queues & Scheduling Policies
32
External
Queue Name
Internal
Queue Name
Walltime
limit
Other limits Remarks
largemem 24 Hours
To be decided For Jobs requiring more that
4GB per core.
normal dev 1 Hours
2 standard
nodes per user
High priority queue for testing
and development works.
small 24 Hours
Up to 24 cores
per job
For jobs that do not require
more than one node.
medium 24 Hours
Up to limit as
per prevailing
policies
For standard job runs
requiring more than one
node.
long 120 Hours
1 node per
user
Low priority queue for jobs
that cannot be checkpointed.
gpu gpunormal 24 Hours
Up to limit as
per prevailing
policies
For “normal” jobs which
require GPU.
gpulong 240 Hours
Up to limit as
per prevailing
policies
Low priority queue for GPU
jobs which cannot be
checkpointed.
33. Job Queues & Scheduling Policies
33
External
Queue Name
Internal
Queue Name
Walltime
limit
Other limits Remarks
iworkq 8 Hours
1 node per
user
For visualisation.
ime
(look for it in
the near
future)
24 Hours
Up to limit as
per prevailing
policies
For users who wish to
experiment with DDN's IME
burst buffer which offers up
to 500GB/s of transfer speed
* Users only need to specify the 'External Queue' for job submission. Jobs will be routed
to the internal queue depending on the job resource requirements.
36. Parallel programming OpenMP
• Available compilers (gcc/gfortran/icc/ifort)
– OpenMP (not openmpi, Used mainly in SMP programming)
• OpenMP (Open Multi-Processing)
• OpenMP is an approach and OpenMPI is an implementation of MPI
• An API for shared-memory parallel programming in C/C++ and Fortran
• Parallelization in OpenMP achieved through threads
• Programming OpenMP is easier as it involves only pragma directive
• OpenMP program cannot communicate to the processor over network
• Different stages of the program uses different number of threads
• A typical approach is demonstrated through the below image
36
37. Parallel Programming MPI
• MPI
– MPI stands for Messaging Passing Interface
– MPI is a library specification
– MPI implementation is typically a wrapper to standard
compilers such as C/Fortran/Java/Python
– Typically used in Distributed memory communication
37
39. 39
Allinea DDT
• DDT – Distributed Debugging tool from Allinea
• Graphical interface for debugging
– Serial applications/codes
– OpenMP applications/codes
– MPI applications/codes
– CUDA applications/codes
• You control the pace of the code execution and examine
execution flow and variables
• Typical Scenario
– Set a point in your code where you want execution to stop
– Let your code run until the point is reached
– Check the variables of concern
40. 40
Allinea MAP
• MAP – Application Profiling tool from Allinea
• Graphical interface for profilling
– Serial applications/codes
– OpenMP applications/codes
– MPI applications/codes
44. GPU
• GPUs – Graphic Processing Units were initially made to
render better graphics performance
• With the amount of research put on GPUs, it was
identified that GPUs can perform better with Floating
Point Operations as well
• The term GPU changed to GPGPUs (General Purpose
GPUs)
• CUDA Toolkit includes compiler, math libraries, tools, and
debuggers
44
45. GPU in NSCC
• GPU Configuration
– Total 128 GPU nodes
– Each server with 1 Tesla K40 GPU
– 128 GB host memory per server
– 12GB device memory
– 2880 CUDA Cores
• Connect to GPU server
– To compile GPU application:
• Submit interactive job requesting for GPU resource
• Compile job using NVCC compiler
– To submit GPU job
• Flexible to among qsub for login nodes
• OR login to compute manager
45
47. What is Environment modules
• Environment modules helps to dynamically load/unload
environment variables such as PATH, LD_LIBRARY_PATH,
etc.,
• Environment modules are based on module files which
are written in TCL language
• Environment modules are shell independent
• Helpful to maintain different version of same software
• Flexibility to create module files by the users
47
52. Managed Services offered
53
• Computational resources
• Storage management
Infrastructure Services
• Hardware break fix
• Software incident resolution
Incident Resolution
• Data management
• Job management
• Software installation etc.,
General Service Requests
• Code Optimization
• Special queue configuration, etc.
Specialized Service Requests
• Introductory class
• Code optimization techniques
• Parallel Profiling etc.
Training Services
• Portal/e-Mail/Phone
• Request for a service via portal
• Interactive Job submission portal
Helpdesk
53. Where is NSCC
• NSCC Petascale
supercomputer in
Connexis building
• 40Gbps links extended to
NUS, NTU and GIS
• Login nodes are placed in
NUS, NTU and GIS
datacenters
• Access to NSCC is just
like your local HPC
system
54
1 Fusionopolis Way, Level-17 Connexis South
Tower, Singapore 138632
54. Supported Login methods
• How do I login
– SSH
From a Windows PC use Putty or any standard SSH client software hostname is
nscclogin.nus.edu.sg, use NSCC Credentials
From Linux machine, use ssh username@nus.nscc.sg
From MAC, open terminal and ssh username@nus.nscc.sg
– File Transfer
SCP or any other secure shell file transfer software from Windows
Use the command scp to transfer files from MAC/Linux
– Compute Manager / Display Manager
Open any standard web browser
In the address bar, type https://nusweb.nscc.sg
Use NSCC credentials to login
– Outside campus
Connect to Campus VPN to gain above mentioned services
55
55. NSCC HPC Support (Proposed to be available by 15th Mar)
• Corporate Info – web portal
http://nscc.sg
• NSCC HPC web portal
http://help.nscc.sg
• NSCC support email
help@nscc.sg
• NSCC Workshop portal
http://workshop.nscc.sg
56
56. 57
Help us improve. Take the online survey!
Visit: http://workshop.nscc.sg >> Survey
60. Web Site : http://nscc.sg
Helpdesk : https://help.nscc.sg
Email : help@nscc.sg
Phone : +65 6645 3412
61
62. User Enrollment
Instructions:
• Open https://help.nscc.sg
• Navigate User services -> Enrollment
• Click on Login
• Select your organization (NUS/NTU/A*Star) from the
drop down
• Input your credentials
Ref: https://help.nscc.sg -> User Guides -> User Enrollment guide
63
63. Login to NSCC Login nodes
• Download Putty form internet
• Open Putty
• Type login server name (login.nscc.sg)
• Input your credentials to login
64
64. Compute manager
• Open Web Browser (Firefox or IE)
• Type https://nusweb.nscc.sg / https://ntuweb.nscc.sg /
https://loginweb-astar.nscc.sg
• Use your credentials to login
• Submit a sample job
65
73. Using Scratch space
#!/bin/bash
#PBS -N My_Job
# Name of the job
#PBS -l select=1:ncpus=24:mpiprocs=24
# Setting number of nodes and CPUs to use
#PBS -W sandbox=private
# Get PBS to enter private sandbox
#PBS -W stagein=file_io@wlm01:/home/adm/sup/fsg1/<my input directory>
# Directory name where all the input files are alvailable
# files in the input directory will be copied to scratch space creating a directory file_io
#PBS -W stageout=*@wlm01:/home/adm/sup/fsg1/<myoutput directory>
# Output directory path in my home directory
# Once the job is finished, the files from file_io in scratch will be copied back to <myoutput
directory>
#PBS -q normal
cd ${PBS_O_WORKDIR}
echo " PBS_WORK_DIR is : $PBS_O_WORKDIR"
echo "PBS JOB DIR is: $PBS_JOBDIR"
#Notice that the output of pwd will be in lustre scratch space
echo "PWD is : `pwd`"
sleep 30
#mpirun ./a.out < input_file > output_file
74
Hinweis der Redaktion
Algorithms & Numerical Techniques
Astronomy & Astrophysics
Augmented Reality
Big Data & Data Mining
Bioinformatics & Genomics
Business Intelligence & Analytics
Climate/Weather/Ocean Modeling
Cloud Computing
Computational Chemistry
Computational Fluid Dynamics (CFD)
Computational Photography
Computational Structural Mechanics
Computer Aided Design (CAD)
Computer Graphics & Visualization
Computer Vision & Machine Vision
Databases
DCC & Special Effects
Development Tools & Libraries
Economics
Education & Training
Electronic Design Automation (EDA)
Embedded & Robotics
Energy Exploration & Generation
Geoscience
Image Processing
Machine Learning & AI
Material Science
Medical Imaging
Mobile
Molecular Dynamics
Neuroscience
Physics
Programming Languages & Compilers
Quantum Chemistry
Ray Tracing
Signal/Audio Processing
Supercomputing
Video Processing
GIS’ capacity grew by 14 times within 3 years. We need more firepower to store & compute – As such GIS will need to work together with NSCC in order to process their ever growing amount of data. But transferring data by network will take at least a day. This was the typical situation ~ 6 months ago. Even though we know of the compute resources in FP, many researchers are reluctant to use them as they’ll end up spending most of their time waiting for data movement.
We are testing a 2km 500Gbps link from the sequencing labs in GIS to our supercomputers in Fusionopolis building direct from data generation to CPU and storage. A project task force has been set up.
We are also scheduling for the Systems Biology Garuda stems on our HPC cloud in time for live demo at the ICSB 2015 congress come November.
This image was extracted from current planning document. What I want to convey with this slide: Given the new network infrastructure, we’re going to be fully integrated with the up-coming NSCC. Not simply a matter of copying files there quickly. The network will enable us to use NSCC resources as it’s just next to our desk. i.e. The speed of transfer is so fast, latency so low that the distance becomes irrelevant.
Due to the high speed connection (500Gbps enabled), we can now stream sequencing data from GIS to remote supercomputers in NSCC (which is 2km away) to analyze sequence data!
Summary of Setup (together with ACRC (LongBow and HPC FP) and ITSS
GIS HS4000 is currently streaming sequencing data directly (no local footprint) to FP via IB or ExaNet. A Single HS4000 will stream ~300GB worth of data every 24 hours.
Once Sequencing is completed. Automated Primary Analysis
Results from Analysis will return to GIS via the 500Gbps IB link
This simple-looking trial setup took quite a bit of effort to setup.
Power Usage Effectiveness PUE=Total Facility Energy / IT Equipment Energy
Overall 14 Racks of storage and Parallel file system
PBS server
Central focus for a PBS complex
Routes job to compute host
Processes PBS commands
Provides central batch services
Server maintains its own server and queue settings
Daemon executes as pbs_server.bin
PBS MoM (machine-oriented miniserver)
Executes jobs at request of PBS scheduler
Monitors resource usage of running jobs
Enforces resource limits on jobs
Reports system resource limits, configuration
Daemon executes as pbs_mom
PBS scheduler
Queries list of running and queued jobs from the PBS server
Queries queue, server, and node properties
Queries resource consumption and availability from each PBS MoM
Sorts available jobs according to local scheduling policies
Determines which job is eligible to run next
Daemon executes as pbs_sched
Machine Oriented Mini-server
Stacks view
OpenMP Regions view
Functions view
Metrics view
Briefly run through the list of popular applications that are compatible on NSCC HPC cluster.