Bsdtw17: johannes m dieterich: high performance computing and gpu acceleration: where do we stand?

Johannes M Dieterich: High Performance Computing and
GPU acceleration: WhereDo We Stand?
Personal Background
● Computational Chemist
● high-performance computing developer and user
FreeBSD experience
● user since 6.1
● development platform since 7.1
● ports committer since Jan 2017
HPC
● large problems
HPC usage by chemistry subfield
● materials, ...
Cluster science
● experiment and simulation
Typical HPC Installations
● 3552 nodes, 85k CPU cores, 2.7 PFLOPS, 222TB Ram, 2.7PByte storage
HPC Ecosystem
● HW, OS, Tools, Numerical libs, Parallelization, Languages
Processor Architecture
● x86 dominates the top500
● a little bit of Power, Sparc64
OS
● Linux dominates
● a little bit of Unix (non x86 archs)
(Free)BSD and HPC
● Strengths
○ poudriere and pkg
■ custom, optimized packages for a whole cluster
○ libs, compilers setup ease
○ 31k+ ports
○ network and kernel performance
○ some automated regression testing
○ related: as virtualization host for contained HPC applications
Why Should We All Care?
● tech and visibility
○ GPU accel
○ numerical libs
○ performance
● Community capital
○ HPC developers
○ HPC users
○ new ideas
HPC Programming Languages
● Fortran dominates
● Some C/C++
● Tiny bit of Python
● Large part unknown (probably still Fortran)
● data from ARCHER

Compilers and Build Environments
● LLVM works well
● lang/gcc pitfalls:
○ -Wl,-rpath=/usr/local/lib/gcc5/
○ C++ standard library
○ OpenMP implementation mixing
● USES= fortran/compiler:openmp goes to GCC
● Intel icpc, icc available
HPC make environments
● normal makes (gmake, cmake, …)
● homebrew (bash, perl, python, ….)
● Makefiles hard code OS assumptions
devel/flang: LLVM-based
● 64bit-only Fortran compiler
● fortran 77, 90, 95, 2003 well supported
Status
● port works for amd64
● patches not upstream
● based on LLVM 5.0
“jrm” use flang by default on x86 for math/R
“jrm” added USES= fortran:flang
Parallelization: inter-node (MPI)
● MPI
○ message-based (sync/async)
○ high latency
○ explicit support required
○ (queue support required)
○ mpicc/mpif90 + magic
● Support in FreeBSD
○ OpenMP, mpich
○ fortran compiler limitations
Parallelization: intra-node (OpenMP)
#pragma omp parallel for default (none) shared(grid,sum)
● pragma-based
● low latency
● offloading capable
● $CC -fopenmp
Support in FreeBSD
● no support in base since LLVM switch
○ USES= compiler:openmp depends on lang/gcc
● D6362 use devel/llvm50 on x86
Software, Libs, and Capabilities
● Capabilities
○ queuing systems sysutils/slurm-wlm
○ networking
■ InfiniBand, Intel OmniPath
○ Libs
■ math/fftw3
■ science/libint, libxc (exchange-correlation for quantum chemistry)
■ math/tblis WIP tensor contraction framework
■ … many other non comp-chemistry libs ---
BLAS and LAPACK
Linear Algebra
● standard interface for Fortran/C

● Linux: vendor libs like Intel MKL
operations
● BLAS1; O(N)
○ e.g. dot product
● BLAS2: O(N^2)
○ e.g. matrix-vector product *gemv
● BLAS3: O(N^3)
○ e.g. matrix-matrix product
● LAPACK: complex ops
○ e.g. Choleskey-decomposition *syrk
BLAS/LAPACK on FreeBSD: Mk/Uses/blaslapack.mk
● netlib (reference impl, not fast enough)
● atlas
● openblas
● gotoblas (legacy)
● blis/libflame
dgemm performance on FreeBSD (Matrix-Matrix product)
● OpenBLAS g++/gfortran, ...
● (performance chart along M/N/K)
Porting HPC Apps to FreeBSD: Examples
● TigerCI
○ open-source
○ C++/Fortran + OpenMP
○ cmake
○ 10s os users
○ 74k SLOC
● ADF suite
○ commercial
○ Fortran, some C/C++, tcl + MPI
○ homebrew Python build system
● Challenges
○ time / funding
○ long-term viability
○ Linux assumptions
■ e.g. #!/bin/bash
○ BSDisms
■ e.g. -Wl,-rpath=
○ port dependencies
● Benefits
○ portability
○ compliance
○ exposure of bugs
Developing on FreeBSD
● Debug: gdb, lldb, valgrind
● Profiling: DTrace, hwpmc
HPC: accelerators
● none is still most common
● NVIDIA: CUDA
● Small bit of Xeon Phi
● Tiny bit of AMD OpenCL
GPU Accel
● CUDA
○ Fortran/C/C++, others
○ effectively proprietary

○ LLVM’s CUDA backend missing runtime libraries
● OpenCL
○ C/C++, others
○ general runtime loader ocl/loader
OpenCL on FreeBSD: CPU
● POCL
OpenCL: Intel
● beignet
● status: double precision support only experimental
○ double is critical for some algorithms
OpenCL: AMD
● clover
● status: doesn’t really work
Develop OpenCL On FreeBSD
● Tools
● devel/oclgrind: virtual OpenCL device simulator
● devel/cltune
● benchmarks/clpeak
● Libs:...
Language Bindings: java/aparapi
● JNI wrapper for OpenCL
Radeon Open Compute (ROCm) and FreeBSDDesktop
● Fundamentals
○ amdgpu: KMS driver
○ amdkfd: compute kernel driver
○ ROC Thunk: usermode library to amdkfd
○ runtime libraries
○ llvm fork for targetting ROCm
● Libs
○ OpenCL
○ HIP (CUDA emulator)
○ rocBLAS
○ rocFFT
○ MIOpen
○ hiptensorflow
Wishlist: near-term
● math/blis
● OpenMP
○ D6362 into tree
● OpenCL
○ fix lang/pocl and lang/clover?
● more people, ports
Wishlist: mid-term
● blaslapack:flame
● OpenMP
○ in base
● accel
○ amdkfd
○ ROCm ecosystem
● devel/flang
○ USES= fortran:flang for more ports
Wishlist: long-term
● blaslapack:flame
● OpenMP

○ non x86 archs
● accel
○ OpenCL via ROCm
● devel/flang
○ non x86 archs
● devel/linuxisms port?
Acknowledgements
(long list of contributors)
Q&A
● Audience remark: providing ports/linuxisms is not a problem at all. As long as no improper dependency
exists.
● Will this support multiple GPUs in a single system? A: yes, multiple DRM devices.
● Experience with TI co-processor? A: no first hand experience. lang/pocl seems to have some support
● Why are there so few accelerators in top500?
○ A: a lot of legacy code. Publication bias. Models may not scale on certain GPU hardware due to
memory bandwidth and size.
○ May change as language and frameworks evolve. Takes too much manual labor and thus
limited by developer time for now.

Bsdtw17: johannes m dieterich: high performance computing and gpu acceleration: where do we stand?

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Bsdtw17: johannes m dieterich: high performance computing and gpu acceleration: where do we stand?

Ähnlich wie Bsdtw17: johannes m dieterich: high performance computing and gpu acceleration: where do we stand? (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Bsdtw17: johannes m dieterich: high performance computing and gpu acceleration: where do we stand?