Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
PCCC20 筑波大学計算科学研究センター「学際計算科学による最新の研究成果」
1. https://www.ccs.tsukuba.ac.jp/
University of Tsukuba | Center for Computational Sciences
Mission of CCS
The CCS promotes "multidisciplinary computational science" on the basis of the fusion
between computational science and computer science. For the purpose, the CCS
develops high-performance computing systems by the "co-design". The scientific
research areas cover particle physics, astrophysics, nuclear physics, nano-science, life
science, environmental science, and information science.
The CCS was reorganized in April, 2004, from the preceding center, Center for
Computational Physics that was established in 1992. The CCS is the institute for the
above-mentioned research fields and also the joint-use facility for outside researchers.
Since 2010, the CCS has been approved as a national core-center, Advanced
Interdisciplinary Computational Science Collaboration Initiative (AISCI), by the Ministry of
Education, Culture, Sports, Science and Technology (MEXT). The CCS aims at playing a
significant role for the development of the Multidisciplinary Computational Science.
Chronology and Major Events
Foundation of the Center for Computational Physics (CCP)
Completion of CP-PACS, a 0.6 TFLOPS MPP ranked No. 1 on the Top 500 in Nov. 1996
Completion of HMCS (Heterogeneous Multi-Computer System), an 8.6 TFLOPS coupled CP-
PACS/GRAPE-6 system
Reorganization and expansion of CCP, renamed Center for Computational Sciences (CCS)
Two major new computing facilities start operation.
PACS-CS a general-purpose 14.3 TFLOPS MPP cluster for computational sciences
FIRST an HMCS-E for astrophysical simulations General-purpose 3.5 TFLOPS +
gravity 35 TFLOPS
Completion of T2K-Tsukuba system, a 95.4 TFLOPS cluster ranked No. 20 on the Top 500 in
Jun. 2008
HA-PACS Base Cluster is delivered with 802 TFLOPS of peak performance, ranked No. 41 on
the Top 500 in Jun. 2012.
HA-PACS/TCA is added to HA-PACS system with 364 TFLOPS of peak performance in Oct.
2013, and total peak performance of HA-PACS system is expanded to over 1.1 PFLOPS.
Joint Center for Advanced HPC(JCAHPC) established in alliance with the University of Tokyo
COMA(PACS IX) is delivered with 1.001PFLOPS of peak performance, ranked No.51 on the
Top 500 in Jun. 2014.
Oakforest-PACS is installed and started operation in JCAHPC
Cygnus is installed and started operation.
1992
1996
2002
2004
2006
2008
2012
2013
2014
2016
2019
CP-PACS FIRST-Cluster
PACS-CS T2K-Tsukuba
HA-PACS COMA
Oakforest-PACS
Current Supercomputers
Cygnus
2. 2+1 flavor QCD at Physical Point on very large lattices (master-field simulations)
University of Tsukuba | Center for Computational Sciences
https://www.ccs.tsukuba.ac.jp/
Exploring QCD phase diagram
Research in Particle Physics
contact address: pr@ccs.tsukuba.ac.jp
Investigating the phase structure of QCD at non-zero temperature and density is very
important to understand properties of strongly interacting matters under extreme
conditions. It is known that the order of the phase transition depends on the mass
and the number of flavors of quarks and there should be so-called critical endlines,
lines of second order phase transitions, in certain space of quark masses as shown in
Fig. 2a.
To determine the shape of the critical endline in the small quark mass region we are
carrying out lattice QCD simulations at finite temperature with 2+1 as well as 3
degenerate quark flavors on Cygnus and Oakforest-PACS. Fig. 2b shows our recent
estimation of the critical pion mass in 3 flavor QCD in the continuum limit including a
new calculation with the temporal lattice extent of 12, where the new result gives a
smaller upper bound than that of our previous calculation.
Fig1a:
Relative difference of the light hadron spectrum from the
experiment. Inputs are only the pion, kaon, and omega baryon
masses to determine the up-down and strange quark masses, and
the lattice cutoff, respectively. Our results show good agreement
with the experiment albeit errors are still not quite small for
some of the hadrons.
[K-.I. Ishikawa et al., https://arxiv.org/abs/1511.09222]
Fig. 1b:
A comparison of pseudoscalar decay constants, fπ and fK, on
(10fm)4 and (5fm)4. We detect 0.66% and 0.26% finite volume
effect on fπ and fK, respectively. The effect is very small and
negligible to compare the corresponding experiments. Now, we
can control and remove the finite volume effect completely by
using the master-field simulations.
[K-.I. Ishikawa et al., Phys. Rev. D 99, 014504]
Hadrons are the constituents of atomic nuclei. Computing the mass
spectrum of hadrons from first principles of the quantum
chromodynamics (QCD), the fundamental theory of strong interaction
described by quarks and gluons, is a principal subject in particle
physics.
After quenched and succeeding 2 flavor QCD simulations by the CP-
PACS, those studies were extended to 2+1 flavor QCD by
incorporating the dynamical strange quark, though the degenerate
up-down quark mass was much heavier than the physical one. On the
PACS-CS and the T2K computers, we have succeeded in reaching the
physical point. This calculation is followed by a larger volume
simulation on the K computer.
Our current project is aiming to control and remove systematic errors
due to the previous simulations on a finite volume with a finite lattice
spacing. We are performing so called master-field simulations on very
larger (10fm)4 volume with several lattice spacings using the
Oakforest-PACS.
Fig. 2a:
Expected quark mass dependence of the
order of the QCD phase transition. Our goal
is to determine the shape of the critical
endline shown as a red curve in the lower-
left corner.
Fig. 2b:
Our recent estimation of the critical pion mass,
mπ,E, in 3 flavor QCD. The continuum extrapolation
including new data sets with the temporal extent
of 12 gives an upper bound mπ,E ≲ 110 MeV.
[Y. Kuramashi et al., Phys. Rev. D 101, 054509]
3. Vlasov-Poisson simulation of cosmic neutrinos in the large-scale structure
of the universe
Theoretical galaxy formation – numerical simulations reveal the fate of stars and gas
University of Tsukuba | Center for Computational Sciences
http://www.ccs.tsukuba.ac.jp/
Solving the Mysteries of the Universe with Computational Astrophysics
When a cluster of stars forms, only a part of the natal cloud is
converted into stars, and the rest is ionized and heated by the
powerful stellar radiation and ejected outward. Using
radiation-hydrodynamic simulations, we found that star
formation is primarily controlled by the formation of ionized
regions, as well as the surface density and dust content of the
natal cloud. We developed a new semi-analytic model that
captures this behaviour and can be incorporated in subgrid
recipes for large-scale cosmological simulations.
Fukushima, Yajima, et al. (2020), MNRAS, 497, 3830
contact address: ayw@ccs.tsukuba.ac.jp / pr@ccs.tsukuba.ac.jp
We devise a physical model to determine the formation,
distribution, and kinematics of molecular gas clouds in
galaxies, and predict the intensities of carbon monoxide (CO)
lines and the molecular hydrogen (H2) abundance, taking into
account the interstellar radiation field and dust attenuation.
We apply the model to data from the Illustris-TNG
cosmological simulations and compare the CO luminosities
and H2 masses with recent observations of galaxies at low
and high redshifts. The model successfully reproduces the
observed CO-luminosity function and the total H2 mass in
the local universe.
Inoue, S., Yoshida, N. & Yajima, H., (2020) accepted for publication in MNRAS
100 kpc
b)
a)
Fig. 2a: The structure of the five brightest galaxies in CO(1-0) in the simulation.
Fig. 2b: Density evolution in the formation of star clusters. White circles indicate
stars and the green contours bound ionization regions.
Neutrinos are elementary particles ubiquitous in the universe. The Super-Kamiokande experiment revealed that
neutrinos have mass, which implies that neutrinos can dynamically affect the formation of large-scale structure (LSS) in
the universe. We perform numerical simulations of LSS formation incorporating the effect of massive neutrinos by
directly solving the collisionless Boltzmann equation in 6D phase-space on two supercomputers, FUGAKU and Oakforest-
PACS. Our highly optimized simulation code achieves almost ideal weak and strong scaling on FUGAKU.
Yoshikawa, K., Tanaka, S., Yoshida, N. & Saito, S. (2020) accepted for publication in ApJ.
Fig. 1a: Simulated distributions of massive neutrinos (color scale) and dark matter
(contours) as well as dark matter halos (white circle) at a) redshift z = 0 (the present),
and b) redshift of 1 (about 7.9 Gyr ago).
Fig. 1b: Strong scaling of VLASOV simulations on super computer
FUGAKU. Run ID prefixes S, M, and L denote grid resolutions of
96³, 192³, and 384³, respectively, and the number denotes the
number of computational nodes in multiples of 144.
a) b)
4. Are “free neutrons” in neutron stars free?
University of Tsukuba | Center for Computational Sciences
https://www.ccs.tsukuba.ac.jp/
Computational Nuclear Physics
Although the nucleus is a microscopic object on earth, there is a gigantic nucleus in the universe, that is the neutron
star (Fig.1). Near the surface of the neutron stars, a periodic crystalline structure is formed and all the protons are
expected to be confined. In contrast, there are unbound neutrons which are regarded as “free”. These free neutrons
play a key role in various observed phenomena, such as pulsar glitch and cooling.
Interactive Plot of Atomic nuclei and Computed Shapes (InPACS)
Measuring nuclear properties is very expensive using accelerators. The obtained data are precious for various
technologies of human beings, thus, compiled by nuclear data centers in the world, then, open to public. We have
calculated almost all kinds of nuclides in the universe, using the energy density functional theory. The computation
complements missing experimental data. In order to publicize the computational nuclear data, we have opened a web
site, InPACS, in which you may interactively obtain various nuclear data/information.
contact address: nakatsukasa@nucl.ph.tsukuba.ac.jp
Fig. 3: Snapshot of InPACS web site.
Fig. 1: Structure of a neutron star
Courtesy of http://www.astroscu.unam.mx/neutrones/
0.6
0.7
0.8
0.9
1
1.1
0 0.02 0.04 0.06 0.08 0.1
m
*
/mn
r [ fm
-3
]
Fig. 2: Ratio of effective mass of
free neutrons in the neutron-
star crust (slab phase) to their
bare mass.
We have examined properties of the “free neutrons”, with the nuclear
density functional calculation. Surprisingly, at a certain density region,
they are even “super-free”, which means that their mass is lighter in the
neutron star than in the vacuum (Fig.2)!
This research was supported by
ImPACT project on Reduction and
Resource Recycling of High-level
Radioactive Wastes through Nuclear
Transmutation.
5. (a) Optical near-field generated in metal-organic framework, IRMOF-10
SALMON: Scalable Ab-initio Light-Matter simulator for Optics and Nanoscience
Optical Properties of Nano-materials in Real Time and Real Space
University of Tsukuba | Center for Computational Sciences
https://www.ccs.tsukuba.ac.jp/
Quantum Condensed Matter Physics
Understanding interaction between light and matter is the basis of a wide range of technologies. For this
purpose, it is essential to describe electron dynamics in matters induced by light electromagnetic fields in a
microscopic scale, 10-9 (nano-)meter in space and 10-15 (femto-) second in time. We have been developing
an open-source computer code SALMON, Scalable Ab-initio Light-Matter simulator for Optics and
Nanoscience that describes electron dynamics in molecules, nano-materials, and solids based on first-
principles time-dependent density functional theory [http://salmon-tddft.jp]. As a novel function of
SALMON, light propagation in nano-materials as well as in bulk medium can be described taking full
account of nonlinearity and nonlocality of light-matter interactions in the ab-initio level. We expect
SALMON will be widely used in cutting-edge researches in optics and nanoscience.
contact address: pr@ccs.tsukuba.ac.jp
I (W/cm2) 109 1010 1011 1012
laser field
I=1010W/cm2
ω=3.38eV
(b) Weak-scaling
performance on the
Fugaku system using
up to 27,648 nodes
to simulate 13,648
atoms.
When a light pulse irradiates on nano-sized objects, a strong and spatially-
localized electromagnetic field, which is called the near field, appears around the
object. The near field enables imaging beyond the limit of optical resolution and
enhances nonlinear optical processes. We perform first-principles calculations of
the photoexcitation dynamics of an acetylene molecule in a metal organic
framework, IRMOF-10. Resonant laser excitation of the IRMOF-10 generates an
optical near field around the two benzene rings that comprise the main
framework of the IRMOF-10. The second harmonic excitation caused by spatial
nonuniformity of the optical near field is observed.
(b) Optical property of metallic metasurface with sub-nm gaps
By virtue of rapid progresses in fabrications of nano-materials, it is
possible to manufacture periodic materials composed of uniformly
structured nano-objects. Here we investigate the optical properties
of quantum plasmonic metasurfaces composed of two-dimensional
arrayed metallic nano-spheres with sub-nm gaps according to the
time-dependent density functional theory, a fully quantum
mechanical approach. When the quantum and classical
descriptions are compared, the absorption rates of the
metasurface exhibit substantial differences at shorter gap distances.
The differences are caused by electron transport through the gaps
of the nano-objects. Re Im
Absorption rates
Current distribution
x
y
0.4 nm
Gap distances
Energy
Classical TDDFT
(a) A multiphysics simulation
solving Maxwell, time-
dependent Kohn-Sham,
and Newton equations is
performed on the Fugaku
system for a thin film of
amorphous SiO2 composed
of more than 10,000 atoms.
Disclaimer
The results obtained on the evaluation environment in the trial phase do not guarantee the performance, power and other attributes of the supercomputer Fugaku at the start of its public use operation.
(a) (b)
6. University of Tsukuba | Center for Computational Sciences
Computational Elucidations for Biomolecules
The world of life is full of mystery. Actual molecular structures, motions and chemical reactions of biological molecules,
such as protein, nucleic acids, carbohydrates and lipids are still unclear. Using supercomputers, we have performed highly
demanding computations based on molecular mechanics (MD) and hybrid quantum mechanics/molecular mechanics
(QM/MM) methods, and we are uncovering some important biological questions.
Fig. 2: (a) Effective conformational sampling of MD simulations: Parallel Cascade Selection MD (PaCS-MD). To promote the conformational
transition, the following cycle is repeated in PaCS-MD; (I) Selections of initial seeds (structures) that have high potential to transit. (II) The
conformational resampling through restarting multiple MD simulations from the selected initial seeds. [R. Harada et al., J. Chem. Phys. 139
035103 (2013)]
(b) QM/MM model of oxygen evolving complex in photosystem II. Key intermediate states in the catalytic reaction “2H2O + 4hv -> 4H++4e–
+O2” have been investigated using the large model. [M. Shoji et al., Catal. Soc. Technol., 3, 1831 (2013).]
2H2O
4H+
O2
QM
region
CaMn4O5 cluster
(b)
GPU-accelerated Molecular Orbital Calculation
Large-scale ab initio molecular orbital calculation is a target application in quantum chemistry for HPC computer systems,
and the fragment molecular orbital (FMO) method is one of such application because it is designed for parallel computer.
We have developed GPU-accelerated FMO calculation program with CUDA, and obtained 3.8x speedups from CPU on-the-fly
FMO calculation of 1,961 atomic protein. [H. Umeda et al., IPSJ Transactions on Advanced Computing Systems 6, 4, (2013) 26-37. H. Umeda et al., SC15 poster (2015).]
(a)
Divides into fragments
Dimer SCF or ES-Dimer calc.
for each fragment-pair
SCF calc. for each
fragment with ESP (SCC)
Application Lysozyme HA3
#Atoms 1,961 23,460
#Nodes (#GPU) 8 (0) 8 (32) 64 (256)
SCC 3,071 s 828 s 3.7x 0.52 hr
Dimer SCF 6,246 s 1,675 s 3.7x 0.90 hr
ES Dimer 407 s 78 s 5.2x 0.45 hr
Total 9,770 s 2,597 s 3.8x 1.97 hr
(b)
2 Hours for FMO
calculation with 256 GPUs
Influenza HA3 protein
(23,460 atoms, 721 fragments)
Fig. 1: (a) FMO calculation scheme, where large molecule is divided into many small fragments. Total molecular properties are reconstructed from the
self consistent field (SCF) calculations of fragments and fragment-pairs with SCC (self-consistent-charge)-condition-satisfied electrostatic potential
(ESP).
(b) Performance of GPU-accelerated FMO calculations. GPU-accelerated FMO-HF/6-31G(d) calculation of lysozyme with HA-PACS base cluster shows
3.8x speedups.
(c) As large-scale MO application, FMO-HF/6-31G(d) calculation of Influenza HA3 protein is successfully performed with 256 GPUs within two hours.
(c)
MD and QM/MM simulations using supercomputers
https://www.ccs.tsukuba.ac.jp/contact address: shigeta@ccs.tsukuba.ac.jp
(a)
resampling
criteria
7. 338-gene analyses resolved the phylogenetic affiliation of a microeukaryote
Microheliella maris.
In silico structural modeling and analysis of translation elongation factor 1α proteins
University of Tsukuba | Center for Computational Sciences
https://www.ccs.tsukuba.ac.jp/
Biological Sciences
contact address: yuji@ccs.tsukuba.ac.jp
Fig. 2: EF-1α and tRNA structures and surface electrostatic distribution.
(a) EF-1α structure of an archaeon (PDB ID: 3WXM). (b) tRNA structure
(PDB ID: 1EHZ). (c & d) divEF-1α models. Dotted lines in (a), (c) and (d)
indicate the surfaces interacting with tRNA.
Translation elongation factor-1α (EF-1α) interacts with tRNA
during protein synthesis. Some eukaryotes appeared to possess
highly divergent EF-1α (divEF-1α), implying that these proteins
lack the ability to interact with tRNA. We modelled the tertiary
structures of divEF-1α and validated their model structures by
molecular dynamics simulations. We found that the molecular
surfaces of divEF-1α are negatively charged partly, suggesting
that they may not interact with negatively charged tRNA as
strongly as the canonical EF-1α with the positively charged
surfaces. (a) (b)
(c) (d)
Canonical EF-1α tRNA
divEF-1α in a diatom divEF-1α in a fungus
Surface interacting
with tRNA
Surface interacting with tRNA
-0.1 V
+0.1 V
Sakamoto et al. 2019 ACS Omega 4:7308-7316
Previously published phylogenetic studies failed to elucidate the phylogenetic position of a
heliozoan microeukaryote Microheliella maris. Thus, we took a “phylogenomic” approach
to place M. maris in the global tree of eukaryotes with accuracy. In the phylogeny inferred
from an alignment containing 338 genes, M. maris branched at the base of the clade of a
diverse collection of microeukaryote collectively called Cryptista with high statistical
support.
Fig. 1a: Schematic cell drawing of Microheliell maris.
Fig. 1b: Maximum likelihood phylogeny inferred from the 338-gene alignment.
(a) (b)
8. University of Tsukuba | Center for Computational Sciences
https://www.ccs.tsukuba.ac.jp/
Simulation of Atmospheric General Circulation by Global Cloud Resolving Model,
NICAM
Development of LES Model for thermal environment at city scale
NICAM (Nonhydrostatic ICosahedoral Atmospheric Model) is able to reproduce
the multi-scale cloud systems realistically, cumulus convection, Tropical cyclones,
Arctic cyclones, the Madden−Julian Oscillation (MJO), and Intertropical
Convergence Zone (ITCZ).
In Fig. 1, NICAM with glevel-10 (7-km horizontal resolution) well simulates
Typhoon Shinraku near the Philippine Islands and Hurricane IKE near the Gulf of
Mexico.
Our group has been developing a Large Eddy
Simulation (LES) model for urban environment.
The main features of the model include (i)
Building resolving, (ii) Roadside trees are
resolved in vertical direction, (iii) Multiple
reflections of short- and long-wave radiation
between buildings and trees by radiosity
method, (iv) resolving shadows from buildings
and trees, and (v) incorporation of cloud
physics and atmospheric radiation models.
Numerical simulation of thermal environment
around Tokyo station was conducted using
Oakforest-PACS supercomputer. The total
number of grid points is about 100 million.
Division of Global Environmental Sciences
contact address: pr@ccs.tsukuba.ac.jp
℃
Tokyo
Station
Tokyo
Station
Fig. 1: Numerical simulation of the general circulation of the atmosphere
produced by 7-km resolution NICAM.
(2a) (2b)
Fig. 2: Road skin temperature distribution estimated by the CCS-LES model (2a) and helicopter
observation (2b). Black indicates buildings.
Hurricane forecast using an operational numerical weather prediction model
A easy-to-use version of Integrated Forecast Systems (IFS)
operated at ECMWF (European Centre for Medium-range Weather Forecasts).
・Hydrostatic global spectral model
(max resolution T1279: about 14km grid interval)
・Reduced Gaussian Grid
・Hybrid MPI-OpenMP scheme
(Non-GPU, Non-FPGA)
ECMWF OpenIFS
Results - forecast of Hurricane Joaquin (2015) -
Experimental settings
Version
cy40r1 (ECMWF, 2014)
operational ver. in 19 Nov. 2013 - 11 May 2015
Initial condition
Atmosphere: GFS high-res analysis
Land & Sea: ERA5 reanalysis
Model
resolution
T639 L91 (32km grid spacing on the equator and 91 vertical levels)
Forecast length 240 hours ( 960 time steps with dt =900 s)
Computer
Parallelisation
256 MPI procs (16 nodes * 16 procs/node)
4 OpenMP threads/process
Computation Time 3:12:38 ( 19 minutes for 1 day forecast)
Data size of output 9.9 GB
Computation time have decreased by 40% with Intel MKL Library in comparison with
LAPACK.
Remark
The experimental result showed a cyclone track similar to the NCEP control
forecasts (thick line), suggesting that the initial conditions had a larger impact on
the track forecast than NWP models in this case.
Fig. 3: Predicted cyclone tracks of Hurricane Joaquin (coloured lines) by ECMWF
(Europe, left), the OpenIFS experiment (second left), NCEP (US, second right)
and JMA (Japan, rightmost). Black lines shows observed track.
9. University of Tsukuba | Center for Computational Sciences
https://www.ccs.tsukuba.ac.jp/
Implementation of Parallel 3-D Real FFT with 2-D Decomposition on Intel Xeon Phi Clusters
Numerical Computation
contact address: pr@ccs.tsukuba.ac.jp
Development of the high accurate Block Krylov solver
Background
The fast Fourier transform (FFT) is an algorithm which is currently widely used in science and engineering. A typical
decomposition for performing a parallel 3-D FFT is slabwise. This becomes an issue with very large MPI process counts
for a massively parallel cluster of many-core processors.
Overview
We proposed an implementation of a parallel 3-D real FFT with 2-D decomposition on Intel Xeon Phi clusters. The
proposed implementation of the parallel 3-D real FFT is based on the conjugate symmetry property of the discrete
Fourier transform (DFT) and the row-column FFT algorithm. We vectorized FFT kernels using the Intel Advanced Vector
Extensions 512 (intel AVX-512) instructions.
Performance
To evaluate the implemented 3-D real FFT with 2-D
decomposition, referred to as FFTE 7.0 (2-D
decomposition), we compared its performance with
that of the FFTE 7.0 (1-D decomposition), the FFTW
3.3.8 and the P3DFFT 2.7.7. The performance results
demonstrate that the proposed implementation of
parallel 3-D real FFT with 2-D decomposition
effectively improves performance by reducing the
communication time for larger numbers of MPI
processes on Intel Xeon Phi clusters. Fig. 1: Performance of Parallel 3-D Real FFTs (N = 256 × 512 × 512)
Linear systems with multiple right-hand sides appear in many scientific applications such as the computation of physical
quantity in lattice Quantum Chromodynamics (QCD), inner problems of eigensolvers for sparse matrix, and so on. As
numerical methods for solving these linear systems, it is known that Block Krylov subspace methods are efficient
methods in terms of the number of iterations and the computation time. However, the accuracy of the obtained solution
may often deteriorate due to the error occurs in the computation of matrix-matrix multiplications. To improve the
accuracy of the obtained solution, we have developed the new Block Krylov subspace method named Block GWBiCGSTAB
method [1]. The Block GWBiCGSTAB method is based on the group-wise updating technique. By using this technique, the
matrix-matrix multiplications that cause accuracy degradation can be avoided. As shown in Fig. 1, the accuracy of the
obtained solution generated by the Block GWBiCGSTAB method is higher than that by other methods.
Better
Fig. 2: True relative residual norm as a function of the
number L of right-hand sides. The test problem is the
linear system derived from the lattice QCD calculation.
Problem size: 1,572,864.
[1] Hiroto Tadano and Ryosei Kuramoto, Accuracy improvement of the Block BiCGSTAB method for linear systems with multiple right-
hands sides by group-wise updating technique, J. Adv. Simulat. Sci. Eng., Vol. 6, No. 1, pp. 100—117, 2019.
10. Python is one of the most popular general-purpose programming
languages, and persistent memory (PMEM) is a new device which can
accelerate data-intensive computing. There is a strong demand to use
persistent memory from Python easily. Therefore, we focus on pmemkv,
which is a key-value store optimized for persistent memory, and its
python bindings. We are currently evaluating pmemkv’s python
bindings in detail for efficient use of PMEM in Python.
University of Tsukuba | Center for Computational Sciences
https://www.ccs.tsukuba.ac.jp/
http://oss-tsukuba.org/en/software/gfarm
Software Researches for Big Data and Extreme-Scale Computing
contact address: pr@ccs.tsukuba.ac.jp
Investigate DAOS architecture for metadata
operation
Research of caching file system to exploit node
local storages
The open-source DAOS – Distributed Asynchronous Object Storage
– is notable for its rank on the IO-500 list and its use of Intel®
Optane™ Persistent Memory. In particular, metadata performance
is remarkable compared to other systems.
We investigate the reason for DAOS remarkable metadata
performance on its architecture and consider to integrate DAOS
ways to an existing system or develop a new storage system with
persistent memory.
The performance gap between processors and disk-based storage is
growing in modern HPC systems. To reduce the gap, SSDs attached to
compute nodes has been used as a “node local burst buffer”. We are
implementing distributed file system that uses local SSDs as a caching
layer of the storage nodes. The system uses fuse-library for system
call replacing and mochi-framework for RPC data transfer.
Acknowledgment
This work is partially supported by Multidisciplinary Cooperative Research Program in CCS, University of Tsukuba, New Energy and Industrial
Technology Development Organization (NEDO), and Fujitsu Laboratories Ltd.
Gfarm/BB – Gfarm File System for Node-local
burst buffer
Accelerating Python Applications with
Persistent Memory
Features include
•Open source
•Exploit local storage and data locality for scalable I/O performance
•InfiniBand support
•Data integrity is supported for silent data corruption
•Production systems: 8PB JLDG, 100PB HPCI Storage, etc.
gfarmbb –h hostfile –m mount_point start
…
gfarmbb –h hostfile stop
Fig. 1: IOR file-per-process read/write performance on Cygnus supercomputer
Fig. 3: mdtest performance comparison of IO-500 10 node challenge scores
Fig. 4: Automation of construction/destruction a swarm cluster
Fig.2a: Memory-storage hierarchy
with persistent memory
Fig2b: Applications can directly access
the persistent memory resident data
structures without using buffers.
Acceleration of Deep Learning using pytorch
with persistent memory
Persistent memory offers greater capacity than DRAM and significantly
better performance than storage. We use it for deep learning with
pytorch. Usually, before performing deep learning using the GPU, the
training data is copied to the main memory from the storage. We
exploit the persistent memory to improve the performance.
11. Scalable Graph Analysis over Intel Xeon Phi Coprocessors
The structural graph clustering method SCAN is successfully used in many applications since it detects not only densely
connected nodes as clusters but also extracts sparsely connected nodes as hubs or outliers (Fig. 1). However, it is difficult to
apply SCAN to large-scale graphs since SCAN needs to evaluate the density for all adjacent nodes included in the graph. In
this work, so as to address the above problem, we present a novel algorithm SCAN-XP that performs on Intel Xeon Phi
coprocessors. We designed SCAN-XP to make the best use of many cores in the Intel Xeon Phi by employing the following
approaches: First, SCAN-XP avoids the bottlenecks that arise from parallel graph computations by providing good load
balances among the cores. Second, SCAN-XP effectively exploits 512 bit SIMD instructions implemented in each core to
speed up the density evaluations. As a result, SCAN-XP runs approximately 100 times faster than SCAN; for the graphs with
100 million edges, SCAN-XP is able to perform in a few seconds (Fig. 2).
Fig. 2: Overall performances
Fig. 1: Structural Graph Clustering SCAN
Table. 1: Real-world Dataset
Noise-robust sleep stage scoring for mice using deep learning & big data
University of Tsukuba | Center for Computational Sciences
https://www.ccs.tsukuba.ac.jp/
Database Group
Sleep stage scoring for mice is one of the most basic analyses in sleep research; however, this analysis is time-consuming
and requires considerable expertise and effort. Although several studies have proposed automated scoring methods, they
do not achieve robustness against noise in biological signals enough for research uses. To develop a noise-robust scoring
method, we employ the following approaches.
1) Employing convolutional neural networks (CNN) & long short-term memory (LSTM), which can locate the feature of
both biological signals and noise in them.
2) Training the model using noisy biological signals obtained from over 3000 mice.
Thank to these improvements, the proposed method achieved scoring accuracy of more than 95% for noisy biological
signals. This result indicates that our method is practical enough for sleep research uses.
contact address: {kitagawa, amagasa, shiokawa, horie}@cs.tsukuba.ac.jp
①
W (Wake) NR (Non-REM) R (REM)R (REM)Stage
②
③
① Measure biological signals (EEG & EMG) from mice
② Split the signals into 20-sec. epochs (subsequences)
③ Assign sleep stages (W, NR, and R) to epochs
EEG
EMG
CNN with
wide filters
CNN with
wide filters
CNN with
narrow filters
Inputs
Feature
extraction
LSTM
Dense
Softmax
Scoring
model
Stage
{W,NR,R}
Stage Peak Freq. of EEG Amplitude of EMG
W 7-11 Hz Large
NR 1-6 Hz Small
R 7-11 Hz Smallest
Fig. 4: Structure of the proposed system
Fig. 3: Procedure of sleep stage scoring
Table 2: Feature of each stage
12. University of Tsukuba | Center for Computational Sciences
https://www.ccs.tsukuba.ac.jp/
Computational Media Group
We are researching navigation for the visually impaired. We
propose a new interface that utilizes sound and vibration to
support turn-by-turn navigation that is common for visually
impaired. In our proposed interface, the target path is divided
into straight segments and points of change direction. The
navigation instruction given by the sound and vibration is
carefully designed to give minimum yet sufficient clues on the
visually impaired walking. We have implemented a
preliminary system based on our proposal and conducted a
subject experiment for visually impaired people. The results
imply that our proposed approach is useful for visually
impaired navigation.
Accurate Overlapping Method of Time-Lapse Images for World Heritage Site Investigation
A method is proposed to accurately
overlap multiple high-quality images with
different shooting positions and intervals
by combining corresponding point
information between images and 3D
shape information. In the proposed
method, the correct feature matching of
images obtained by rendering the 3D
model of the subject is used. In this
research, the subjects were the pillars of
the Angkor Thom Bayon Temple and the
epilithic microorganisms adhering to and
eroding their surfaces. Synthetic
transformation of a homography utilizing
the correct matches is employed to
overlap the target images.
contact address: pr@ccs.tsukuba.ac.jp
We proposes a method to improve the
quality of omnidirectional free-viewpoint
images using generative adversarial
networks (GAN). By estimating the 3D
information of the capturing space while
integrating the omnidirectional images
taken from multiple viewpoints, it is
possible to generate an arbitrary
omnidirectional appearance. However,
the image quality of free-viewpoint
images deteriorates due to artifacts
caused by 3D estimation errors and
occlusion. We solve this problem by using
GAN and, moreover, by focusing on
projective geometry during training, we
further improve image quality by
converting the omnidirectional image into
perspective-projection images.
Information Display Design on Turn-By-Turn Navigation for Visually Impaired People
Image-quality Improvement of Omnidirectional: Free-Viewpoint Images by GAN
(a): OFV image (no image-quality improvement).
(d): Correct image (captured image).
(b): Proposed method using learning by image
division (with image-quality improvement).
(c): Proposed method using learning with
omnidirectional images (with image-quality
improvement).
Location Estimation by CV
Reference Query
Orientation measurement by IMU
Goal
LR-correction Orientation
Signal to turn Mode change
Voice
Announce
Field test
13. • Combining goodness of different type of accelerators: GPU + FPGA
• GPU is still an essential accelerator for simple and large degree of
parallelism to provide ~10 TFLOPS peak performance
• FPGA is a new type of accelerator for application-specific hardware with
programmability and speeded up based on pipelining of calculation
• FPGA is good for external communication between them with advanced
high speed interconnection up to 100Gbps x4 chan.
University of Tsukuba | Center for Computational Sciences
https://www.ccs.tsukuba.ac.jp/
Multi-Hybrid Accelerated Computing Platform
Supercomputer at CCS: Cygnus
OpenCL-ready High Speed FPGA Networking [1]
comp.
node
…
IB EDR Network (100Gbps x4/node)
Ordinary inter-node communication channel for CPU and GPU, but
they can also request it to FPGA
comp.
node
comp.
node
…comp.
node
Deneb nodes Albireo nodes
comp.
node
comp.
node
Ordinary inter-node network (CPU, GPU) by IB EDR
With 4-ports x full bisection b/w
…
…
Inter-FPGA direct network
• Our new supercomputer “Cygnus”
• Operation started in May 2019
• 2x Intel Xeon CPUs, 4x NVIDIA V100 GPUs, 2x Intel
Stratix10 FPGAs
• Deneb: 49 CPU+GPU nodes
• Albireo: 32 CPU+GPU+FPGA nodes
with 2D-torus dedicated network for FPGAs
(100Gbpsx4)
Albireo node (x32)
Deneb node (x48)
Specification of Cygnus
Target GPU:
NVIDIA Tesla V100
Target FPGA:
Nallatech 520N
Item Specification
Peak
performance
2.4 PFLOPS DP
(GPU: 2.2 PFLOPS, CPU: 0.2 PFLOPS, FPGA: 0.6 PFLOPS SP)
⇨ enhanced by mixed precision and variable precision on
FPGA
# of nodes
81 (32 Albireo (GPU+FPGA) nodes, 49 Deneb (GPU-only)
nodes)
Memory
192 GiB DDR4-2666/node = 256GB/s, 32GiB x 4 for
GPU/node = 3.6TB/s
CPU / node Intel Xeon Gold (SKL) x2 sockets
GPU / node NVIDIA V100 x4 (PCIe)
FPGA / node
Intel Stratix10 x2 (each with 100Gbps x4 links/FPGA and
x8 links/node)
Global File
System
Lustre, RAID6, 2.5 PB
Interconnect
ion Network
Mellanox InfiniBand HDR100 x4 (two cables of HDR200 /
node)
4 TB/s aggregated bandwidthj
Programmin
g Language
CPU: C, C++, Fortran, OpenMP, GPU: OpenACC, CUDA
FPGA: OpenCL, Verilog HDL
System
Vendor
NEC
• FPGA design plan
• Router
- For the dedicated
network, this impl. is
mandatory.
- Forwarding packets
to destinations
• User Logic
- OpenCL kernel runs
here.
- Inter-FPGA comm.
can be controlled
from OpenCL kernel.
• SL3
- SerialLite III : Intel
FPGA IP
- Including transceiver
modules for Inter-
FPGA data transfer.
- Users don’t need to
care
CPU
PCIe network (switch)
G
P
U
G
P
U
FPGA
HCA HCA
Inter-FPGA
direct network
(100Gbps x4)
Network switch
(100Gbps x2)
CPU
PCIe network (switch)
G
P
U
G
P
U
FPGA
HCA HCA
Inter-FPGA
direct network
(100Gbps x4)
SINGLE
NODE
(with FPGA)
Network switch
(100Gbps x2)
CPU
PCIe network (switch)
G
P
U
G
P
U
HCA HCA
Network switch
(100Gbps x2)
CPU
PCIe network (switch)
G
P
U
G
P
U
HCA HCA
SINGLE
NODE
(without FPGA)
Network switch
(100Gbps x2)
FPGA FPGA FPGA
FPGA FPGA FPGA
FPGA FPGA FPGA
(only for Albirero nodes)
Inter-FPGA direct network
64 FPGAs on Albireo nodes are
connected directly as 2D-Torus
configuration without Ethernet sw.
: QSFP28 Port
情報処理学会研究報告
IPSJ SIG Technical Report
uint16 val = (uint16 )(0);
if (in_port == 1) {
val = read_channel_intel ( fwd_x_neg_in );
} else if (in_port == 2) {
val = read_channel_intel ( fwd_x_pos_in );
} else if (in_port == 3) {
val = read_channel_intel ( fwd_y_neg_in );
} else if (in_port == 4) {
val = read_channel_intel ( fwd_y_pos_in );
}
val += (uint16 )(
v + 0, v + 1, v + 2, v + 3,
v + 4, v + 5, v + 6, v + 7,
v + 8, v + 9, v + 10, v + 11,
v + 12, v + 13, v + 14, v + 15
);
ulong t_tmp = 0;
if (out_port == 1) {
write_channel_intel (fwd_x_neg_out , clocktime(
val , &t_tmp ));
} else if (out_port == 2) {
write_channel_intel (fwd_x_pos_out , clocktime(
val , &t_tmp ));
} else if (out_port == 3) {
write_channel_intel (fwd_y_neg_out , clocktime(
val , &t_tmp ));
} else if (out_port == 4) {
write_channel_intel (fwd_y_pos_out , clocktime(
val , &t_tmp ));
} else if (out_port == 5) {
write_channel_intel (internal , clocktime(val , &
t_tmp ));
}
図 13: トイプログラムの OpenCL コードの一部.
ネルは 2 種類のカーネルで構成される.1 つは往路のデー
タ転送を行うカーネルであり,もう 1 つは復路のデータ転
送を行うカーネルである.通信はバケツリレー方式で行わ
れ,全体の計算が完了したら計算結果を全てのノードに返
す.また,和の計算を行う処理は往路で行われ,復路でそ
の計算結果をブロードキャストする.
図 13 にトイプログラムのコードの一部を示す.この
コードは入力を受け取り,その結果に値を加算し,出力す
るというものであり,図 12 の灰色で示されているカーネル
の一部である.read channel intel,write channel intel 関
数はそれぞれ Channel から読み出し,書き出しを行う組み
込み関数であり,clocktime 関数は時間を測定する独自の
関数である.if 文で入出力する Channel を切り替えられる
ようになっているが,これは CoE にはルーティング機能
がまだなく,FPGA ボードにあるどの外部リンクで通信を
行うかを明示する必要があるためである.
性能評価の結果を図 14 に示す.最小レイテンシは
2014ns,最大スループットは 181.4Gbps が得られた.本実
図 14: トイプログラムの測定結果.
表 3: プロトコルオーバヘッド
要素 ペイロード通信速度 効率
物理層速度 103.125Gbps
67b/64b 98.484Gbps ×0.955
Meta Frame 98.287Gbps ×0.998
SL3 Burst 96.813Gbps ×0.985
CoE Header 90.762Gbps ×0.938
験には ppx2-02, ppx2-03, ppx2-05 の計 3 ノードを用いた.
図 12 にあるように ppx2-05 が始点ノードとなり ppx2-03
で折り返す.測定結果のスループットは,始点ノードの
データ送信開始から始点ノードのデータ受信終了までの時
間から求めた.また,横軸のデータサイズは,各ノードが
持っているデータサイズを表しており,MPI Allreduce に
おける count 引数に相当する.pinpong ベンチマークの結
果 90.7Gbps と比べて,181.4Gbps と約 2 倍の性能が得ら
れているが,これは通信と演算がパイプライン化によって
送信と受信が同時に行われるためである.
6. 考察
6.1 pingpong ベンチマーク
pingpong ベンチマークで得られた最大スループットは
90.7Gbps であり,物理層に 100Gbps を用いているのに対
して約 90%の性能しか得られていない.しかしながら,こ
の性能は設計の意図したとおりである.表 3 に理論上の通
信性能を示す.評価環境では物理層の速度は 103.125Gbps
(4 × 25.78125Gbps) であり,この速度は 100Gb Ethernet
の物理層と同じ速度を採用している.表 3 は,その物理
層の速度に対して,プロトコル上のオーバヘッドがどの
程度あるのかを示したものである.この中で,67b/64b,
Meta Frame,SL3 Burst は SerialLite III に由来するオー
バーヘッドであり,公式ドキュメント [11] に記載されてい
る計算式を用いて求めた.CoE Header は CoE が付与する
ヘッダによるオーバーヘッドを示すものである.CoE のパ
ケットは 64byte で構成されており,そこに 4byte のヘッ
c 2019 Information Processing Society of Japan 7
情報処理学会研究報告
IPSJ SIG Technical Report
uint16 val = (uint16 )(0);
if (in_port == 1) {
val = read_channel_intel ( fwd_x_neg_in );
} else if (in_port == 2) {
val = read_channel_intel ( fwd_x_pos_in );
} else if (in_port == 3) {
val = read_channel_intel ( fwd_y_neg_in );
} else if (in_port == 4) {
val = read_channel_intel ( fwd_y_pos_in );
}
val += (uint16 )(
v + 0, v + 1, v + 2, v + 3,
v + 4, v + 5, v + 6, v + 7,
v + 8, v + 9, v + 10, v + 11,
v + 12, v + 13, v + 14, v + 15
);
ulong t_tmp = 0;
if (out_port == 1) {
write_channel_intel (fwd_x_neg_out , clocktime(
val , &t_tmp ));
} else if (out_port == 2) {
write_channel_intel (fwd_x_pos_out , clocktime(
val , &t_tmp ));
} else if (out_port == 3) {
write_channel_intel (fwd_y_neg_out , clocktime(
val , &t_tmp ));
} else if (out_port == 4) {
write_channel_intel (fwd_y_pos_out , clocktime(
val , &t_tmp ));
} else if (out_port == 5) {
write_channel_intel (internal , clocktime(val , &
t_tmp ));
}
図 13: トイプログラムの OpenCL コードの一部.
ネルは 2 種類のカーネルで構成される.1 つは往路のデー
タ転送を行うカーネルであり,もう 1 つは復路のデータ転
送を行うカーネルである.通信はバケツリレー方式で行わ
れ,全体の計算が完了したら計算結果を全てのノードに返
す.また,和の計算を行う処理は往路で行われ,復路でそ
の計算結果をブロードキャストする.
図 13 にトイプログラムのコードの一部を示す.この
コードは入力を受け取り,その結果に値を加算し,出力す
るというものであり,図 12 の灰色で示されているカーネル
の一部である.read channel intel,write channel intel 関
数はそれぞれ Channel から読み出し,書き出しを行う組み
込み関数であり,clocktime 関数は時間を測定する独自の
関数である.if 文で入出力する Channel を切り替えられる
ようになっているが,これは CoE にはルーティング機能
がまだなく,FPGA ボードにあるどの外部リンクで通信を
行うかを明示する必要があるためである.
性能評価の結果を図 14 に示す.最小レイテンシは
2014ns,最大スループットは 181.4Gbps が得られた.本実
図 14: トイプログラムの測定結果.
表 3: プロトコルオーバヘッド
要素 ペイロード通信速度 効率
物理層速度 103.125Gbps
67b/64b 98.484Gbps ×0.955
Meta Frame 98.287Gbps ×0.998
SL3 Burst 96.813Gbps ×0.985
CoE Header 90.762Gbps ×0.938
験には ppx2-02, ppx2-03, ppx2-05 の計 3 ノードを用いた.
図 12 にあるように ppx2-05 が始点ノードとなり ppx2-03
で折り返す.測定結果のスループットは,始点ノードの
データ送信開始から始点ノードのデータ受信終了までの時
間から求めた.また,横軸のデータサイズは,各ノードが
持っているデータサイズを表しており,MPI Allreduce に
おける count 引数に相当する.pinpong ベンチマークの結
果 90.7Gbps と比べて,181.4Gbps と約 2 倍の性能が得ら
れているが,これは通信と演算がパイプライン化によって
送信と受信が同時に行われるためである.
6. 考察
6.1 pingpong ベンチマーク
pingpong ベンチマークで得られた最大スループットは
90.7Gbps であり,物理層に 100Gbps を用いているのに対
して約 90%の性能しか得られていない.しかしながら,こ
の性能は設計の意図したとおりである.表 3 に理論上の通
信性能を示す.評価環境では物理層の速度は 103.125Gbps
(4 × 25.78125Gbps) であり,この速度は 100Gb Ethernet
の物理層と同じ速度を採用している.表 3 は,その物理
層の速度に対して,プロトコル上のオーバヘッドがどの
程度あるのかを示したものである.この中で,67b/64b,
Meta Frame,SL3 Burst は SerialLite III に由来するオー
バーヘッドであり,公式ドキュメント [11] に記載されてい
る計算式を用いて求めた.CoE Header は CoE が付与する
ヘッダによるオーバーヘッドを示すものである.CoE のパ
ケットは 64byte で構成されており,そこに 4byte のヘッ
c 2019 Information Processing Society of Japan 7
Cluster System with FPGAs
sender(__global float* restrict x, int n) {
for (int i = 0; i < n; i++) {
float v = x[i];
write_channel_intel(simple_out, v);
}
}
receiver(__global float* restrict x, int n) {
for (int i = 0; i < n; i++) {
float v = read_channel_intel(simple_in);
x[i] = v;
}
}
lCommunication Integrated Reconfigurable
CompUting System (CIRCUS)
ØCIRCUS enables OpenCL code communicate
with other FPGAs on different nodes
ØExtending Intel’s channel mechanism to
external communications
ØPipeline manner: sending/receiving data
from/to compute pipeline directly
Global Memory
(DDR4)
Source
Kernel
Destination
Kernel
Write Read
Off Chip
Source
Kernel
Destination
Kernel
FIFO
Channel
O e CL
Ke e
40G E h.
C e
BSPO e CL C c
Se ial Link ( 4)IO Channel
Network
Controller
FPGA
PCIe
OpenCLAPI
Interconnect
・ I/O Channel
- connects OpenCL
with peripherals
- We used this feature
Comm. w/o channels
Comm. w/ channels
・ Channel Extension:
Transferring data between
kernels directly (low latency
and high bandwidth)
・ We can use multiple
kernel design to exploit
space parallelism in an
FPGA
lFPGA-based parallel comp. with OpenCL
- Needs a communication system being
suitable to OpenCL and Intel FPGAs
- Using of Intel FPGA SDK for OpenCL
CIRCUS
Backends
sender code on FPGA1
receiver code on FPGA2
Our proposed method Pipelined communication experiment
90.7Gbps↑
Recv.
Comp.
Send
A
B
A,B: Start and end point to clock
Authentic Radiation Transfer [2]
• Accelerated Radiative transfer on grids Oct-Tree
(ARGOT) has been developer in Center for
Computational Sciences, University of Tsukuba
• ART is one of algorithms used in ARGOT and
dominant part (90% or more of computation
time) of ARGOT program
• ART is ray tracing based algorithm
• problem space is divided
into meshes and reactions
are computed on each mesh
• Memory access pattern
depends on ray direction
• Not suitable for SIMD architecture
0
200
400
600
800
1000
1200
1400
(16,16,16) (32,32,32) (64,64,64) (128,128,128)
Performance[Mmesh/s]
mesh size
CPU(14C)
CPU(28C)
P100(x1)
FPGA
better
Table 2: Resource usage and clock frequency
size # of PEs ALMs (%) Registers (%) M20
(16, 16, 16) (2, 2, 2) 132,283 31% 267,828 31% 7
(32, 32, 32) (2, 2, 2) 169,882 40% 344,447 40% 7
(64, 64, 64) (2, 2, 2) 169,549 40% 344,512 40% 7
(128, 128, 128) (2, 2, 2) 169,662 40% 344,505 40% 7
Table 3: Performance comparison between FPGA, CPU and
GPU implementations. The unit is M mesh/sec.
Size CPU(14C) CPU(28C) P100 FPGA
(16,16,16) 112.4 77.2 105.3 1282.8
(32,32,32) 158.9 183.4 490.4 1165.2
(64,64,64) 175.0 227.2 1041.4 1111.0
(128,128,128) 95.4 165.0 1116.1 1133.5
per link) multiple interconnection links (up to 4 channels) on
it. Additionally, HLS such as OpenCL programming envi-
ronment is provided, and there are several tyeps of research
to involve them in FPGA computing. In [3], Kobayashi, et
al. show the basic feature to utilize the high speed intercon-
nection over FPGA driven by OpenCL kernels. Therefore,
although the performance of our implementation is almost
same as NVIDIA P100 GPU, the overall performance with
weak po
through
run our
ation In
than Ar
blocks a
9. R
[1] K. M
F. K
Hea
and
Astr
[2] K. H
Com
imag
IEE
App
PE Array
(2x2x2)
DDR4
Memory
Memory
Reader
Memory
Writer
Buffer
Buffer
Channel
Memory Network
Fig. 5: Design Outline of ART on FPGA.
each other. Each kernel computes reaction between a mesh and a ray
on its own computation space which is dedicated to each kernel. While
computing, a ray is traversed among multiple compute kernels depend-
ing on its location. If a ray goes out from kernel’s space, its data will be
transferred to a neighbor kernel through a channel.
Figure 5 shows the design outline of our implementation. “Memory Reader”
reads mesh data from DDR4 memory which is seen as a global memory
from OpenCL language. “Memory Writer” is a counterpart to the reader
and updates mesh data by the result of computation. It has both of read
and write memory access because it computes integration of gas reaction.
“Buffer” is a mesh data buffer to improve memory access performance.
“PE Array” is an array of PEs (Processing Element). PE computes the
kernel of ART method. The array is consists of multiple kernels. We show
the detail of PE network in the next subsection.
Since our implementation is work-in-progress, it lacks some features from
the CPU implementation. While computation in an FPGA, all mesh data
must be put into its internal BRAM (Block Random Access Memory).
The FPGA implementation does not support to replace mesh data in-
volved by progression of its computation. Therefore, problem size which
an FPGA can solve is limited by the size of BRAM. The CPU implemen-
tation supports inter-node parallelization using MPI (Message Passing
Interface), but the FPGA implementation does not support any network-
ing functionality and uses only one FPGA.
4.2 Parallelization using Channel in an FPGA
We describe the structure in “PE Array” shown in Figure 5. A PE Array
is consists of PEs and BEs (Boundary Element) as shown in Figure 6.
Source
Kernel
Destination
Kernel
FIFO
Channel
Global Memory
(DDR4)
Source
Kernel
Destination
Kernel
Write Read
Off Chip
• Our implementation uses channel based approach
• One of extensions to OpenCL for FPGAs by Intel
• It enables inter kernel communication much faster
• No external memory (DDR) access is required
• Lower resource utilization than DDR access
without channels with channels
(16x16x16) (8x8x8)
mesh
• Problem space is divided into small blocks
• e.g. (16, 16, 16) → 8 (8, 8, 8)
• PE is assigned to each of small blocks
PE BEBE PE
96bit x2
(read,write)
Channel
PE PE BEBE
BEBE
BEBE
y
x
Ray Data
• PEs are connected by channels each other
• PE: Processing Element
• BE: Boundary Element
• Kernel of PEs and BEs are started automatically by
autorun attribute
• Lower control overhead and resource usage
because of decreasing number of host controlled
kernels
4.9x faster
almost equal performance
Reference
[1] Norihisa Fujita, Ryohei Kobayashi, Yoshiki Yamaguchi, and Taisuke Boku, Parallel Processing on FPGA Combining Computation and Communication in OpenCL Programming, 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp.479-488, May 2019
[2] Norihisa Fujita, Ryohei Kobayashi, Yoshiki Yamaguchi, Yuuma Oobata, Taisuke Boku, Makito Abe, Kohji Yoshikawa, and Masayuki Umemura: Accelerating Space Radiate Transfer on FPGA using OpenCL (Accepted), International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART 2018)
Acknowledgment
This research is a part of the project titled “Development of Computing-Communication Unified Supercomputer in Next Generation” under the program of “Research and Development for Next-Generation Supercomputing Technology” by MEXT. We thank Intel University Program for providing us both of hardware and software.
14. JCAHPC (Joint Center for Advanced HPC), which is a cooperative organization by the University of Tokyo and University
of Tsukuba for joint procurement and operation of the largest scale of supercomputer in Japan, introduced a new
supercomputer system “Oakforest-PACS” with 25 PFLOPS peak performance and started its operation from December
1st, 2016. The Oakforest-PACS system is ranked at #6 in TOP500 List of November 2016 with 13.55 PFLOPS of Linpack
performance, and also recognized as Japan's fastest supercomputer. The system is installed at the Kashiwa Research
Complex II building in the Kashiwa-no-Ha campus, the University of Tokyo.
The Oakforest-PACS system has 8,208 compute nodes, each of which consists of the latest version of Intel Xeon Phi
processor (code name: Knights Landing), and Intel Omni-Path Architecture as the high performance interconnect. The
Oakforest-PACS system is the largest cluster solution with Knights Landing processor as well as also the largest
configuration with Omni-Path Architecture in the world. The system is integrated by Fujitsu Co. Ltd, and its PRIMERGY
server is employed as each of compute node. Additionally, the system employs the Lustre shared files system (capacity:
26 PB), and IME (fast file cache system, 940 TB), both of which are provided by DataDirect Network (DDN).
All the computation nodes and servers including login nodes, Lustre servers and IME servers are connected by a full
bisection bandwidth of Fat-Tree interconnection network with Intel Omni-Path Architecture to provide highly flexible job
allocation over the nodes and high performance file access.
Overview
The Oakforest-PACS is offered to researchers in Japan
and their international collaborators through various
types of programs operated by HPCI under MEXT, and
by original supercomputer resource sharing programs
by two universities.
It is expected to contribute to dramatic development of
new frontiers of various field of studies. The Oakforest-
PACS will be also utilized for education and training of
students and young researchers. We will continue to
make further social contributions through operations of
the Oakforest-PACS.
Research & Education
System Configuration
12 of
768 port Director Switch
(Source by Intel)
362 of
48 port Edge Switch
2 2
241 4825 7249
Uplink: 24
Downlink: 24
. . . . . . . . .
Parallel File System
26.2 PB
Omni-Path Architecture (100 Gbps), Full-bisection BW Fat-tree
Lustre Filesystem
DDN ES14KX x10
File Cache System
940TB
DDN IME14KX x25
1560 GB/s
500 GB/s
Compute Nodes: 25 PFlops
CPU: Intel Xeon Phi 7250
(KNL 68 core, 1.4 GHz)
Mem: 16 GB (MCDRAM,
490 GB/sec, effective)
+ 96 GB (DDR4-2400, 115.2 GB/sec)
×8,208
Fujitsu PRIMERGY CX1640 M1
x 8 node inside CX600 M1 (2U)
Login
node
Login Node x20
Login
node
Login
node
Login
node
Login
node
Login
node
Login
node
Login
node
Login
node
Login
node
Login
node
Login
node
U. Tsukuba
users
U. Tokyo
users
Total peak performance 25 PFLOPS
Total number of
compute nodes
8,208
Power consumption 4.2 MW (including cooling)
# of racks 102
Cooling
system
Compute
Node
Type Warm-water cooling
Direct cooling (CPU)
Rear door cooling (except CPU)
Facility Cooling tower & Chiller
Others Type Air cooling
Facility PAC
Joint Center for Advanced High Performance Computing
Joint Center for Advanced HPC | http://jcahpc.jp/
TOP 500 #6 (#1 in Japan), HPCG #3 (#2), Green 500 #6 (#2)
@Nov. 2016
IO 500 #1 @Nov. 2017, Jun. 2018
IO-500 BW #1 @Jun. 2019