SlideShare a Scribd company logo
1 of 14
Download to read offline
https://www.ccs.tsukuba.ac.jp/
University of Tsukuba | Center for Computational Sciences
Mission of CCS
The CCS promotes "multidisciplinary computational science" on the basis of the fusion
between computational science and computer science. For the purpose, the CCS
develops high-performance computing systems by the "co-design". The scientific
research areas cover particle physics, astrophysics, nuclear physics, nano-science, life
science, environmental science, and information science.
The CCS was reorganized in April, 2004, from the preceding center, Center for
Computational Physics that was established in 1992. The CCS is the institute for the
above-mentioned research fields and also the joint-use facility for outside researchers.
Since 2010, the CCS has been approved as a national core-center, Advanced
Interdisciplinary Computational Science Collaboration Initiative (AISCI), by the Ministry of
Education, Culture, Sports, Science and Technology (MEXT). The CCS aims at playing a
significant role for the development of the Multidisciplinary Computational Science.
Chronology and Major Events
Foundation of the Center for Computational Physics (CCP)
Completion of CP-PACS, a 0.6 TFLOPS MPP ranked No. 1 on the Top 500 in Nov. 1996
Completion of HMCS (Heterogeneous Multi-Computer System), an 8.6 TFLOPS coupled CP-
PACS/GRAPE-6 system
Reorganization and expansion of CCP, renamed Center for Computational Sciences (CCS)
Two major new computing facilities start operation.
PACS-CS a general-purpose 14.3 TFLOPS MPP cluster for computational sciences
FIRST an HMCS-E for astrophysical simulations General-purpose 3.5 TFLOPS +
gravity 35 TFLOPS
Completion of T2K-Tsukuba system, a 95.4 TFLOPS cluster ranked No. 20 on the Top 500 in
Jun. 2008
HA-PACS Base Cluster is delivered with 802 TFLOPS of peak performance, ranked No. 41 on
the Top 500 in Jun. 2012.
HA-PACS/TCA is added to HA-PACS system with 364 TFLOPS of peak performance in Oct.
2013, and total peak performance of HA-PACS system is expanded to over 1.1 PFLOPS.
Joint Center for Advanced HPC(JCAHPC) established in alliance with the University of Tokyo
COMA(PACS IX) is delivered with 1.001PFLOPS of peak performance, ranked No.51 on the
Top 500 in Jun. 2014.
Oakforest-PACS is installed and started operation in JCAHPC
Cygnus is installed and started operation.
1992
1996
2002
2004
2006
2008
2012
2013
2014
2016
2019
CP-PACS FIRST-Cluster
PACS-CS T2K-Tsukuba
HA-PACS COMA
Oakforest-PACS
Current Supercomputers
Cygnus
2+1 flavor QCD at Physical Point on very large lattices (master-field simulations)
University of Tsukuba | Center for Computational Sciences
https://www.ccs.tsukuba.ac.jp/
Exploring QCD phase diagram
Research in Particle Physics
contact address: pr@ccs.tsukuba.ac.jp
Investigating the phase structure of QCD at non-zero temperature and density is very
important to understand properties of strongly interacting matters under extreme
conditions. It is known that the order of the phase transition depends on the mass
and the number of flavors of quarks and there should be so-called critical endlines,
lines of second order phase transitions, in certain space of quark masses as shown in
Fig. 2a.
To determine the shape of the critical endline in the small quark mass region we are
carrying out lattice QCD simulations at finite temperature with 2+1 as well as 3
degenerate quark flavors on Cygnus and Oakforest-PACS. Fig. 2b shows our recent
estimation of the critical pion mass in 3 flavor QCD in the continuum limit including a
new calculation with the temporal lattice extent of 12, where the new result gives a
smaller upper bound than that of our previous calculation.
Fig1a:
Relative difference of the light hadron spectrum from the
experiment. Inputs are only the pion, kaon, and omega baryon
masses to determine the up-down and strange quark masses, and
the lattice cutoff, respectively. Our results show good agreement
with the experiment albeit errors are still not quite small for
some of the hadrons.
[K-.I. Ishikawa et al., https://arxiv.org/abs/1511.09222]
Fig. 1b:
A comparison of pseudoscalar decay constants, fπ and fK, on
(10fm)4 and (5fm)4. We detect 0.66% and 0.26% finite volume
effect on fπ and fK, respectively. The effect is very small and
negligible to compare the corresponding experiments. Now, we
can control and remove the finite volume effect completely by
using the master-field simulations.
[K-.I. Ishikawa et al., Phys. Rev. D 99, 014504]
Hadrons are the constituents of atomic nuclei. Computing the mass
spectrum of hadrons from first principles of the quantum
chromodynamics (QCD), the fundamental theory of strong interaction
described by quarks and gluons, is a principal subject in particle
physics.
After quenched and succeeding 2 flavor QCD simulations by the CP-
PACS, those studies were extended to 2+1 flavor QCD by
incorporating the dynamical strange quark, though the degenerate
up-down quark mass was much heavier than the physical one. On the
PACS-CS and the T2K computers, we have succeeded in reaching the
physical point. This calculation is followed by a larger volume
simulation on the K computer.
Our current project is aiming to control and remove systematic errors
due to the previous simulations on a finite volume with a finite lattice
spacing. We are performing so called master-field simulations on very
larger (10fm)4 volume with several lattice spacings using the
Oakforest-PACS.
Fig. 2a:
Expected quark mass dependence of the
order of the QCD phase transition. Our goal
is to determine the shape of the critical
endline shown as a red curve in the lower-
left corner.
Fig. 2b:
Our recent estimation of the critical pion mass,
mπ,E, in 3 flavor QCD. The continuum extrapolation
including new data sets with the temporal extent
of 12 gives an upper bound mπ,E ≲ 110 MeV.
[Y. Kuramashi et al., Phys. Rev. D 101, 054509]
Vlasov-Poisson	simulation	of	cosmic	neutrinos	in	the	large-scale	structure	
of	the	universe
Theoretical	galaxy	formation	– numerical	simulations	reveal	the	fate	of	stars	and	gas
University	of	Tsukuba		| Center	for	Computational	Sciences
http://www.ccs.tsukuba.ac.jp/
Solving	the	Mysteries	of	the	Universe	with	Computational	Astrophysics
When a cluster of stars forms, only a part of the natal cloud is
converted into stars, and the rest is ionized and heated by the
powerful stellar radiation and ejected outward. Using
radiation-hydrodynamic simulations, we found that star
formation is primarily controlled by the formation of ionized
regions, as well as the surface density and dust content of the
natal cloud. We developed a new semi-analytic model that
captures this behaviour and can be incorporated in subgrid
recipes for large-scale cosmological simulations.
Fukushima, Yajima, et al. (2020), MNRAS, 497, 3830
contact	address:	ayw@ccs.tsukuba.ac.jp /		pr@ccs.tsukuba.ac.jp
We devise a physical model to determine the formation,
distribution, and kinematics of molecular gas clouds in
galaxies, and predict the intensities of carbon monoxide (CO)
lines and the molecular hydrogen (H2) abundance, taking into
account the interstellar radiation field and dust attenuation.
We apply the model to data from the Illustris-TNG
cosmological simulations and compare the CO luminosities
and H2 masses with recent observations of galaxies at low
and high redshifts. The model successfully reproduces the
observed CO-luminosity function and the total H2 mass in
the local universe.
Inoue, S., Yoshida, N. & Yajima, H., (2020) accepted for publication in MNRAS
100	kpc	
b)
a)
Fig. 2a: The structure of the five brightest galaxies in CO(1-0) in the simulation.
Fig. 2b: Density evolution in the formation of star clusters. White circles indicate
stars and the green contours bound ionization regions.
Neutrinos are elementary particles ubiquitous in the universe. The Super-Kamiokande experiment revealed that
neutrinos have mass, which implies that neutrinos can dynamically affect the formation of large-scale structure (LSS) in
the universe. We perform numerical simulations of LSS formation incorporating the effect of massive neutrinos by
directly solving the collisionless Boltzmann equation in 6D phase-space on two supercomputers, FUGAKU and Oakforest-
PACS. Our highly optimized simulation code achieves almost ideal weak and strong scaling on FUGAKU.
Yoshikawa, K., Tanaka, S., Yoshida, N. & Saito, S. (2020) accepted for publication in ApJ.
Fig. 1a: Simulated distributions of massive neutrinos (color scale) and dark matter
(contours) as well as dark matter halos (white circle) at a) redshift z = 0 (the present),
and b) redshift of 1 (about 7.9 Gyr ago).
Fig. 1b: Strong scaling of VLASOV simulations on super computer
FUGAKU. Run ID prefixes S, M, and L denote grid resolutions of
96³, 192³, and 384³, respectively, and the number denotes the
number of computational nodes in multiples of 144.
a) b)
Are “free neutrons” in neutron stars free?
University of Tsukuba | Center for Computational Sciences
https://www.ccs.tsukuba.ac.jp/
Computational Nuclear Physics
Although the nucleus is a microscopic object on earth, there is a gigantic nucleus in the universe, that is the neutron
star (Fig.1). Near the surface of the neutron stars, a periodic crystalline structure is formed and all the protons are
expected to be confined. In contrast, there are unbound neutrons which are regarded as “free”. These free neutrons
play a key role in various observed phenomena, such as pulsar glitch and cooling.
Interactive Plot of Atomic nuclei and Computed Shapes (InPACS)
Measuring nuclear properties is very expensive using accelerators. The obtained data are precious for various
technologies of human beings, thus, compiled by nuclear data centers in the world, then, open to public. We have
calculated almost all kinds of nuclides in the universe, using the energy density functional theory. The computation
complements missing experimental data. In order to publicize the computational nuclear data, we have opened a web
site, InPACS, in which you may interactively obtain various nuclear data/information.
contact address: nakatsukasa@nucl.ph.tsukuba.ac.jp
Fig. 3: Snapshot of InPACS web site.
Fig. 1: Structure of a neutron star
Courtesy of http://www.astroscu.unam.mx/neutrones/
0.6
0.7
0.8
0.9
1
1.1
0 0.02 0.04 0.06 0.08 0.1
m
*
/mn
r [ fm
-3
]
Fig. 2: Ratio of effective mass of
free neutrons in the neutron-
star crust (slab phase) to their
bare mass.
We have examined properties of the “free neutrons”, with the nuclear
density functional calculation. Surprisingly, at a certain density region,
they are even “super-free”, which means that their mass is lighter in the
neutron star than in the vacuum (Fig.2)!
This research was supported by
ImPACT project on Reduction and
Resource Recycling of High-level
Radioactive Wastes through Nuclear
Transmutation.
(a) Optical near-field generated in metal-organic framework, IRMOF-10
SALMON: Scalable Ab-initio Light-Matter simulator for Optics and Nanoscience
Optical Properties of Nano-materials in Real Time and Real Space
University of Tsukuba | Center for Computational Sciences
https://www.ccs.tsukuba.ac.jp/
Quantum Condensed Matter Physics
Understanding interaction between light and matter is the basis of a wide range of technologies. For this
purpose, it is essential to describe electron dynamics in matters induced by light electromagnetic fields in a
microscopic scale, 10-9 (nano-)meter in space and 10-15 (femto-) second in time. We have been developing
an open-source computer code SALMON, Scalable Ab-initio Light-Matter simulator for Optics and
Nanoscience that describes electron dynamics in molecules, nano-materials, and solids based on first-
principles time-dependent density functional theory [http://salmon-tddft.jp]. As a novel function of
SALMON, light propagation in nano-materials as well as in bulk medium can be described taking full
account of nonlinearity and nonlocality of light-matter interactions in the ab-initio level. We expect
SALMON will be widely used in cutting-edge researches in optics and nanoscience.
contact address: pr@ccs.tsukuba.ac.jp
I (W/cm2) 109 1010 1011 1012
laser field
I=1010W/cm2
ω=3.38eV
(b) Weak-scaling
performance on the
Fugaku system using
up to 27,648 nodes
to simulate 13,648
atoms.
When a light pulse irradiates on nano-sized objects, a strong and spatially-
localized electromagnetic field, which is called the near field, appears around the
object. The near field enables imaging beyond the limit of optical resolution and
enhances nonlinear optical processes. We perform first-principles calculations of
the photoexcitation dynamics of an acetylene molecule in a metal organic
framework, IRMOF-10. Resonant laser excitation of the IRMOF-10 generates an
optical near field around the two benzene rings that comprise the main
framework of the IRMOF-10. The second harmonic excitation caused by spatial
nonuniformity of the optical near field is observed.
(b) Optical property of metallic metasurface with sub-nm gaps
By virtue of rapid progresses in fabrications of nano-materials, it is
possible to manufacture periodic materials composed of uniformly
structured nano-objects. Here we investigate the optical properties
of quantum plasmonic metasurfaces composed of two-dimensional
arrayed metallic nano-spheres with sub-nm gaps according to the
time-dependent density functional theory, a fully quantum
mechanical approach. When the quantum and classical
descriptions are compared, the absorption rates of the
metasurface exhibit substantial differences at shorter gap distances.
The differences are caused by electron transport through the gaps
of the nano-objects. Re Im
Absorption rates
Current distribution
x
y
0.4 nm
Gap distances
Energy
Classical TDDFT
(a) A multiphysics simulation
solving Maxwell, time-
dependent Kohn-Sham,
and Newton equations is
performed on the Fugaku
system for a thin film of
amorphous SiO2 composed
of more than 10,000 atoms.
Disclaimer
The results obtained on the evaluation environment in the trial phase do not guarantee the performance, power and other attributes of the supercomputer Fugaku at the start of its public use operation.
(a) (b)
University of Tsukuba | Center for Computational Sciences
Computational Elucidations for Biomolecules
The world of life is full of mystery. Actual molecular structures, motions and chemical reactions of biological molecules,
such as protein, nucleic acids, carbohydrates and lipids are still unclear. Using supercomputers, we have performed highly
demanding computations based on molecular mechanics (MD) and hybrid quantum mechanics/molecular mechanics
(QM/MM) methods, and we are uncovering some important biological questions.
Fig. 2: (a) Effective conformational sampling of MD simulations: Parallel Cascade Selection MD (PaCS-MD). To promote the conformational
transition, the following cycle is repeated in PaCS-MD; (I) Selections of initial seeds (structures) that have high potential to transit. (II) The
conformational resampling through restarting multiple MD simulations from the selected initial seeds. [R. Harada et al., J. Chem. Phys. 139
035103 (2013)]
(b) QM/MM model of oxygen evolving complex in photosystem II. Key intermediate states in the catalytic reaction “2H2O + 4hv -> 4H++4e–
+O2” have been investigated using the large model. [M. Shoji et al., Catal. Soc. Technol., 3, 1831 (2013).]
2H2O
4H+
O2
QM
region
CaMn4O5 cluster
(b)
GPU-accelerated Molecular Orbital Calculation
Large-scale ab initio molecular orbital calculation is a target application in quantum chemistry for HPC computer systems,
and the fragment molecular orbital (FMO) method is one of such application because it is designed for parallel computer.
We have developed GPU-accelerated FMO calculation program with CUDA, and obtained 3.8x speedups from CPU on-the-fly
FMO calculation of 1,961 atomic protein. [H. Umeda et al., IPSJ Transactions on Advanced Computing Systems 6, 4, (2013) 26-37. H. Umeda et al., SC15 poster (2015).]
(a)
Divides into fragments
Dimer SCF or ES-Dimer calc.
for each fragment-pair
SCF calc. for each
fragment with ESP (SCC)
Application Lysozyme HA3
#Atoms 1,961 23,460
#Nodes (#GPU) 8 (0) 8 (32) 64 (256)
SCC 3,071 s 828 s 3.7x 0.52 hr
Dimer SCF 6,246 s 1,675 s 3.7x 0.90 hr
ES Dimer 407 s 78 s 5.2x 0.45 hr
Total 9,770 s 2,597 s 3.8x 1.97 hr
(b)
2 Hours for FMO
calculation with 256 GPUs
Influenza HA3 protein
(23,460 atoms, 721 fragments)
Fig. 1: (a) FMO calculation scheme, where large molecule is divided into many small fragments. Total molecular properties are reconstructed from the
self consistent field (SCF) calculations of fragments and fragment-pairs with SCC (self-consistent-charge)-condition-satisfied electrostatic potential
(ESP).
(b) Performance of GPU-accelerated FMO calculations. GPU-accelerated FMO-HF/6-31G(d) calculation of lysozyme with HA-PACS base cluster shows
3.8x speedups.
(c) As large-scale MO application, FMO-HF/6-31G(d) calculation of Influenza HA3 protein is successfully performed with 256 GPUs within two hours.
(c)
MD and QM/MM simulations using supercomputers
https://www.ccs.tsukuba.ac.jp/contact address: shigeta@ccs.tsukuba.ac.jp
(a)
resampling
criteria
338-gene	analyses	resolved	the	phylogenetic	affiliation	of	a	microeukaryote
Microheliella maris.
In	silico	structural	modeling	and	analysis	of	translation	elongation	factor	1α	proteins
University	of	Tsukuba		| Center	for	Computational	Sciences
https://www.ccs.tsukuba.ac.jp/
Biological	Sciences
contact	address:	yuji@ccs.tsukuba.ac.jp
Fig. 2: EF-1α and tRNA structures and surface electrostatic distribution.
(a) EF-1α structure of an archaeon (PDB ID: 3WXM). (b) tRNA structure
(PDB ID: 1EHZ). (c & d) divEF-1α models. Dotted lines in (a), (c) and (d)
indicate the surfaces interacting with tRNA.
Translation elongation factor-1α (EF-1α) interacts with tRNA
during protein synthesis. Some eukaryotes appeared to possess
highly divergent EF-1α (divEF-1α), implying that these proteins
lack the ability to interact with tRNA. We modelled the tertiary
structures of divEF-1α and validated their model structures by
molecular dynamics simulations. We found that the molecular
surfaces of divEF-1α are negatively charged partly, suggesting
that they may not interact with negatively charged tRNA as
strongly as the canonical EF-1α with the positively charged
surfaces. (a) (b)
(c) (d)
Canonical EF-1α tRNA
divEF-1α in a diatom divEF-1α in a fungus
Surface interacting
with tRNA
Surface interacting with tRNA
-0.1 V
+0.1 V
Sakamoto et al.	2019	ACS	Omega	4:7308-7316
Previously published phylogenetic studies failed to elucidate the phylogenetic position of a
heliozoan microeukaryote Microheliella maris. Thus, we took a “phylogenomic” approach
to place M. maris in the global tree of eukaryotes with accuracy. In the phylogeny inferred
from an alignment containing 338 genes, M. maris branched at the base of the clade of a
diverse collection of microeukaryote collectively called Cryptista with high statistical
support.
Fig.	1a:	Schematic	cell	drawing	of	Microheliell maris.	
Fig.	1b:	Maximum	likelihood	phylogeny	inferred	from	the	338-gene	alignment.
(a) (b)
University of Tsukuba | Center for Computational Sciences
https://www.ccs.tsukuba.ac.jp/
Simulation of Atmospheric General Circulation by Global Cloud Resolving Model,
NICAM
Development of LES Model for thermal environment at city scale
NICAM (Nonhydrostatic ICosahedoral Atmospheric Model) is able to reproduce
the multi-scale cloud systems realistically, cumulus convection, Tropical cyclones,
Arctic cyclones, the Madden−Julian Oscillation (MJO), and Intertropical
Convergence Zone (ITCZ).
In Fig. 1, NICAM with glevel-10 (7-km horizontal resolution) well simulates
Typhoon Shinraku near the Philippine Islands and Hurricane IKE near the Gulf of
Mexico.
Our group has been developing a Large Eddy
Simulation (LES) model for urban environment.
The main features of the model include (i)
Building resolving, (ii) Roadside trees are
resolved in vertical direction, (iii) Multiple
reflections of short- and long-wave radiation
between buildings and trees by radiosity
method, (iv) resolving shadows from buildings
and trees, and (v) incorporation of cloud
physics and atmospheric radiation models.
Numerical simulation of thermal environment
around Tokyo station was conducted using
Oakforest-PACS supercomputer. The total
number of grid points is about 100 million.
Division of Global Environmental Sciences
contact address: pr@ccs.tsukuba.ac.jp
℃
Tokyo
Station
Tokyo
Station
Fig. 1: Numerical simulation of the general circulation of the atmosphere
produced by 7-km resolution NICAM.
(2a) (2b)
Fig. 2: Road skin temperature distribution estimated by the CCS-LES model (2a) and helicopter
observation (2b). Black indicates buildings.
Hurricane forecast using an operational numerical weather prediction model
A easy-to-use version of Integrated Forecast Systems (IFS)
operated at ECMWF (European Centre for Medium-range Weather Forecasts).
・Hydrostatic global spectral model
(max resolution T1279: about 14km grid interval)
・Reduced Gaussian Grid
・Hybrid MPI-OpenMP scheme
(Non-GPU, Non-FPGA)
ECMWF OpenIFS
Results - forecast of Hurricane Joaquin (2015) -
Experimental settings
Version
cy40r1 (ECMWF, 2014)
operational ver. in 19 Nov. 2013 - 11 May 2015
Initial condition
Atmosphere: GFS high-res analysis
Land & Sea: ERA5 reanalysis
Model
resolution
T639 L91 (32km grid spacing on the equator and 91 vertical levels)
Forecast length 240 hours ( 960 time steps with dt =900 s)
Computer
Parallelisation
256 MPI procs (16 nodes * 16 procs/node)
4 OpenMP threads/process
Computation Time 3:12:38 ( 19 minutes for 1 day forecast)
Data size of output 9.9 GB
Computation time have decreased by 40% with Intel MKL Library in comparison with
LAPACK.
Remark
The experimental result showed a cyclone track similar to the NCEP control
forecasts (thick line), suggesting that the initial conditions had a larger impact on
the track forecast than NWP models in this case.
Fig. 3: Predicted cyclone tracks of Hurricane Joaquin (coloured lines) by ECMWF
(Europe, left), the OpenIFS experiment (second left), NCEP (US, second right)
and JMA (Japan, rightmost). Black lines shows observed track.
University of Tsukuba | Center for Computational Sciences
https://www.ccs.tsukuba.ac.jp/
Implementation of Parallel 3-D Real FFT with 2-D Decomposition on Intel Xeon Phi Clusters
Numerical Computation
contact address: pr@ccs.tsukuba.ac.jp
Development of the high accurate Block Krylov solver
Background
The fast Fourier transform (FFT) is an algorithm which is currently widely used in science and engineering. A typical
decomposition for performing a parallel 3-D FFT is slabwise. This becomes an issue with very large MPI process counts
for a massively parallel cluster of many-core processors.
Overview
We proposed an implementation of a parallel 3-D real FFT with 2-D decomposition on Intel Xeon Phi clusters. The
proposed implementation of the parallel 3-D real FFT is based on the conjugate symmetry property of the discrete
Fourier transform (DFT) and the row-column FFT algorithm. We vectorized FFT kernels using the Intel Advanced Vector
Extensions 512 (intel AVX-512) instructions.
Performance
To evaluate the implemented 3-D real FFT with 2-D
decomposition, referred to as FFTE 7.0 (2-D
decomposition), we compared its performance with
that of the FFTE 7.0 (1-D decomposition), the FFTW
3.3.8 and the P3DFFT 2.7.7. The performance results
demonstrate that the proposed implementation of
parallel 3-D real FFT with 2-D decomposition
effectively improves performance by reducing the
communication time for larger numbers of MPI
processes on Intel Xeon Phi clusters. Fig. 1: Performance of Parallel 3-D Real FFTs (N = 256 × 512 × 512)
Linear systems with multiple right-hand sides appear in many scientific applications such as the computation of physical
quantity in lattice Quantum Chromodynamics (QCD), inner problems of eigensolvers for sparse matrix, and so on. As
numerical methods for solving these linear systems, it is known that Block Krylov subspace methods are efficient
methods in terms of the number of iterations and the computation time. However, the accuracy of the obtained solution
may often deteriorate due to the error occurs in the computation of matrix-matrix multiplications. To improve the
accuracy of the obtained solution, we have developed the new Block Krylov subspace method named Block GWBiCGSTAB
method [1]. The Block GWBiCGSTAB method is based on the group-wise updating technique. By using this technique, the
matrix-matrix multiplications that cause accuracy degradation can be avoided. As shown in Fig. 1, the accuracy of the
obtained solution generated by the Block GWBiCGSTAB method is higher than that by other methods.
Better
Fig. 2: True relative residual norm as a function of the
number L of right-hand sides. The test problem is the
linear system derived from the lattice QCD calculation.
Problem size: 1,572,864.
[1] Hiroto Tadano and Ryosei Kuramoto, Accuracy improvement of the Block BiCGSTAB method for linear systems with multiple right-
hands sides by group-wise updating technique, J. Adv. Simulat. Sci. Eng., Vol. 6, No. 1, pp. 100—117, 2019.
Python is one of the most popular general-purpose programming
languages, and persistent memory (PMEM) is a new device which can
accelerate data-intensive computing. There is a strong demand to use
persistent memory from Python easily. Therefore, we focus on pmemkv,
which is a key-value store optimized for persistent memory, and its
python bindings. We are currently evaluating pmemkv’s python
bindings in detail for efficient use of PMEM in Python.
University	of	Tsukuba		| Center	for	Computational	Sciences
https://www.ccs.tsukuba.ac.jp/
http://oss-tsukuba.org/en/software/gfarm
Software	Researches	for	Big	Data	and	Extreme-Scale	Computing	
contact	address:	pr@ccs.tsukuba.ac.jp
Investigate	DAOS	architecture	for	metadata	
operation
Research	of	caching	file	system	to	exploit	node	
local	storages
The open-source DAOS – Distributed Asynchronous Object Storage
– is notable for its rank on the IO-500 list and its use of Intel®
Optane™ Persistent Memory. In particular, metadata performance
is remarkable compared to other systems.
We investigate the reason for DAOS remarkable metadata
performance on its architecture and consider to integrate DAOS
ways to an existing system or develop a new storage system with
persistent memory.
The performance gap between processors and disk-based storage is
growing in modern HPC systems. To reduce the gap, SSDs attached to
compute nodes has been used as a “node local burst buffer”. We are
implementing distributed file system that uses local SSDs as a caching
layer of the storage nodes. The system uses fuse-library for system
call replacing and mochi-framework for RPC data transfer.
Acknowledgment	
This	work	is	partially	supported	by	Multidisciplinary	Cooperative	Research	Program	in	CCS,	University	of	Tsukuba,	New	Energy	and Industrial	
Technology	Development	Organization	(NEDO),	and	Fujitsu	Laboratories	Ltd.
Gfarm/BB	– Gfarm File	System	for	Node-local	
burst	buffer
Accelerating	Python	Applications	with	
Persistent	Memory
Features	include
•Open	source
•Exploit	local	storage	and	data	locality	for	scalable	I/O	performance
•InfiniBand	support
•Data	integrity	is	supported	for	silent	data	corruption
•Production	systems:	8PB	JLDG,	100PB	HPCI	Storage,	etc.
gfarmbb –h	hostfile –m	mount_point start
…
gfarmbb –h	hostfile stop
Fig.	1:	IOR	file-per-process	read/write	performance	on	Cygnus	supercomputer
Fig.	3:	mdtest performance	comparison	of	IO-500	10	node	challenge	scores
Fig.	4:	Automation	of	construction/destruction	a	swarm	cluster
Fig.2a:	Memory-storage	hierarchy	
with	persistent	memory
Fig2b:	Applications	can	directly	access	
the	persistent	memory	resident	data	
structures	without	using	buffers.	
Acceleration	of	Deep	Learning	using	pytorch
with	persistent	memory
Persistent memory offers greater capacity than DRAM and significantly
better performance than storage. We use it for deep learning with
pytorch. Usually, before performing deep learning using the GPU, the
training data is copied to the main memory from the storage. We
exploit the persistent memory to improve the performance.
Scalable Graph Analysis over Intel Xeon Phi Coprocessors
The structural graph clustering method SCAN is successfully used in many applications since it detects not only densely
connected nodes as clusters but also extracts sparsely connected nodes as hubs or outliers (Fig. 1). However, it is difficult to
apply SCAN to large-scale graphs since SCAN needs to evaluate the density for all adjacent nodes included in the graph. In
this work, so as to address the above problem, we present a novel algorithm SCAN-XP that performs on Intel Xeon Phi
coprocessors. We designed SCAN-XP to make the best use of many cores in the Intel Xeon Phi by employing the following
approaches: First, SCAN-XP avoids the bottlenecks that arise from parallel graph computations by providing good load
balances among the cores. Second, SCAN-XP effectively exploits 512 bit SIMD instructions implemented in each core to
speed up the density evaluations. As a result, SCAN-XP runs approximately 100 times faster than SCAN; for the graphs with
100 million edges, SCAN-XP is able to perform in a few seconds (Fig. 2).
Fig. 2: Overall performances
Fig. 1: Structural Graph Clustering SCAN
Table. 1: Real-world Dataset
Noise-robust sleep stage scoring for mice using deep learning & big data
University of Tsukuba | Center for Computational Sciences
https://www.ccs.tsukuba.ac.jp/
Database Group
Sleep stage scoring for mice is one of the most basic analyses in sleep research; however, this analysis is time-consuming
and requires considerable expertise and effort. Although several studies have proposed automated scoring methods, they
do not achieve robustness against noise in biological signals enough for research uses. To develop a noise-robust scoring
method, we employ the following approaches.
1) Employing convolutional neural networks (CNN) & long short-term memory (LSTM), which can locate the feature of
both biological signals and noise in them.
2) Training the model using noisy biological signals obtained from over 3000 mice.
Thank to these improvements, the proposed method achieved scoring accuracy of more than 95% for noisy biological
signals. This result indicates that our method is practical enough for sleep research uses.
contact address: {kitagawa, amagasa, shiokawa, horie}@cs.tsukuba.ac.jp
①
W (Wake) NR (Non-REM) R (REM)R (REM)Stage
②
③
① Measure biological signals (EEG & EMG) from mice
② Split the signals into 20-sec. epochs (subsequences)
③ Assign sleep stages (W, NR, and R) to epochs
EEG
EMG
CNN with
wide filters
CNN with
wide filters
CNN with
narrow filters
Inputs
Feature
extraction
LSTM
Dense
Softmax
Scoring
model
Stage
{W,NR,R}
Stage Peak Freq. of EEG Amplitude of EMG
W 7-11 Hz Large
NR 1-6 Hz Small
R 7-11 Hz Smallest
Fig. 4: Structure of the proposed system
Fig. 3: Procedure of sleep stage scoring
Table 2: Feature of each stage
University	of	Tsukuba		| Center	for	Computational	Sciences
https://www.ccs.tsukuba.ac.jp/
Computational	Media	Group	
We are researching navigation for the visually impaired. We
propose a new interface that utilizes sound and vibration to
support turn-by-turn navigation that is common for visually
impaired. In our proposed interface, the target path is divided
into straight segments and points of change direction. The
navigation instruction given by the sound and vibration is
carefully designed to give minimum yet sufficient clues on the
visually impaired walking. We have implemented a
preliminary system based on our proposal and conducted a
subject experiment for visually impaired people. The results
imply that our proposed approach is useful for visually
impaired navigation.
Accurate	Overlapping	Method	of	Time-Lapse	Images	for	World	Heritage	Site	Investigation	
A method is proposed to accurately
overlap multiple high-quality images with
different shooting positions and intervals
by combining corresponding point
information between images and 3D
shape information. In the proposed
method, the correct feature matching of
images obtained by rendering the 3D
model of the subject is used. In this
research, the subjects were the pillars of
the Angkor Thom Bayon Temple and the
epilithic microorganisms adhering to and
eroding their surfaces. Synthetic
transformation of a homography utilizing
the correct matches is employed to
overlap the target images.
contact	address:	pr@ccs.tsukuba.ac.jp
We proposes a method to improve the
quality of omnidirectional free-viewpoint
images using generative adversarial
networks (GAN). By estimating the 3D
information of the capturing space while
integrating the omnidirectional images
taken from multiple viewpoints, it is
possible to generate an arbitrary
omnidirectional appearance. However,
the image quality of free-viewpoint
images deteriorates due to artifacts
caused by 3D estimation errors and
occlusion. We solve this problem by using
GAN and, moreover, by focusing on
projective geometry during training, we
further improve image quality by
converting the omnidirectional image into
perspective-projection images.
Information	Display	Design	on	Turn-By-Turn	Navigation	for	Visually	Impaired	People
Image-quality	Improvement	of	Omnidirectional:		Free-Viewpoint	Images	by	GAN
(a):	OFV	image	(no	image-quality	improvement).	
(d):	Correct	image	(captured	image).	
(b):	Proposed	method	using	learning	by	image	
division	(with	image-quality	improvement).	
(c):	Proposed	method	using	learning	with	
omnidirectional	images	(with	image-quality	
improvement).	
Location	Estimation	by	CV
Reference Query
Orientation	measurement	by	IMU
Goal
LR-correction Orientation
Signal	to	turn Mode	change
Voice	
Announce
Field	test
• Combining	goodness	of	different	type	of	accelerators:	GPU	+	FPGA
• GPU	is	still	an	essential	accelerator	for	simple	and	large	degree	of	
parallelism	to	provide	~10	TFLOPS peak	performance
• FPGA	is	a	new	type	of	accelerator	for	application-specific	hardware	with	
programmability	and	speeded	up	based	on	pipelining	of	calculation
• FPGA	is	good	for	external	communication	between	them	with	advanced	
high	speed	interconnection	up	to	100Gbps	x4 chan.
University	of	Tsukuba		| Center	for	Computational	Sciences
https://www.ccs.tsukuba.ac.jp/
Multi-Hybrid	Accelerated	Computing	Platform
Supercomputer	at	CCS:	Cygnus
OpenCL-ready	High	Speed	FPGA	Networking	[1]
comp.
node
…
IB EDR Network (100Gbps x4/node)
Ordinary inter-node communication channel for CPU and GPU, but
they can also request it to FPGA
comp.
node
comp.
node
…comp.
node
Deneb nodes Albireo nodes
comp.
node
comp.
node
Ordinary inter-node network (CPU, GPU) by IB EDR
With 4-ports x full bisection b/w
…
…
Inter-FPGA direct network
• Our	new	supercomputer	“Cygnus”
• Operation	started	in	May	2019
• 2x	Intel	Xeon	CPUs,	4x	NVIDIA	V100	GPUs,	2x	Intel	
Stratix10	FPGAs
• Deneb:	49	CPU+GPU nodes
• Albireo:	32	CPU+GPU+FPGA	nodes	
with	2D-torus	dedicated	network	for		FPGAs	
(100Gbpsx4)
Albireo	node	(x32)
Deneb	node	(x48)
Specification	of	Cygnus
Target	GPU:
NVIDIA	Tesla	V100
Target	FPGA:
Nallatech 520N
Item Specification
Peak	
performance
2.4	PFLOPS	DP
(GPU:	2.2	PFLOPS,	CPU:	0.2	PFLOPS,	FPGA:	0.6	PFLOPS	SP)
⇨ enhanced	by	mixed	precision	and	variable	precision	on	
FPGA
#	of	nodes
81 (32	Albireo	(GPU+FPGA)	nodes,		49	Deneb	(GPU-only)	
nodes)
Memory
192	GiB DDR4-2666/node	=	256GB/s,	32GiB	x	4	for	
GPU/node	=	3.6TB/s
CPU	/	node Intel	Xeon	Gold (SKL)	x2	sockets
GPU	/	node NVIDIA	V100	x4 (PCIe)
FPGA	/	node
Intel	Stratix10	x2	(each	with	100Gbps	x4	links/FPGA	and	
x8	links/node)
Global	File	
System
Lustre,	RAID6,	2.5	PB
Interconnect
ion	Network
Mellanox	InfiniBand	HDR100	x4	(two	cables	of	HDR200	/	
node)
4	TB/s	aggregated	bandwidthj
Programmin
g	Language
CPU:	C,	C++,	Fortran,	OpenMP,	GPU:	OpenACC,	CUDA
FPGA:	OpenCL,	Verilog	HDL
System	
Vendor
NEC
• FPGA	design	plan
• Router
- For	the	dedicated	
network,	this	impl.	is	
mandatory.	
- Forwarding	packets	
to	destinations
• User	Logic
- OpenCL	kernel	runs	
here.
- Inter-FPGA	comm.	
can	be	controlled	
from	OpenCL	kernel.	
• SL3
- SerialLite III	:	Intel	
FPGA	IP
- Including	transceiver	
modules		for	Inter-
FPGA	data	transfer.
- Users	don’t	need	to	
care
CPU
PCIe network (switch)
G
P
U
G
P
U
FPGA
HCA HCA
Inter-FPGA
direct network
(100Gbps x4)
Network switch
(100Gbps x2)
CPU
PCIe network (switch)
G
P
U
G
P
U
FPGA
HCA HCA
Inter-FPGA
direct network
(100Gbps x4)
SINGLE
NODE
(with FPGA)
Network switch
(100Gbps x2)
CPU
PCIe network (switch)
G
P
U
G
P
U
HCA HCA
Network switch
(100Gbps x2)
CPU
PCIe network (switch)
G
P
U
G
P
U
HCA HCA
SINGLE
NODE
(without FPGA)
Network switch
(100Gbps x2)
FPGA FPGA FPGA
FPGA FPGA FPGA
FPGA FPGA FPGA
(only for Albirero nodes)
Inter-FPGA direct network
64 FPGAs on Albireo nodes are
connected directly as 2D-Torus
configuration without Ethernet sw.
: QSFP28 Port
情報処理学会研究報告
IPSJ SIG Technical Report
uint16 val = (uint16 )(0);
if (in_port == 1) {
val = read_channel_intel ( fwd_x_neg_in );
} else if (in_port == 2) {
val = read_channel_intel ( fwd_x_pos_in );
} else if (in_port == 3) {
val = read_channel_intel ( fwd_y_neg_in );
} else if (in_port == 4) {
val = read_channel_intel ( fwd_y_pos_in );
}
val += (uint16 )(
v + 0, v + 1, v + 2, v + 3,
v + 4, v + 5, v + 6, v + 7,
v + 8, v + 9, v + 10, v + 11,
v + 12, v + 13, v + 14, v + 15
);
ulong t_tmp = 0;
if (out_port == 1) {
write_channel_intel (fwd_x_neg_out , clocktime(
val , &t_tmp ));
} else if (out_port == 2) {
write_channel_intel (fwd_x_pos_out , clocktime(
val , &t_tmp ));
} else if (out_port == 3) {
write_channel_intel (fwd_y_neg_out , clocktime(
val , &t_tmp ));
} else if (out_port == 4) {
write_channel_intel (fwd_y_pos_out , clocktime(
val , &t_tmp ));
} else if (out_port == 5) {
write_channel_intel (internal , clocktime(val , &
t_tmp ));
}
図 13: トイプログラムの OpenCL コードの一部.
ネルは 2 種類のカーネルで構成される.1 つは往路のデー
タ転送を行うカーネルであり,もう 1 つは復路のデータ転
送を行うカーネルである.通信はバケツリレー方式で行わ
れ,全体の計算が完了したら計算結果を全てのノードに返
す.また,和の計算を行う処理は往路で行われ,復路でそ
の計算結果をブロードキャストする.
図 13 にトイプログラムのコードの一部を示す.この
コードは入力を受け取り,その結果に値を加算し,出力す
るというものであり,図 12 の灰色で示されているカーネル
の一部である.read channel intel,write channel intel 関
数はそれぞれ Channel から読み出し,書き出しを行う組み
込み関数であり,clocktime 関数は時間を測定する独自の
関数である.if 文で入出力する Channel を切り替えられる
ようになっているが,これは CoE にはルーティング機能
がまだなく,FPGA ボードにあるどの外部リンクで通信を
行うかを明示する必要があるためである.
性能評価の結果を図 14 に示す.最小レイテンシは
2014ns,最大スループットは 181.4Gbps が得られた.本実
図 14: トイプログラムの測定結果.
表 3: プロトコルオーバヘッド
要素 ペイロード通信速度 効率
物理層速度 103.125Gbps
67b/64b 98.484Gbps ×0.955
Meta Frame 98.287Gbps ×0.998
SL3 Burst 96.813Gbps ×0.985
CoE Header 90.762Gbps ×0.938
験には ppx2-02, ppx2-03, ppx2-05 の計 3 ノードを用いた.
図 12 にあるように ppx2-05 が始点ノードとなり ppx2-03
で折り返す.測定結果のスループットは,始点ノードの
データ送信開始から始点ノードのデータ受信終了までの時
間から求めた.また,横軸のデータサイズは,各ノードが
持っているデータサイズを表しており,MPI Allreduce に
おける count 引数に相当する.pinpong ベンチマークの結
果 90.7Gbps と比べて,181.4Gbps と約 2 倍の性能が得ら
れているが,これは通信と演算がパイプライン化によって
送信と受信が同時に行われるためである.
6. 考察
6.1 pingpong ベンチマーク
pingpong ベンチマークで得られた最大スループットは
90.7Gbps であり,物理層に 100Gbps を用いているのに対
して約 90%の性能しか得られていない.しかしながら,こ
の性能は設計の意図したとおりである.表 3 に理論上の通
信性能を示す.評価環境では物理層の速度は 103.125Gbps
(4 × 25.78125Gbps) であり,この速度は 100Gb Ethernet
の物理層と同じ速度を採用している.表 3 は,その物理
層の速度に対して,プロトコル上のオーバヘッドがどの
程度あるのかを示したものである.この中で,67b/64b,
Meta Frame,SL3 Burst は SerialLite III に由来するオー
バーヘッドであり,公式ドキュメント [11] に記載されてい
る計算式を用いて求めた.CoE Header は CoE が付与する
ヘッダによるオーバーヘッドを示すものである.CoE のパ
ケットは 64byte で構成されており,そこに 4byte のヘッ
c 2019 Information Processing Society of Japan 7
情報処理学会研究報告
IPSJ SIG Technical Report
uint16 val = (uint16 )(0);
if (in_port == 1) {
val = read_channel_intel ( fwd_x_neg_in );
} else if (in_port == 2) {
val = read_channel_intel ( fwd_x_pos_in );
} else if (in_port == 3) {
val = read_channel_intel ( fwd_y_neg_in );
} else if (in_port == 4) {
val = read_channel_intel ( fwd_y_pos_in );
}
val += (uint16 )(
v + 0, v + 1, v + 2, v + 3,
v + 4, v + 5, v + 6, v + 7,
v + 8, v + 9, v + 10, v + 11,
v + 12, v + 13, v + 14, v + 15
);
ulong t_tmp = 0;
if (out_port == 1) {
write_channel_intel (fwd_x_neg_out , clocktime(
val , &t_tmp ));
} else if (out_port == 2) {
write_channel_intel (fwd_x_pos_out , clocktime(
val , &t_tmp ));
} else if (out_port == 3) {
write_channel_intel (fwd_y_neg_out , clocktime(
val , &t_tmp ));
} else if (out_port == 4) {
write_channel_intel (fwd_y_pos_out , clocktime(
val , &t_tmp ));
} else if (out_port == 5) {
write_channel_intel (internal , clocktime(val , &
t_tmp ));
}
図 13: トイプログラムの OpenCL コードの一部.
ネルは 2 種類のカーネルで構成される.1 つは往路のデー
タ転送を行うカーネルであり,もう 1 つは復路のデータ転
送を行うカーネルである.通信はバケツリレー方式で行わ
れ,全体の計算が完了したら計算結果を全てのノードに返
す.また,和の計算を行う処理は往路で行われ,復路でそ
の計算結果をブロードキャストする.
図 13 にトイプログラムのコードの一部を示す.この
コードは入力を受け取り,その結果に値を加算し,出力す
るというものであり,図 12 の灰色で示されているカーネル
の一部である.read channel intel,write channel intel 関
数はそれぞれ Channel から読み出し,書き出しを行う組み
込み関数であり,clocktime 関数は時間を測定する独自の
関数である.if 文で入出力する Channel を切り替えられる
ようになっているが,これは CoE にはルーティング機能
がまだなく,FPGA ボードにあるどの外部リンクで通信を
行うかを明示する必要があるためである.
性能評価の結果を図 14 に示す.最小レイテンシは
2014ns,最大スループットは 181.4Gbps が得られた.本実
図 14: トイプログラムの測定結果.
表 3: プロトコルオーバヘッド
要素 ペイロード通信速度 効率
物理層速度 103.125Gbps
67b/64b 98.484Gbps ×0.955
Meta Frame 98.287Gbps ×0.998
SL3 Burst 96.813Gbps ×0.985
CoE Header 90.762Gbps ×0.938
験には ppx2-02, ppx2-03, ppx2-05 の計 3 ノードを用いた.
図 12 にあるように ppx2-05 が始点ノードとなり ppx2-03
で折り返す.測定結果のスループットは,始点ノードの
データ送信開始から始点ノードのデータ受信終了までの時
間から求めた.また,横軸のデータサイズは,各ノードが
持っているデータサイズを表しており,MPI Allreduce に
おける count 引数に相当する.pinpong ベンチマークの結
果 90.7Gbps と比べて,181.4Gbps と約 2 倍の性能が得ら
れているが,これは通信と演算がパイプライン化によって
送信と受信が同時に行われるためである.
6. 考察
6.1 pingpong ベンチマーク
pingpong ベンチマークで得られた最大スループットは
90.7Gbps であり,物理層に 100Gbps を用いているのに対
して約 90%の性能しか得られていない.しかしながら,こ
の性能は設計の意図したとおりである.表 3 に理論上の通
信性能を示す.評価環境では物理層の速度は 103.125Gbps
(4 × 25.78125Gbps) であり,この速度は 100Gb Ethernet
の物理層と同じ速度を採用している.表 3 は,その物理
層の速度に対して,プロトコル上のオーバヘッドがどの
程度あるのかを示したものである.この中で,67b/64b,
Meta Frame,SL3 Burst は SerialLite III に由来するオー
バーヘッドであり,公式ドキュメント [11] に記載されてい
る計算式を用いて求めた.CoE Header は CoE が付与する
ヘッダによるオーバーヘッドを示すものである.CoE のパ
ケットは 64byte で構成されており,そこに 4byte のヘッ
c 2019 Information Processing Society of Japan 7
Cluster	System	with	FPGAs
sender(__global float* restrict x, int n) {
for (int i = 0; i < n; i++) {
float v = x[i];
write_channel_intel(simple_out, v);
}
}
receiver(__global float* restrict x, int n) {
for (int i = 0; i < n; i++) {
float v = read_channel_intel(simple_in);
x[i] = v;
}
}
lCommunication	Integrated	Reconfigurable	
CompUting System	(CIRCUS)
ØCIRCUS	enables	OpenCL	code	communicate	
with	other	FPGAs	on	different	nodes
ØExtending	Intel’s	channel	mechanism	to	
external	communications	
ØPipeline	manner:	sending/receiving	data	
from/to	compute	pipeline	directly
Global Memory
(DDR4)
Source
Kernel
Destination
Kernel
Write Read
Off Chip
Source
Kernel
Destination
Kernel
FIFO
Channel
O e CL
Ke e
40G E h.
C e
BSPO e CL C c
Se ial Link ( 4)IO Channel
Network
Controller
FPGA
PCIe
OpenCLAPI
Interconnect
・ I/O	Channel
- connects	OpenCL
with	peripherals	
- We	used	this	feature
Comm.	w/o	channels
Comm.	w/	channels
・ Channel	Extension:	
Transferring	data	between	
kernels	directly (low	latency	
and	high	bandwidth)
・ We	can	use	multiple	
kernel	design	to	exploit	
space	parallelism	in	an	
FPGA
lFPGA-based	parallel	comp.	with	OpenCL
- Needs	a	communication	system	being	
suitable	to	OpenCL	and	Intel	FPGAs	
- Using	of	Intel	FPGA	SDK	for	OpenCL
CIRCUS
Backends
sender	code	on	FPGA1
receiver	code	on	FPGA2
Our	proposed	method Pipelined	communication	experiment
90.7Gbps↑
Recv.
Comp.
Send
A
B
A,B: Start and end point to clock
Authentic	Radiation	Transfer	[2]
• Accelerated Radiative transfer on grids Oct-Tree
(ARGOT) has been developer in Center for
Computational Sciences, University of Tsukuba
• ART is one of algorithms used in ARGOT and
dominant part (90% or more of computation
time) of ARGOT program
• ART is ray tracing based algorithm
• problem space is divided
into meshes and reactions
are computed on each mesh
• Memory access pattern
depends on ray direction
• Not suitable for SIMD architecture
0
200
400
600
800
1000
1200
1400
(16,16,16) (32,32,32) (64,64,64) (128,128,128)
Performance[Mmesh/s]
mesh size
CPU(14C)
CPU(28C)
P100(x1)
FPGA
better
Table 2: Resource usage and clock frequency
size # of PEs ALMs (%) Registers (%) M20
(16, 16, 16) (2, 2, 2) 132,283 31% 267,828 31% 7
(32, 32, 32) (2, 2, 2) 169,882 40% 344,447 40% 7
(64, 64, 64) (2, 2, 2) 169,549 40% 344,512 40% 7
(128, 128, 128) (2, 2, 2) 169,662 40% 344,505 40% 7
Table 3: Performance comparison between FPGA, CPU and
GPU implementations. The unit is M mesh/sec.
Size CPU(14C) CPU(28C) P100 FPGA
(16,16,16) 112.4 77.2 105.3 1282.8
(32,32,32) 158.9 183.4 490.4 1165.2
(64,64,64) 175.0 227.2 1041.4 1111.0
(128,128,128) 95.4 165.0 1116.1 1133.5
per link) multiple interconnection links (up to 4 channels) on
it. Additionally, HLS such as OpenCL programming envi-
ronment is provided, and there are several tyeps of research
to involve them in FPGA computing. In [3], Kobayashi, et
al. show the basic feature to utilize the high speed intercon-
nection over FPGA driven by OpenCL kernels. Therefore,
although the performance of our implementation is almost
same as NVIDIA P100 GPU, the overall performance with
weak po
through
run our
ation In
than Ar
blocks a
9. R
[1] K. M
F. K
Hea
and
Astr
[2] K. H
Com
imag
IEE
App
PE Array
(2x2x2)
DDR4
Memory
Memory
Reader
Memory
Writer
Buffer
Buffer
Channel
Memory Network
Fig. 5: Design Outline of ART on FPGA.
each other. Each kernel computes reaction between a mesh and a ray
on its own computation space which is dedicated to each kernel. While
computing, a ray is traversed among multiple compute kernels depend-
ing on its location. If a ray goes out from kernel’s space, its data will be
transferred to a neighbor kernel through a channel.
Figure 5 shows the design outline of our implementation. “Memory Reader”
reads mesh data from DDR4 memory which is seen as a global memory
from OpenCL language. “Memory Writer” is a counterpart to the reader
and updates mesh data by the result of computation. It has both of read
and write memory access because it computes integration of gas reaction.
“Buffer” is a mesh data buffer to improve memory access performance.
“PE Array” is an array of PEs (Processing Element). PE computes the
kernel of ART method. The array is consists of multiple kernels. We show
the detail of PE network in the next subsection.
Since our implementation is work-in-progress, it lacks some features from
the CPU implementation. While computation in an FPGA, all mesh data
must be put into its internal BRAM (Block Random Access Memory).
The FPGA implementation does not support to replace mesh data in-
volved by progression of its computation. Therefore, problem size which
an FPGA can solve is limited by the size of BRAM. The CPU implemen-
tation supports inter-node parallelization using MPI (Message Passing
Interface), but the FPGA implementation does not support any network-
ing functionality and uses only one FPGA.
4.2 Parallelization using Channel in an FPGA
We describe the structure in “PE Array” shown in Figure 5. A PE Array
is consists of PEs and BEs (Boundary Element) as shown in Figure 6.
Source
Kernel
Destination
Kernel
FIFO
Channel
Global Memory
(DDR4)
Source
Kernel
Destination
Kernel
Write Read
Off Chip
• Our implementation uses channel based approach
• One of extensions to OpenCL for FPGAs by Intel
• It enables inter kernel communication much faster
• No external memory (DDR) access is required
• Lower resource utilization than DDR access
without channels with channels
(16x16x16) (8x8x8)
mesh
• Problem space is divided into small blocks
• e.g. (16, 16, 16) → 8 (8, 8, 8)
• PE is assigned to each of small blocks
PE BEBE PE
96bit x2
(read,write)
Channel
PE PE BEBE
BEBE
BEBE
y
x
Ray Data
• PEs are connected by channels each other
• PE: Processing Element
• BE: Boundary Element
• Kernel of PEs and BEs are started automatically by
autorun attribute
• Lower control overhead and resource usage
because of decreasing number of host controlled
kernels
4.9x	faster
almost	equal	performance
Reference	
[1]	Norihisa	Fujita,	Ryohei	Kobayashi,	Yoshiki	Yamaguchi,	and	Taisuke	Boku,	Parallel	Processing	on	FPGA	Combining	Computation and	Communication	in	OpenCL Programming,	2019	IEEE	International	Parallel	and	Distributed	Processing	Symposium	Workshops	(IPDPSW),	pp.479-488,	May	2019
[2]	Norihisa	Fujita,	Ryohei	Kobayashi,	Yoshiki	Yamaguchi,	Yuuma Oobata,	Taisuke	Boku,	Makito Abe,	Kohji	Yoshikawa,	and	Masayuki	Umemura:	Accelerating	Space	Radiate	Transfer	on	FPGA	using	OpenCL (Accepted),	International	Symposium	on	Highly-Efficient	Accelerators	and	Reconfigurable	Technologies	(HEART	2018)
Acknowledgment	
This	research	is	a	part	of	the	project	titled	“Development	of	Computing-Communication	Unified	Supercomputer	in	Next	Generation”	under	the	program	of	“Research	and	Development	for	Next-Generation	Supercomputing	Technology”	by	MEXT.	We	thank	Intel	University Program	for	providing	us	both	of	hardware	and	software.
JCAHPC (Joint Center for Advanced HPC), which is a cooperative organization by the University of Tokyo and University
of Tsukuba for joint procurement and operation of the largest scale of supercomputer in Japan, introduced a new
supercomputer system “Oakforest-PACS” with 25 PFLOPS peak performance and started its operation from December
1st, 2016. The Oakforest-PACS system is ranked at #6 in TOP500 List of November 2016 with 13.55 PFLOPS of Linpack
performance, and also recognized as Japan's fastest supercomputer. The system is installed at the Kashiwa Research
Complex II building in the Kashiwa-no-Ha campus, the University of Tokyo.
The Oakforest-PACS system has 8,208 compute nodes, each of which consists of the latest version of Intel Xeon Phi
processor (code name: Knights Landing), and Intel Omni-Path Architecture as the high performance interconnect. The
Oakforest-PACS system is the largest cluster solution with Knights Landing processor as well as also the largest
configuration with Omni-Path Architecture in the world. The system is integrated by Fujitsu Co. Ltd, and its PRIMERGY
server is employed as each of compute node. Additionally, the system employs the Lustre shared files system (capacity:
26 PB), and IME (fast file cache system, 940 TB), both of which are provided by DataDirect Network (DDN).
All the computation nodes and servers including login nodes, Lustre servers and IME servers are connected by a full
bisection bandwidth of Fat-Tree interconnection network with Intel Omni-Path Architecture to provide highly flexible job
allocation over the nodes and high performance file access.
Overview
The Oakforest-PACS is offered to researchers in Japan
and their international collaborators through various
types of programs operated by HPCI under MEXT, and
by original supercomputer resource sharing programs
by two universities.
It is expected to contribute to dramatic development of
new frontiers of various field of studies. The Oakforest-
PACS will be also utilized for education and training of
students and young researchers. We will continue to
make further social contributions through operations of
the Oakforest-PACS.
Research & Education
System Configuration
12 of
768 port Director Switch
(Source by Intel)
362 of
48 port Edge Switch
2 2
241 4825 7249
Uplink: 24
Downlink: 24
. . . . . . . . .
Parallel File System
26.2 PB
Omni-Path Architecture (100 Gbps), Full-bisection BW Fat-tree
Lustre Filesystem
DDN ES14KX x10
File Cache System
940TB
DDN IME14KX x25
1560 GB/s
500 GB/s
Compute Nodes: 25 PFlops
CPU: Intel Xeon Phi 7250
(KNL 68 core, 1.4 GHz)
Mem: 16 GB (MCDRAM,
490 GB/sec, effective)
+ 96 GB (DDR4-2400, 115.2 GB/sec)
×8,208
Fujitsu PRIMERGY CX1640 M1
x 8 node inside CX600 M1 (2U)
Login
node
Login Node x20
Login
node
Login
node
Login
node
Login
node
Login
node
Login
node
Login
node
Login
node
Login
node
Login
node
Login
node
U. Tsukuba
users
U. Tokyo
users
Total peak performance 25 PFLOPS
Total number of
compute nodes
8,208
Power consumption 4.2 MW (including cooling)
# of racks 102
Cooling
system
Compute
Node
Type Warm-water cooling
Direct cooling (CPU)
Rear door cooling (except CPU)
Facility Cooling tower & Chiller
Others Type Air cooling
Facility PAC
Joint Center for Advanced High Performance Computing
Joint Center for Advanced HPC | http://jcahpc.jp/
TOP 500 #6 (#1 in Japan), HPCG #3 (#2), Green 500 #6 (#2)
@Nov. 2016
IO 500 #1 @Nov. 2017, Jun. 2018
IO-500 BW #1 @Jun. 2019

More Related Content

What's hot

The Square Kilometre Array Science Cases (CosmoAndes 2018)
The Square Kilometre Array Science Cases (CosmoAndes 2018)The Square Kilometre Array Science Cases (CosmoAndes 2018)
The Square Kilometre Array Science Cases (CosmoAndes 2018)Joint ALMA Observatory
 
Eso1437a
Eso1437aEso1437a
Eso1437aGOASA
 
Detection of lyman_alpha_emission_from_a_triply_imaged_z_6_85_galaxy_behind_m...
Detection of lyman_alpha_emission_from_a_triply_imaged_z_6_85_galaxy_behind_m...Detection of lyman_alpha_emission_from_a_triply_imaged_z_6_85_galaxy_behind_m...
Detection of lyman_alpha_emission_from_a_triply_imaged_z_6_85_galaxy_behind_m...Sérgio Sacani
 
Radioastron observations of_the_quasar_3_c273_a_challenge_to_the_brightness_t...
Radioastron observations of_the_quasar_3_c273_a_challenge_to_the_brightness_t...Radioastron observations of_the_quasar_3_c273_a_challenge_to_the_brightness_t...
Radioastron observations of_the_quasar_3_c273_a_challenge_to_the_brightness_t...Sérgio Sacani
 
Young remmants of_type_ia_supernovae_and_their_progenitors_a_study_of_snr_g19_03
Young remmants of_type_ia_supernovae_and_their_progenitors_a_study_of_snr_g19_03Young remmants of_type_ia_supernovae_and_their_progenitors_a_study_of_snr_g19_03
Young remmants of_type_ia_supernovae_and_their_progenitors_a_study_of_snr_g19_03Sérgio Sacani
 
The stelar mass_growth_of_brightest_cluster_galaxies_in_the_irac_shallow_clus...
The stelar mass_growth_of_brightest_cluster_galaxies_in_the_irac_shallow_clus...The stelar mass_growth_of_brightest_cluster_galaxies_in_the_irac_shallow_clus...
The stelar mass_growth_of_brightest_cluster_galaxies_in_the_irac_shallow_clus...Sérgio Sacani
 
Periodic mass extinctions_and_the_planet_x_model_reconsidered
Periodic mass extinctions_and_the_planet_x_model_reconsideredPeriodic mass extinctions_and_the_planet_x_model_reconsidered
Periodic mass extinctions_and_the_planet_x_model_reconsideredSérgio Sacani
 
Exocometary gas in_th_hd_181327_debris_ring
Exocometary gas in_th_hd_181327_debris_ringExocometary gas in_th_hd_181327_debris_ring
Exocometary gas in_th_hd_181327_debris_ringSérgio Sacani
 
The shadow _of_the_flying_saucer_a_very_low_temperature_for_large_dust_grains
The shadow _of_the_flying_saucer_a_very_low_temperature_for_large_dust_grainsThe shadow _of_the_flying_saucer_a_very_low_temperature_for_large_dust_grains
The shadow _of_the_flying_saucer_a_very_low_temperature_for_large_dust_grainsSérgio Sacani
 
A possible carbonrich_interior_in_superearth_55_cancrie
A possible carbonrich_interior_in_superearth_55_cancrieA possible carbonrich_interior_in_superearth_55_cancrie
A possible carbonrich_interior_in_superearth_55_cancrieSérgio Sacani
 
The characterization of_the_gamma_ray_signal_from_the_central_milk_way_a_comp...
The characterization of_the_gamma_ray_signal_from_the_central_milk_way_a_comp...The characterization of_the_gamma_ray_signal_from_the_central_milk_way_a_comp...
The characterization of_the_gamma_ray_signal_from_the_central_milk_way_a_comp...Sérgio Sacani
 
Evidence for the_thermal_sunyaev-zeldovich_effect_associated_with_quasar_feed...
Evidence for the_thermal_sunyaev-zeldovich_effect_associated_with_quasar_feed...Evidence for the_thermal_sunyaev-zeldovich_effect_associated_with_quasar_feed...
Evidence for the_thermal_sunyaev-zeldovich_effect_associated_with_quasar_feed...Sérgio Sacani
 
Ringed structure and_a_gap_at_1_au_in_the_nearest_protoplanetary_disk
Ringed structure and_a_gap_at_1_au_in_the_nearest_protoplanetary_diskRinged structure and_a_gap_at_1_au_in_the_nearest_protoplanetary_disk
Ringed structure and_a_gap_at_1_au_in_the_nearest_protoplanetary_diskSérgio Sacani
 
No large population of unbound or wide-orbit Jupiter-mass planets
No large population of unbound or wide-orbit Jupiter-mass planets No large population of unbound or wide-orbit Jupiter-mass planets
No large population of unbound or wide-orbit Jupiter-mass planets Sérgio Sacani
 
First identification of_direct_collapse_black_holes_candidates_in_the_early_u...
First identification of_direct_collapse_black_holes_candidates_in_the_early_u...First identification of_direct_collapse_black_holes_candidates_in_the_early_u...
First identification of_direct_collapse_black_holes_candidates_in_the_early_u...Sérgio Sacani
 
Magnetic interaction of_a_super_cme_with_the_earths_magnetosphere_scenario_fo...
Magnetic interaction of_a_super_cme_with_the_earths_magnetosphere_scenario_fo...Magnetic interaction of_a_super_cme_with_the_earths_magnetosphere_scenario_fo...
Magnetic interaction of_a_super_cme_with_the_earths_magnetosphere_scenario_fo...Sérgio Sacani
 
Cold clumpy accretion_toward_an_active_supermasive_black_hole
Cold clumpy accretion_toward_an_active_supermasive_black_holeCold clumpy accretion_toward_an_active_supermasive_black_hole
Cold clumpy accretion_toward_an_active_supermasive_black_holeSérgio Sacani
 
Shock breakout and_early_light_curves_of_type_ii_p_supernovae_observed_with_k...
Shock breakout and_early_light_curves_of_type_ii_p_supernovae_observed_with_k...Shock breakout and_early_light_curves_of_type_ii_p_supernovae_observed_with_k...
Shock breakout and_early_light_curves_of_type_ii_p_supernovae_observed_with_k...Sérgio Sacani
 
The open cluster_ngc6520_and_the_nearby_dark_molecular_cloud_barnard_86
The open cluster_ngc6520_and_the_nearby_dark_molecular_cloud_barnard_86The open cluster_ngc6520_and_the_nearby_dark_molecular_cloud_barnard_86
The open cluster_ngc6520_and_the_nearby_dark_molecular_cloud_barnard_86Sérgio Sacani
 
Distances luminosities and_temperatures_of_the_coldest_known_substelar_objects
Distances luminosities and_temperatures_of_the_coldest_known_substelar_objectsDistances luminosities and_temperatures_of_the_coldest_known_substelar_objects
Distances luminosities and_temperatures_of_the_coldest_known_substelar_objectsSérgio Sacani
 

What's hot (20)

The Square Kilometre Array Science Cases (CosmoAndes 2018)
The Square Kilometre Array Science Cases (CosmoAndes 2018)The Square Kilometre Array Science Cases (CosmoAndes 2018)
The Square Kilometre Array Science Cases (CosmoAndes 2018)
 
Eso1437a
Eso1437aEso1437a
Eso1437a
 
Detection of lyman_alpha_emission_from_a_triply_imaged_z_6_85_galaxy_behind_m...
Detection of lyman_alpha_emission_from_a_triply_imaged_z_6_85_galaxy_behind_m...Detection of lyman_alpha_emission_from_a_triply_imaged_z_6_85_galaxy_behind_m...
Detection of lyman_alpha_emission_from_a_triply_imaged_z_6_85_galaxy_behind_m...
 
Radioastron observations of_the_quasar_3_c273_a_challenge_to_the_brightness_t...
Radioastron observations of_the_quasar_3_c273_a_challenge_to_the_brightness_t...Radioastron observations of_the_quasar_3_c273_a_challenge_to_the_brightness_t...
Radioastron observations of_the_quasar_3_c273_a_challenge_to_the_brightness_t...
 
Young remmants of_type_ia_supernovae_and_their_progenitors_a_study_of_snr_g19_03
Young remmants of_type_ia_supernovae_and_their_progenitors_a_study_of_snr_g19_03Young remmants of_type_ia_supernovae_and_their_progenitors_a_study_of_snr_g19_03
Young remmants of_type_ia_supernovae_and_their_progenitors_a_study_of_snr_g19_03
 
The stelar mass_growth_of_brightest_cluster_galaxies_in_the_irac_shallow_clus...
The stelar mass_growth_of_brightest_cluster_galaxies_in_the_irac_shallow_clus...The stelar mass_growth_of_brightest_cluster_galaxies_in_the_irac_shallow_clus...
The stelar mass_growth_of_brightest_cluster_galaxies_in_the_irac_shallow_clus...
 
Periodic mass extinctions_and_the_planet_x_model_reconsidered
Periodic mass extinctions_and_the_planet_x_model_reconsideredPeriodic mass extinctions_and_the_planet_x_model_reconsidered
Periodic mass extinctions_and_the_planet_x_model_reconsidered
 
Exocometary gas in_th_hd_181327_debris_ring
Exocometary gas in_th_hd_181327_debris_ringExocometary gas in_th_hd_181327_debris_ring
Exocometary gas in_th_hd_181327_debris_ring
 
The shadow _of_the_flying_saucer_a_very_low_temperature_for_large_dust_grains
The shadow _of_the_flying_saucer_a_very_low_temperature_for_large_dust_grainsThe shadow _of_the_flying_saucer_a_very_low_temperature_for_large_dust_grains
The shadow _of_the_flying_saucer_a_very_low_temperature_for_large_dust_grains
 
A possible carbonrich_interior_in_superearth_55_cancrie
A possible carbonrich_interior_in_superearth_55_cancrieA possible carbonrich_interior_in_superearth_55_cancrie
A possible carbonrich_interior_in_superearth_55_cancrie
 
The characterization of_the_gamma_ray_signal_from_the_central_milk_way_a_comp...
The characterization of_the_gamma_ray_signal_from_the_central_milk_way_a_comp...The characterization of_the_gamma_ray_signal_from_the_central_milk_way_a_comp...
The characterization of_the_gamma_ray_signal_from_the_central_milk_way_a_comp...
 
Evidence for the_thermal_sunyaev-zeldovich_effect_associated_with_quasar_feed...
Evidence for the_thermal_sunyaev-zeldovich_effect_associated_with_quasar_feed...Evidence for the_thermal_sunyaev-zeldovich_effect_associated_with_quasar_feed...
Evidence for the_thermal_sunyaev-zeldovich_effect_associated_with_quasar_feed...
 
Ringed structure and_a_gap_at_1_au_in_the_nearest_protoplanetary_disk
Ringed structure and_a_gap_at_1_au_in_the_nearest_protoplanetary_diskRinged structure and_a_gap_at_1_au_in_the_nearest_protoplanetary_disk
Ringed structure and_a_gap_at_1_au_in_the_nearest_protoplanetary_disk
 
No large population of unbound or wide-orbit Jupiter-mass planets
No large population of unbound or wide-orbit Jupiter-mass planets No large population of unbound or wide-orbit Jupiter-mass planets
No large population of unbound or wide-orbit Jupiter-mass planets
 
First identification of_direct_collapse_black_holes_candidates_in_the_early_u...
First identification of_direct_collapse_black_holes_candidates_in_the_early_u...First identification of_direct_collapse_black_holes_candidates_in_the_early_u...
First identification of_direct_collapse_black_holes_candidates_in_the_early_u...
 
Magnetic interaction of_a_super_cme_with_the_earths_magnetosphere_scenario_fo...
Magnetic interaction of_a_super_cme_with_the_earths_magnetosphere_scenario_fo...Magnetic interaction of_a_super_cme_with_the_earths_magnetosphere_scenario_fo...
Magnetic interaction of_a_super_cme_with_the_earths_magnetosphere_scenario_fo...
 
Cold clumpy accretion_toward_an_active_supermasive_black_hole
Cold clumpy accretion_toward_an_active_supermasive_black_holeCold clumpy accretion_toward_an_active_supermasive_black_hole
Cold clumpy accretion_toward_an_active_supermasive_black_hole
 
Shock breakout and_early_light_curves_of_type_ii_p_supernovae_observed_with_k...
Shock breakout and_early_light_curves_of_type_ii_p_supernovae_observed_with_k...Shock breakout and_early_light_curves_of_type_ii_p_supernovae_observed_with_k...
Shock breakout and_early_light_curves_of_type_ii_p_supernovae_observed_with_k...
 
The open cluster_ngc6520_and_the_nearby_dark_molecular_cloud_barnard_86
The open cluster_ngc6520_and_the_nearby_dark_molecular_cloud_barnard_86The open cluster_ngc6520_and_the_nearby_dark_molecular_cloud_barnard_86
The open cluster_ngc6520_and_the_nearby_dark_molecular_cloud_barnard_86
 
Distances luminosities and_temperatures_of_the_coldest_known_substelar_objects
Distances luminosities and_temperatures_of_the_coldest_known_substelar_objectsDistances luminosities and_temperatures_of_the_coldest_known_substelar_objects
Distances luminosities and_temperatures_of_the_coldest_known_substelar_objects
 

Similar to PCCC20 筑波大学計算科学研究センター「学際計算科学による最新の研究成果」

Constraints on the Universe as a Numerical Simulation
Constraints on the Universe as a Numerical SimulationConstraints on the Universe as a Numerical Simulation
Constraints on the Universe as a Numerical Simulationsolodoe
 
Laser Pulsing in Linear Compton Scattering
Laser Pulsing in Linear Compton ScatteringLaser Pulsing in Linear Compton Scattering
Laser Pulsing in Linear Compton ScatteringTodd Hodges
 
Axion Dark Matter Experiment Detailed Design Nbsp And Operations
Axion Dark Matter Experiment  Detailed Design Nbsp And OperationsAxion Dark Matter Experiment  Detailed Design Nbsp And Operations
Axion Dark Matter Experiment Detailed Design Nbsp And OperationsSandra Long
 
PCCC22:筑波大学計算科学研究センター テーマ2「学際計算科学による最新の研究成果」
PCCC22:筑波大学計算科学研究センター テーマ2「学際計算科学による最新の研究成果」PCCC22:筑波大学計算科学研究センター テーマ2「学際計算科学による最新の研究成果」
PCCC22:筑波大学計算科学研究センター テーマ2「学際計算科学による最新の研究成果」PC Cluster Consortium
 
Shell model calculations for even even 42,44,46 ca nuclei
Shell model calculations for even even 42,44,46 ca nucleiShell model calculations for even even 42,44,46 ca nuclei
Shell model calculations for even even 42,44,46 ca nucleiAlexander Decker
 
Shell model calculations for even even 42,44,46 ca nuclei
Shell model calculations for even even 42,44,46 ca nucleiShell model calculations for even even 42,44,46 ca nuclei
Shell model calculations for even even 42,44,46 ca nucleiAlexander Decker
 
Cold molecular gas_in_merger_remmants_formation_of_molecular_gas_disks
Cold molecular gas_in_merger_remmants_formation_of_molecular_gas_disksCold molecular gas_in_merger_remmants_formation_of_molecular_gas_disks
Cold molecular gas_in_merger_remmants_formation_of_molecular_gas_disksSérgio Sacani
 
Cold Molecular Gas in Merger Remnants. I. Formation of Molecular Gas Discs
Cold Molecular Gas in Merger Remnants. I. Formation of Molecular Gas DiscsCold Molecular Gas in Merger Remnants. I. Formation of Molecular Gas Discs
Cold Molecular Gas in Merger Remnants. I. Formation of Molecular Gas DiscsGOASA
 
Comaskey_William_Poster_SULI_FALL_2014
Comaskey_William_Poster_SULI_FALL_2014Comaskey_William_Poster_SULI_FALL_2014
Comaskey_William_Poster_SULI_FALL_2014William Comaskey
 
Integration of flux tower data and remotely sensed data into the SCOPE simula...
Integration of flux tower data and remotely sensed data into the SCOPE simula...Integration of flux tower data and remotely sensed data into the SCOPE simula...
Integration of flux tower data and remotely sensed data into the SCOPE simula...Integrated Carbon Observation System (ICOS)
 
Kilogramo 2 - 2017.pdf
Kilogramo 2 - 2017.pdfKilogramo 2 - 2017.pdf
Kilogramo 2 - 2017.pdfElmasContento
 
Pablo Estevez: "Computational Intelligence Applied to Time Series Analysis"
Pablo Estevez: "Computational Intelligence Applied to Time Series Analysis" Pablo Estevez: "Computational Intelligence Applied to Time Series Analysis"
Pablo Estevez: "Computational Intelligence Applied to Time Series Analysis" ieee_cis_cyprus
 
Laser ablation - optical cavity isotopic spectrometer (LAOCIS) for Mars rovers
Laser ablation - optical cavity isotopic spectrometer (LAOCIS) for Mars roversLaser ablation - optical cavity isotopic spectrometer (LAOCIS) for Mars rovers
Laser ablation - optical cavity isotopic spectrometer (LAOCIS) for Mars roversAlexander Bolshakov
 
CNT Hydrogen Storage Brief
CNT Hydrogen Storage BriefCNT Hydrogen Storage Brief
CNT Hydrogen Storage BriefAndy Zelinski
 
MUSE sneaks a peek at extreme ram-pressure stripping events. I. A kinematic s...
MUSE sneaks a peek at extreme ram-pressure stripping events. I. A kinematic s...MUSE sneaks a peek at extreme ram-pressure stripping events. I. A kinematic s...
MUSE sneaks a peek at extreme ram-pressure stripping events. I. A kinematic s...Sérgio Sacani
 
The canarias einstein_ring_a_newly_discovered_optical_einstein_ring
The canarias einstein_ring_a_newly_discovered_optical_einstein_ringThe canarias einstein_ring_a_newly_discovered_optical_einstein_ring
The canarias einstein_ring_a_newly_discovered_optical_einstein_ringSérgio Sacani
 
Unknown 2019 - expand “explorations at and beyond the neutron dripline ”
Unknown   2019 - expand “explorations at and beyond the neutron dripline ”Unknown   2019 - expand “explorations at and beyond the neutron dripline ”
Unknown 2019 - expand “explorations at and beyond the neutron dripline ”LinhBui343479
 
Forming intracluster gas in a galaxy protocluster at a redshift of 2.16
Forming intracluster gas in a galaxy protocluster at a redshift of 2.16Forming intracluster gas in a galaxy protocluster at a redshift of 2.16
Forming intracluster gas in a galaxy protocluster at a redshift of 2.16Sérgio Sacani
 
The build up_of_the_c_d_halo_of_m87_evidence_for_accretion_in_the_last_gyr
The build up_of_the_c_d_halo_of_m87_evidence_for_accretion_in_the_last_gyrThe build up_of_the_c_d_halo_of_m87_evidence_for_accretion_in_the_last_gyr
The build up_of_the_c_d_halo_of_m87_evidence_for_accretion_in_the_last_gyrSérgio Sacani
 

Similar to PCCC20 筑波大学計算科学研究センター「学際計算科学による最新の研究成果」 (20)

Constraints on the Universe as a Numerical Simulation
Constraints on the Universe as a Numerical SimulationConstraints on the Universe as a Numerical Simulation
Constraints on the Universe as a Numerical Simulation
 
Ijetcas14 318
Ijetcas14 318Ijetcas14 318
Ijetcas14 318
 
Laser Pulsing in Linear Compton Scattering
Laser Pulsing in Linear Compton ScatteringLaser Pulsing in Linear Compton Scattering
Laser Pulsing in Linear Compton Scattering
 
Axion Dark Matter Experiment Detailed Design Nbsp And Operations
Axion Dark Matter Experiment  Detailed Design Nbsp And OperationsAxion Dark Matter Experiment  Detailed Design Nbsp And Operations
Axion Dark Matter Experiment Detailed Design Nbsp And Operations
 
PCCC22:筑波大学計算科学研究センター テーマ2「学際計算科学による最新の研究成果」
PCCC22:筑波大学計算科学研究センター テーマ2「学際計算科学による最新の研究成果」PCCC22:筑波大学計算科学研究センター テーマ2「学際計算科学による最新の研究成果」
PCCC22:筑波大学計算科学研究センター テーマ2「学際計算科学による最新の研究成果」
 
Shell model calculations for even even 42,44,46 ca nuclei
Shell model calculations for even even 42,44,46 ca nucleiShell model calculations for even even 42,44,46 ca nuclei
Shell model calculations for even even 42,44,46 ca nuclei
 
Shell model calculations for even even 42,44,46 ca nuclei
Shell model calculations for even even 42,44,46 ca nucleiShell model calculations for even even 42,44,46 ca nuclei
Shell model calculations for even even 42,44,46 ca nuclei
 
Cold molecular gas_in_merger_remmants_formation_of_molecular_gas_disks
Cold molecular gas_in_merger_remmants_formation_of_molecular_gas_disksCold molecular gas_in_merger_remmants_formation_of_molecular_gas_disks
Cold molecular gas_in_merger_remmants_formation_of_molecular_gas_disks
 
Cold Molecular Gas in Merger Remnants. I. Formation of Molecular Gas Discs
Cold Molecular Gas in Merger Remnants. I. Formation of Molecular Gas DiscsCold Molecular Gas in Merger Remnants. I. Formation of Molecular Gas Discs
Cold Molecular Gas in Merger Remnants. I. Formation of Molecular Gas Discs
 
Comaskey_William_Poster_SULI_FALL_2014
Comaskey_William_Poster_SULI_FALL_2014Comaskey_William_Poster_SULI_FALL_2014
Comaskey_William_Poster_SULI_FALL_2014
 
Integration of flux tower data and remotely sensed data into the SCOPE simula...
Integration of flux tower data and remotely sensed data into the SCOPE simula...Integration of flux tower data and remotely sensed data into the SCOPE simula...
Integration of flux tower data and remotely sensed data into the SCOPE simula...
 
Kilogramo 2 - 2017.pdf
Kilogramo 2 - 2017.pdfKilogramo 2 - 2017.pdf
Kilogramo 2 - 2017.pdf
 
Pablo Estevez: "Computational Intelligence Applied to Time Series Analysis"
Pablo Estevez: "Computational Intelligence Applied to Time Series Analysis" Pablo Estevez: "Computational Intelligence Applied to Time Series Analysis"
Pablo Estevez: "Computational Intelligence Applied to Time Series Analysis"
 
Laser ablation - optical cavity isotopic spectrometer (LAOCIS) for Mars rovers
Laser ablation - optical cavity isotopic spectrometer (LAOCIS) for Mars roversLaser ablation - optical cavity isotopic spectrometer (LAOCIS) for Mars rovers
Laser ablation - optical cavity isotopic spectrometer (LAOCIS) for Mars rovers
 
CNT Hydrogen Storage Brief
CNT Hydrogen Storage BriefCNT Hydrogen Storage Brief
CNT Hydrogen Storage Brief
 
MUSE sneaks a peek at extreme ram-pressure stripping events. I. A kinematic s...
MUSE sneaks a peek at extreme ram-pressure stripping events. I. A kinematic s...MUSE sneaks a peek at extreme ram-pressure stripping events. I. A kinematic s...
MUSE sneaks a peek at extreme ram-pressure stripping events. I. A kinematic s...
 
The canarias einstein_ring_a_newly_discovered_optical_einstein_ring
The canarias einstein_ring_a_newly_discovered_optical_einstein_ringThe canarias einstein_ring_a_newly_discovered_optical_einstein_ring
The canarias einstein_ring_a_newly_discovered_optical_einstein_ring
 
Unknown 2019 - expand “explorations at and beyond the neutron dripline ”
Unknown   2019 - expand “explorations at and beyond the neutron dripline ”Unknown   2019 - expand “explorations at and beyond the neutron dripline ”
Unknown 2019 - expand “explorations at and beyond the neutron dripline ”
 
Forming intracluster gas in a galaxy protocluster at a redshift of 2.16
Forming intracluster gas in a galaxy protocluster at a redshift of 2.16Forming intracluster gas in a galaxy protocluster at a redshift of 2.16
Forming intracluster gas in a galaxy protocluster at a redshift of 2.16
 
The build up_of_the_c_d_halo_of_m87_evidence_for_accretion_in_the_last_gyr
The build up_of_the_c_d_halo_of_m87_evidence_for_accretion_in_the_last_gyrThe build up_of_the_c_d_halo_of_m87_evidence_for_accretion_in_the_last_gyr
The build up_of_the_c_d_halo_of_m87_evidence_for_accretion_in_the_last_gyr
 

More from PC Cluster Consortium

PCCC23:SCSK株式会社 テーマ1「『Azure OpenAI Service』導入支援サービス」
PCCC23:SCSK株式会社 テーマ1「『Azure OpenAI Service』導入支援サービス」PCCC23:SCSK株式会社 テーマ1「『Azure OpenAI Service』導入支援サービス」
PCCC23:SCSK株式会社 テーマ1「『Azure OpenAI Service』導入支援サービス」PC Cluster Consortium
 
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」PC Cluster Consortium
 
PCCC23:富士通株式会社 テーマ1「次世代高性能・省電力プロセッサ『FUJITSU-MONAKA』」
PCCC23:富士通株式会社 テーマ1「次世代高性能・省電力プロセッサ『FUJITSU-MONAKA』」PCCC23:富士通株式会社 テーマ1「次世代高性能・省電力プロセッサ『FUJITSU-MONAKA』」
PCCC23:富士通株式会社 テーマ1「次世代高性能・省電力プロセッサ『FUJITSU-MONAKA』」PC Cluster Consortium
 
PCCC23:東京大学情報基盤センター 「Society5.0の実現を目指す『計算・データ・学習』の融合による革新的スーパーコンピューティング」
PCCC23:東京大学情報基盤センター 「Society5.0の実現を目指す『計算・データ・学習』の融合による革新的スーパーコンピューティング」PCCC23:東京大学情報基盤センター 「Society5.0の実現を目指す『計算・データ・学習』の融合による革新的スーパーコンピューティング」
PCCC23:東京大学情報基盤センター 「Society5.0の実現を目指す『計算・データ・学習』の融合による革新的スーパーコンピューティング」PC Cluster Consortium
 
PCCC23:日本AMD株式会社 テーマ1「AMD Instinct™ アクセラレーターの概要」
PCCC23:日本AMD株式会社 テーマ1「AMD Instinct™ アクセラレーターの概要」PCCC23:日本AMD株式会社 テーマ1「AMD Instinct™ アクセラレーターの概要」
PCCC23:日本AMD株式会社 テーマ1「AMD Instinct™ アクセラレーターの概要」PC Cluster Consortium
 
PCCC23:富士通株式会社 テーマ3「Fujitsu Computing as a Service (CaaS)」
PCCC23:富士通株式会社 テーマ3「Fujitsu Computing as a Service (CaaS)」PCCC23:富士通株式会社 テーマ3「Fujitsu Computing as a Service (CaaS)」
PCCC23:富士通株式会社 テーマ3「Fujitsu Computing as a Service (CaaS)」PC Cluster Consortium
 
PCCC23:日本オラクル株式会社 テーマ1「OCIのHPC基盤技術と生成AI」
PCCC23:日本オラクル株式会社 テーマ1「OCIのHPC基盤技術と生成AI」PCCC23:日本オラクル株式会社 テーマ1「OCIのHPC基盤技術と生成AI」
PCCC23:日本オラクル株式会社 テーマ1「OCIのHPC基盤技術と生成AI」PC Cluster Consortium
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PC Cluster Consortium
 
PCCC23:Pacific Teck Japan テーマ1「データがデータを生む時代に即したストレージソリューション」
PCCC23:Pacific Teck Japan テーマ1「データがデータを生む時代に即したストレージソリューション」PCCC23:Pacific Teck Japan テーマ1「データがデータを生む時代に即したストレージソリューション」
PCCC23:Pacific Teck Japan テーマ1「データがデータを生む時代に即したストレージソリューション」PC Cluster Consortium
 
PCCC23:株式会社計算科学 テーマ1「VRシミュレーションシステム」
PCCC23:株式会社計算科学 テーマ1「VRシミュレーションシステム」PCCC23:株式会社計算科学 テーマ1「VRシミュレーションシステム」
PCCC23:株式会社計算科学 テーマ1「VRシミュレーションシステム」PC Cluster Consortium
 
PCCC22:株式会社アックス テーマ1「俺ASICとロボットと論理推論AI」
PCCC22:株式会社アックス テーマ1「俺ASICとロボットと論理推論AI」PCCC22:株式会社アックス テーマ1「俺ASICとロボットと論理推論AI」
PCCC22:株式会社アックス テーマ1「俺ASICとロボットと論理推論AI」PC Cluster Consortium
 
PCCC22:日本AMD株式会社 テーマ1「第4世代AMD EPYC™ プロセッサー (Genoa) の概要」
PCCC22:日本AMD株式会社 テーマ1「第4世代AMD EPYC™ プロセッサー (Genoa) の概要」PCCC22:日本AMD株式会社 テーマ1「第4世代AMD EPYC™ プロセッサー (Genoa) の概要」
PCCC22:日本AMD株式会社 テーマ1「第4世代AMD EPYC™ プロセッサー (Genoa) の概要」PC Cluster Consortium
 
PCCC22:富士通株式会社 テーマ3「量子シミュレータ」
PCCC22:富士通株式会社 テーマ3「量子シミュレータ」PCCC22:富士通株式会社 テーマ3「量子シミュレータ」
PCCC22:富士通株式会社 テーマ3「量子シミュレータ」PC Cluster Consortium
 
PCCC22:富士通株式会社 テーマ1「Fujitsu Computing as a Service (CaaS)」
PCCC22:富士通株式会社 テーマ1「Fujitsu Computing as a Service (CaaS)」PCCC22:富士通株式会社 テーマ1「Fujitsu Computing as a Service (CaaS)」
PCCC22:富士通株式会社 テーマ1「Fujitsu Computing as a Service (CaaS)」PC Cluster Consortium
 
PCCC22:日本電気株式会社 テーマ1「AI/ビッグデータ分析に最適なプラットフォーム NECのベクトルプロセッサ『SX-Aurora TSUBASA』」
PCCC22:日本電気株式会社 テーマ1「AI/ビッグデータ分析に最適なプラットフォーム NECのベクトルプロセッサ『SX-Aurora TSUBASA』」PCCC22:日本電気株式会社 テーマ1「AI/ビッグデータ分析に最適なプラットフォーム NECのベクトルプロセッサ『SX-Aurora TSUBASA』」
PCCC22:日本電気株式会社 テーマ1「AI/ビッグデータ分析に最適なプラットフォーム NECのベクトルプロセッサ『SX-Aurora TSUBASA』」PC Cluster Consortium
 
PCCC22:東京大学情報基盤センター 「Society5.0の実現を目指す「計算・データ・学習」の融合による革新的スーパーコンピューティング」
PCCC22:東京大学情報基盤センター 「Society5.0の実現を目指す「計算・データ・学習」の融合による革新的スーパーコンピューティング」PCCC22:東京大学情報基盤センター 「Society5.0の実現を目指す「計算・データ・学習」の融合による革新的スーパーコンピューティング」
PCCC22:東京大学情報基盤センター 「Society5.0の実現を目指す「計算・データ・学習」の融合による革新的スーパーコンピューティング」PC Cluster Consortium
 
PCCC22:日本マイクロソフト株式会社 テーマ2「HPC on Azureのお客様事例」03
PCCC22:日本マイクロソフト株式会社 テーマ2「HPC on Azureのお客様事例」03PCCC22:日本マイクロソフト株式会社 テーマ2「HPC on Azureのお客様事例」03
PCCC22:日本マイクロソフト株式会社 テーマ2「HPC on Azureのお客様事例」03PC Cluster Consortium
 
PCCC22:日本マイクロソフト株式会社 テーマ2「HPC on Azureのお客様事例」01
PCCC22:日本マイクロソフト株式会社 テーマ2「HPC on Azureのお客様事例」01PCCC22:日本マイクロソフト株式会社 テーマ2「HPC on Azureのお客様事例」01
PCCC22:日本マイクロソフト株式会社 テーマ2「HPC on Azureのお客様事例」01PC Cluster Consortium
 
PCCC22:日本マイクロソフト株式会社 テーマ1「HPC on Microsoft Azure」
PCCC22:日本マイクロソフト株式会社 テーマ1「HPC on Microsoft Azure」PCCC22:日本マイクロソフト株式会社 テーマ1「HPC on Microsoft Azure」
PCCC22:日本マイクロソフト株式会社 テーマ1「HPC on Microsoft Azure」PC Cluster Consortium
 
PCCC22:インテル株式会社 テーマ3「インテル® oneAPI ツールキット 最新情報のご紹介」
PCCC22:インテル株式会社 テーマ3「インテル® oneAPI ツールキット 最新情報のご紹介」PCCC22:インテル株式会社 テーマ3「インテル® oneAPI ツールキット 最新情報のご紹介」
PCCC22:インテル株式会社 テーマ3「インテル® oneAPI ツールキット 最新情報のご紹介」PC Cluster Consortium
 

More from PC Cluster Consortium (20)

PCCC23:SCSK株式会社 テーマ1「『Azure OpenAI Service』導入支援サービス」
PCCC23:SCSK株式会社 テーマ1「『Azure OpenAI Service』導入支援サービス」PCCC23:SCSK株式会社 テーマ1「『Azure OpenAI Service』導入支援サービス」
PCCC23:SCSK株式会社 テーマ1「『Azure OpenAI Service』導入支援サービス」
 
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」
 
PCCC23:富士通株式会社 テーマ1「次世代高性能・省電力プロセッサ『FUJITSU-MONAKA』」
PCCC23:富士通株式会社 テーマ1「次世代高性能・省電力プロセッサ『FUJITSU-MONAKA』」PCCC23:富士通株式会社 テーマ1「次世代高性能・省電力プロセッサ『FUJITSU-MONAKA』」
PCCC23:富士通株式会社 テーマ1「次世代高性能・省電力プロセッサ『FUJITSU-MONAKA』」
 
PCCC23:東京大学情報基盤センター 「Society5.0の実現を目指す『計算・データ・学習』の融合による革新的スーパーコンピューティング」
PCCC23:東京大学情報基盤センター 「Society5.0の実現を目指す『計算・データ・学習』の融合による革新的スーパーコンピューティング」PCCC23:東京大学情報基盤センター 「Society5.0の実現を目指す『計算・データ・学習』の融合による革新的スーパーコンピューティング」
PCCC23:東京大学情報基盤センター 「Society5.0の実現を目指す『計算・データ・学習』の融合による革新的スーパーコンピューティング」
 
PCCC23:日本AMD株式会社 テーマ1「AMD Instinct™ アクセラレーターの概要」
PCCC23:日本AMD株式会社 テーマ1「AMD Instinct™ アクセラレーターの概要」PCCC23:日本AMD株式会社 テーマ1「AMD Instinct™ アクセラレーターの概要」
PCCC23:日本AMD株式会社 テーマ1「AMD Instinct™ アクセラレーターの概要」
 
PCCC23:富士通株式会社 テーマ3「Fujitsu Computing as a Service (CaaS)」
PCCC23:富士通株式会社 テーマ3「Fujitsu Computing as a Service (CaaS)」PCCC23:富士通株式会社 テーマ3「Fujitsu Computing as a Service (CaaS)」
PCCC23:富士通株式会社 テーマ3「Fujitsu Computing as a Service (CaaS)」
 
PCCC23:日本オラクル株式会社 テーマ1「OCIのHPC基盤技術と生成AI」
PCCC23:日本オラクル株式会社 テーマ1「OCIのHPC基盤技術と生成AI」PCCC23:日本オラクル株式会社 テーマ1「OCIのHPC基盤技術と生成AI」
PCCC23:日本オラクル株式会社 テーマ1「OCIのHPC基盤技術と生成AI」
 
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
PCCC23:筑波大学計算科学研究センター テーマ1「スーパーコンピュータCygnus / Pegasus」
 
PCCC23:Pacific Teck Japan テーマ1「データがデータを生む時代に即したストレージソリューション」
PCCC23:Pacific Teck Japan テーマ1「データがデータを生む時代に即したストレージソリューション」PCCC23:Pacific Teck Japan テーマ1「データがデータを生む時代に即したストレージソリューション」
PCCC23:Pacific Teck Japan テーマ1「データがデータを生む時代に即したストレージソリューション」
 
PCCC23:株式会社計算科学 テーマ1「VRシミュレーションシステム」
PCCC23:株式会社計算科学 テーマ1「VRシミュレーションシステム」PCCC23:株式会社計算科学 テーマ1「VRシミュレーションシステム」
PCCC23:株式会社計算科学 テーマ1「VRシミュレーションシステム」
 
PCCC22:株式会社アックス テーマ1「俺ASICとロボットと論理推論AI」
PCCC22:株式会社アックス テーマ1「俺ASICとロボットと論理推論AI」PCCC22:株式会社アックス テーマ1「俺ASICとロボットと論理推論AI」
PCCC22:株式会社アックス テーマ1「俺ASICとロボットと論理推論AI」
 
PCCC22:日本AMD株式会社 テーマ1「第4世代AMD EPYC™ プロセッサー (Genoa) の概要」
PCCC22:日本AMD株式会社 テーマ1「第4世代AMD EPYC™ プロセッサー (Genoa) の概要」PCCC22:日本AMD株式会社 テーマ1「第4世代AMD EPYC™ プロセッサー (Genoa) の概要」
PCCC22:日本AMD株式会社 テーマ1「第4世代AMD EPYC™ プロセッサー (Genoa) の概要」
 
PCCC22:富士通株式会社 テーマ3「量子シミュレータ」
PCCC22:富士通株式会社 テーマ3「量子シミュレータ」PCCC22:富士通株式会社 テーマ3「量子シミュレータ」
PCCC22:富士通株式会社 テーマ3「量子シミュレータ」
 
PCCC22:富士通株式会社 テーマ1「Fujitsu Computing as a Service (CaaS)」
PCCC22:富士通株式会社 テーマ1「Fujitsu Computing as a Service (CaaS)」PCCC22:富士通株式会社 テーマ1「Fujitsu Computing as a Service (CaaS)」
PCCC22:富士通株式会社 テーマ1「Fujitsu Computing as a Service (CaaS)」
 
PCCC22:日本電気株式会社 テーマ1「AI/ビッグデータ分析に最適なプラットフォーム NECのベクトルプロセッサ『SX-Aurora TSUBASA』」
PCCC22:日本電気株式会社 テーマ1「AI/ビッグデータ分析に最適なプラットフォーム NECのベクトルプロセッサ『SX-Aurora TSUBASA』」PCCC22:日本電気株式会社 テーマ1「AI/ビッグデータ分析に最適なプラットフォーム NECのベクトルプロセッサ『SX-Aurora TSUBASA』」
PCCC22:日本電気株式会社 テーマ1「AI/ビッグデータ分析に最適なプラットフォーム NECのベクトルプロセッサ『SX-Aurora TSUBASA』」
 
PCCC22:東京大学情報基盤センター 「Society5.0の実現を目指す「計算・データ・学習」の融合による革新的スーパーコンピューティング」
PCCC22:東京大学情報基盤センター 「Society5.0の実現を目指す「計算・データ・学習」の融合による革新的スーパーコンピューティング」PCCC22:東京大学情報基盤センター 「Society5.0の実現を目指す「計算・データ・学習」の融合による革新的スーパーコンピューティング」
PCCC22:東京大学情報基盤センター 「Society5.0の実現を目指す「計算・データ・学習」の融合による革新的スーパーコンピューティング」
 
PCCC22:日本マイクロソフト株式会社 テーマ2「HPC on Azureのお客様事例」03
PCCC22:日本マイクロソフト株式会社 テーマ2「HPC on Azureのお客様事例」03PCCC22:日本マイクロソフト株式会社 テーマ2「HPC on Azureのお客様事例」03
PCCC22:日本マイクロソフト株式会社 テーマ2「HPC on Azureのお客様事例」03
 
PCCC22:日本マイクロソフト株式会社 テーマ2「HPC on Azureのお客様事例」01
PCCC22:日本マイクロソフト株式会社 テーマ2「HPC on Azureのお客様事例」01PCCC22:日本マイクロソフト株式会社 テーマ2「HPC on Azureのお客様事例」01
PCCC22:日本マイクロソフト株式会社 テーマ2「HPC on Azureのお客様事例」01
 
PCCC22:日本マイクロソフト株式会社 テーマ1「HPC on Microsoft Azure」
PCCC22:日本マイクロソフト株式会社 テーマ1「HPC on Microsoft Azure」PCCC22:日本マイクロソフト株式会社 テーマ1「HPC on Microsoft Azure」
PCCC22:日本マイクロソフト株式会社 テーマ1「HPC on Microsoft Azure」
 
PCCC22:インテル株式会社 テーマ3「インテル® oneAPI ツールキット 最新情報のご紹介」
PCCC22:インテル株式会社 テーマ3「インテル® oneAPI ツールキット 最新情報のご紹介」PCCC22:インテル株式会社 テーマ3「インテル® oneAPI ツールキット 最新情報のご紹介」
PCCC22:インテル株式会社 テーマ3「インテル® oneAPI ツールキット 最新情報のご紹介」
 

Recently uploaded

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 

Recently uploaded (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

PCCC20 筑波大学計算科学研究センター「学際計算科学による最新の研究成果」

  • 1. https://www.ccs.tsukuba.ac.jp/ University of Tsukuba | Center for Computational Sciences Mission of CCS The CCS promotes "multidisciplinary computational science" on the basis of the fusion between computational science and computer science. For the purpose, the CCS develops high-performance computing systems by the "co-design". The scientific research areas cover particle physics, astrophysics, nuclear physics, nano-science, life science, environmental science, and information science. The CCS was reorganized in April, 2004, from the preceding center, Center for Computational Physics that was established in 1992. The CCS is the institute for the above-mentioned research fields and also the joint-use facility for outside researchers. Since 2010, the CCS has been approved as a national core-center, Advanced Interdisciplinary Computational Science Collaboration Initiative (AISCI), by the Ministry of Education, Culture, Sports, Science and Technology (MEXT). The CCS aims at playing a significant role for the development of the Multidisciplinary Computational Science. Chronology and Major Events Foundation of the Center for Computational Physics (CCP) Completion of CP-PACS, a 0.6 TFLOPS MPP ranked No. 1 on the Top 500 in Nov. 1996 Completion of HMCS (Heterogeneous Multi-Computer System), an 8.6 TFLOPS coupled CP- PACS/GRAPE-6 system Reorganization and expansion of CCP, renamed Center for Computational Sciences (CCS) Two major new computing facilities start operation. PACS-CS a general-purpose 14.3 TFLOPS MPP cluster for computational sciences FIRST an HMCS-E for astrophysical simulations General-purpose 3.5 TFLOPS + gravity 35 TFLOPS Completion of T2K-Tsukuba system, a 95.4 TFLOPS cluster ranked No. 20 on the Top 500 in Jun. 2008 HA-PACS Base Cluster is delivered with 802 TFLOPS of peak performance, ranked No. 41 on the Top 500 in Jun. 2012. HA-PACS/TCA is added to HA-PACS system with 364 TFLOPS of peak performance in Oct. 2013, and total peak performance of HA-PACS system is expanded to over 1.1 PFLOPS. Joint Center for Advanced HPC(JCAHPC) established in alliance with the University of Tokyo COMA(PACS IX) is delivered with 1.001PFLOPS of peak performance, ranked No.51 on the Top 500 in Jun. 2014. Oakforest-PACS is installed and started operation in JCAHPC Cygnus is installed and started operation. 1992 1996 2002 2004 2006 2008 2012 2013 2014 2016 2019 CP-PACS FIRST-Cluster PACS-CS T2K-Tsukuba HA-PACS COMA Oakforest-PACS Current Supercomputers Cygnus
  • 2. 2+1 flavor QCD at Physical Point on very large lattices (master-field simulations) University of Tsukuba | Center for Computational Sciences https://www.ccs.tsukuba.ac.jp/ Exploring QCD phase diagram Research in Particle Physics contact address: pr@ccs.tsukuba.ac.jp Investigating the phase structure of QCD at non-zero temperature and density is very important to understand properties of strongly interacting matters under extreme conditions. It is known that the order of the phase transition depends on the mass and the number of flavors of quarks and there should be so-called critical endlines, lines of second order phase transitions, in certain space of quark masses as shown in Fig. 2a. To determine the shape of the critical endline in the small quark mass region we are carrying out lattice QCD simulations at finite temperature with 2+1 as well as 3 degenerate quark flavors on Cygnus and Oakforest-PACS. Fig. 2b shows our recent estimation of the critical pion mass in 3 flavor QCD in the continuum limit including a new calculation with the temporal lattice extent of 12, where the new result gives a smaller upper bound than that of our previous calculation. Fig1a: Relative difference of the light hadron spectrum from the experiment. Inputs are only the pion, kaon, and omega baryon masses to determine the up-down and strange quark masses, and the lattice cutoff, respectively. Our results show good agreement with the experiment albeit errors are still not quite small for some of the hadrons. [K-.I. Ishikawa et al., https://arxiv.org/abs/1511.09222] Fig. 1b: A comparison of pseudoscalar decay constants, fπ and fK, on (10fm)4 and (5fm)4. We detect 0.66% and 0.26% finite volume effect on fπ and fK, respectively. The effect is very small and negligible to compare the corresponding experiments. Now, we can control and remove the finite volume effect completely by using the master-field simulations. [K-.I. Ishikawa et al., Phys. Rev. D 99, 014504] Hadrons are the constituents of atomic nuclei. Computing the mass spectrum of hadrons from first principles of the quantum chromodynamics (QCD), the fundamental theory of strong interaction described by quarks and gluons, is a principal subject in particle physics. After quenched and succeeding 2 flavor QCD simulations by the CP- PACS, those studies were extended to 2+1 flavor QCD by incorporating the dynamical strange quark, though the degenerate up-down quark mass was much heavier than the physical one. On the PACS-CS and the T2K computers, we have succeeded in reaching the physical point. This calculation is followed by a larger volume simulation on the K computer. Our current project is aiming to control and remove systematic errors due to the previous simulations on a finite volume with a finite lattice spacing. We are performing so called master-field simulations on very larger (10fm)4 volume with several lattice spacings using the Oakforest-PACS. Fig. 2a: Expected quark mass dependence of the order of the QCD phase transition. Our goal is to determine the shape of the critical endline shown as a red curve in the lower- left corner. Fig. 2b: Our recent estimation of the critical pion mass, mπ,E, in 3 flavor QCD. The continuum extrapolation including new data sets with the temporal extent of 12 gives an upper bound mπ,E ≲ 110 MeV. [Y. Kuramashi et al., Phys. Rev. D 101, 054509]
  • 3. Vlasov-Poisson simulation of cosmic neutrinos in the large-scale structure of the universe Theoretical galaxy formation – numerical simulations reveal the fate of stars and gas University of Tsukuba | Center for Computational Sciences http://www.ccs.tsukuba.ac.jp/ Solving the Mysteries of the Universe with Computational Astrophysics When a cluster of stars forms, only a part of the natal cloud is converted into stars, and the rest is ionized and heated by the powerful stellar radiation and ejected outward. Using radiation-hydrodynamic simulations, we found that star formation is primarily controlled by the formation of ionized regions, as well as the surface density and dust content of the natal cloud. We developed a new semi-analytic model that captures this behaviour and can be incorporated in subgrid recipes for large-scale cosmological simulations. Fukushima, Yajima, et al. (2020), MNRAS, 497, 3830 contact address: ayw@ccs.tsukuba.ac.jp / pr@ccs.tsukuba.ac.jp We devise a physical model to determine the formation, distribution, and kinematics of molecular gas clouds in galaxies, and predict the intensities of carbon monoxide (CO) lines and the molecular hydrogen (H2) abundance, taking into account the interstellar radiation field and dust attenuation. We apply the model to data from the Illustris-TNG cosmological simulations and compare the CO luminosities and H2 masses with recent observations of galaxies at low and high redshifts. The model successfully reproduces the observed CO-luminosity function and the total H2 mass in the local universe. Inoue, S., Yoshida, N. & Yajima, H., (2020) accepted for publication in MNRAS 100 kpc b) a) Fig. 2a: The structure of the five brightest galaxies in CO(1-0) in the simulation. Fig. 2b: Density evolution in the formation of star clusters. White circles indicate stars and the green contours bound ionization regions. Neutrinos are elementary particles ubiquitous in the universe. The Super-Kamiokande experiment revealed that neutrinos have mass, which implies that neutrinos can dynamically affect the formation of large-scale structure (LSS) in the universe. We perform numerical simulations of LSS formation incorporating the effect of massive neutrinos by directly solving the collisionless Boltzmann equation in 6D phase-space on two supercomputers, FUGAKU and Oakforest- PACS. Our highly optimized simulation code achieves almost ideal weak and strong scaling on FUGAKU. Yoshikawa, K., Tanaka, S., Yoshida, N. & Saito, S. (2020) accepted for publication in ApJ. Fig. 1a: Simulated distributions of massive neutrinos (color scale) and dark matter (contours) as well as dark matter halos (white circle) at a) redshift z = 0 (the present), and b) redshift of 1 (about 7.9 Gyr ago). Fig. 1b: Strong scaling of VLASOV simulations on super computer FUGAKU. Run ID prefixes S, M, and L denote grid resolutions of 96³, 192³, and 384³, respectively, and the number denotes the number of computational nodes in multiples of 144. a) b)
  • 4. Are “free neutrons” in neutron stars free? University of Tsukuba | Center for Computational Sciences https://www.ccs.tsukuba.ac.jp/ Computational Nuclear Physics Although the nucleus is a microscopic object on earth, there is a gigantic nucleus in the universe, that is the neutron star (Fig.1). Near the surface of the neutron stars, a periodic crystalline structure is formed and all the protons are expected to be confined. In contrast, there are unbound neutrons which are regarded as “free”. These free neutrons play a key role in various observed phenomena, such as pulsar glitch and cooling. Interactive Plot of Atomic nuclei and Computed Shapes (InPACS) Measuring nuclear properties is very expensive using accelerators. The obtained data are precious for various technologies of human beings, thus, compiled by nuclear data centers in the world, then, open to public. We have calculated almost all kinds of nuclides in the universe, using the energy density functional theory. The computation complements missing experimental data. In order to publicize the computational nuclear data, we have opened a web site, InPACS, in which you may interactively obtain various nuclear data/information. contact address: nakatsukasa@nucl.ph.tsukuba.ac.jp Fig. 3: Snapshot of InPACS web site. Fig. 1: Structure of a neutron star Courtesy of http://www.astroscu.unam.mx/neutrones/ 0.6 0.7 0.8 0.9 1 1.1 0 0.02 0.04 0.06 0.08 0.1 m * /mn r [ fm -3 ] Fig. 2: Ratio of effective mass of free neutrons in the neutron- star crust (slab phase) to their bare mass. We have examined properties of the “free neutrons”, with the nuclear density functional calculation. Surprisingly, at a certain density region, they are even “super-free”, which means that their mass is lighter in the neutron star than in the vacuum (Fig.2)! This research was supported by ImPACT project on Reduction and Resource Recycling of High-level Radioactive Wastes through Nuclear Transmutation.
  • 5. (a) Optical near-field generated in metal-organic framework, IRMOF-10 SALMON: Scalable Ab-initio Light-Matter simulator for Optics and Nanoscience Optical Properties of Nano-materials in Real Time and Real Space University of Tsukuba | Center for Computational Sciences https://www.ccs.tsukuba.ac.jp/ Quantum Condensed Matter Physics Understanding interaction between light and matter is the basis of a wide range of technologies. For this purpose, it is essential to describe electron dynamics in matters induced by light electromagnetic fields in a microscopic scale, 10-9 (nano-)meter in space and 10-15 (femto-) second in time. We have been developing an open-source computer code SALMON, Scalable Ab-initio Light-Matter simulator for Optics and Nanoscience that describes electron dynamics in molecules, nano-materials, and solids based on first- principles time-dependent density functional theory [http://salmon-tddft.jp]. As a novel function of SALMON, light propagation in nano-materials as well as in bulk medium can be described taking full account of nonlinearity and nonlocality of light-matter interactions in the ab-initio level. We expect SALMON will be widely used in cutting-edge researches in optics and nanoscience. contact address: pr@ccs.tsukuba.ac.jp I (W/cm2) 109 1010 1011 1012 laser field I=1010W/cm2 ω=3.38eV (b) Weak-scaling performance on the Fugaku system using up to 27,648 nodes to simulate 13,648 atoms. When a light pulse irradiates on nano-sized objects, a strong and spatially- localized electromagnetic field, which is called the near field, appears around the object. The near field enables imaging beyond the limit of optical resolution and enhances nonlinear optical processes. We perform first-principles calculations of the photoexcitation dynamics of an acetylene molecule in a metal organic framework, IRMOF-10. Resonant laser excitation of the IRMOF-10 generates an optical near field around the two benzene rings that comprise the main framework of the IRMOF-10. The second harmonic excitation caused by spatial nonuniformity of the optical near field is observed. (b) Optical property of metallic metasurface with sub-nm gaps By virtue of rapid progresses in fabrications of nano-materials, it is possible to manufacture periodic materials composed of uniformly structured nano-objects. Here we investigate the optical properties of quantum plasmonic metasurfaces composed of two-dimensional arrayed metallic nano-spheres with sub-nm gaps according to the time-dependent density functional theory, a fully quantum mechanical approach. When the quantum and classical descriptions are compared, the absorption rates of the metasurface exhibit substantial differences at shorter gap distances. The differences are caused by electron transport through the gaps of the nano-objects. Re Im Absorption rates Current distribution x y 0.4 nm Gap distances Energy Classical TDDFT (a) A multiphysics simulation solving Maxwell, time- dependent Kohn-Sham, and Newton equations is performed on the Fugaku system for a thin film of amorphous SiO2 composed of more than 10,000 atoms. Disclaimer The results obtained on the evaluation environment in the trial phase do not guarantee the performance, power and other attributes of the supercomputer Fugaku at the start of its public use operation. (a) (b)
  • 6. University of Tsukuba | Center for Computational Sciences Computational Elucidations for Biomolecules The world of life is full of mystery. Actual molecular structures, motions and chemical reactions of biological molecules, such as protein, nucleic acids, carbohydrates and lipids are still unclear. Using supercomputers, we have performed highly demanding computations based on molecular mechanics (MD) and hybrid quantum mechanics/molecular mechanics (QM/MM) methods, and we are uncovering some important biological questions. Fig. 2: (a) Effective conformational sampling of MD simulations: Parallel Cascade Selection MD (PaCS-MD). To promote the conformational transition, the following cycle is repeated in PaCS-MD; (I) Selections of initial seeds (structures) that have high potential to transit. (II) The conformational resampling through restarting multiple MD simulations from the selected initial seeds. [R. Harada et al., J. Chem. Phys. 139 035103 (2013)] (b) QM/MM model of oxygen evolving complex in photosystem II. Key intermediate states in the catalytic reaction “2H2O + 4hv -> 4H++4e– +O2” have been investigated using the large model. [M. Shoji et al., Catal. Soc. Technol., 3, 1831 (2013).] 2H2O 4H+ O2 QM region CaMn4O5 cluster (b) GPU-accelerated Molecular Orbital Calculation Large-scale ab initio molecular orbital calculation is a target application in quantum chemistry for HPC computer systems, and the fragment molecular orbital (FMO) method is one of such application because it is designed for parallel computer. We have developed GPU-accelerated FMO calculation program with CUDA, and obtained 3.8x speedups from CPU on-the-fly FMO calculation of 1,961 atomic protein. [H. Umeda et al., IPSJ Transactions on Advanced Computing Systems 6, 4, (2013) 26-37. H. Umeda et al., SC15 poster (2015).] (a) Divides into fragments Dimer SCF or ES-Dimer calc. for each fragment-pair SCF calc. for each fragment with ESP (SCC) Application Lysozyme HA3 #Atoms 1,961 23,460 #Nodes (#GPU) 8 (0) 8 (32) 64 (256) SCC 3,071 s 828 s 3.7x 0.52 hr Dimer SCF 6,246 s 1,675 s 3.7x 0.90 hr ES Dimer 407 s 78 s 5.2x 0.45 hr Total 9,770 s 2,597 s 3.8x 1.97 hr (b) 2 Hours for FMO calculation with 256 GPUs Influenza HA3 protein (23,460 atoms, 721 fragments) Fig. 1: (a) FMO calculation scheme, where large molecule is divided into many small fragments. Total molecular properties are reconstructed from the self consistent field (SCF) calculations of fragments and fragment-pairs with SCC (self-consistent-charge)-condition-satisfied electrostatic potential (ESP). (b) Performance of GPU-accelerated FMO calculations. GPU-accelerated FMO-HF/6-31G(d) calculation of lysozyme with HA-PACS base cluster shows 3.8x speedups. (c) As large-scale MO application, FMO-HF/6-31G(d) calculation of Influenza HA3 protein is successfully performed with 256 GPUs within two hours. (c) MD and QM/MM simulations using supercomputers https://www.ccs.tsukuba.ac.jp/contact address: shigeta@ccs.tsukuba.ac.jp (a) resampling criteria
  • 7. 338-gene analyses resolved the phylogenetic affiliation of a microeukaryote Microheliella maris. In silico structural modeling and analysis of translation elongation factor 1α proteins University of Tsukuba | Center for Computational Sciences https://www.ccs.tsukuba.ac.jp/ Biological Sciences contact address: yuji@ccs.tsukuba.ac.jp Fig. 2: EF-1α and tRNA structures and surface electrostatic distribution. (a) EF-1α structure of an archaeon (PDB ID: 3WXM). (b) tRNA structure (PDB ID: 1EHZ). (c & d) divEF-1α models. Dotted lines in (a), (c) and (d) indicate the surfaces interacting with tRNA. Translation elongation factor-1α (EF-1α) interacts with tRNA during protein synthesis. Some eukaryotes appeared to possess highly divergent EF-1α (divEF-1α), implying that these proteins lack the ability to interact with tRNA. We modelled the tertiary structures of divEF-1α and validated their model structures by molecular dynamics simulations. We found that the molecular surfaces of divEF-1α are negatively charged partly, suggesting that they may not interact with negatively charged tRNA as strongly as the canonical EF-1α with the positively charged surfaces. (a) (b) (c) (d) Canonical EF-1α tRNA divEF-1α in a diatom divEF-1α in a fungus Surface interacting with tRNA Surface interacting with tRNA -0.1 V +0.1 V Sakamoto et al. 2019 ACS Omega 4:7308-7316 Previously published phylogenetic studies failed to elucidate the phylogenetic position of a heliozoan microeukaryote Microheliella maris. Thus, we took a “phylogenomic” approach to place M. maris in the global tree of eukaryotes with accuracy. In the phylogeny inferred from an alignment containing 338 genes, M. maris branched at the base of the clade of a diverse collection of microeukaryote collectively called Cryptista with high statistical support. Fig. 1a: Schematic cell drawing of Microheliell maris. Fig. 1b: Maximum likelihood phylogeny inferred from the 338-gene alignment. (a) (b)
  • 8. University of Tsukuba | Center for Computational Sciences https://www.ccs.tsukuba.ac.jp/ Simulation of Atmospheric General Circulation by Global Cloud Resolving Model, NICAM Development of LES Model for thermal environment at city scale NICAM (Nonhydrostatic ICosahedoral Atmospheric Model) is able to reproduce the multi-scale cloud systems realistically, cumulus convection, Tropical cyclones, Arctic cyclones, the Madden−Julian Oscillation (MJO), and Intertropical Convergence Zone (ITCZ). In Fig. 1, NICAM with glevel-10 (7-km horizontal resolution) well simulates Typhoon Shinraku near the Philippine Islands and Hurricane IKE near the Gulf of Mexico. Our group has been developing a Large Eddy Simulation (LES) model for urban environment. The main features of the model include (i) Building resolving, (ii) Roadside trees are resolved in vertical direction, (iii) Multiple reflections of short- and long-wave radiation between buildings and trees by radiosity method, (iv) resolving shadows from buildings and trees, and (v) incorporation of cloud physics and atmospheric radiation models. Numerical simulation of thermal environment around Tokyo station was conducted using Oakforest-PACS supercomputer. The total number of grid points is about 100 million. Division of Global Environmental Sciences contact address: pr@ccs.tsukuba.ac.jp ℃ Tokyo Station Tokyo Station Fig. 1: Numerical simulation of the general circulation of the atmosphere produced by 7-km resolution NICAM. (2a) (2b) Fig. 2: Road skin temperature distribution estimated by the CCS-LES model (2a) and helicopter observation (2b). Black indicates buildings. Hurricane forecast using an operational numerical weather prediction model A easy-to-use version of Integrated Forecast Systems (IFS) operated at ECMWF (European Centre for Medium-range Weather Forecasts). ・Hydrostatic global spectral model (max resolution T1279: about 14km grid interval) ・Reduced Gaussian Grid ・Hybrid MPI-OpenMP scheme (Non-GPU, Non-FPGA) ECMWF OpenIFS Results - forecast of Hurricane Joaquin (2015) - Experimental settings Version cy40r1 (ECMWF, 2014) operational ver. in 19 Nov. 2013 - 11 May 2015 Initial condition Atmosphere: GFS high-res analysis Land & Sea: ERA5 reanalysis Model resolution T639 L91 (32km grid spacing on the equator and 91 vertical levels) Forecast length 240 hours ( 960 time steps with dt =900 s) Computer Parallelisation 256 MPI procs (16 nodes * 16 procs/node) 4 OpenMP threads/process Computation Time 3:12:38 ( 19 minutes for 1 day forecast) Data size of output 9.9 GB Computation time have decreased by 40% with Intel MKL Library in comparison with LAPACK. Remark The experimental result showed a cyclone track similar to the NCEP control forecasts (thick line), suggesting that the initial conditions had a larger impact on the track forecast than NWP models in this case. Fig. 3: Predicted cyclone tracks of Hurricane Joaquin (coloured lines) by ECMWF (Europe, left), the OpenIFS experiment (second left), NCEP (US, second right) and JMA (Japan, rightmost). Black lines shows observed track.
  • 9. University of Tsukuba | Center for Computational Sciences https://www.ccs.tsukuba.ac.jp/ Implementation of Parallel 3-D Real FFT with 2-D Decomposition on Intel Xeon Phi Clusters Numerical Computation contact address: pr@ccs.tsukuba.ac.jp Development of the high accurate Block Krylov solver Background The fast Fourier transform (FFT) is an algorithm which is currently widely used in science and engineering. A typical decomposition for performing a parallel 3-D FFT is slabwise. This becomes an issue with very large MPI process counts for a massively parallel cluster of many-core processors. Overview We proposed an implementation of a parallel 3-D real FFT with 2-D decomposition on Intel Xeon Phi clusters. The proposed implementation of the parallel 3-D real FFT is based on the conjugate symmetry property of the discrete Fourier transform (DFT) and the row-column FFT algorithm. We vectorized FFT kernels using the Intel Advanced Vector Extensions 512 (intel AVX-512) instructions. Performance To evaluate the implemented 3-D real FFT with 2-D decomposition, referred to as FFTE 7.0 (2-D decomposition), we compared its performance with that of the FFTE 7.0 (1-D decomposition), the FFTW 3.3.8 and the P3DFFT 2.7.7. The performance results demonstrate that the proposed implementation of parallel 3-D real FFT with 2-D decomposition effectively improves performance by reducing the communication time for larger numbers of MPI processes on Intel Xeon Phi clusters. Fig. 1: Performance of Parallel 3-D Real FFTs (N = 256 × 512 × 512) Linear systems with multiple right-hand sides appear in many scientific applications such as the computation of physical quantity in lattice Quantum Chromodynamics (QCD), inner problems of eigensolvers for sparse matrix, and so on. As numerical methods for solving these linear systems, it is known that Block Krylov subspace methods are efficient methods in terms of the number of iterations and the computation time. However, the accuracy of the obtained solution may often deteriorate due to the error occurs in the computation of matrix-matrix multiplications. To improve the accuracy of the obtained solution, we have developed the new Block Krylov subspace method named Block GWBiCGSTAB method [1]. The Block GWBiCGSTAB method is based on the group-wise updating technique. By using this technique, the matrix-matrix multiplications that cause accuracy degradation can be avoided. As shown in Fig. 1, the accuracy of the obtained solution generated by the Block GWBiCGSTAB method is higher than that by other methods. Better Fig. 2: True relative residual norm as a function of the number L of right-hand sides. The test problem is the linear system derived from the lattice QCD calculation. Problem size: 1,572,864. [1] Hiroto Tadano and Ryosei Kuramoto, Accuracy improvement of the Block BiCGSTAB method for linear systems with multiple right- hands sides by group-wise updating technique, J. Adv. Simulat. Sci. Eng., Vol. 6, No. 1, pp. 100—117, 2019.
  • 10. Python is one of the most popular general-purpose programming languages, and persistent memory (PMEM) is a new device which can accelerate data-intensive computing. There is a strong demand to use persistent memory from Python easily. Therefore, we focus on pmemkv, which is a key-value store optimized for persistent memory, and its python bindings. We are currently evaluating pmemkv’s python bindings in detail for efficient use of PMEM in Python. University of Tsukuba | Center for Computational Sciences https://www.ccs.tsukuba.ac.jp/ http://oss-tsukuba.org/en/software/gfarm Software Researches for Big Data and Extreme-Scale Computing contact address: pr@ccs.tsukuba.ac.jp Investigate DAOS architecture for metadata operation Research of caching file system to exploit node local storages The open-source DAOS – Distributed Asynchronous Object Storage – is notable for its rank on the IO-500 list and its use of Intel® Optane™ Persistent Memory. In particular, metadata performance is remarkable compared to other systems. We investigate the reason for DAOS remarkable metadata performance on its architecture and consider to integrate DAOS ways to an existing system or develop a new storage system with persistent memory. The performance gap between processors and disk-based storage is growing in modern HPC systems. To reduce the gap, SSDs attached to compute nodes has been used as a “node local burst buffer”. We are implementing distributed file system that uses local SSDs as a caching layer of the storage nodes. The system uses fuse-library for system call replacing and mochi-framework for RPC data transfer. Acknowledgment This work is partially supported by Multidisciplinary Cooperative Research Program in CCS, University of Tsukuba, New Energy and Industrial Technology Development Organization (NEDO), and Fujitsu Laboratories Ltd. Gfarm/BB – Gfarm File System for Node-local burst buffer Accelerating Python Applications with Persistent Memory Features include •Open source •Exploit local storage and data locality for scalable I/O performance •InfiniBand support •Data integrity is supported for silent data corruption •Production systems: 8PB JLDG, 100PB HPCI Storage, etc. gfarmbb –h hostfile –m mount_point start … gfarmbb –h hostfile stop Fig. 1: IOR file-per-process read/write performance on Cygnus supercomputer Fig. 3: mdtest performance comparison of IO-500 10 node challenge scores Fig. 4: Automation of construction/destruction a swarm cluster Fig.2a: Memory-storage hierarchy with persistent memory Fig2b: Applications can directly access the persistent memory resident data structures without using buffers. Acceleration of Deep Learning using pytorch with persistent memory Persistent memory offers greater capacity than DRAM and significantly better performance than storage. We use it for deep learning with pytorch. Usually, before performing deep learning using the GPU, the training data is copied to the main memory from the storage. We exploit the persistent memory to improve the performance.
  • 11. Scalable Graph Analysis over Intel Xeon Phi Coprocessors The structural graph clustering method SCAN is successfully used in many applications since it detects not only densely connected nodes as clusters but also extracts sparsely connected nodes as hubs or outliers (Fig. 1). However, it is difficult to apply SCAN to large-scale graphs since SCAN needs to evaluate the density for all adjacent nodes included in the graph. In this work, so as to address the above problem, we present a novel algorithm SCAN-XP that performs on Intel Xeon Phi coprocessors. We designed SCAN-XP to make the best use of many cores in the Intel Xeon Phi by employing the following approaches: First, SCAN-XP avoids the bottlenecks that arise from parallel graph computations by providing good load balances among the cores. Second, SCAN-XP effectively exploits 512 bit SIMD instructions implemented in each core to speed up the density evaluations. As a result, SCAN-XP runs approximately 100 times faster than SCAN; for the graphs with 100 million edges, SCAN-XP is able to perform in a few seconds (Fig. 2). Fig. 2: Overall performances Fig. 1: Structural Graph Clustering SCAN Table. 1: Real-world Dataset Noise-robust sleep stage scoring for mice using deep learning & big data University of Tsukuba | Center for Computational Sciences https://www.ccs.tsukuba.ac.jp/ Database Group Sleep stage scoring for mice is one of the most basic analyses in sleep research; however, this analysis is time-consuming and requires considerable expertise and effort. Although several studies have proposed automated scoring methods, they do not achieve robustness against noise in biological signals enough for research uses. To develop a noise-robust scoring method, we employ the following approaches. 1) Employing convolutional neural networks (CNN) & long short-term memory (LSTM), which can locate the feature of both biological signals and noise in them. 2) Training the model using noisy biological signals obtained from over 3000 mice. Thank to these improvements, the proposed method achieved scoring accuracy of more than 95% for noisy biological signals. This result indicates that our method is practical enough for sleep research uses. contact address: {kitagawa, amagasa, shiokawa, horie}@cs.tsukuba.ac.jp ① W (Wake) NR (Non-REM) R (REM)R (REM)Stage ② ③ ① Measure biological signals (EEG & EMG) from mice ② Split the signals into 20-sec. epochs (subsequences) ③ Assign sleep stages (W, NR, and R) to epochs EEG EMG CNN with wide filters CNN with wide filters CNN with narrow filters Inputs Feature extraction LSTM Dense Softmax Scoring model Stage {W,NR,R} Stage Peak Freq. of EEG Amplitude of EMG W 7-11 Hz Large NR 1-6 Hz Small R 7-11 Hz Smallest Fig. 4: Structure of the proposed system Fig. 3: Procedure of sleep stage scoring Table 2: Feature of each stage
  • 12. University of Tsukuba | Center for Computational Sciences https://www.ccs.tsukuba.ac.jp/ Computational Media Group We are researching navigation for the visually impaired. We propose a new interface that utilizes sound and vibration to support turn-by-turn navigation that is common for visually impaired. In our proposed interface, the target path is divided into straight segments and points of change direction. The navigation instruction given by the sound and vibration is carefully designed to give minimum yet sufficient clues on the visually impaired walking. We have implemented a preliminary system based on our proposal and conducted a subject experiment for visually impaired people. The results imply that our proposed approach is useful for visually impaired navigation. Accurate Overlapping Method of Time-Lapse Images for World Heritage Site Investigation A method is proposed to accurately overlap multiple high-quality images with different shooting positions and intervals by combining corresponding point information between images and 3D shape information. In the proposed method, the correct feature matching of images obtained by rendering the 3D model of the subject is used. In this research, the subjects were the pillars of the Angkor Thom Bayon Temple and the epilithic microorganisms adhering to and eroding their surfaces. Synthetic transformation of a homography utilizing the correct matches is employed to overlap the target images. contact address: pr@ccs.tsukuba.ac.jp We proposes a method to improve the quality of omnidirectional free-viewpoint images using generative adversarial networks (GAN). By estimating the 3D information of the capturing space while integrating the omnidirectional images taken from multiple viewpoints, it is possible to generate an arbitrary omnidirectional appearance. However, the image quality of free-viewpoint images deteriorates due to artifacts caused by 3D estimation errors and occlusion. We solve this problem by using GAN and, moreover, by focusing on projective geometry during training, we further improve image quality by converting the omnidirectional image into perspective-projection images. Information Display Design on Turn-By-Turn Navigation for Visually Impaired People Image-quality Improvement of Omnidirectional: Free-Viewpoint Images by GAN (a): OFV image (no image-quality improvement). (d): Correct image (captured image). (b): Proposed method using learning by image division (with image-quality improvement). (c): Proposed method using learning with omnidirectional images (with image-quality improvement). Location Estimation by CV Reference Query Orientation measurement by IMU Goal LR-correction Orientation Signal to turn Mode change Voice Announce Field test
  • 13. • Combining goodness of different type of accelerators: GPU + FPGA • GPU is still an essential accelerator for simple and large degree of parallelism to provide ~10 TFLOPS peak performance • FPGA is a new type of accelerator for application-specific hardware with programmability and speeded up based on pipelining of calculation • FPGA is good for external communication between them with advanced high speed interconnection up to 100Gbps x4 chan. University of Tsukuba | Center for Computational Sciences https://www.ccs.tsukuba.ac.jp/ Multi-Hybrid Accelerated Computing Platform Supercomputer at CCS: Cygnus OpenCL-ready High Speed FPGA Networking [1] comp. node … IB EDR Network (100Gbps x4/node) Ordinary inter-node communication channel for CPU and GPU, but they can also request it to FPGA comp. node comp. node …comp. node Deneb nodes Albireo nodes comp. node comp. node Ordinary inter-node network (CPU, GPU) by IB EDR With 4-ports x full bisection b/w … … Inter-FPGA direct network • Our new supercomputer “Cygnus” • Operation started in May 2019 • 2x Intel Xeon CPUs, 4x NVIDIA V100 GPUs, 2x Intel Stratix10 FPGAs • Deneb: 49 CPU+GPU nodes • Albireo: 32 CPU+GPU+FPGA nodes with 2D-torus dedicated network for FPGAs (100Gbpsx4) Albireo node (x32) Deneb node (x48) Specification of Cygnus Target GPU: NVIDIA Tesla V100 Target FPGA: Nallatech 520N Item Specification Peak performance 2.4 PFLOPS DP (GPU: 2.2 PFLOPS, CPU: 0.2 PFLOPS, FPGA: 0.6 PFLOPS SP) ⇨ enhanced by mixed precision and variable precision on FPGA # of nodes 81 (32 Albireo (GPU+FPGA) nodes, 49 Deneb (GPU-only) nodes) Memory 192 GiB DDR4-2666/node = 256GB/s, 32GiB x 4 for GPU/node = 3.6TB/s CPU / node Intel Xeon Gold (SKL) x2 sockets GPU / node NVIDIA V100 x4 (PCIe) FPGA / node Intel Stratix10 x2 (each with 100Gbps x4 links/FPGA and x8 links/node) Global File System Lustre, RAID6, 2.5 PB Interconnect ion Network Mellanox InfiniBand HDR100 x4 (two cables of HDR200 / node) 4 TB/s aggregated bandwidthj Programmin g Language CPU: C, C++, Fortran, OpenMP, GPU: OpenACC, CUDA FPGA: OpenCL, Verilog HDL System Vendor NEC • FPGA design plan • Router - For the dedicated network, this impl. is mandatory. - Forwarding packets to destinations • User Logic - OpenCL kernel runs here. - Inter-FPGA comm. can be controlled from OpenCL kernel. • SL3 - SerialLite III : Intel FPGA IP - Including transceiver modules for Inter- FPGA data transfer. - Users don’t need to care CPU PCIe network (switch) G P U G P U FPGA HCA HCA Inter-FPGA direct network (100Gbps x4) Network switch (100Gbps x2) CPU PCIe network (switch) G P U G P U FPGA HCA HCA Inter-FPGA direct network (100Gbps x4) SINGLE NODE (with FPGA) Network switch (100Gbps x2) CPU PCIe network (switch) G P U G P U HCA HCA Network switch (100Gbps x2) CPU PCIe network (switch) G P U G P U HCA HCA SINGLE NODE (without FPGA) Network switch (100Gbps x2) FPGA FPGA FPGA FPGA FPGA FPGA FPGA FPGA FPGA (only for Albirero nodes) Inter-FPGA direct network 64 FPGAs on Albireo nodes are connected directly as 2D-Torus configuration without Ethernet sw. : QSFP28 Port 情報処理学会研究報告 IPSJ SIG Technical Report uint16 val = (uint16 )(0); if (in_port == 1) { val = read_channel_intel ( fwd_x_neg_in ); } else if (in_port == 2) { val = read_channel_intel ( fwd_x_pos_in ); } else if (in_port == 3) { val = read_channel_intel ( fwd_y_neg_in ); } else if (in_port == 4) { val = read_channel_intel ( fwd_y_pos_in ); } val += (uint16 )( v + 0, v + 1, v + 2, v + 3, v + 4, v + 5, v + 6, v + 7, v + 8, v + 9, v + 10, v + 11, v + 12, v + 13, v + 14, v + 15 ); ulong t_tmp = 0; if (out_port == 1) { write_channel_intel (fwd_x_neg_out , clocktime( val , &t_tmp )); } else if (out_port == 2) { write_channel_intel (fwd_x_pos_out , clocktime( val , &t_tmp )); } else if (out_port == 3) { write_channel_intel (fwd_y_neg_out , clocktime( val , &t_tmp )); } else if (out_port == 4) { write_channel_intel (fwd_y_pos_out , clocktime( val , &t_tmp )); } else if (out_port == 5) { write_channel_intel (internal , clocktime(val , & t_tmp )); } 図 13: トイプログラムの OpenCL コードの一部. ネルは 2 種類のカーネルで構成される.1 つは往路のデー タ転送を行うカーネルであり,もう 1 つは復路のデータ転 送を行うカーネルである.通信はバケツリレー方式で行わ れ,全体の計算が完了したら計算結果を全てのノードに返 す.また,和の計算を行う処理は往路で行われ,復路でそ の計算結果をブロードキャストする. 図 13 にトイプログラムのコードの一部を示す.この コードは入力を受け取り,その結果に値を加算し,出力す るというものであり,図 12 の灰色で示されているカーネル の一部である.read channel intel,write channel intel 関 数はそれぞれ Channel から読み出し,書き出しを行う組み 込み関数であり,clocktime 関数は時間を測定する独自の 関数である.if 文で入出力する Channel を切り替えられる ようになっているが,これは CoE にはルーティング機能 がまだなく,FPGA ボードにあるどの外部リンクで通信を 行うかを明示する必要があるためである. 性能評価の結果を図 14 に示す.最小レイテンシは 2014ns,最大スループットは 181.4Gbps が得られた.本実 図 14: トイプログラムの測定結果. 表 3: プロトコルオーバヘッド 要素 ペイロード通信速度 効率 物理層速度 103.125Gbps 67b/64b 98.484Gbps ×0.955 Meta Frame 98.287Gbps ×0.998 SL3 Burst 96.813Gbps ×0.985 CoE Header 90.762Gbps ×0.938 験には ppx2-02, ppx2-03, ppx2-05 の計 3 ノードを用いた. 図 12 にあるように ppx2-05 が始点ノードとなり ppx2-03 で折り返す.測定結果のスループットは,始点ノードの データ送信開始から始点ノードのデータ受信終了までの時 間から求めた.また,横軸のデータサイズは,各ノードが 持っているデータサイズを表しており,MPI Allreduce に おける count 引数に相当する.pinpong ベンチマークの結 果 90.7Gbps と比べて,181.4Gbps と約 2 倍の性能が得ら れているが,これは通信と演算がパイプライン化によって 送信と受信が同時に行われるためである. 6. 考察 6.1 pingpong ベンチマーク pingpong ベンチマークで得られた最大スループットは 90.7Gbps であり,物理層に 100Gbps を用いているのに対 して約 90%の性能しか得られていない.しかしながら,こ の性能は設計の意図したとおりである.表 3 に理論上の通 信性能を示す.評価環境では物理層の速度は 103.125Gbps (4 × 25.78125Gbps) であり,この速度は 100Gb Ethernet の物理層と同じ速度を採用している.表 3 は,その物理 層の速度に対して,プロトコル上のオーバヘッドがどの 程度あるのかを示したものである.この中で,67b/64b, Meta Frame,SL3 Burst は SerialLite III に由来するオー バーヘッドであり,公式ドキュメント [11] に記載されてい る計算式を用いて求めた.CoE Header は CoE が付与する ヘッダによるオーバーヘッドを示すものである.CoE のパ ケットは 64byte で構成されており,そこに 4byte のヘッ c 2019 Information Processing Society of Japan 7 情報処理学会研究報告 IPSJ SIG Technical Report uint16 val = (uint16 )(0); if (in_port == 1) { val = read_channel_intel ( fwd_x_neg_in ); } else if (in_port == 2) { val = read_channel_intel ( fwd_x_pos_in ); } else if (in_port == 3) { val = read_channel_intel ( fwd_y_neg_in ); } else if (in_port == 4) { val = read_channel_intel ( fwd_y_pos_in ); } val += (uint16 )( v + 0, v + 1, v + 2, v + 3, v + 4, v + 5, v + 6, v + 7, v + 8, v + 9, v + 10, v + 11, v + 12, v + 13, v + 14, v + 15 ); ulong t_tmp = 0; if (out_port == 1) { write_channel_intel (fwd_x_neg_out , clocktime( val , &t_tmp )); } else if (out_port == 2) { write_channel_intel (fwd_x_pos_out , clocktime( val , &t_tmp )); } else if (out_port == 3) { write_channel_intel (fwd_y_neg_out , clocktime( val , &t_tmp )); } else if (out_port == 4) { write_channel_intel (fwd_y_pos_out , clocktime( val , &t_tmp )); } else if (out_port == 5) { write_channel_intel (internal , clocktime(val , & t_tmp )); } 図 13: トイプログラムの OpenCL コードの一部. ネルは 2 種類のカーネルで構成される.1 つは往路のデー タ転送を行うカーネルであり,もう 1 つは復路のデータ転 送を行うカーネルである.通信はバケツリレー方式で行わ れ,全体の計算が完了したら計算結果を全てのノードに返 す.また,和の計算を行う処理は往路で行われ,復路でそ の計算結果をブロードキャストする. 図 13 にトイプログラムのコードの一部を示す.この コードは入力を受け取り,その結果に値を加算し,出力す るというものであり,図 12 の灰色で示されているカーネル の一部である.read channel intel,write channel intel 関 数はそれぞれ Channel から読み出し,書き出しを行う組み 込み関数であり,clocktime 関数は時間を測定する独自の 関数である.if 文で入出力する Channel を切り替えられる ようになっているが,これは CoE にはルーティング機能 がまだなく,FPGA ボードにあるどの外部リンクで通信を 行うかを明示する必要があるためである. 性能評価の結果を図 14 に示す.最小レイテンシは 2014ns,最大スループットは 181.4Gbps が得られた.本実 図 14: トイプログラムの測定結果. 表 3: プロトコルオーバヘッド 要素 ペイロード通信速度 効率 物理層速度 103.125Gbps 67b/64b 98.484Gbps ×0.955 Meta Frame 98.287Gbps ×0.998 SL3 Burst 96.813Gbps ×0.985 CoE Header 90.762Gbps ×0.938 験には ppx2-02, ppx2-03, ppx2-05 の計 3 ノードを用いた. 図 12 にあるように ppx2-05 が始点ノードとなり ppx2-03 で折り返す.測定結果のスループットは,始点ノードの データ送信開始から始点ノードのデータ受信終了までの時 間から求めた.また,横軸のデータサイズは,各ノードが 持っているデータサイズを表しており,MPI Allreduce に おける count 引数に相当する.pinpong ベンチマークの結 果 90.7Gbps と比べて,181.4Gbps と約 2 倍の性能が得ら れているが,これは通信と演算がパイプライン化によって 送信と受信が同時に行われるためである. 6. 考察 6.1 pingpong ベンチマーク pingpong ベンチマークで得られた最大スループットは 90.7Gbps であり,物理層に 100Gbps を用いているのに対 して約 90%の性能しか得られていない.しかしながら,こ の性能は設計の意図したとおりである.表 3 に理論上の通 信性能を示す.評価環境では物理層の速度は 103.125Gbps (4 × 25.78125Gbps) であり,この速度は 100Gb Ethernet の物理層と同じ速度を採用している.表 3 は,その物理 層の速度に対して,プロトコル上のオーバヘッドがどの 程度あるのかを示したものである.この中で,67b/64b, Meta Frame,SL3 Burst は SerialLite III に由来するオー バーヘッドであり,公式ドキュメント [11] に記載されてい る計算式を用いて求めた.CoE Header は CoE が付与する ヘッダによるオーバーヘッドを示すものである.CoE のパ ケットは 64byte で構成されており,そこに 4byte のヘッ c 2019 Information Processing Society of Japan 7 Cluster System with FPGAs sender(__global float* restrict x, int n) { for (int i = 0; i < n; i++) { float v = x[i]; write_channel_intel(simple_out, v); } } receiver(__global float* restrict x, int n) { for (int i = 0; i < n; i++) { float v = read_channel_intel(simple_in); x[i] = v; } } lCommunication Integrated Reconfigurable CompUting System (CIRCUS) ØCIRCUS enables OpenCL code communicate with other FPGAs on different nodes ØExtending Intel’s channel mechanism to external communications ØPipeline manner: sending/receiving data from/to compute pipeline directly Global Memory (DDR4) Source Kernel Destination Kernel Write Read Off Chip Source Kernel Destination Kernel FIFO Channel O e CL Ke e 40G E h. C e BSPO e CL C c Se ial Link ( 4)IO Channel Network Controller FPGA PCIe OpenCLAPI Interconnect ・ I/O Channel - connects OpenCL with peripherals - We used this feature Comm. w/o channels Comm. w/ channels ・ Channel Extension: Transferring data between kernels directly (low latency and high bandwidth) ・ We can use multiple kernel design to exploit space parallelism in an FPGA lFPGA-based parallel comp. with OpenCL - Needs a communication system being suitable to OpenCL and Intel FPGAs - Using of Intel FPGA SDK for OpenCL CIRCUS Backends sender code on FPGA1 receiver code on FPGA2 Our proposed method Pipelined communication experiment 90.7Gbps↑ Recv. Comp. Send A B A,B: Start and end point to clock Authentic Radiation Transfer [2] • Accelerated Radiative transfer on grids Oct-Tree (ARGOT) has been developer in Center for Computational Sciences, University of Tsukuba • ART is one of algorithms used in ARGOT and dominant part (90% or more of computation time) of ARGOT program • ART is ray tracing based algorithm • problem space is divided into meshes and reactions are computed on each mesh • Memory access pattern depends on ray direction • Not suitable for SIMD architecture 0 200 400 600 800 1000 1200 1400 (16,16,16) (32,32,32) (64,64,64) (128,128,128) Performance[Mmesh/s] mesh size CPU(14C) CPU(28C) P100(x1) FPGA better Table 2: Resource usage and clock frequency size # of PEs ALMs (%) Registers (%) M20 (16, 16, 16) (2, 2, 2) 132,283 31% 267,828 31% 7 (32, 32, 32) (2, 2, 2) 169,882 40% 344,447 40% 7 (64, 64, 64) (2, 2, 2) 169,549 40% 344,512 40% 7 (128, 128, 128) (2, 2, 2) 169,662 40% 344,505 40% 7 Table 3: Performance comparison between FPGA, CPU and GPU implementations. The unit is M mesh/sec. Size CPU(14C) CPU(28C) P100 FPGA (16,16,16) 112.4 77.2 105.3 1282.8 (32,32,32) 158.9 183.4 490.4 1165.2 (64,64,64) 175.0 227.2 1041.4 1111.0 (128,128,128) 95.4 165.0 1116.1 1133.5 per link) multiple interconnection links (up to 4 channels) on it. Additionally, HLS such as OpenCL programming envi- ronment is provided, and there are several tyeps of research to involve them in FPGA computing. In [3], Kobayashi, et al. show the basic feature to utilize the high speed intercon- nection over FPGA driven by OpenCL kernels. Therefore, although the performance of our implementation is almost same as NVIDIA P100 GPU, the overall performance with weak po through run our ation In than Ar blocks a 9. R [1] K. M F. K Hea and Astr [2] K. H Com imag IEE App PE Array (2x2x2) DDR4 Memory Memory Reader Memory Writer Buffer Buffer Channel Memory Network Fig. 5: Design Outline of ART on FPGA. each other. Each kernel computes reaction between a mesh and a ray on its own computation space which is dedicated to each kernel. While computing, a ray is traversed among multiple compute kernels depend- ing on its location. If a ray goes out from kernel’s space, its data will be transferred to a neighbor kernel through a channel. Figure 5 shows the design outline of our implementation. “Memory Reader” reads mesh data from DDR4 memory which is seen as a global memory from OpenCL language. “Memory Writer” is a counterpart to the reader and updates mesh data by the result of computation. It has both of read and write memory access because it computes integration of gas reaction. “Buffer” is a mesh data buffer to improve memory access performance. “PE Array” is an array of PEs (Processing Element). PE computes the kernel of ART method. The array is consists of multiple kernels. We show the detail of PE network in the next subsection. Since our implementation is work-in-progress, it lacks some features from the CPU implementation. While computation in an FPGA, all mesh data must be put into its internal BRAM (Block Random Access Memory). The FPGA implementation does not support to replace mesh data in- volved by progression of its computation. Therefore, problem size which an FPGA can solve is limited by the size of BRAM. The CPU implemen- tation supports inter-node parallelization using MPI (Message Passing Interface), but the FPGA implementation does not support any network- ing functionality and uses only one FPGA. 4.2 Parallelization using Channel in an FPGA We describe the structure in “PE Array” shown in Figure 5. A PE Array is consists of PEs and BEs (Boundary Element) as shown in Figure 6. Source Kernel Destination Kernel FIFO Channel Global Memory (DDR4) Source Kernel Destination Kernel Write Read Off Chip • Our implementation uses channel based approach • One of extensions to OpenCL for FPGAs by Intel • It enables inter kernel communication much faster • No external memory (DDR) access is required • Lower resource utilization than DDR access without channels with channels (16x16x16) (8x8x8) mesh • Problem space is divided into small blocks • e.g. (16, 16, 16) → 8 (8, 8, 8) • PE is assigned to each of small blocks PE BEBE PE 96bit x2 (read,write) Channel PE PE BEBE BEBE BEBE y x Ray Data • PEs are connected by channels each other • PE: Processing Element • BE: Boundary Element • Kernel of PEs and BEs are started automatically by autorun attribute • Lower control overhead and resource usage because of decreasing number of host controlled kernels 4.9x faster almost equal performance Reference [1] Norihisa Fujita, Ryohei Kobayashi, Yoshiki Yamaguchi, and Taisuke Boku, Parallel Processing on FPGA Combining Computation and Communication in OpenCL Programming, 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp.479-488, May 2019 [2] Norihisa Fujita, Ryohei Kobayashi, Yoshiki Yamaguchi, Yuuma Oobata, Taisuke Boku, Makito Abe, Kohji Yoshikawa, and Masayuki Umemura: Accelerating Space Radiate Transfer on FPGA using OpenCL (Accepted), International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies (HEART 2018) Acknowledgment This research is a part of the project titled “Development of Computing-Communication Unified Supercomputer in Next Generation” under the program of “Research and Development for Next-Generation Supercomputing Technology” by MEXT. We thank Intel University Program for providing us both of hardware and software.
  • 14. JCAHPC (Joint Center for Advanced HPC), which is a cooperative organization by the University of Tokyo and University of Tsukuba for joint procurement and operation of the largest scale of supercomputer in Japan, introduced a new supercomputer system “Oakforest-PACS” with 25 PFLOPS peak performance and started its operation from December 1st, 2016. The Oakforest-PACS system is ranked at #6 in TOP500 List of November 2016 with 13.55 PFLOPS of Linpack performance, and also recognized as Japan's fastest supercomputer. The system is installed at the Kashiwa Research Complex II building in the Kashiwa-no-Ha campus, the University of Tokyo. The Oakforest-PACS system has 8,208 compute nodes, each of which consists of the latest version of Intel Xeon Phi processor (code name: Knights Landing), and Intel Omni-Path Architecture as the high performance interconnect. The Oakforest-PACS system is the largest cluster solution with Knights Landing processor as well as also the largest configuration with Omni-Path Architecture in the world. The system is integrated by Fujitsu Co. Ltd, and its PRIMERGY server is employed as each of compute node. Additionally, the system employs the Lustre shared files system (capacity: 26 PB), and IME (fast file cache system, 940 TB), both of which are provided by DataDirect Network (DDN). All the computation nodes and servers including login nodes, Lustre servers and IME servers are connected by a full bisection bandwidth of Fat-Tree interconnection network with Intel Omni-Path Architecture to provide highly flexible job allocation over the nodes and high performance file access. Overview The Oakforest-PACS is offered to researchers in Japan and their international collaborators through various types of programs operated by HPCI under MEXT, and by original supercomputer resource sharing programs by two universities. It is expected to contribute to dramatic development of new frontiers of various field of studies. The Oakforest- PACS will be also utilized for education and training of students and young researchers. We will continue to make further social contributions through operations of the Oakforest-PACS. Research & Education System Configuration 12 of 768 port Director Switch (Source by Intel) 362 of 48 port Edge Switch 2 2 241 4825 7249 Uplink: 24 Downlink: 24 . . . . . . . . . Parallel File System 26.2 PB Omni-Path Architecture (100 Gbps), Full-bisection BW Fat-tree Lustre Filesystem DDN ES14KX x10 File Cache System 940TB DDN IME14KX x25 1560 GB/s 500 GB/s Compute Nodes: 25 PFlops CPU: Intel Xeon Phi 7250 (KNL 68 core, 1.4 GHz) Mem: 16 GB (MCDRAM, 490 GB/sec, effective) + 96 GB (DDR4-2400, 115.2 GB/sec) ×8,208 Fujitsu PRIMERGY CX1640 M1 x 8 node inside CX600 M1 (2U) Login node Login Node x20 Login node Login node Login node Login node Login node Login node Login node Login node Login node Login node Login node U. Tsukuba users U. Tokyo users Total peak performance 25 PFLOPS Total number of compute nodes 8,208 Power consumption 4.2 MW (including cooling) # of racks 102 Cooling system Compute Node Type Warm-water cooling Direct cooling (CPU) Rear door cooling (except CPU) Facility Cooling tower & Chiller Others Type Air cooling Facility PAC Joint Center for Advanced High Performance Computing Joint Center for Advanced HPC | http://jcahpc.jp/ TOP 500 #6 (#1 in Japan), HPCG #3 (#2), Green 500 #6 (#2) @Nov. 2016 IO 500 #1 @Nov. 2017, Jun. 2018 IO-500 BW #1 @Jun. 2019