SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Submitted by:
NOMAN SIDDIQUI
SEC: A (Evening)
Seat No.: EB21102087
3rd
Semester (BSCS)
Assignment Report:
Top 10 Supercomputers With Descriptive Information & Analysis
Submitted To:
SIR KHALID AHMED
Department of Computer Science - (UBIT)
UNIVERSITY OF KARACHI
Top 10 Supercomputers Report
What is Supercomputer?
A supercomputer is a computer with a high level of performance as compared to
a general-purpose computer. The performance of a supercomputer is commonly
measured in floating-point operations per second (FLOPS) instead of million instructions
per second (MIPS). Since 2017, there are supercomputers which can perform over
1017 FLOPS (a hundred quadrillion FLOPS, 100 petaFLOPS or 100 PFLOPS
Supercomputers play an important role in the field of computational science, and are used
for a wide range of computationally intensive tasks in various fields, including quantum
mechanics, weather forecasting, climate research, oil and gas exploration, molecular
modeling (computing the structures and properties of chemical compounds,
biological macromolecules, polymers, and crystals), and physical simulations (such as
simulations of the early moments of the universe, airplane and spacecraft aerodynamics,
the detonation of nuclear weapons, and nuclear fusion). They have been essential in the
field of cryptanalysis.
1. The Fugaku Supercomputer
Introduction:
Fugaku is a petascale supercomputer (while only at petascale for mainstream
benchmark), at the Riken Center for Computational Science in Kobe, Japan. It started
development in 2014 as the successor to the K computer, and started operating in
2021. Fugaku made its debut in 2020, and became the fastest supercomputer in the world
in the June 2020 TOP500 list, as well as becoming the first ARM architecture-based
computer to achieve this. In June 2020, it achieved 1.42 exaFLOPS (in HPL-AI
benchmark making it the first ever supercomputer that achieved 1 exaFLOPS. As of
November 2021, Fugaku is the fastest supercomputer in the world. It is named after
an alternative name for Mount Fuji.
Block Diagram:
Functional Units:
Functional Units, Co-Design and System for the Supercomputer “Fugaku”
1. Performance estimation tool: This tool, taking Fujitsu FX100 (FX100 is the previous
Fujitsu supercomputer) execution profile data as an input, enables the performance
projection by a given set of architecture parameters. The performance projection is
modeled according to the Fujitsu microarchitecture. This tool can also estimate the power
consumption based on the architecture model.
2. Fujitsu in-house processor simulator: We used an extended FX100 SPARC instruction-
set simulator and compiler, developed by Fujitsu, for preliminary studies in the initial
phase, and an Armv8þSVE simulator and compiler afterward.
3. Gem5 simulator for the Post-K processor: The Post-K processor simulator3 based on
an opensource system-level processor simulator, Gem5, was developed by RIKEN during
the co-design for architecture verification and performance tuning. A fundamental
problem is the scale of scientific applications that are expected to be run on Post-K. Even
our target applications are thousands of lines of code and are written to use complex
algorithms and data structures. Although the processor simulators are capable of
providing very accurate performance results at the cycle level, they are very slow and are
limited to execution on a single processor without MPI communications between the
nodes. Our performance estimation tool is useful since it enables performance analysis
based on the execution profile taken from an actual run on the FX100 hardware. It has a
rich set of performance counters, including busy cycles for read/write memory access,
busy cycles for L1/L2 cache access, busy cycles of floating-point arithmetic, and cycles
for instruction commit. These features enable the performance projection for a new set of
hardware parameters by changing the busy cycles of functional blocks. The breakdown
of the execution time (cycles) can be calculated by summing the busy cycles of each
functional block in the pipeline according to the processor microarchitecture. Since the
execution time is estimated by a simple formula modeling the pipeline, it can be applied
to a region of uniform behavior such as a kernel loop.
The first step of performance analysis is to identify kernels in each target application and
insert the library calls to get the execution profile. The total execution time is calculated
by summing the estimated execution time of each kernel using the performance
estimation tool with some architecture parameters. We repeated this process changing
several architecture parameters for design space exploration. Some important kernels
were extracted as independent programs. These kernels can be executed by the cycle-
level processor simulators for more accurate analysis. Since the performance estimation
tool is not able to take the impact of the out-of-order (O3) resources into account, the
Fujitsu in-house processor simulator was used to analyze a new instruction set and the
effect of changing the O3 resources. These kernels were also used for the processor
emulator for logic-design verification.
Co-Design of the Manycore Processor
Prior to the FLAGSHIP 2020 project, feasibility study projects were carried out to
investigate the basic March/April 2022 IEEE Micro 27 COOL CHIPS design from 2012 to
2013. As a result, the basic architecture suggested by the feasibility study was a
largescale system using a general-purpose manycore processor with wide single-
instruction/multiple-data (SIMD) arithmetic units. The choice of the instruction set
architecture was an important decision for architecture design. Fujitsu offered the Armv8
instruction set with the Arm SIMD instruction set called the scalable vector extension
(SVE).4 The Arm instruction-set architecture has been widely accepted by software
developers and users not only for mobile processors, but also, recently, for HPC. For
example, Cavium Thunder X2 is an Arm processor designed for servers and HPC, and
has been used for several supercomputer systems, including Astra5 and Isambard.6 The
SVE is an extended SIMD instruction set. The most significant feature of the SVE realizes
vector length agnostic programming; as the name suggests, it does not depend on the
vector length. We have decided to have two 512-bits-width SIMD arithmetic units, as
suggested by the feasibility study. The processor is custom designed by Fujitsu using
their microarchitecture as a backend of processor core. Fujitsu proposed the basic
structure of the manycore processor architecture according to their microarchitecture:
Each core has an L1 cache, and a cluster of cores shares an L2 cache and a memory
controller. This cluster of cores is called a core-memory group (CMG). While other high-
performance processors, such as those of Intel and AMD, have L1 and L2 caches in the
core and share an L3 cache as a lastlevel cache, the core of our processor has only an
L1 cache to reduce the die size for the core. Our technology target for silicon fabrication
was 7- nm FinFET technology. The die size of the chip is the most dominant factor in
terms of cost. It is known that the cost of the chip increases in proportion to the size and
increases significantly beyond a certain size, and the yield of the chip becomes worse as
the size of the chip increases. One configuration is to use small chips and connect these
chips by multichip module (MCM) technology. Recently, AMD has used this “chiplet”
approach successfully. The advantage of this approach is that a small chip can be
relatively cheaper with a good yield. However, at the time of the basic design, the cost of
MCM was deemed too high, and a different kind of chip for the interconnect and I/O must
be made, resulting in even higher costs. The connection between chips on the MCM
would also increase the power consumption. Thus, our decision was to use a single large
die containing some CMGs and the network interface for interconnect and PCIe for I/O
connected by a network-on-chip. As a result we decided to use 48 cores (plus four cores)
and 12 cores/CMG 4 CMGs. The size of the die fitted within about 400 mm2 , which was
reasonable in terms of cost for 7-nm FinFET technology. As the peak floating-point
performance of the central processing unit (CPU) chip was expected to reach a few
TFLOPS, the memory bandwidth of DDR4 was too low compared to the performance.
Thus, high-speed memory technologies, such as HBM and hybrid memory cube, were
examined to balance the memory bandwidth and arithmetic performance. The HBM is a
stacked memory chip connected via TSV on a silicon interposer. The HBM2 provides a
bandwidth of 256 GB/s per module, but the capacity of HBM2 is just up to 8 GiB, and the
cost is high because the silicon interposer is required. As a memory technology available
around 2019, HBM2 was chosen for its power efficiency and high memory bandwidth. We
decided not to use any additional DDR memory to reduce the cost. As described
previously, the number of HBM2 modules attached to CMGs is four, that is, the main
memory capacity is 32 GiB. Although it seems small for certain applications, we already
have many scalable applications developed for the K computer. Such scalable
applications can increase the problem size by increasing the number of used nodes. The
key to designing a cache architecture is to provide a high hit rate for many applications
and to prevent a bottleneck when data are supplied with full bandwidth from memory. We
examined various parameters, such as the line size, the number of ways, and the
capacity, in order to optimize the cache performance under the constraint of the size of
the area on the die and the amount of power consumption. To decide the cache structure
and size, we examined the impact of the cache configuration on the performance by
running some kernels extracted from target applications on the simulator for a single
CMG. We designed the cache to save power for accessing data in a set associative
cache. Data read from ways and tag search may be used in parallel to reduce the latency,
but this may waste power because the data will not be used when the tag is not matched.
In our design, data access is performed after a tag match. While it causes a long latency,
there is less impact on the performance in the case of through put intensive HPC
applications. This design was applied to the L1 cache for vector access and the L2 cache,
resulting in the reduction of power by 10% in HPL with almost no performance
degradation. The microarchitecture is an O3 architecture designed by Fujitsu. The
amount of the O3 resources was decided by the tradeoff between the performance and
the impact to the die size by the evaluation of some kernels extracted from the target
applications.
OVERVIEW OF FUGAKU SYSTEM
In 2019, the name of the system was decided as “Fugaku,” and the installation was
completed in May 2020. layer storage system is the global file system, which is a Luster-
based parallel file system, developed by Fujitsu. A Linux kernel runs on each node. All
system daemons run on two or four assistant cores. The CPU chip with two assistant
cores is used on compute-only nodes. The chip with four assistant cores is used on
compute and I/O nodes because such nodes service I/ O functions requiring more CPU
resources.
Final specification for architecture parameters by our co-design.
Item Co-design parameter Spec.
design
paramet
er
Chip a CMG/chip 4
s Core/chip 48(+4)*
Memory/chip
 Technology 1113M2
 Memory size 32 GB
 Memory 8W 1024 GB/s
CMG a Core/CMG 12 (+W
L2 cache / CMG
 Sae 8 MiB.
 a way 16 way
 Load BW to Ll 128 GWs
 Store BW from L1 64 GB/s
 Line size 256 bytes
Core SIMI) width 512 bits
 SIMD unit 2
LID cache / Care
 Sae 64103
 a way 4 way
 Load 8W 256 GB/s
 Store BW 128 GB/s
Out of order resource/core
 Reorder buffer 128 entries
 Reservation Station 60 entries
 sPhysical &MD register 128
 Load buffer 40 entries
 Store buffer 24 entries
*Assistant core.
**Cache BW is with the CPU clock speed of 2 GHz7
Software Used:
Fugaku will use a "light-weight multi-kernel operating system" named IHK/McKernel. The
operating system uses both Linux and the McKernel light-weight kernel operating
simultaneously and side by side. The infrastructure that both kernels run on is termed
the Interface for Heterogeneous Kernels (IHK). The high-performance simulations are run
on McKernel, with Linux available for all other POSIX-compatible services
2. Summit Supercomputer
Introduction:
Summit or OLCF-4 is a supercomputer developed by IBM for use at Oak Ridge National
Laboratory, capable of 200 petaFLOPS, making it the second fastest supercomputer
in the world (it held the number 1 position from November 2018 to June 2020.) Its
current LINPACK benchmark is clocked at 148.6 petaFLOPS. As of November 2019, the
supercomputer had ranked as the 5th most energy efficient in the world with a measured
power efficiency of 14.668 gigaFLOPS/watt. Summit was the first supercomputer to reach
exaflop (a quintillion operations per second) speed
Block Diagram:
Software Used:
Red Hat Enterprise Linux is also widely deployed in National Labs and research centers
around the globe and is a proven platform for large-scale computing across multiple
hardware architectures. The total system design of Summit, consisting of 4,608 IBM
computer servers, aims to make it easier to bring research applications to this behemoth.
Part of this is the consistent environment provided by Red Hat Enterprise Linux.
Functional Units:
System Overview & Specifications
Summit is an IBM system located at the Oak Ridge Leadership Computing Facility. With
a theoretical peak double-precision performance of approximately 200 PF, it is one of
the most capable systems in the world for a wide range of traditional computational
science applications. It is also one of the “smartest” computers in the world for deep
learning applications with a mixed-precision capability in excess of 3 EF.
Core Pipeline
NVDIA Tesla v100 GPU Architecture
3. Sierra Supercomputer
Introduction:
Sierra or ATS-2 is a supercomputer built for the Lawrence Livermore National
Laboratory for use by the National Nuclear Security Administration as the second
Advanced Technology System. It is primarily used for predictive applications in stockpile
stewardship, helping to assure the safety, reliability and effectiveness of the United
States' nuclear weapons.
Sierra is very similar in architecture to the Summit supercomputer built for the Oak Ridge
National Laboratory. The Sierra system uses IBM POWER9 CPUs in conjunction
with Nvidia Tesla V100 GPUs. The nodes in Sierra are Witherspoon IBM S922LC
OpenPOWER servers with two GPUs per CPU and four GPUs per node. These nodes
are connected with EDR InfiniBand. In 2019 Sierra was upgraded with IBM Power
System A922 nodes.
Block Diagram:
Software Used:
The Summit and Sierra supercomputer cores are IBM POWER9 central processing units
(CPUs) and NVIDIA V100 graphic processing units (GPUs). NVIDIA claims that its GPUs
are delivering 95% of Summit’s performance. Both supercomputers use a Linux operating
system.
Functional Units:
Sierra boasts a peak performance of 125 petaFLOPS—125 quadrillion floating-point
operations per second. Early indications using existing codes and benchmark tests are
promising, demonstrating as predicted that Sierra can perform most required calculations
far more efficiently in terms of cost and power consumption than computers consisting of
CPUs alone. Depending on the application, Sierra is expected to be six to 10 times more
capable than LLNL’s 20-petaFLOP Sequoia, currently the world’s eighth-fastest
supercomputer.
To prepare for this architecture, LLNL has partnered with IBM and NVIDIA to rapidly
develop codes and prepare applications to effectively optimize the CPU/GPU nodes. IBM
and NVIDIA personnel worked closely with LLNL, both on-site and remotely, on code
development and restructuring to achieve maximum performance. Meanwhile, LLNL
personnel provided feedback on system design and the software stack to the vendor.
LLNL selected the IBM/NVIDIA system due to its energy and cost-efficiency, as well as
its potential to effectively run NNSA applications. Sierra’s IBM POWER9 processors
feature CPU-to-GPU connection via NVIDIA NVLink interconnect, enabling greater
memory bandwidth between each node so Sierra can move data throughout the system
for maximum performance and efficiency. Backing Sierra is 154 petabytes of IBM
Spectrum Scale, a software-defined parallel file system, deployed across 24 racks of
Elastic Storage Servers (ESS). To meet the scaling demands of the heterogeneous
systems, ESS delivers 1.54 terabytes per second in both read and write bandwidth and
can manage 100 billion files per file system.
“The next frontier of supercomputing lies in artificial intelligence,” said John Kelly, senior
vice president, Cognitive Solutions and IBM Research. “IBM's decades-long partnership
with LLNL has allowed us to build Sierra from the ground up with the unique design and
architecture needed for applying AI to massive data sets. The tremendous insights
researchers are seeing will only accelerate high-performance computing for research and
business.”
As the first NNSA production supercomputer backed by GPU-accelerated architecture,
Sierra’s acquisition required a fundamental shift in how scientists at the three NNSA
laboratories program their codes to take advantage of the GPUs. The system’s NVIDIA
GPUs also present scientists with an opportunity to investigate the use of machine
learning and deep learning to accelerate the time-to-solution of physics codes. It is
expected that simulation, leveraged by acceleration coming from the use of artificial
intelligence technology will be increasingly employed over the coming decade.
In addition to critical national security applications, a companion unclassified system,
called Lassen, also has been installed in the Livermore Computing Center. This
institutionally focused supercomputer will play a role in projects aimed at speeding cancer
drug discovery, precision medicine, research on traumatic brain injury, seismology,
climate, astrophysics, materials science, and other basic science benefiting society.
Sierra continues the long lineage of world-class LLNL supercomputers and represents
the penultimate step on NNSA’s road to exascale computing, which is expected to start
by 2023 with an LLNL system called “El Capitan.” Funded by the NNSA’s Advanced
Simulation and Computing (ASC) program, El Capitan will be NNSA’s first exascale
supercomputer, capable of more than a quintillion calculations per second—about 10
times greater performance than Sierra. Such computing power will be easily absorbed by
NNSA for its mission, having required the most advanced computing capabilities and
deep partnerships with American industry.
4. Sunway TaihuLight Supercomputer
Introduction:
The Sunway TaihuLight is a Chinese supercomputer which, as of November 2021, is
ranked fourth in the TOP500 list, with a LINPACK benchmark rating of 93 petaflops The
name is translated as divine power, the light of Taihu Lake. This is nearly three times as
fast as the previous Tianhe-2, which ran at 34 petaflops. As of June 2017, it is ranked as
the 16th most energy-efficient supercomputer in the Green500, with an efficiency of
6.051 GFlops/watt. It was designed by the National Research Center of Parallel
Computer Engineering & Technology (NRCPC) and is located at the National
Supercomputing Center in Wuxi in the city of Wuxi, in Jiangsu province, China.
Block Diagram:
Software Used:
The system runs on its own operating system, Sunway RaiseOS 2.0.5, which is based
on Linux. The system has its own customized implementation of OpenACC 2.0 to aid the
parallelization of code.
Functional Units
The Sunway TaihuLight supercomputer:
an overview The Sunway TaihuLight supercomputer is hosted at the National Supercomputing
Center in Wuxi (NSCCWuxi), which operates as a collaboration center between the City of Wuxi,
Jiangsu Province, and Tsinghua University. NSCC-Wuxi focuses on the development needs of
technological innovation and industrial upgrading around Jiangsu Province and the Yangtze
River Delta economic circle, as well as the demands of the national key strategies on science
and technology development.
The SW26010 many-core processor:
One major technology innovation of the Sunway TaihuLight supercomputer is the homegrown
SW26010 many-core processor. The general architecture of the SW26010 processor [10] is
shown in Figure 2. The processor includes four core-groups (CGs). Each CG includes one
management processing element (MPE), one computing processing element (CPE) cluster with
eight by eight CPEs, and one memory controller (MC). These four CGs are connected via the
network on chip (NoC). Each CG has its own memory space, which is connected to the MPE and
the CPE cluster through the MC. The processor connects to other outside devices through a
system interface (SI). The MPE is a complete 64-bit RISC core, which can run in both the user
and system modes. The MPE completely supports the interrupt functions, memory
management, superscalar processing, and outof-order execution. Therefore, the MPE is an ideal
core for handling management and communication functions. In contrast, the CPE is also a 64-
bit RISC core, but with limited functions. The CPE can only run in user mode and does not
support interrupt functions. The design goal of this element is to achieve the maximum
aggregated computing power, while minimizing the complexity of the micro-architecture. The
CPE cluster is organized as an eight by eight mesh, with a mesh network to achieve low-latency
register data communication among the eight by eight CPEs. The mesh also includes a mesh
controller that handles interrupt and synchronization controls. Both the MPE and CPE support
256-bit vector instructions.
4 Subcomponent systems of the Sunway TaihuLight
In this section, we provide more detail about the various subcomponent systems of the
Sunway TaihuLight, specifically the computing, network, peripheral, maintenance and
diagnostic, power and cooling, and the software systems.
4.1 The computing system
Aiming for a peak performance of 125 PFlops, the computing system of the Sunway
TaihuLight is built using a fully customized integration approach with a number of different
levels: (1) computing node (one CPU per computing node); (2) super node s(256
computing nodes per super node); (3) cabinet (4 super nodes per cabinet); and (4) the
entire computing system (40 cabinets). The computing nodes are the basic units of the
computing system, and include one SW26010 processor, 32 GB memory, a node
management controller, power supply, interface circuits, etc. Groups of 256 computing
nodes, are integrated into a tightly coupled super node using a fully-connected crossing
switch, so as to support computationally-intensive, communication-intensive, and I/O-
intensive computing jobs.
4.2 The network system
The network system consists of three different levels, with the central switching network
at the top, super node network in the middle, and resource-sharing network at the bottom.
The bisection network bandwidth is 70 TB/s, with a network diameter of 7. Each super
node includes 256 Sunway processors that are fully connected by the super node
network, which achieves both high bandwidth and low latency for all-to-all
communications among the entire 65536 processing elements. The central switching
network is responsible for building connections and enabling data exchange between
different super nodes. The resource-sharing network connects the sharing resources to
the super nodes, and provides services for I/O communication and fault tolerance of the
computing nodes. Fu H H, et al. Sci China Inf Sci July 2016 Vol. 59 072001:6 4.3 The
peripheral system The peripheral system consists of the network storage system and
peripheral management system. The network storage system includes both the storage
network and storage disk array, providing a total storage of 20 PB and a high-speed and
reliable data storage service for the computing nodes. The peripheral management
system includes the system console, management server, and management network,
which enable system management and service.
4.3 The power supply system and cooling system
The TaihuLight supercomputer uses a mutual-backup power input of 2 × 35 KV. The
cabinets of the system use a three-level (300 V-12 V-0.9 V) DC power supply mode.
The front-end power supply output is 300 V, which is directly linked to the cabinet. The
main power supply of the cabinet converts 300 V DC to 12 V DC, and the CPU power
supply converts 12 V into the voltage that the CPU needs. The cabinets of the
computing and network systems use indirect water cooling, while the peripheral devices
use air and water exchange, and the power system uses forced air cooling. The
cabinets use closed-loop, static hydraulic pressure for cavum, indirect parallel flow
water cooling technology, which provides effective cooling for the full-scale Linpack run.
5. Tianhe-2A Supercomputer
Introduction:
It was the world's fastest supercomputer according to the TOP500 lists for June 2013,
November 2013, June 2014, November 2014, June 2015, and November 2015. The
record was surpassed in June 2016 by the Sunway TaihuLight. In 2015, plans of the Sun
Yat-sen University in collaboration with Guangzhou district and city administration to
double its computing capacities were stopped by a U.S. government rejection of Intel's
application for an export license for the CPUs and coprocessor boards.
In response to the U.S. sanction, China introduced the Sunway TaihuLight supercomputer in
2016, which substantially outperforms the Tianhe-2 (and also affected the update of Tianhe-2 to
Tianhe-2A replacing US tech), and now ranks fourth in the TOP500 list while using completely
domestic technology including the Sunway manycore microprocessor.
Block Diagram:
Software Used:
Tianhe-2 ran on Kylin Linux, a version of the operating system developed by NUDT
Functional Unit:
System Architecture & Compute Blade
The original TH-2 compute blade consisted of two nodes split into two modules: (1) the
Computer Processor Module (CPM) module and (2) the Accelerator Processor Unit
(APU) module (Figure 5). The CPM contained four Ivy Bridge CPUs, memory, and one
Xeon Phi KNC accelerator, and the APU contained five Xeon Phi KNC accelerators.
Connections from the Ivy Bridge CPUs to each of the KNC accelerators are made
through a ×16 PCI Express 2.0 multiboard with 10 Gbps of bandwidth. The actual
design and implementation of the board supports PCI Express 3.0, but the Xeon Phi
KNC accelerator only supports PCI Express 2.0. There was also a PCI Express
connection for the network interface controller (NIC).
With the upgraded TH-2A, the Intel Xeon Phi KNC accelerators have been replaced.
The CPM module still has four Ivy Bridge CPUs but is no longer housing an accelerator.
The APU now houses four Matrix-2000 accelerators instead of the five Intel Xeon Phi
KNC accelerators. So, in the TH-2A, the compute blade has two heterogeneous
compute nodes, and each compute node is equipped with two Intel Ivy Bridge CPUs
and two proprietary Matrix-2000 accelerators. Each node has 192 GB memory, and a
peak performance of 5.3376 Tflop/s. The Intel Ivy Bridge processors have not been
changed and are the same as in the original TH-2. Each of the Intel Ivy Bridge CPU’s 12
compute cores can perform 8 FLOPs per cycle per core, which results in 211.2 Gflop/s
total peak performance per socket (12 cores × 8 FLOPs per cycle × 2.2 GHz clock). The
logical structure of the compute node is shown in Figure 6. The two Intel Ivy Bridge
CPUs are linked using two Intel Quick Path Interconnects (QPI). Each CPU has four
memory channels with eight dual in-line memory module (DIMM) slots. CPU0 expands
its I/O devices using Intel’s Platform Controller Hub (PCH) chipset and connects with a
14G proprietary NIC through a ×16 PCI Express 3.0 connection. Each CPU also uses a
×16 PCI Express 3.0 connection to access the Matrix-2000 accelerators. Each
accelerator has eight memory channels. In a compute node, the CPUs are equipped
with 64 GB of DDR3 memory, while the accelerators are equipped with 128 GB of
DDR4 memory. With 17,792 compute nodes, the total memory capacity of the whole
system is 3.4 PB.
H-2A compute blade is composed of two parts: the CPM (left) and the APU (middle).
The CPM integrates four Ivy Bridge CPUs, and the APU integrates four Matrix2000
accelerators. Each compute blade contains two heterogeneous compute nodes.
As stated earlier, the peak performance of each Ivy Bridge CPU is 211.2 Gflop/s, and
the peak performance of each Matrix-2000 accelerator is 2.4576 Tflop/s. Thus, the peak
performance of each compute node can be calculated as (0.2112 Tflop/s × 2) + (2.4576
Tflop/s × 2) = 5.3376 Tflop/s. With 17,792 compute nodes, the peak performance of the
whole system is 94.97 Pflop/s (5.3376 Tflop/s x × 17,792 nodes = 94.97 Pflop/s total)
6. Frontera Supercomputer
Introduction:
In August 2018, Dell EMC and Intel announced intentions to jointly design Frontera, an
academic supercomputer funded by a $60 million grant from the National Science
Foundation that would replace Stampede2 at the University of Texas at Austin’s Texas
Advanced Computing Center (TACC). Those plans came to fruition in June when the two
companies deployed Frontera, which was formally unveiled this morning.
Intel claims that Frontera can achieve peak performance of 38.7 quadrillion floating point
operations per second, or petaflops, making it the world’s fastest computer designed for
academic workloads like modeling and simulation, big data, and machine learning. (That’s
compared with Stampede2’s peak performance of 18 petaflops.) Earlier this year,
Frontera earned the fifth spot on the twice-annual Top500 list with 23.5 petaflops on the
LINPACK benchmark, which ranks the world’s most powerful non-distributed computer
systems.
Block Diagram:
Software Used:
With a peak-performance rating of 38.7 petaFLOPS, the supercomputer is about twice as
powerful as TACC's Stampede2 system, which is currently the 19th fastest
supercomputer in the world. Dell EMC provided the primary computing system for
Frontera, based on Dell EMC PowerEdge™ C6420 servers.
Functional Unit
the Frontera system will provide academic researchers with the ability to calculate and handle
artificial intelligence-related jobs with extremely high levels of complexity, even never existing
before. 'With the integration of many Intel-exclusive technologies, this supercomputer opens up
a wealth of new possibilities in the field of scientific and technical research in general, thereby
fostering deeper understanding. for complex, scholarly issues related to space research, cures,
energy needs, and artificial intelligence, 'says Trish Damkroger, Intel vice president and general
manager of the team. Intel computing official Trish Damkroger said.
 Supercomputers can fully detect cyber threats
Hundreds of 2nd generation Xeon processors that can be expanded to 28 cores (Cascade
Lake) housed in Dell EMC PowerEdge servers will be responsible for handling Frontera's heavy
computing tasks, besides Nvidia nodes. ensure single-precision calculation. Frontera's
processor architecture is built on Intel's advanced Advanced Vector Extensions 512 (AVX-512)
model. Basically, the AVX-512 is a set of instructions that allows doubling the number of FLOPS
per clock speed compared to the previous generation.
It is also important to mention that one extremely important part of a supercomputer is the
cooling system. Frontera uses a liquid cooling mechanism for most of its nodes. In particular,
Dell EMC is responsible for water and cooling oil, combined with CoolIT and Green Revolution
Cool systems. This supercomputer uses Mellanox HDR and HDR-100 connections to transfer
data at up to 200Gbps on each link between the switches, which is responsible for connecting
8.008 nodes across the system. Each node is expected to consume about 65 kilowatts of
electricity, about one third of which is used by TACC from wind and solar power to save costs
Frontera will provide academic researchers with the ability to calculate and handle
artificial intelligence-related jobs with extremely high complexity.
In terms of storage, Frontera owns 4 different environments designed and built by
DataDirect Networks, which will have a total of more than 50 petabytes paired with 3
petabytes of NAND flash (equivalent to about 480GB of SSD storage. on each node).
Besides, this supercomputer also possesses extremely fast connectivity, with a speed
of up to 1.5 terabytes per second.
Finally, Frontera is also very effective at Intel Optane DC, the 'non-volatile' memory
technology developed by Intel and Micron Technology, which has PIN and DDR4
compatibility, and incorporates a large cache memory. with a smaller DRAM group
(192GB per node), thereby improving performance improvement. Not stopping there,
Intel Optane DC on Frontera is also combined with the latest generation Xeon Scalable
Processors, delivering up to 287,000 operations per second, compared to 3,116
operations per second of Conventional DRAM systems. With such equipment,
Frontera's reboot time only takes 17 seconds.
Basic specifications of Frontera supercomputer
Basic calculation system
The configuration of each node in Frontera is described as follows (Frontera owns 8.008
available nodes):
 Processor: Intel Xeon Platinum 8280 ("Cascade Lake"); Number of cores: 28
per socket, 56 per node; Pulse Rate: 2.7Ghz ("Base Frequency")
 Maximum node performance: 4.8TF, double precision
 RAM: DDR-4, 192GB / node
 Local drive: 480GB SSD / node
 Network: Mellanox InfiniBand, HDR-100
Subsystems
Liquid submerged system:
 Processor: 360 NVIDIA Quadro RTX 5000 GPU
 Ram: 128GB / node
 Cooling: GRC ICEraQ ™ system
 Network: Mellanox InfiniBand, HDR-100
 Maximum performance: 4PF single precision
Longhorn:
 Processor: IBM POWER9-hosted system with 448 NVIDIA V100 GPUs
 Ram: 256GB / node
 Storage: 5 petabyte filesystem
 Network: Infiniband EDR
 Maximum performance: 3.5PF double precision; 7.0PF single precision
7. Piz Daint
Introduction:
Piz Daint is a supercomputer in the Swiss National Supercomputing Centre, named after
the mountain Piz Daint in the Swiss Alps.
It was ranked 8th on the TOP500 ranking of supercomputers until the end of 2015, higher
than any other supercomputer in Europe. At the end of 2016, the computing performance
of Piz Daint was tripled to reach 25 petaflops; it thus became the third most powerful
supercomputer in the world. As of November 2021, Piz Daint is ranked 20th on
the TOP500. The original Piz Daint Cray XC30 system was installed in December
2012. This system was extended with Piz Dora, a Cray XC40 with 1,256 compute nodes,
in 2013.[9] In October 2016, Piz Daint and Piz Dora were upgraded and combined into
the current Cray XC50/XC40 system featuring Nvidia Tesla P100 GPUs.
Block Diagram:
Software Used:
Architecture Intel Xeon E5-26xx (various) , Nvidia Tesla P100
Operating system Linux (CLE)
8. Trinity Supercomputer
Introduction:
Trinity (or ATS-1) is a United States supercomputer built by the National Nuclear Security
Administration (NNSA) for the Advanced Simulation and Computing
Program (ASC).[2] The aim of the ASC program is to simulate, test, and maintain the
United States nuclear stockpile.
Block Diagram:
Software Used:
Trinity uses a Sonexion based Lustre file system with a total capacity of 78 PB.
Throughput on this tier is about 1.8 TB/s (1.6 TiB/s). It is used to stage data in preparation
for HPC operations. Data residence in this tier is typically several weeks.
Functional Unit
Trinity is a Cray XC40 supercomputer, with delivery over two phases; phase 1 is based
on Intel Xeon Haswell compute nodes, and phase 2 will add Intel Xeon Phi Knights
Landing (KNL) compute nodes.
Phase 1 was delivered and accepted in the latter part of 2016, and consists of 54
cabinets, including multiple node types. Foremost are 9436 Haswell-based compute
nodes, delivering ~1 PiB of memory capacity and ~11 PF/s of peak performance. Each
Haswell compute node features two 16-core Haswell processors operating at 2.3 GHz,
along with 128GiB of DDR4- 2133 memory, spread across 8 channels (4 per CPU).
Phase 1 also includes 114 Lustre router nodes (see Section III.B) and 300 burst buffer
nodes (see Section IV). Trinity utilizes a Sonexion based Lustre filesystem with 78 PB of
usable storage and approximately 1.6 TB/s of bandwidth. However, due to the limited
number of Lustre router nodes in Phase 1, only about half of this bandwidth is currently
achievable. Phase 1 also includes all of the other typical service nodes: 2 boot, 2 SDB,
2 UDSL, 6 DVS, 12 MOM, and 10 RSIP. Additionally, Trinity utilizes 6 external login
nodes. Phase 2 is scheduled to begin delivery in mid-2016. It adds more than 9500
Xeon Phi Knights Landing (KNL) based compute nodes. Each KNL compute node
consists of a single KNL with 16 GiB of on-package memory and 96 GiB of DDR4- 2400
memory. It has a peak performance of approximately 3 TF/s. In total, the KNL nodes
add ~1 PiB of memory capacity and ~29 PF/s peak performance. In addition to the
KNLs, Phase 2 also adds the balance of the Lustre router nodes (108 additional, total of
222) and burst buffer nodes (276 additional, total of 576). When all burst buffer nodes
are installed, they will provide 3.69 PB of raw storage capacity and 3.28 TB/s of
bandwidth.
BURST BUFFER INTEGRATION AND PERFORMANCE
1. Design
Trinity includes the first large scale instance of on-platform burst buffers using the Cray
DataWarp® product. The Trinity burst buffer is provided in two phases along with the
two phases of Trinity. The phase 1 burst buffer consists of 300 DataWarp nodes. This is
expanded to 576 DataWarp nodes by phase 2. In this section, unless otherwise noted,
the phase 1 burst buffer will be described. The 300 DataWarp nodes are built from Cray
service nodes, each with a 16 core Intel Sandy Bridge processor and 64 gigabytes of
memory. Storage on each DataWarp node is provided by two Intel P3608 Solid State
Drive (SSD) cards. The DataWarp nodes use the Aries high speed network for
communications with the Trinity compute nodes and for communications with the Lustre
Parallel File System (PFS) via the LNET router nodes. Each SSD card has 4 TB of
capacity and is attached to the service node via a PCI-E x4 interface. The SSD cards
are overprovisioned to improve the endurance of the card from the normal 3 Drive
Writes Per Day (DWPD) over 5 years to 10 DWPD over 5 years. This reduces the
available capacity of each card. The total usable capacity of the 300 DataWarp nodes is
1.7 PiB. The DataWarp nodes run a Cray provided version of Linux together with a
DataWarp specific software stack consisting of an enhanced Data Virtualization Service
(DVS) server and various configuration and management services. The DataWarp
nodes also provide a staging function that can be used to asynchronously move data
between the PFS and DataWarp. There is a centralized DataWarp registration service
that runs on one of the Cray System Management nodes. Compute nodes run a DVS
client that is enhanced to provide support for DataWarp. The DataWarp resources can
be examined and controlled via several DataWarp specific command line interface (CLI)
utilities that run on any of the system’s nodes. DataWarp can be configured to operate
in a number of different modes. The primary use case at ACES is to support checkpoint
and analysis files, these are supported by the striped scratch mode of DataWarp.
Striped scratch provides a single file name space that is visible to multiple compute
nodes with the file data striped across one or more DataWarp nodes. A striped private
mode is additionally available. In the future, paging space and cache modes may be
provided. This section will discuss LANL’s experience with striped scratch mode. A
DataWarp allocation is normally configured by job script directives. Trinity uses the
Moab Work Load Manager (WLM). The WLM reads the job script at job submission time
and records the DataWarp directives for future use. When the requested DataWarp
capacity is available, the WLM will start the job. Prior to the job starting, the WLM uses
DataWarp CLI utilities to request instantiation of a DataWarp allocation and any
requested stage-in of data from the PFS. After the job completes, the WLM requests
stage-out of data and then frees the DataWarp allocation. The stage-in and stage-out
happen without any allocated compute nodes or any compute node involvement. The
DataWarp allocation is made accessible via mount on only the compute nodes of the
requesting job. Unix file permissions are effective for files in DataWarp and are
preserved by stage-in and stage-out operations. A DataWarp allocation is normally only
available for the life of the requesting job, with the exception of a persistent DataWarp
allocation that may be accessed by multiple jobs, possibly simultaneously.
Simultaneous access by multiple jobs is used to support in-transit data visualization and
analysis use cases.
2. Integration
Correct operation of DataWarp in conjunction with the WLM was achieved after several
months of extended integration testing on-site at LANL. Numerous fixes and functional
enhancements have improved the stability and usability of the DataWarp feature on
Trinity. Due to this effort, production use of DataWarp has been limited as of late April,
2016.
3. Performance
All performance measurements were conducted with IOR. The runs were made with:
 reader or writer process per node
 32 GiB total data read or written per node
 256, 512 or 1024 KiB block size
 Nodes counts from 512 to 4096
 The DataWarp allocation striped across all 300 DataWarp nodes
These characteristics were selected to approximate the IO patterns expected when
applications use the HIO library. Additional investigation and optimization of IO
characteristics is needed.
9. AI Bridging Cloud Infrastructure
Introduction:
AI Bridging Cloud Infrastructure (ABCI) is a planned supercomputer being built at
the University of Tokyo for use in artificial intelligence, machine learning, and deep
learning. It is being built by Japan's National Institute of Advanced Industrial Science and
Technology. ABCI is expected to be completed in first quarter 2018 with a planned
performance of 130 petaFLOPS. Power consumption is targeting 3 megawatts, and a
planned power usage effectiveness of 1.1. If performance meets expectations, ABCI
would be the second most powerful supercomputer built, surpassing the current
leader Sunway TaihuLight's 93 petaflops. But still behind the Summit (supercomputer).
Block Diagram:
Software Used:
Along with Docker, Singularity and other tools, Univa Grid Engine plays a key role in
ABCI’s software stack, ensuring that workloads run as efficiently as possible.
Functional Units
The ABCI prototype, which was installed in March, consisted of 50 two-socket
“Broadwell” Xeon E5 servers, each equipped with 256 GB of main memory, 480 GB of
SSD flash memory, and eight of the tesla P100 GPU accelerators in the SMX2 form
factor hooked to each other using the NVLink 1.0 interconnect. Another 68 nodes were
just plain vanilla servers, plus two nodes for interactive management and sixteen nodes
for other functions on the cluster. The system was configured with 4 PB of SF14K
clustered disk storage from DataDirect Networks running the GRIDScaler
implementation of IBM’s GPFS parallel file system, and the whole shebang was
clustered together using 100 Gb/sec EDR InfiniBand from Mellanox Technologies,
specifically 216 of its CS7250 director switches. Among the many workloads running on
this cluster was the Apache Spark in-memory processing framework.
The goal with the real ABCI system was to deliver a machine with somewhere between
130 petaflops and 200 petaflops of AI processing power, which means half precision
and single precision for the most part, with a power usage effectiveness (PUE) of
somewhere under 1.1, which is a ratio of the energy consumed for the datacenter
compared to the compute complex that does actual work. (This is about as good as
most hyperscale datacenters, by the way.) The system was supposed to have about 20
PB of parallel file storage and, with the compute, storage, and switching combined, burn
under 3 megawatts of juice.
The plan was to get the full ABCI system operational by the fourth quarter of 2017 or the
first quarter of 2018, and this obviously depended on the availability of the compute and
networking components. Here is how the proposed ABCI system was stacked up
against the K supercomputer at the RIKEN research lab in Japan and the Tsubame 3.0
machine at the Tokyo Institute of Technology:
The K machine, which is based on the Sparc64 architecture and which was the first
machine in the world to break the 10 petaflops barrier, will eventually be replaced by a
massively parallel ARM system using the Tofu interconnect made for the K system and
subsequently enhanced. The Oakforest-PACs machine built by University of Tokyo and
University of Tsukuba is based on a mix of “Knights Landing” Xeon Phi processors and
Omni-Path interconnect from Intel, and weighs in at 25 petaflops peak double precision.
It is not on this comparison table of big Japanese supercomputers. But perhaps it
should be.
While the Tsubame 3.0 machine is said to focus on double precision performance, the
big difference is really that the Omni-Path network hooking all of the nodes together in
Tsubame 3.0 was configured to maximize extreme injection bandwidth and to have very
high bi-section bandwidth across the network. The machine learning workloads that are
expected to run on ABCI are not as sensitive to these factors and, importantly, the idea
here is to build something that looks more like a high performance cloud datacenter that
can be replicated in other facilities, using standard 19-inch equipment rather than the
specialized HPE and SGI gear that TiTech has used in the Tsubame line to date. In the
case of both Tsubame 3.0 and ABCI, the thermal density of the compute and switching
is in the range of 50 kilowatts to 60 kilowatts per rack, which is a lot higher than the 3
kilowatts to 6 kilowatts per rack in a service provider datacenter, and the PUE at under
1.1 is a lot lower than the 1.5 to 3.0 rating a typical service provider datacenter. (The
hyperscalers do a lot better than this average, obviously.)
This week, AIST awarded the job of building the full ABCI system to Fujitsu, and nailed
down the specs. The system will be installed at a shiny new datacenter at the Kashiwa
II campus of the University of Tokyo, and is now going to start operations in Fujitsu’s
fiscal 2018, which begins next April.
The ABCI system will be comprised of 1,088 of Fujitsu’s Primergy CX2570 server
nodes, which are half-width server sleds that slide into the Primergy CX400 2U chassis.
Each sled can accommodate two Intel “Skylake” Xeon SP processors, and in this case
AIST is using a Xeon SP Gold variant, presumably with a large (but not extreme)
number of cores. Each node is equipped with four of the Volta SMX2 GPU accelerators,
so the entire machine has 2,176 CPU sockets and 4,352 GPU sockets. The use of the
SXM2 variants of the Volta GPU accelerators requires liquid cooling because they run a
little hotter, but the system has an air-cooled option for the Volta accelerators that hook
into the system over the PCI-Express bus. The off-the-shelf models of the CX2570
server sleds also support the lower-grade Silver and Bronze Xeon SP processors as
well as the high-end Platinum chips, so AIST is going in the middle of the road. There
are Intel DC 4600 flash SSDs for local storage on the machine. It is not clear who won
the deal for the GPFS file system for this machine, and if it came in at 20 PB as
expected.
Fujitsu says that the resulting ABCI system will have 37 petaflops of aggregate peak
double precision floating point oomph, and will be rated at 550 petaflops, and 525
petaflops off that comes from using the 16-bit Tensor Core units that were created
explicitly to speed up machine learning workloads. That is a lot more deep learning
performance than was planned, obviously.
AIST has amassed $172 million to fund the prototype and full ABCI machines as well as
build the new datacenter that will house this system.
About $10 million of that funding is for the datacenter, which had its ground breaking
this summer. The initial datacenter setup has a maximum power draw of 3.25
megawatts, and it has 3.2 megawatts of cooling capacity, of which 3 megawatts come
from a free cooling tower assembly and another 200 kilowatts comes from a chilling
unit. The datacenter has a single concrete slab floor, which is cheap and easy, and will
start out with 90 racks of capacity – that’s 18 for storage and 72 for compute – with
room for expansion.
10. SuperMUC-NG Supercomputer
Introduction:
SuperMUC was a supercomputer of the Leibniz Supercomputing Centre (LRZ) of
the Bavarian Academy of Sciences. It was housed in the LRZ's data centre
in Garching near Munich. It was decommissioned in January 2020, having been
superseded by the more powerful SuperMUC-NG. SuperMUC was the fastest European
supercomputer when it entered operation in the summer of 2012 and is currently ranked
#20 in the Top500 list of the world's fastest supercomputers. SuperMUC serves
European researchers of many fields, including medicine, astrophysics, quantum
chromodynamics, computational fluid dynamics, computational chemistry, life sciences,
genome analysis and earth quake simulations.
Block Diagram:
Software Used:
Compute Nodes
Operating System
Thin Nodes
Suse Linux (SLES)
Batch Scheduling System SLURM
High Performance Parallel Filesystem IBM Spectrum Scale (GPFS)
Programming Environment Intel Parallel Studio XE GNU compilers
Functional Unit
Just like the CoolMUC-2, the SuperMUC-NG is located at the Leibniz Supercomputing
Centre in Germany and was built by Lenovo. The system has 311,040 physical cores
and a main memory of 719 TB resulting in a peak performance of 26.9 PFlop/s. A fat-
tree is used as network topology and the bandwidth is 100 Gb/s using the Intel’s Omni-
Path interconnect [25]. The CPUs used in this system are Intel’s Skylake Xeon Platinum
8174, with 24 cores clocked at 3.1 GHz [24]. The SuperMUC-NG was designed as a
general-purpose supercomputer to support applications of all scientific domains like life
sciences, meteorology, geophysics and climatology. The most dominant scientific
domain using LRZ’s supercomputers is Astrophysics. Recently they also made their
resources available for COVID-19 related research.
Fine grain: Single processor core
• instruction parallelism
• multiple floating point units
• SIMD style parallelism: single instruction multiple data
• Medium grain: Multi-core / multi-socket system
• independent processes or threads perform calculations
• on a shared memory area
• Coarse grain: Interconnected (independent) systems
• explicitly programmed data transfers between nodes of the system
• fulfill high memory requirements
Levels of Parallelism
Examples:
● Node Level (e.g. SuperMUC has approx. 10,000 nodes)
● Accelerator Level (e.g. SuperMIC has 2 CPUs and 2 Xeon Phi Accelerators)
● Socket Level (e.g. fat nodes have 4 CPU Sockets)
● Core Level (e.g. SuperMUC Phase 2 has 14 cores per CPU)
● Vector Level (e.g. AVX2 has 16 vector registers per core)
● Pipeline Level (how many simultaneous pipelines)
● Instruction Level (instructions per cycle)
Getting data from: Getting some food from:
CPU register 1 ns fridge 10 s
L2 cache 10 ns microwave 100 s
memory 80 ns pizza service 800 s
network(IB) 200 ns city mall 2,000 s
GPU(PCIe) 50,000 ns mum sends cake 500,000 s
harddisk 500,000 ns grown in own garden 5,000,000 s
Fine grain parallelism
• On CPU level: operations composed from elementary instructions:
• load operand(s) (Instructions are sent to one of the ports of the execution unit)
• perform arithmetic operations
• store result Efficient if:
• increment loop count / check for loop exit • good mix of instructions
• branching / jumping • arguments are available from memory
• low data dependencies
Comparison Analysis Of Supercomputers:
Rank System Memory Cores Rmax(PFlop Rpeak(PFlop/s) Power(kW)
/s)
1. Fugaku 5,087,232
GB
7,630,848 442,010.0 537,212.0 29,899
2. Summit 2,801,664
GB
2,414,592 148,600.0 200,794.9 10,096
3. Sierra 1,382,400
GB
1,572,480 94,640.0 125,712.0 7,438
4. Sunway
TaihuLight
1,310,720
GB
10,649,60 93,014.6 125,435.9 15,371
5. Tianhe-2A 2,277,376
GB
4,981,760 61,444.5 100,678.7 18,482
6. Frontera 1,537,536
GB
448,448 23.5 38.7 6000
7. Piz Daint 365,056
GB
387,872 21.23 27.154 2272
8. Trinity 0 GB 979,072 20.16 41.46 8000
9.
AI Bridging
Cloud
Infrastructur
e
417,792
GB
391,680 19.8 32.577 3000
10. SuperMUC-
NG
75,840
GB
305,856 19.4 26.83 2000
Thank You

Weitere ähnliche Inhalte

Ähnlich wie Top 10 Supercomputers With Descriptive Information & Analysis

Multi Processor Architecture for image processing
Multi Processor Architecture for image processingMulti Processor Architecture for image processing
Multi Processor Architecture for image processing
ideas2ignite
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdf
anil0878
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
Ericsson
 
Implementation of Rotation and Vectoring-Mode Reconfigurable CORDIC
Implementation of Rotation and Vectoring-Mode Reconfigurable CORDICImplementation of Rotation and Vectoring-Mode Reconfigurable CORDIC
Implementation of Rotation and Vectoring-Mode Reconfigurable CORDIC
ijtsrd
 

Ähnlich wie Top 10 Supercomputers With Descriptive Information & Analysis (20)

High Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming ParadigmsHigh Performance Medical Reconstruction Using Stream Programming Paradigms
High Performance Medical Reconstruction Using Stream Programming Paradigms
 
Casp report
Casp reportCasp report
Casp report
 
chameleon chip
chameleon chipchameleon chip
chameleon chip
 
Intel Microprocessors- a Top down Approach
Intel Microprocessors- a Top down ApproachIntel Microprocessors- a Top down Approach
Intel Microprocessors- a Top down Approach
 
Ijetr042175
Ijetr042175Ijetr042175
Ijetr042175
 
Multi Processor Architecture for image processing
Multi Processor Architecture for image processingMulti Processor Architecture for image processing
Multi Processor Architecture for image processing
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdf
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
mobile processors introduction..
mobile processors introduction..mobile processors introduction..
mobile processors introduction..
 
Aqeel
AqeelAqeel
Aqeel
 
UNIT 1 SONCA.pptx
UNIT 1 SONCA.pptxUNIT 1 SONCA.pptx
UNIT 1 SONCA.pptx
 
Lecture 1 m&ca
Lecture 1 m&caLecture 1 m&ca
Lecture 1 m&ca
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
X13 Pre-Release Update featuring 4th Gen Intel® Xeon® Scalable Processors
 
Implementation of Rotation and Vectoring-Mode Reconfigurable CORDIC
Implementation of Rotation and Vectoring-Mode Reconfigurable CORDICImplementation of Rotation and Vectoring-Mode Reconfigurable CORDIC
Implementation of Rotation and Vectoring-Mode Reconfigurable CORDIC
 
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
 
Design of a low power processor for Embedded system applications
Design of a low power processor for Embedded system applicationsDesign of a low power processor for Embedded system applications
Design of a low power processor for Embedded system applications
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons Learned
 
Japan's post K Computer
Japan's post K ComputerJapan's post K Computer
Japan's post K Computer
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 

Top 10 Supercomputers With Descriptive Information & Analysis

  • 1. Submitted by: NOMAN SIDDIQUI SEC: A (Evening) Seat No.: EB21102087 3rd Semester (BSCS) Assignment Report: Top 10 Supercomputers With Descriptive Information & Analysis Submitted To: SIR KHALID AHMED Department of Computer Science - (UBIT) UNIVERSITY OF KARACHI
  • 2. Top 10 Supercomputers Report What is Supercomputer? A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instructions per second (MIPS). Since 2017, there are supercomputers which can perform over 1017 FLOPS (a hundred quadrillion FLOPS, 100 petaFLOPS or 100 PFLOPS Supercomputers play an important role in the field of computational science, and are used for a wide range of computationally intensive tasks in various fields, including quantum mechanics, weather forecasting, climate research, oil and gas exploration, molecular modeling (computing the structures and properties of chemical compounds, biological macromolecules, polymers, and crystals), and physical simulations (such as simulations of the early moments of the universe, airplane and spacecraft aerodynamics, the detonation of nuclear weapons, and nuclear fusion). They have been essential in the field of cryptanalysis. 1. The Fugaku Supercomputer Introduction: Fugaku is a petascale supercomputer (while only at petascale for mainstream benchmark), at the Riken Center for Computational Science in Kobe, Japan. It started development in 2014 as the successor to the K computer, and started operating in 2021. Fugaku made its debut in 2020, and became the fastest supercomputer in the world in the June 2020 TOP500 list, as well as becoming the first ARM architecture-based computer to achieve this. In June 2020, it achieved 1.42 exaFLOPS (in HPL-AI benchmark making it the first ever supercomputer that achieved 1 exaFLOPS. As of November 2021, Fugaku is the fastest supercomputer in the world. It is named after an alternative name for Mount Fuji.
  • 3. Block Diagram: Functional Units: Functional Units, Co-Design and System for the Supercomputer “Fugaku” 1. Performance estimation tool: This tool, taking Fujitsu FX100 (FX100 is the previous Fujitsu supercomputer) execution profile data as an input, enables the performance projection by a given set of architecture parameters. The performance projection is modeled according to the Fujitsu microarchitecture. This tool can also estimate the power consumption based on the architecture model. 2. Fujitsu in-house processor simulator: We used an extended FX100 SPARC instruction- set simulator and compiler, developed by Fujitsu, for preliminary studies in the initial phase, and an Armv8þSVE simulator and compiler afterward. 3. Gem5 simulator for the Post-K processor: The Post-K processor simulator3 based on an opensource system-level processor simulator, Gem5, was developed by RIKEN during the co-design for architecture verification and performance tuning. A fundamental problem is the scale of scientific applications that are expected to be run on Post-K. Even our target applications are thousands of lines of code and are written to use complex algorithms and data structures. Although the processor simulators are capable of providing very accurate performance results at the cycle level, they are very slow and are limited to execution on a single processor without MPI communications between the nodes. Our performance estimation tool is useful since it enables performance analysis based on the execution profile taken from an actual run on the FX100 hardware. It has a
  • 4. rich set of performance counters, including busy cycles for read/write memory access, busy cycles for L1/L2 cache access, busy cycles of floating-point arithmetic, and cycles for instruction commit. These features enable the performance projection for a new set of hardware parameters by changing the busy cycles of functional blocks. The breakdown of the execution time (cycles) can be calculated by summing the busy cycles of each functional block in the pipeline according to the processor microarchitecture. Since the execution time is estimated by a simple formula modeling the pipeline, it can be applied to a region of uniform behavior such as a kernel loop. The first step of performance analysis is to identify kernels in each target application and insert the library calls to get the execution profile. The total execution time is calculated by summing the estimated execution time of each kernel using the performance estimation tool with some architecture parameters. We repeated this process changing several architecture parameters for design space exploration. Some important kernels were extracted as independent programs. These kernels can be executed by the cycle- level processor simulators for more accurate analysis. Since the performance estimation tool is not able to take the impact of the out-of-order (O3) resources into account, the Fujitsu in-house processor simulator was used to analyze a new instruction set and the effect of changing the O3 resources. These kernels were also used for the processor emulator for logic-design verification. Co-Design of the Manycore Processor Prior to the FLAGSHIP 2020 project, feasibility study projects were carried out to investigate the basic March/April 2022 IEEE Micro 27 COOL CHIPS design from 2012 to 2013. As a result, the basic architecture suggested by the feasibility study was a largescale system using a general-purpose manycore processor with wide single- instruction/multiple-data (SIMD) arithmetic units. The choice of the instruction set architecture was an important decision for architecture design. Fujitsu offered the Armv8 instruction set with the Arm SIMD instruction set called the scalable vector extension (SVE).4 The Arm instruction-set architecture has been widely accepted by software developers and users not only for mobile processors, but also, recently, for HPC. For example, Cavium Thunder X2 is an Arm processor designed for servers and HPC, and has been used for several supercomputer systems, including Astra5 and Isambard.6 The SVE is an extended SIMD instruction set. The most significant feature of the SVE realizes vector length agnostic programming; as the name suggests, it does not depend on the vector length. We have decided to have two 512-bits-width SIMD arithmetic units, as suggested by the feasibility study. The processor is custom designed by Fujitsu using their microarchitecture as a backend of processor core. Fujitsu proposed the basic structure of the manycore processor architecture according to their microarchitecture: Each core has an L1 cache, and a cluster of cores shares an L2 cache and a memory controller. This cluster of cores is called a core-memory group (CMG). While other high- performance processors, such as those of Intel and AMD, have L1 and L2 caches in the core and share an L3 cache as a lastlevel cache, the core of our processor has only an
  • 5. L1 cache to reduce the die size for the core. Our technology target for silicon fabrication was 7- nm FinFET technology. The die size of the chip is the most dominant factor in terms of cost. It is known that the cost of the chip increases in proportion to the size and increases significantly beyond a certain size, and the yield of the chip becomes worse as the size of the chip increases. One configuration is to use small chips and connect these chips by multichip module (MCM) technology. Recently, AMD has used this “chiplet” approach successfully. The advantage of this approach is that a small chip can be relatively cheaper with a good yield. However, at the time of the basic design, the cost of MCM was deemed too high, and a different kind of chip for the interconnect and I/O must be made, resulting in even higher costs. The connection between chips on the MCM would also increase the power consumption. Thus, our decision was to use a single large die containing some CMGs and the network interface for interconnect and PCIe for I/O connected by a network-on-chip. As a result we decided to use 48 cores (plus four cores) and 12 cores/CMG 4 CMGs. The size of the die fitted within about 400 mm2 , which was reasonable in terms of cost for 7-nm FinFET technology. As the peak floating-point performance of the central processing unit (CPU) chip was expected to reach a few TFLOPS, the memory bandwidth of DDR4 was too low compared to the performance. Thus, high-speed memory technologies, such as HBM and hybrid memory cube, were examined to balance the memory bandwidth and arithmetic performance. The HBM is a stacked memory chip connected via TSV on a silicon interposer. The HBM2 provides a bandwidth of 256 GB/s per module, but the capacity of HBM2 is just up to 8 GiB, and the cost is high because the silicon interposer is required. As a memory technology available around 2019, HBM2 was chosen for its power efficiency and high memory bandwidth. We decided not to use any additional DDR memory to reduce the cost. As described previously, the number of HBM2 modules attached to CMGs is four, that is, the main memory capacity is 32 GiB. Although it seems small for certain applications, we already have many scalable applications developed for the K computer. Such scalable applications can increase the problem size by increasing the number of used nodes. The key to designing a cache architecture is to provide a high hit rate for many applications and to prevent a bottleneck when data are supplied with full bandwidth from memory. We examined various parameters, such as the line size, the number of ways, and the capacity, in order to optimize the cache performance under the constraint of the size of the area on the die and the amount of power consumption. To decide the cache structure and size, we examined the impact of the cache configuration on the performance by running some kernels extracted from target applications on the simulator for a single CMG. We designed the cache to save power for accessing data in a set associative cache. Data read from ways and tag search may be used in parallel to reduce the latency, but this may waste power because the data will not be used when the tag is not matched. In our design, data access is performed after a tag match. While it causes a long latency, there is less impact on the performance in the case of through put intensive HPC applications. This design was applied to the L1 cache for vector access and the L2 cache, resulting in the reduction of power by 10% in HPL with almost no performance degradation. The microarchitecture is an O3 architecture designed by Fujitsu. The
  • 6. amount of the O3 resources was decided by the tradeoff between the performance and the impact to the die size by the evaluation of some kernels extracted from the target applications. OVERVIEW OF FUGAKU SYSTEM In 2019, the name of the system was decided as “Fugaku,” and the installation was completed in May 2020. layer storage system is the global file system, which is a Luster- based parallel file system, developed by Fujitsu. A Linux kernel runs on each node. All system daemons run on two or four assistant cores. The CPU chip with two assistant cores is used on compute-only nodes. The chip with four assistant cores is used on compute and I/O nodes because such nodes service I/ O functions requiring more CPU resources. Final specification for architecture parameters by our co-design. Item Co-design parameter Spec. design paramet er Chip a CMG/chip 4 s Core/chip 48(+4)* Memory/chip  Technology 1113M2  Memory size 32 GB  Memory 8W 1024 GB/s CMG a Core/CMG 12 (+W L2 cache / CMG  Sae 8 MiB.  a way 16 way  Load BW to Ll 128 GWs  Store BW from L1 64 GB/s  Line size 256 bytes Core SIMI) width 512 bits  SIMD unit 2 LID cache / Care  Sae 64103  a way 4 way  Load 8W 256 GB/s  Store BW 128 GB/s Out of order resource/core  Reorder buffer 128 entries  Reservation Station 60 entries  sPhysical &MD register 128  Load buffer 40 entries  Store buffer 24 entries *Assistant core.
  • 7. **Cache BW is with the CPU clock speed of 2 GHz7 Software Used: Fugaku will use a "light-weight multi-kernel operating system" named IHK/McKernel. The operating system uses both Linux and the McKernel light-weight kernel operating simultaneously and side by side. The infrastructure that both kernels run on is termed the Interface for Heterogeneous Kernels (IHK). The high-performance simulations are run on McKernel, with Linux available for all other POSIX-compatible services 2. Summit Supercomputer Introduction: Summit or OLCF-4 is a supercomputer developed by IBM for use at Oak Ridge National Laboratory, capable of 200 petaFLOPS, making it the second fastest supercomputer in the world (it held the number 1 position from November 2018 to June 2020.) Its current LINPACK benchmark is clocked at 148.6 petaFLOPS. As of November 2019, the supercomputer had ranked as the 5th most energy efficient in the world with a measured power efficiency of 14.668 gigaFLOPS/watt. Summit was the first supercomputer to reach exaflop (a quintillion operations per second) speed Block Diagram:
  • 8. Software Used: Red Hat Enterprise Linux is also widely deployed in National Labs and research centers around the globe and is a proven platform for large-scale computing across multiple hardware architectures. The total system design of Summit, consisting of 4,608 IBM computer servers, aims to make it easier to bring research applications to this behemoth. Part of this is the consistent environment provided by Red Hat Enterprise Linux. Functional Units: System Overview & Specifications Summit is an IBM system located at the Oak Ridge Leadership Computing Facility. With a theoretical peak double-precision performance of approximately 200 PF, it is one of the most capable systems in the world for a wide range of traditional computational science applications. It is also one of the “smartest” computers in the world for deep learning applications with a mixed-precision capability in excess of 3 EF.
  • 9.
  • 10. Core Pipeline NVDIA Tesla v100 GPU Architecture 3. Sierra Supercomputer Introduction: Sierra or ATS-2 is a supercomputer built for the Lawrence Livermore National Laboratory for use by the National Nuclear Security Administration as the second Advanced Technology System. It is primarily used for predictive applications in stockpile stewardship, helping to assure the safety, reliability and effectiveness of the United States' nuclear weapons. Sierra is very similar in architecture to the Summit supercomputer built for the Oak Ridge National Laboratory. The Sierra system uses IBM POWER9 CPUs in conjunction with Nvidia Tesla V100 GPUs. The nodes in Sierra are Witherspoon IBM S922LC
  • 11. OpenPOWER servers with two GPUs per CPU and four GPUs per node. These nodes are connected with EDR InfiniBand. In 2019 Sierra was upgraded with IBM Power System A922 nodes. Block Diagram: Software Used: The Summit and Sierra supercomputer cores are IBM POWER9 central processing units (CPUs) and NVIDIA V100 graphic processing units (GPUs). NVIDIA claims that its GPUs are delivering 95% of Summit’s performance. Both supercomputers use a Linux operating system. Functional Units: Sierra boasts a peak performance of 125 petaFLOPS—125 quadrillion floating-point operations per second. Early indications using existing codes and benchmark tests are
  • 12. promising, demonstrating as predicted that Sierra can perform most required calculations far more efficiently in terms of cost and power consumption than computers consisting of CPUs alone. Depending on the application, Sierra is expected to be six to 10 times more capable than LLNL’s 20-petaFLOP Sequoia, currently the world’s eighth-fastest supercomputer. To prepare for this architecture, LLNL has partnered with IBM and NVIDIA to rapidly develop codes and prepare applications to effectively optimize the CPU/GPU nodes. IBM and NVIDIA personnel worked closely with LLNL, both on-site and remotely, on code development and restructuring to achieve maximum performance. Meanwhile, LLNL personnel provided feedback on system design and the software stack to the vendor. LLNL selected the IBM/NVIDIA system due to its energy and cost-efficiency, as well as its potential to effectively run NNSA applications. Sierra’s IBM POWER9 processors feature CPU-to-GPU connection via NVIDIA NVLink interconnect, enabling greater memory bandwidth between each node so Sierra can move data throughout the system for maximum performance and efficiency. Backing Sierra is 154 petabytes of IBM Spectrum Scale, a software-defined parallel file system, deployed across 24 racks of Elastic Storage Servers (ESS). To meet the scaling demands of the heterogeneous systems, ESS delivers 1.54 terabytes per second in both read and write bandwidth and can manage 100 billion files per file system. “The next frontier of supercomputing lies in artificial intelligence,” said John Kelly, senior vice president, Cognitive Solutions and IBM Research. “IBM's decades-long partnership with LLNL has allowed us to build Sierra from the ground up with the unique design and architecture needed for applying AI to massive data sets. The tremendous insights researchers are seeing will only accelerate high-performance computing for research and business.” As the first NNSA production supercomputer backed by GPU-accelerated architecture, Sierra’s acquisition required a fundamental shift in how scientists at the three NNSA laboratories program their codes to take advantage of the GPUs. The system’s NVIDIA GPUs also present scientists with an opportunity to investigate the use of machine learning and deep learning to accelerate the time-to-solution of physics codes. It is expected that simulation, leveraged by acceleration coming from the use of artificial intelligence technology will be increasingly employed over the coming decade. In addition to critical national security applications, a companion unclassified system, called Lassen, also has been installed in the Livermore Computing Center. This institutionally focused supercomputer will play a role in projects aimed at speeding cancer drug discovery, precision medicine, research on traumatic brain injury, seismology, climate, astrophysics, materials science, and other basic science benefiting society. Sierra continues the long lineage of world-class LLNL supercomputers and represents the penultimate step on NNSA’s road to exascale computing, which is expected to start by 2023 with an LLNL system called “El Capitan.” Funded by the NNSA’s Advanced
  • 13. Simulation and Computing (ASC) program, El Capitan will be NNSA’s first exascale supercomputer, capable of more than a quintillion calculations per second—about 10 times greater performance than Sierra. Such computing power will be easily absorbed by NNSA for its mission, having required the most advanced computing capabilities and deep partnerships with American industry. 4. Sunway TaihuLight Supercomputer Introduction: The Sunway TaihuLight is a Chinese supercomputer which, as of November 2021, is ranked fourth in the TOP500 list, with a LINPACK benchmark rating of 93 petaflops The name is translated as divine power, the light of Taihu Lake. This is nearly three times as fast as the previous Tianhe-2, which ran at 34 petaflops. As of June 2017, it is ranked as the 16th most energy-efficient supercomputer in the Green500, with an efficiency of 6.051 GFlops/watt. It was designed by the National Research Center of Parallel Computer Engineering & Technology (NRCPC) and is located at the National Supercomputing Center in Wuxi in the city of Wuxi, in Jiangsu province, China. Block Diagram:
  • 14. Software Used: The system runs on its own operating system, Sunway RaiseOS 2.0.5, which is based on Linux. The system has its own customized implementation of OpenACC 2.0 to aid the parallelization of code. Functional Units The Sunway TaihuLight supercomputer: an overview The Sunway TaihuLight supercomputer is hosted at the National Supercomputing Center in Wuxi (NSCCWuxi), which operates as a collaboration center between the City of Wuxi, Jiangsu Province, and Tsinghua University. NSCC-Wuxi focuses on the development needs of technological innovation and industrial upgrading around Jiangsu Province and the Yangtze River Delta economic circle, as well as the demands of the national key strategies on science and technology development.
  • 15. The SW26010 many-core processor: One major technology innovation of the Sunway TaihuLight supercomputer is the homegrown SW26010 many-core processor. The general architecture of the SW26010 processor [10] is shown in Figure 2. The processor includes four core-groups (CGs). Each CG includes one management processing element (MPE), one computing processing element (CPE) cluster with eight by eight CPEs, and one memory controller (MC). These four CGs are connected via the network on chip (NoC). Each CG has its own memory space, which is connected to the MPE and the CPE cluster through the MC. The processor connects to other outside devices through a system interface (SI). The MPE is a complete 64-bit RISC core, which can run in both the user and system modes. The MPE completely supports the interrupt functions, memory management, superscalar processing, and outof-order execution. Therefore, the MPE is an ideal core for handling management and communication functions. In contrast, the CPE is also a 64- bit RISC core, but with limited functions. The CPE can only run in user mode and does not support interrupt functions. The design goal of this element is to achieve the maximum aggregated computing power, while minimizing the complexity of the micro-architecture. The CPE cluster is organized as an eight by eight mesh, with a mesh network to achieve low-latency register data communication among the eight by eight CPEs. The mesh also includes a mesh controller that handles interrupt and synchronization controls. Both the MPE and CPE support 256-bit vector instructions. 4 Subcomponent systems of the Sunway TaihuLight In this section, we provide more detail about the various subcomponent systems of the Sunway TaihuLight, specifically the computing, network, peripheral, maintenance and diagnostic, power and cooling, and the software systems. 4.1 The computing system Aiming for a peak performance of 125 PFlops, the computing system of the Sunway TaihuLight is built using a fully customized integration approach with a number of different levels: (1) computing node (one CPU per computing node); (2) super node s(256
  • 16. computing nodes per super node); (3) cabinet (4 super nodes per cabinet); and (4) the entire computing system (40 cabinets). The computing nodes are the basic units of the computing system, and include one SW26010 processor, 32 GB memory, a node management controller, power supply, interface circuits, etc. Groups of 256 computing nodes, are integrated into a tightly coupled super node using a fully-connected crossing switch, so as to support computationally-intensive, communication-intensive, and I/O- intensive computing jobs. 4.2 The network system The network system consists of three different levels, with the central switching network at the top, super node network in the middle, and resource-sharing network at the bottom. The bisection network bandwidth is 70 TB/s, with a network diameter of 7. Each super node includes 256 Sunway processors that are fully connected by the super node network, which achieves both high bandwidth and low latency for all-to-all communications among the entire 65536 processing elements. The central switching network is responsible for building connections and enabling data exchange between different super nodes. The resource-sharing network connects the sharing resources to the super nodes, and provides services for I/O communication and fault tolerance of the computing nodes. Fu H H, et al. Sci China Inf Sci July 2016 Vol. 59 072001:6 4.3 The peripheral system The peripheral system consists of the network storage system and peripheral management system. The network storage system includes both the storage network and storage disk array, providing a total storage of 20 PB and a high-speed and reliable data storage service for the computing nodes. The peripheral management system includes the system console, management server, and management network, which enable system management and service. 4.3 The power supply system and cooling system The TaihuLight supercomputer uses a mutual-backup power input of 2 × 35 KV. The cabinets of the system use a three-level (300 V-12 V-0.9 V) DC power supply mode. The front-end power supply output is 300 V, which is directly linked to the cabinet. The main power supply of the cabinet converts 300 V DC to 12 V DC, and the CPU power supply converts 12 V into the voltage that the CPU needs. The cabinets of the computing and network systems use indirect water cooling, while the peripheral devices use air and water exchange, and the power system uses forced air cooling. The cabinets use closed-loop, static hydraulic pressure for cavum, indirect parallel flow water cooling technology, which provides effective cooling for the full-scale Linpack run.
  • 17. 5. Tianhe-2A Supercomputer Introduction: It was the world's fastest supercomputer according to the TOP500 lists for June 2013, November 2013, June 2014, November 2014, June 2015, and November 2015. The record was surpassed in June 2016 by the Sunway TaihuLight. In 2015, plans of the Sun Yat-sen University in collaboration with Guangzhou district and city administration to double its computing capacities were stopped by a U.S. government rejection of Intel's application for an export license for the CPUs and coprocessor boards. In response to the U.S. sanction, China introduced the Sunway TaihuLight supercomputer in 2016, which substantially outperforms the Tianhe-2 (and also affected the update of Tianhe-2 to Tianhe-2A replacing US tech), and now ranks fourth in the TOP500 list while using completely domestic technology including the Sunway manycore microprocessor. Block Diagram:
  • 18. Software Used: Tianhe-2 ran on Kylin Linux, a version of the operating system developed by NUDT Functional Unit: System Architecture & Compute Blade The original TH-2 compute blade consisted of two nodes split into two modules: (1) the Computer Processor Module (CPM) module and (2) the Accelerator Processor Unit (APU) module (Figure 5). The CPM contained four Ivy Bridge CPUs, memory, and one Xeon Phi KNC accelerator, and the APU contained five Xeon Phi KNC accelerators. Connections from the Ivy Bridge CPUs to each of the KNC accelerators are made through a ×16 PCI Express 2.0 multiboard with 10 Gbps of bandwidth. The actual design and implementation of the board supports PCI Express 3.0, but the Xeon Phi KNC accelerator only supports PCI Express 2.0. There was also a PCI Express connection for the network interface controller (NIC). With the upgraded TH-2A, the Intel Xeon Phi KNC accelerators have been replaced. The CPM module still has four Ivy Bridge CPUs but is no longer housing an accelerator. The APU now houses four Matrix-2000 accelerators instead of the five Intel Xeon Phi KNC accelerators. So, in the TH-2A, the compute blade has two heterogeneous compute nodes, and each compute node is equipped with two Intel Ivy Bridge CPUs and two proprietary Matrix-2000 accelerators. Each node has 192 GB memory, and a peak performance of 5.3376 Tflop/s. The Intel Ivy Bridge processors have not been changed and are the same as in the original TH-2. Each of the Intel Ivy Bridge CPU’s 12 compute cores can perform 8 FLOPs per cycle per core, which results in 211.2 Gflop/s total peak performance per socket (12 cores × 8 FLOPs per cycle × 2.2 GHz clock). The logical structure of the compute node is shown in Figure 6. The two Intel Ivy Bridge CPUs are linked using two Intel Quick Path Interconnects (QPI). Each CPU has four memory channels with eight dual in-line memory module (DIMM) slots. CPU0 expands its I/O devices using Intel’s Platform Controller Hub (PCH) chipset and connects with a 14G proprietary NIC through a ×16 PCI Express 3.0 connection. Each CPU also uses a ×16 PCI Express 3.0 connection to access the Matrix-2000 accelerators. Each accelerator has eight memory channels. In a compute node, the CPUs are equipped with 64 GB of DDR3 memory, while the accelerators are equipped with 128 GB of DDR4 memory. With 17,792 compute nodes, the total memory capacity of the whole system is 3.4 PB. H-2A compute blade is composed of two parts: the CPM (left) and the APU (middle). The CPM integrates four Ivy Bridge CPUs, and the APU integrates four Matrix2000 accelerators. Each compute blade contains two heterogeneous compute nodes. As stated earlier, the peak performance of each Ivy Bridge CPU is 211.2 Gflop/s, and the peak performance of each Matrix-2000 accelerator is 2.4576 Tflop/s. Thus, the peak performance of each compute node can be calculated as (0.2112 Tflop/s × 2) + (2.4576
  • 19. Tflop/s × 2) = 5.3376 Tflop/s. With 17,792 compute nodes, the peak performance of the whole system is 94.97 Pflop/s (5.3376 Tflop/s x × 17,792 nodes = 94.97 Pflop/s total) 6. Frontera Supercomputer Introduction: In August 2018, Dell EMC and Intel announced intentions to jointly design Frontera, an academic supercomputer funded by a $60 million grant from the National Science Foundation that would replace Stampede2 at the University of Texas at Austin’s Texas Advanced Computing Center (TACC). Those plans came to fruition in June when the two companies deployed Frontera, which was formally unveiled this morning. Intel claims that Frontera can achieve peak performance of 38.7 quadrillion floating point operations per second, or petaflops, making it the world’s fastest computer designed for academic workloads like modeling and simulation, big data, and machine learning. (That’s compared with Stampede2’s peak performance of 18 petaflops.) Earlier this year, Frontera earned the fifth spot on the twice-annual Top500 list with 23.5 petaflops on the LINPACK benchmark, which ranks the world’s most powerful non-distributed computer systems. Block Diagram:
  • 20. Software Used: With a peak-performance rating of 38.7 petaFLOPS, the supercomputer is about twice as powerful as TACC's Stampede2 system, which is currently the 19th fastest supercomputer in the world. Dell EMC provided the primary computing system for Frontera, based on Dell EMC PowerEdge™ C6420 servers. Functional Unit the Frontera system will provide academic researchers with the ability to calculate and handle artificial intelligence-related jobs with extremely high levels of complexity, even never existing before. 'With the integration of many Intel-exclusive technologies, this supercomputer opens up a wealth of new possibilities in the field of scientific and technical research in general, thereby fostering deeper understanding. for complex, scholarly issues related to space research, cures, energy needs, and artificial intelligence, 'says Trish Damkroger, Intel vice president and general manager of the team. Intel computing official Trish Damkroger said.  Supercomputers can fully detect cyber threats Hundreds of 2nd generation Xeon processors that can be expanded to 28 cores (Cascade Lake) housed in Dell EMC PowerEdge servers will be responsible for handling Frontera's heavy computing tasks, besides Nvidia nodes. ensure single-precision calculation. Frontera's processor architecture is built on Intel's advanced Advanced Vector Extensions 512 (AVX-512) model. Basically, the AVX-512 is a set of instructions that allows doubling the number of FLOPS per clock speed compared to the previous generation. It is also important to mention that one extremely important part of a supercomputer is the cooling system. Frontera uses a liquid cooling mechanism for most of its nodes. In particular, Dell EMC is responsible for water and cooling oil, combined with CoolIT and Green Revolution Cool systems. This supercomputer uses Mellanox HDR and HDR-100 connections to transfer data at up to 200Gbps on each link between the switches, which is responsible for connecting 8.008 nodes across the system. Each node is expected to consume about 65 kilowatts of electricity, about one third of which is used by TACC from wind and solar power to save costs Frontera will provide academic researchers with the ability to calculate and handle artificial intelligence-related jobs with extremely high complexity. In terms of storage, Frontera owns 4 different environments designed and built by DataDirect Networks, which will have a total of more than 50 petabytes paired with 3 petabytes of NAND flash (equivalent to about 480GB of SSD storage. on each node). Besides, this supercomputer also possesses extremely fast connectivity, with a speed of up to 1.5 terabytes per second. Finally, Frontera is also very effective at Intel Optane DC, the 'non-volatile' memory technology developed by Intel and Micron Technology, which has PIN and DDR4 compatibility, and incorporates a large cache memory. with a smaller DRAM group (192GB per node), thereby improving performance improvement. Not stopping there, Intel Optane DC on Frontera is also combined with the latest generation Xeon Scalable Processors, delivering up to 287,000 operations per second, compared to 3,116
  • 21. operations per second of Conventional DRAM systems. With such equipment, Frontera's reboot time only takes 17 seconds. Basic specifications of Frontera supercomputer Basic calculation system The configuration of each node in Frontera is described as follows (Frontera owns 8.008 available nodes):  Processor: Intel Xeon Platinum 8280 ("Cascade Lake"); Number of cores: 28 per socket, 56 per node; Pulse Rate: 2.7Ghz ("Base Frequency")  Maximum node performance: 4.8TF, double precision  RAM: DDR-4, 192GB / node  Local drive: 480GB SSD / node  Network: Mellanox InfiniBand, HDR-100 Subsystems Liquid submerged system:  Processor: 360 NVIDIA Quadro RTX 5000 GPU  Ram: 128GB / node  Cooling: GRC ICEraQ ™ system  Network: Mellanox InfiniBand, HDR-100  Maximum performance: 4PF single precision Longhorn:  Processor: IBM POWER9-hosted system with 448 NVIDIA V100 GPUs  Ram: 256GB / node  Storage: 5 petabyte filesystem  Network: Infiniband EDR  Maximum performance: 3.5PF double precision; 7.0PF single precision 7. Piz Daint Introduction: Piz Daint is a supercomputer in the Swiss National Supercomputing Centre, named after the mountain Piz Daint in the Swiss Alps. It was ranked 8th on the TOP500 ranking of supercomputers until the end of 2015, higher than any other supercomputer in Europe. At the end of 2016, the computing performance of Piz Daint was tripled to reach 25 petaflops; it thus became the third most powerful supercomputer in the world. As of November 2021, Piz Daint is ranked 20th on the TOP500. The original Piz Daint Cray XC30 system was installed in December 2012. This system was extended with Piz Dora, a Cray XC40 with 1,256 compute nodes, in 2013.[9] In October 2016, Piz Daint and Piz Dora were upgraded and combined into the current Cray XC50/XC40 system featuring Nvidia Tesla P100 GPUs.
  • 22. Block Diagram: Software Used: Architecture Intel Xeon E5-26xx (various) , Nvidia Tesla P100 Operating system Linux (CLE) 8. Trinity Supercomputer Introduction: Trinity (or ATS-1) is a United States supercomputer built by the National Nuclear Security Administration (NNSA) for the Advanced Simulation and Computing Program (ASC).[2] The aim of the ASC program is to simulate, test, and maintain the United States nuclear stockpile. Block Diagram:
  • 23. Software Used: Trinity uses a Sonexion based Lustre file system with a total capacity of 78 PB. Throughput on this tier is about 1.8 TB/s (1.6 TiB/s). It is used to stage data in preparation for HPC operations. Data residence in this tier is typically several weeks. Functional Unit Trinity is a Cray XC40 supercomputer, with delivery over two phases; phase 1 is based on Intel Xeon Haswell compute nodes, and phase 2 will add Intel Xeon Phi Knights Landing (KNL) compute nodes. Phase 1 was delivered and accepted in the latter part of 2016, and consists of 54 cabinets, including multiple node types. Foremost are 9436 Haswell-based compute nodes, delivering ~1 PiB of memory capacity and ~11 PF/s of peak performance. Each Haswell compute node features two 16-core Haswell processors operating at 2.3 GHz, along with 128GiB of DDR4- 2133 memory, spread across 8 channels (4 per CPU). Phase 1 also includes 114 Lustre router nodes (see Section III.B) and 300 burst buffer nodes (see Section IV). Trinity utilizes a Sonexion based Lustre filesystem with 78 PB of usable storage and approximately 1.6 TB/s of bandwidth. However, due to the limited number of Lustre router nodes in Phase 1, only about half of this bandwidth is currently achievable. Phase 1 also includes all of the other typical service nodes: 2 boot, 2 SDB, 2 UDSL, 6 DVS, 12 MOM, and 10 RSIP. Additionally, Trinity utilizes 6 external login
  • 24. nodes. Phase 2 is scheduled to begin delivery in mid-2016. It adds more than 9500 Xeon Phi Knights Landing (KNL) based compute nodes. Each KNL compute node consists of a single KNL with 16 GiB of on-package memory and 96 GiB of DDR4- 2400 memory. It has a peak performance of approximately 3 TF/s. In total, the KNL nodes add ~1 PiB of memory capacity and ~29 PF/s peak performance. In addition to the KNLs, Phase 2 also adds the balance of the Lustre router nodes (108 additional, total of 222) and burst buffer nodes (276 additional, total of 576). When all burst buffer nodes are installed, they will provide 3.69 PB of raw storage capacity and 3.28 TB/s of bandwidth. BURST BUFFER INTEGRATION AND PERFORMANCE 1. Design Trinity includes the first large scale instance of on-platform burst buffers using the Cray DataWarp® product. The Trinity burst buffer is provided in two phases along with the two phases of Trinity. The phase 1 burst buffer consists of 300 DataWarp nodes. This is expanded to 576 DataWarp nodes by phase 2. In this section, unless otherwise noted, the phase 1 burst buffer will be described. The 300 DataWarp nodes are built from Cray service nodes, each with a 16 core Intel Sandy Bridge processor and 64 gigabytes of memory. Storage on each DataWarp node is provided by two Intel P3608 Solid State Drive (SSD) cards. The DataWarp nodes use the Aries high speed network for communications with the Trinity compute nodes and for communications with the Lustre Parallel File System (PFS) via the LNET router nodes. Each SSD card has 4 TB of capacity and is attached to the service node via a PCI-E x4 interface. The SSD cards are overprovisioned to improve the endurance of the card from the normal 3 Drive Writes Per Day (DWPD) over 5 years to 10 DWPD over 5 years. This reduces the available capacity of each card. The total usable capacity of the 300 DataWarp nodes is 1.7 PiB. The DataWarp nodes run a Cray provided version of Linux together with a DataWarp specific software stack consisting of an enhanced Data Virtualization Service (DVS) server and various configuration and management services. The DataWarp nodes also provide a staging function that can be used to asynchronously move data between the PFS and DataWarp. There is a centralized DataWarp registration service that runs on one of the Cray System Management nodes. Compute nodes run a DVS client that is enhanced to provide support for DataWarp. The DataWarp resources can be examined and controlled via several DataWarp specific command line interface (CLI) utilities that run on any of the system’s nodes. DataWarp can be configured to operate in a number of different modes. The primary use case at ACES is to support checkpoint and analysis files, these are supported by the striped scratch mode of DataWarp. Striped scratch provides a single file name space that is visible to multiple compute nodes with the file data striped across one or more DataWarp nodes. A striped private mode is additionally available. In the future, paging space and cache modes may be provided. This section will discuss LANL’s experience with striped scratch mode. A DataWarp allocation is normally configured by job script directives. Trinity uses the
  • 25. Moab Work Load Manager (WLM). The WLM reads the job script at job submission time and records the DataWarp directives for future use. When the requested DataWarp capacity is available, the WLM will start the job. Prior to the job starting, the WLM uses DataWarp CLI utilities to request instantiation of a DataWarp allocation and any requested stage-in of data from the PFS. After the job completes, the WLM requests stage-out of data and then frees the DataWarp allocation. The stage-in and stage-out happen without any allocated compute nodes or any compute node involvement. The DataWarp allocation is made accessible via mount on only the compute nodes of the requesting job. Unix file permissions are effective for files in DataWarp and are preserved by stage-in and stage-out operations. A DataWarp allocation is normally only available for the life of the requesting job, with the exception of a persistent DataWarp allocation that may be accessed by multiple jobs, possibly simultaneously. Simultaneous access by multiple jobs is used to support in-transit data visualization and analysis use cases. 2. Integration Correct operation of DataWarp in conjunction with the WLM was achieved after several months of extended integration testing on-site at LANL. Numerous fixes and functional enhancements have improved the stability and usability of the DataWarp feature on Trinity. Due to this effort, production use of DataWarp has been limited as of late April, 2016. 3. Performance All performance measurements were conducted with IOR. The runs were made with:  reader or writer process per node  32 GiB total data read or written per node  256, 512 or 1024 KiB block size  Nodes counts from 512 to 4096  The DataWarp allocation striped across all 300 DataWarp nodes These characteristics were selected to approximate the IO patterns expected when applications use the HIO library. Additional investigation and optimization of IO characteristics is needed. 9. AI Bridging Cloud Infrastructure
  • 26. Introduction: AI Bridging Cloud Infrastructure (ABCI) is a planned supercomputer being built at the University of Tokyo for use in artificial intelligence, machine learning, and deep learning. It is being built by Japan's National Institute of Advanced Industrial Science and Technology. ABCI is expected to be completed in first quarter 2018 with a planned performance of 130 petaFLOPS. Power consumption is targeting 3 megawatts, and a planned power usage effectiveness of 1.1. If performance meets expectations, ABCI would be the second most powerful supercomputer built, surpassing the current leader Sunway TaihuLight's 93 petaflops. But still behind the Summit (supercomputer). Block Diagram: Software Used:
  • 27. Along with Docker, Singularity and other tools, Univa Grid Engine plays a key role in ABCI’s software stack, ensuring that workloads run as efficiently as possible. Functional Units The ABCI prototype, which was installed in March, consisted of 50 two-socket “Broadwell” Xeon E5 servers, each equipped with 256 GB of main memory, 480 GB of SSD flash memory, and eight of the tesla P100 GPU accelerators in the SMX2 form factor hooked to each other using the NVLink 1.0 interconnect. Another 68 nodes were just plain vanilla servers, plus two nodes for interactive management and sixteen nodes for other functions on the cluster. The system was configured with 4 PB of SF14K clustered disk storage from DataDirect Networks running the GRIDScaler implementation of IBM’s GPFS parallel file system, and the whole shebang was clustered together using 100 Gb/sec EDR InfiniBand from Mellanox Technologies, specifically 216 of its CS7250 director switches. Among the many workloads running on this cluster was the Apache Spark in-memory processing framework. The goal with the real ABCI system was to deliver a machine with somewhere between 130 petaflops and 200 petaflops of AI processing power, which means half precision and single precision for the most part, with a power usage effectiveness (PUE) of somewhere under 1.1, which is a ratio of the energy consumed for the datacenter compared to the compute complex that does actual work. (This is about as good as most hyperscale datacenters, by the way.) The system was supposed to have about 20 PB of parallel file storage and, with the compute, storage, and switching combined, burn under 3 megawatts of juice. The plan was to get the full ABCI system operational by the fourth quarter of 2017 or the first quarter of 2018, and this obviously depended on the availability of the compute and networking components. Here is how the proposed ABCI system was stacked up against the K supercomputer at the RIKEN research lab in Japan and the Tsubame 3.0 machine at the Tokyo Institute of Technology: The K machine, which is based on the Sparc64 architecture and which was the first machine in the world to break the 10 petaflops barrier, will eventually be replaced by a massively parallel ARM system using the Tofu interconnect made for the K system and subsequently enhanced. The Oakforest-PACs machine built by University of Tokyo and University of Tsukuba is based on a mix of “Knights Landing” Xeon Phi processors and Omni-Path interconnect from Intel, and weighs in at 25 petaflops peak double precision. It is not on this comparison table of big Japanese supercomputers. But perhaps it should be. While the Tsubame 3.0 machine is said to focus on double precision performance, the big difference is really that the Omni-Path network hooking all of the nodes together in Tsubame 3.0 was configured to maximize extreme injection bandwidth and to have very high bi-section bandwidth across the network. The machine learning workloads that are expected to run on ABCI are not as sensitive to these factors and, importantly, the idea
  • 28. here is to build something that looks more like a high performance cloud datacenter that can be replicated in other facilities, using standard 19-inch equipment rather than the specialized HPE and SGI gear that TiTech has used in the Tsubame line to date. In the case of both Tsubame 3.0 and ABCI, the thermal density of the compute and switching is in the range of 50 kilowatts to 60 kilowatts per rack, which is a lot higher than the 3 kilowatts to 6 kilowatts per rack in a service provider datacenter, and the PUE at under 1.1 is a lot lower than the 1.5 to 3.0 rating a typical service provider datacenter. (The hyperscalers do a lot better than this average, obviously.) This week, AIST awarded the job of building the full ABCI system to Fujitsu, and nailed down the specs. The system will be installed at a shiny new datacenter at the Kashiwa II campus of the University of Tokyo, and is now going to start operations in Fujitsu’s fiscal 2018, which begins next April. The ABCI system will be comprised of 1,088 of Fujitsu’s Primergy CX2570 server nodes, which are half-width server sleds that slide into the Primergy CX400 2U chassis. Each sled can accommodate two Intel “Skylake” Xeon SP processors, and in this case AIST is using a Xeon SP Gold variant, presumably with a large (but not extreme) number of cores. Each node is equipped with four of the Volta SMX2 GPU accelerators, so the entire machine has 2,176 CPU sockets and 4,352 GPU sockets. The use of the SXM2 variants of the Volta GPU accelerators requires liquid cooling because they run a little hotter, but the system has an air-cooled option for the Volta accelerators that hook into the system over the PCI-Express bus. The off-the-shelf models of the CX2570 server sleds also support the lower-grade Silver and Bronze Xeon SP processors as well as the high-end Platinum chips, so AIST is going in the middle of the road. There are Intel DC 4600 flash SSDs for local storage on the machine. It is not clear who won the deal for the GPFS file system for this machine, and if it came in at 20 PB as expected. Fujitsu says that the resulting ABCI system will have 37 petaflops of aggregate peak double precision floating point oomph, and will be rated at 550 petaflops, and 525 petaflops off that comes from using the 16-bit Tensor Core units that were created explicitly to speed up machine learning workloads. That is a lot more deep learning performance than was planned, obviously. AIST has amassed $172 million to fund the prototype and full ABCI machines as well as build the new datacenter that will house this system. About $10 million of that funding is for the datacenter, which had its ground breaking this summer. The initial datacenter setup has a maximum power draw of 3.25 megawatts, and it has 3.2 megawatts of cooling capacity, of which 3 megawatts come from a free cooling tower assembly and another 200 kilowatts comes from a chilling unit. The datacenter has a single concrete slab floor, which is cheap and easy, and will start out with 90 racks of capacity – that’s 18 for storage and 72 for compute – with room for expansion.
  • 30. SuperMUC was a supercomputer of the Leibniz Supercomputing Centre (LRZ) of the Bavarian Academy of Sciences. It was housed in the LRZ's data centre in Garching near Munich. It was decommissioned in January 2020, having been superseded by the more powerful SuperMUC-NG. SuperMUC was the fastest European supercomputer when it entered operation in the summer of 2012 and is currently ranked #20 in the Top500 list of the world's fastest supercomputers. SuperMUC serves European researchers of many fields, including medicine, astrophysics, quantum chromodynamics, computational fluid dynamics, computational chemistry, life sciences, genome analysis and earth quake simulations. Block Diagram: Software Used: Compute Nodes Operating System Thin Nodes Suse Linux (SLES) Batch Scheduling System SLURM High Performance Parallel Filesystem IBM Spectrum Scale (GPFS)
  • 31. Programming Environment Intel Parallel Studio XE GNU compilers Functional Unit Just like the CoolMUC-2, the SuperMUC-NG is located at the Leibniz Supercomputing Centre in Germany and was built by Lenovo. The system has 311,040 physical cores and a main memory of 719 TB resulting in a peak performance of 26.9 PFlop/s. A fat- tree is used as network topology and the bandwidth is 100 Gb/s using the Intel’s Omni- Path interconnect [25]. The CPUs used in this system are Intel’s Skylake Xeon Platinum 8174, with 24 cores clocked at 3.1 GHz [24]. The SuperMUC-NG was designed as a general-purpose supercomputer to support applications of all scientific domains like life sciences, meteorology, geophysics and climatology. The most dominant scientific domain using LRZ’s supercomputers is Astrophysics. Recently they also made their resources available for COVID-19 related research. Fine grain: Single processor core • instruction parallelism • multiple floating point units • SIMD style parallelism: single instruction multiple data • Medium grain: Multi-core / multi-socket system • independent processes or threads perform calculations • on a shared memory area • Coarse grain: Interconnected (independent) systems • explicitly programmed data transfers between nodes of the system • fulfill high memory requirements Levels of Parallelism Examples:
  • 32. ● Node Level (e.g. SuperMUC has approx. 10,000 nodes) ● Accelerator Level (e.g. SuperMIC has 2 CPUs and 2 Xeon Phi Accelerators) ● Socket Level (e.g. fat nodes have 4 CPU Sockets) ● Core Level (e.g. SuperMUC Phase 2 has 14 cores per CPU) ● Vector Level (e.g. AVX2 has 16 vector registers per core) ● Pipeline Level (how many simultaneous pipelines) ● Instruction Level (instructions per cycle) Getting data from: Getting some food from: CPU register 1 ns fridge 10 s L2 cache 10 ns microwave 100 s memory 80 ns pizza service 800 s network(IB) 200 ns city mall 2,000 s GPU(PCIe) 50,000 ns mum sends cake 500,000 s harddisk 500,000 ns grown in own garden 5,000,000 s Fine grain parallelism • On CPU level: operations composed from elementary instructions: • load operand(s) (Instructions are sent to one of the ports of the execution unit) • perform arithmetic operations • store result Efficient if: • increment loop count / check for loop exit • good mix of instructions • branching / jumping • arguments are available from memory • low data dependencies Comparison Analysis Of Supercomputers: Rank System Memory Cores Rmax(PFlop Rpeak(PFlop/s) Power(kW)
  • 33. /s) 1. Fugaku 5,087,232 GB 7,630,848 442,010.0 537,212.0 29,899 2. Summit 2,801,664 GB 2,414,592 148,600.0 200,794.9 10,096 3. Sierra 1,382,400 GB 1,572,480 94,640.0 125,712.0 7,438 4. Sunway TaihuLight 1,310,720 GB 10,649,60 93,014.6 125,435.9 15,371 5. Tianhe-2A 2,277,376 GB 4,981,760 61,444.5 100,678.7 18,482 6. Frontera 1,537,536 GB 448,448 23.5 38.7 6000 7. Piz Daint 365,056 GB 387,872 21.23 27.154 2272 8. Trinity 0 GB 979,072 20.16 41.46 8000 9. AI Bridging Cloud Infrastructur e 417,792 GB 391,680 19.8 32.577 3000 10. SuperMUC- NG 75,840 GB 305,856 19.4 26.83 2000