SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Google Warehouse-
Scale Computer:
Hardware and
Performance
Prepared by: Tejhaskar Ashok Kumar
Master of Applied Science in Computer
Engineering
Memorial University of Newfoundland
Outline
 Introduction
 Architectural Overview of Google WSC
 Server
 Storage
 Network
 Hardware Accelerators
 GPU
 TPU
 Energy and Power Efficiency
 Performance of Google WSC
 Top-Down Micro-architectural Analysis Method
 Cooling Systems
2
Introduction
 A Warehouse-Scale Computer comprises of ten to thousands of cluster connected to a network to process
thousands to millions of users’ request.
 A WSC can be used to provide internet services : Search, Video Sharing, E-Commerce
 The difference between a traditional data center and a WSC:
 Data centers host services for multiple providers
 WSC run by a single organization
 WSC exploits Request Level Parallelism and Data Level Parallelism
 WSC Design Goals:
 Cost-Performance
 Energy Efficiency
 Dependability via Redundancy
 Network I/O
 Interactive and Batch processing workloads
3
Architectural Overview of Google
WSC
 In general, WSC is a building which considered as a
computer, where multiple server nodes, storage are tightly
coupled with interconnection networks.
 Inside the building, there are lot of containers and each
container contains multiple servers arranged in a rack
interconnected with storage
 Each container also has some cooling support to eliminate
the heat generated.
 A Google WSC uses more than 100,000 servers
 WSC’s are designed in a way to perform reliable and faster
to perform every request powered by internet services.
4
Fig 1 – WSC as a building
Servers
 The WSC uses a low-end server in a 1U or blade enclosure
format, and the servers are mounted in a rack and each
server is interconnected with an Ethernet switch.
 Google WSC uses a 19-inch wide rack which can hold 48
blade servers connected to a rack switch.
 Each server has a PCIe link, which is useful for connecting
the CPU servers with the GPU and TPU trays.
 Every server is arranged in a rack, which is of 7ft high, 4ft
wide, and 2ft deep, which contains about 48 slots for
loading the server, power conversion chords, network
switches, and a battery backup tray which is useful during
the power interruption.
 CPU used in Google WSC: Intel Xeon Scalable Processor
(Cascade Lake), Intel Xeon Scalable Processor (Skylake), Intel
Xeon E7 (Broadwell E7),Intel Xeon E5 v4 (Broadwell E5),Intel
Xeon E5 v3 (Haswell),Intel Xeon E5 v2 (Ivy Bridge),Intel Xeon
E5 (Sandy Bridge),AMD EPYC Rome
5
Fig 2 – Server to Cluster
Fig 3 – Server Rack
Storage
 Google WSC rely on local disks and uses Google File System
(GFS), which is a distributed file system developed by
Google
 A GFS cluster contains multiple nodes and divided into two
categories:
 Master Node: storing the data required for
processing the current request
 Chunkserver: maintaining the copies of the
data
 Google WSC storage maintains at least three replicas to
improve the dependability
 Every server storage is interconnected with the server
storage in the local rack, and every server storage in local
rack is interconnected with cluster.
6
Fig 4 – Storage Hierarchy of Google WSC
Network
 The Google WSC network uses a network called
'clos,' which is a multistage network that uses
low port-count switches
 These clos networks are fault-tolerant and
provide excellent bandwidth. Google improves
the bandwidth of the network by adding more
stages to the multistage network
 Google uses a 3-stage clos network:
 Ingress Stage (input stage)
 Middle Stage
 Egress Stage (output stage)
 Google uses Jupiter Clos Network inside their
WSC.
7
Fig 5 – 3 stage clos network
Hardware Accelerators
 If the overall performance of the uniprocessor is too slow,
an additional hardware can be used to speed up the
system. This hardware is called hardware accelerator
 The hardware accelerator is a component that works with
the processor and executes the tasks much faster than
the processor
 An accelerator appears as a device on the bus
 Google WSC uses:
 Graphical Processing Unit (GPU)
 Tensor Processing Unit (TPU)
8
Fig 6 – Hardware Accelerators
Graphical Processing Unit (GPU)
 In a Google WSC, each CPU of a server is connected with a PCIe attached
accelerator tray with multiple GPUs
 The multiple GPU within the tray are interconnected with NVlink, which is a
wire-based near-range communication protocol developed by Nvidia.
 Each SM has an L1 cache associated with the core and a shared L2 cache.
 The presence of multiple CUDA cores in the Nvidia GPU makes the
computation faster than a CPU.
 In a GPU, the task to be executed is divided into several processes and sent to
the several Processor Clusters(PC) to achieve low memory latency and high
throughput.
 The GPU has small cache layers than CPU since the GPU has dedicated
transistors deployed for the computation
 Also, since there are multiple cores, parallelism can be achieved by running
the processes effectively, quickly, and reliably
https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/4
9
Fig 7 – Graphical Processing Unit
Graphical Processing Unit (GPU)
 For Compute workloads, following GPUs are used by
Google WSC (designed specifically for AI & data
center solutions):
 NVIDIA® Tesla® T4
 NVIDIA® Tesla® V100
 NVIDIA® Tesla® P100
 NVIDIA® Tesla® P4
 NVIDIA® Tesla® K80
No of
Cores
Memory
Size
Memory Type SM Count Tensor
Cores
L1 Cache L2 Cache FP32(float)
performance
Tesla T4 2560 16 GB GDDR6 40 320 64 KB/SM 4 MB 8.141 TFLOPS
Tesla V100 5120 32 GB HBM2 80 640 128 KB/SM 6 MB 14.13 TFLOPS
Tesla P100 3584 16 GB HBM2 56 NA 24 KB/SM 4 MB 9.526 TFLOPS
Tesla P4 2560 8 GB GDDR5 20 NA 48 KB/SM 2 MB 5.704 TFLOPS
Tesla K80 4992 24 GB GDDR4 26 NA 32 KB/SM 1536 KB 8.226 TFLOPS
0
5
10
15
Tesla T4 Tesla V100 Tesla P100 Tesla P4 Tesla K80
Performance
GPU
Performance
https://cloud.google.com/compute/docs/gpus
10
Tensor Processing Unit (TPU)
 Google’s ASIC specifically designed for AI solutions
 Matrix Multiply Unit (MXU) – heart of TPU
 Contains 256x256 MAC
 Weight FIFO uses 8 GB off-chip DRAM to provides weight to the
MMU
 Unified Buffer (24 MB) keeps activation input/output of the
MMU and host
 Accumulators collects the 16 MB MMU products
11
Fig 7 – Inside the TPU
Parallel Processing on the Matrix Multiply Unit (MXU)
 Typical RISC processor process a single operation (=scalar processing)
with each instruction
 GPU uses vector processing and performs the operation concurrently
on multiple SMs. GPU performs 100-1000 of operations in a single
clock cycle
 To increase the number of operations in single clock cycle, Google
developed a matrix processor that process hundreds of thousands of
operations(=matrix operations) in a single clock cycle
 To implement a large scale matrix processor, Google uses a different
architecture than CPUs and GPUs, called a systolic array.
 MXU reads each input value once, and reuses it for many different
operations without storing it back to a register – Not like CPU
 CPUs and GPUs often spend energy to access multiple registers per
operation. A systolic array chains multiple ALUs together, reusing the
result of reading a single register.
https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
12
Fig 8 – Register Access :CPU and GPU vs. TPU
Parallel Processing on the Matrix Multiply Unit (MXU)
 Systolic array is a homogeneous network of tightly coupled Data Processing Unit(DPU) called
nodes.
 Uses Multiple Instruction Single Data (MISD) architecture
 The design is systolic because the data flows through a network of hard-wired processor nodes
 The systolic array contains 256x256 = total 65,536 ALUs, which means TPU can process 65,536
multiply and add for 8 bit integer every cycle
 Clock Frequency of TPU = 700 MHz, thus TPU can compute 65,536 x 700MHz = 46 ×
1012 multiply-and-add operations per second
 Number of operations per cycle between CPU, GPU and TPU
CPU a few
CPU (vector extension) tens
GPU tens of thousands
TPU hundreds of thousands, up to 128k
https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
13
Roofline Model – CPU vs. GPU vs. TPU
 It ties floating-point performance, memory performance and
arithmetic intensity together in a 2-D graph.
 Arithmetic Intensity
 Ratio of floating-point operation per byte of memory
accessed.
 X-axis is based on arithmetic intensity and Y-axis is
performance in floating-point operations per second
 The graph can be plotted using the following
 Attainable GFLOPs/sec = Min(Peak Memory BW x AI,
Peak floating-point perf.)
 The comparison is made between Intel Haswell(CPU), Nvidia
Tesla K80 (GPU) and TPUv1 for six different neural network
applications
 The six NN applications in CPU and GPU are below the ceiling
than TPU – TPU has higher performance
14
Fig 9 – Roofline Model of CPU
Fig 10 – Roofline Model of GPU
Fig 11 – Roofline Model of TPU
Energy Efficiency of a Google WSC
 The workload of a system seems to increase
tremendously by consuming a lot of power and energy.
 The simple metric used to calculate the efficiency of a
WSC is called power utilization effectiveness or PUE.
 PUE = (Total Facility Power) / (IT Equipment
Power)
 Power Usage Effectiveness is the relation between the
total energy entering a WSC and the energy used by IT
equipment inside the WSC
 Many requests make the WSC system busy all the time
and contributing to the other kind of energy losses like
power distribution loss, cooling loss, air loss, etc.,
34%
13%
14%
18%
14%
7%
Server Losses
Rectifier Losses
Power Distibution Loss
Cooling
Other Losses
Air Loss
0% 10% 20% 30% 40%
Energy Losses in a Google WSC
Energy Losses in a Google WSC
15
Energy Efficiency of a Google WSC
 Here are some of the workloads, which are considered to be the
highest power consumption.
 Web-search: high request throughput and data processing
requests
 Webmail: disk I/O internet service, where each machine is
configured with a large number of disk drivers to run this
workload
 MapReduce: cluster processes use hundreds or thousands
of servers to process terabytes of data by large offline jobs
 To reduce the power consumption, Google implements
 CPU Voltage/Frequency Scaling
 DVFS reduces the servers’ power consumption by
dynamically changing the voltage and frequency of a
CPU according to its load.
 Google reduces 23% of power by implementing this
technique.
16
Fig 12 – Energy Consumption Comparison
Performance of a Google WSC
 The overall WSC performance can be calculated by aggregating per-job performance
 WSC performance = ∑ weight i× Performance metrici (i denotes unique job ID)
 Weight - weight determines how much a job's performance affects the overall performance
 Performance Metric - Google WSC uses IPC (Instructions per cycle) to evaluate the
performance of a job
 Reason for Performance impact
 The CPU, which suffers from memory latency and memory bandwidth, has a performance
impact in the processor by suffering from stall cycles due to the cache misses.
 The lower performance is due to the data cache miss, and instruction cache misses, these
two misses contribute to a lower IPC.
17
Top-Down Micro-Architectural Analysis Method
 The Top-Down Micro-architecture Analysis method is used to identify all the performance issues related to a processor.
 Google used Naïve approach (a traditional approach) to identify the performance bottlenecks and Google implemented
TDMAM in 2015.
 Simple, Structured, Quick
 The pipeline of a CPU used by WSC is quite complex as the pipeline is divided into two halves,
 Front-End: It is responsible for fetching the program code, and the program code is decoded into two or
more low-level hardware operations called micro-ops(uops)
 Back-End: The micro-ops are passed to process called allocation. Once the micro-ops are allocated, the
back-end checks the available execution unit and try to execute the micro-ops instructions.
 The pipeline slots are classified into four broad categories:
 Retiring - when a micro-ops leaves the queue and commits
 Bad speculation – when a pipeline slot wasted due to incorrect operation
 Front-end bound - overheads due to fetching, instruction caches, and decoding
 Back-end bound - overheads due to data cache hierarchy and the lack of Instruction Level Parallelism.
18
Top-Down Micro-Architectural Analysis Method
 The chart represents the pipeline slots breakdown of
applications running in Google WSC.
 Large number of stalled cycles in back-end due to the lack of
instruction level parallelism.
 The processor finds difficult to run all the instructions
simultaneously and increases the memory stall time.
 To overcome this, Google uses Simultaneous
Multithreading to hide the latency by overlapping the stall
cycles.
 SMT is an architectural feature allowing instructions from
more than one thread to be executed in any given pipeline
stage at a time.
 SMT increases the performance of CPU by supporting
thread-level parallelism.
19
Cooling System
 The intention of WSC cooling systems is to remove the heat
generated by the equipment.
 Google WSC uses Ceiling-mounted cooling as the cooling system.
 This type of cooling system comes with a large plenum space
that removes the hot air from the data center. Once the plenum
space removes the heat from the data center, the fan coil is
responsible for blowing the cold air towards the intake of the
data center
 (1) Hot exhaust from the datacenter rises in a vertical plenum
space.
 (2) Hot air enters a large plenum space above the drop ceiling.
 (3) Heat is exchanged with process water in a fan coil unit
 (4) blows the cold air down toward the intake of the data center.
20
Fig 13 – Google’s Cooling System
Conclusion
 The computation in WSC does not rely on a single machine, and it requires hundreds
or thousands of machines connected over a network to achieve greater
performance. We also observed that, Google WSC deploys hardware accelerators
such as GPU and TPU to increase the performance and energy.
 Designing a performance-oriented and energy-efficient WSC is a main concern,
Google have implemented some power saving approaches and performance
improvement mechanisms like SMT to eliminate the stall cycles due to the cache
misses.
 Hence, Google uses the above mentioned hardware and techniques to design a
performance and energy-efficient warehouse-scale systems.
21
Thank You
22

Weitere ähnliche Inhalte

Was ist angesagt?

Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievVolodymyr Saviak
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionNVIDIA Taiwan
 
Early Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic ComputingEarly Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic ComputingDESMOND YUEN
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)Fatima Qayyum
 
intel speed-select-technology-base-frequency-enhancing-performance
intel speed-select-technology-base-frequency-enhancing-performanceintel speed-select-technology-base-frequency-enhancing-performance
intel speed-select-technology-base-frequency-enhancing-performanceDESMOND YUEN
 
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC PlatformsProtecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC PlatformsHeechul Yun
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_PlaceKohei KaiGai
 
Supermicro High Performance Enterprise Hadoop Infrastructure
Supermicro High Performance Enterprise Hadoop InfrastructureSupermicro High Performance Enterprise Hadoop Infrastructure
Supermicro High Performance Enterprise Hadoop Infrastructuretempledf
 
Introduction to High-Performance Computing (HPC) Containers and Singularity*
Introduction to High-Performance Computing (HPC) Containers and Singularity*Introduction to High-Performance Computing (HPC) Containers and Singularity*
Introduction to High-Performance Computing (HPC) Containers and Singularity*Intel® Software
 
Expectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchExpectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchRyousei Takano
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwJan Holčapek
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Kohei KaiGai
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUsShree Kumar
 
A cloud gaming system based on user level virtualization and its resource sch...
A cloud gaming system based on user level virtualization and its resource sch...A cloud gaming system based on user level virtualization and its resource sch...
A cloud gaming system based on user level virtualization and its resource sch...redpel dot com
 
Bullx HPC eXtreme computing cluster references
Bullx HPC eXtreme computing cluster referencesBullx HPC eXtreme computing cluster references
Bullx HPC eXtreme computing cluster referencesJeff Spencer
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
 

Was ist angesagt? (20)

Lec06 memory
Lec06 memoryLec06 memory
Lec06 memory
 
Kindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 KievKindratenko hpc day 2011 Kiev
Kindratenko hpc day 2011 Kiev
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
 
Early Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic ComputingEarly Benchmarking Results for Neuromorphic Computing
Early Benchmarking Results for Neuromorphic Computing
 
GIST AI-X Computing Cluster
GIST AI-X Computing ClusterGIST AI-X Computing Cluster
GIST AI-X Computing Cluster
 
GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)GPU Architecture NVIDIA (GTX GeForce 480)
GPU Architecture NVIDIA (GTX GeForce 480)
 
intel speed-select-technology-base-frequency-enhancing-performance
intel speed-select-technology-base-frequency-enhancing-performanceintel speed-select-technology-base-frequency-enhancing-performance
intel speed-select-technology-base-frequency-enhancing-performance
 
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC PlatformsProtecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
Protecting Real-Time GPU Kernels in Integrated CPU-GPU SoC Platforms
 
20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place20160407_GTC2016_PgSQL_In_Place
20160407_GTC2016_PgSQL_In_Place
 
Supermicro High Performance Enterprise Hadoop Infrastructure
Supermicro High Performance Enterprise Hadoop InfrastructureSupermicro High Performance Enterprise Hadoop Infrastructure
Supermicro High Performance Enterprise Hadoop Infrastructure
 
Introduction to High-Performance Computing (HPC) Containers and Singularity*
Introduction to High-Performance Computing (HPC) Containers and Singularity*Introduction to High-Performance Computing (HPC) Containers and Singularity*
Introduction to High-Performance Computing (HPC) Containers and Singularity*
 
Cuda
CudaCuda
Cuda
 
SGI HPC Update for June 2013
SGI HPC Update for June 2013SGI HPC Update for June 2013
SGI HPC Update for June 2013
 
Expectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchExpectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software research
 
Let's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdwLet's turn your PostgreSQL into columnar store with cstore_fdw
Let's turn your PostgreSQL into columnar store with cstore_fdw
 
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
Technology Updates of PG-Strom at Aug-2014 (PGUnconf@Tokyo)
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
 
A cloud gaming system based on user level virtualization and its resource sch...
A cloud gaming system based on user level virtualization and its resource sch...A cloud gaming system based on user level virtualization and its resource sch...
A cloud gaming system based on user level virtualization and its resource sch...
 
Bullx HPC eXtreme computing cluster references
Bullx HPC eXtreme computing cluster referencesBullx HPC eXtreme computing cluster references
Bullx HPC eXtreme computing cluster references
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 

Ähnlich wie Google warehouse scale computer

Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Editor IJARCET
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Editor IJARCET
 
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...journalBEEI
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8AbdullahMunir32
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceAmazon Web Services
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONSA SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONScseij
 
Modern processor art
Modern processor artModern processor art
Modern processor artwaqasjadoon11
 
Stream Processing
Stream ProcessingStream Processing
Stream Processingarnamoy10
 
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...Amazon Web Services
 
Intel new processors
Intel new processorsIntel new processors
Intel new processorszaid_b
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)Kohei KaiGai
 
Opportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCIOpportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCIRyousei Takano
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to ProductionWebinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to Productioniguazio
 

Ähnlich wie Google warehouse scale computer (20)

Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
 
Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045Volume 2-issue-6-2040-2045
Volume 2-issue-6-2040-2045
 
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...Architecture exploration of recent GPUs to analyze the efficiency of hardware...
Architecture exploration of recent GPUs to analyze the efficiency of hardware...
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
 
Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8Parallel and Distributed Computing Chapter 8
Parallel and Distributed Computing Chapter 8
 
openCL Paper
openCL PaperopenCL Paper
openCL Paper
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance Performance
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
processor struct
processor structprocessor struct
processor struct
 
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONSA SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
A SURVEY ON GPU SYSTEM CONSIDERING ITS PERFORMANCE ON DIFFERENT APPLICATIONS
 
Modern processor art
Modern processor artModern processor art
Modern processor art
 
Danish presentation
Danish presentationDanish presentation
Danish presentation
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
 
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
 
Intel new processors
Intel new processorsIntel new processors
Intel new processors
 
GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)GPGPU Accelerates PostgreSQL (English)
GPGPU Accelerates PostgreSQL (English)
 
Opportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCIOpportunities of ML-based data analytics in ABCI
Opportunities of ML-based data analytics in ABCI
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to ProductionWebinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
 

Kürzlich hochgeladen

kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxnuruddin69
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086anil_gaur
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . pptDineshKumar4165
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfsmsksolar
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwaitjaanualu31
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...soginsider
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 

Kürzlich hochgeladen (20)

kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Rums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdfRums floating Omkareshwar FSPV IM_16112021.pdf
Rums floating Omkareshwar FSPV IM_16112021.pdf
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 

Google warehouse scale computer

  • 1. Google Warehouse- Scale Computer: Hardware and Performance Prepared by: Tejhaskar Ashok Kumar Master of Applied Science in Computer Engineering Memorial University of Newfoundland
  • 2. Outline  Introduction  Architectural Overview of Google WSC  Server  Storage  Network  Hardware Accelerators  GPU  TPU  Energy and Power Efficiency  Performance of Google WSC  Top-Down Micro-architectural Analysis Method  Cooling Systems 2
  • 3. Introduction  A Warehouse-Scale Computer comprises of ten to thousands of cluster connected to a network to process thousands to millions of users’ request.  A WSC can be used to provide internet services : Search, Video Sharing, E-Commerce  The difference between a traditional data center and a WSC:  Data centers host services for multiple providers  WSC run by a single organization  WSC exploits Request Level Parallelism and Data Level Parallelism  WSC Design Goals:  Cost-Performance  Energy Efficiency  Dependability via Redundancy  Network I/O  Interactive and Batch processing workloads 3
  • 4. Architectural Overview of Google WSC  In general, WSC is a building which considered as a computer, where multiple server nodes, storage are tightly coupled with interconnection networks.  Inside the building, there are lot of containers and each container contains multiple servers arranged in a rack interconnected with storage  Each container also has some cooling support to eliminate the heat generated.  A Google WSC uses more than 100,000 servers  WSC’s are designed in a way to perform reliable and faster to perform every request powered by internet services. 4 Fig 1 – WSC as a building
  • 5. Servers  The WSC uses a low-end server in a 1U or blade enclosure format, and the servers are mounted in a rack and each server is interconnected with an Ethernet switch.  Google WSC uses a 19-inch wide rack which can hold 48 blade servers connected to a rack switch.  Each server has a PCIe link, which is useful for connecting the CPU servers with the GPU and TPU trays.  Every server is arranged in a rack, which is of 7ft high, 4ft wide, and 2ft deep, which contains about 48 slots for loading the server, power conversion chords, network switches, and a battery backup tray which is useful during the power interruption.  CPU used in Google WSC: Intel Xeon Scalable Processor (Cascade Lake), Intel Xeon Scalable Processor (Skylake), Intel Xeon E7 (Broadwell E7),Intel Xeon E5 v4 (Broadwell E5),Intel Xeon E5 v3 (Haswell),Intel Xeon E5 v2 (Ivy Bridge),Intel Xeon E5 (Sandy Bridge),AMD EPYC Rome 5 Fig 2 – Server to Cluster Fig 3 – Server Rack
  • 6. Storage  Google WSC rely on local disks and uses Google File System (GFS), which is a distributed file system developed by Google  A GFS cluster contains multiple nodes and divided into two categories:  Master Node: storing the data required for processing the current request  Chunkserver: maintaining the copies of the data  Google WSC storage maintains at least three replicas to improve the dependability  Every server storage is interconnected with the server storage in the local rack, and every server storage in local rack is interconnected with cluster. 6 Fig 4 – Storage Hierarchy of Google WSC
  • 7. Network  The Google WSC network uses a network called 'clos,' which is a multistage network that uses low port-count switches  These clos networks are fault-tolerant and provide excellent bandwidth. Google improves the bandwidth of the network by adding more stages to the multistage network  Google uses a 3-stage clos network:  Ingress Stage (input stage)  Middle Stage  Egress Stage (output stage)  Google uses Jupiter Clos Network inside their WSC. 7 Fig 5 – 3 stage clos network
  • 8. Hardware Accelerators  If the overall performance of the uniprocessor is too slow, an additional hardware can be used to speed up the system. This hardware is called hardware accelerator  The hardware accelerator is a component that works with the processor and executes the tasks much faster than the processor  An accelerator appears as a device on the bus  Google WSC uses:  Graphical Processing Unit (GPU)  Tensor Processing Unit (TPU) 8 Fig 6 – Hardware Accelerators
  • 9. Graphical Processing Unit (GPU)  In a Google WSC, each CPU of a server is connected with a PCIe attached accelerator tray with multiple GPUs  The multiple GPU within the tray are interconnected with NVlink, which is a wire-based near-range communication protocol developed by Nvidia.  Each SM has an L1 cache associated with the core and a shared L2 cache.  The presence of multiple CUDA cores in the Nvidia GPU makes the computation faster than a CPU.  In a GPU, the task to be executed is divided into several processes and sent to the several Processor Clusters(PC) to achieve low memory latency and high throughput.  The GPU has small cache layers than CPU since the GPU has dedicated transistors deployed for the computation  Also, since there are multiple cores, parallelism can be achieved by running the processes effectively, quickly, and reliably https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/4 9 Fig 7 – Graphical Processing Unit
  • 10. Graphical Processing Unit (GPU)  For Compute workloads, following GPUs are used by Google WSC (designed specifically for AI & data center solutions):  NVIDIA® Tesla® T4  NVIDIA® Tesla® V100  NVIDIA® Tesla® P100  NVIDIA® Tesla® P4  NVIDIA® Tesla® K80 No of Cores Memory Size Memory Type SM Count Tensor Cores L1 Cache L2 Cache FP32(float) performance Tesla T4 2560 16 GB GDDR6 40 320 64 KB/SM 4 MB 8.141 TFLOPS Tesla V100 5120 32 GB HBM2 80 640 128 KB/SM 6 MB 14.13 TFLOPS Tesla P100 3584 16 GB HBM2 56 NA 24 KB/SM 4 MB 9.526 TFLOPS Tesla P4 2560 8 GB GDDR5 20 NA 48 KB/SM 2 MB 5.704 TFLOPS Tesla K80 4992 24 GB GDDR4 26 NA 32 KB/SM 1536 KB 8.226 TFLOPS 0 5 10 15 Tesla T4 Tesla V100 Tesla P100 Tesla P4 Tesla K80 Performance GPU Performance https://cloud.google.com/compute/docs/gpus 10
  • 11. Tensor Processing Unit (TPU)  Google’s ASIC specifically designed for AI solutions  Matrix Multiply Unit (MXU) – heart of TPU  Contains 256x256 MAC  Weight FIFO uses 8 GB off-chip DRAM to provides weight to the MMU  Unified Buffer (24 MB) keeps activation input/output of the MMU and host  Accumulators collects the 16 MB MMU products 11 Fig 7 – Inside the TPU
  • 12. Parallel Processing on the Matrix Multiply Unit (MXU)  Typical RISC processor process a single operation (=scalar processing) with each instruction  GPU uses vector processing and performs the operation concurrently on multiple SMs. GPU performs 100-1000 of operations in a single clock cycle  To increase the number of operations in single clock cycle, Google developed a matrix processor that process hundreds of thousands of operations(=matrix operations) in a single clock cycle  To implement a large scale matrix processor, Google uses a different architecture than CPUs and GPUs, called a systolic array.  MXU reads each input value once, and reuses it for many different operations without storing it back to a register – Not like CPU  CPUs and GPUs often spend energy to access multiple registers per operation. A systolic array chains multiple ALUs together, reusing the result of reading a single register. https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu 12 Fig 8 – Register Access :CPU and GPU vs. TPU
  • 13. Parallel Processing on the Matrix Multiply Unit (MXU)  Systolic array is a homogeneous network of tightly coupled Data Processing Unit(DPU) called nodes.  Uses Multiple Instruction Single Data (MISD) architecture  The design is systolic because the data flows through a network of hard-wired processor nodes  The systolic array contains 256x256 = total 65,536 ALUs, which means TPU can process 65,536 multiply and add for 8 bit integer every cycle  Clock Frequency of TPU = 700 MHz, thus TPU can compute 65,536 x 700MHz = 46 × 1012 multiply-and-add operations per second  Number of operations per cycle between CPU, GPU and TPU CPU a few CPU (vector extension) tens GPU tens of thousands TPU hundreds of thousands, up to 128k https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu 13
  • 14. Roofline Model – CPU vs. GPU vs. TPU  It ties floating-point performance, memory performance and arithmetic intensity together in a 2-D graph.  Arithmetic Intensity  Ratio of floating-point operation per byte of memory accessed.  X-axis is based on arithmetic intensity and Y-axis is performance in floating-point operations per second  The graph can be plotted using the following  Attainable GFLOPs/sec = Min(Peak Memory BW x AI, Peak floating-point perf.)  The comparison is made between Intel Haswell(CPU), Nvidia Tesla K80 (GPU) and TPUv1 for six different neural network applications  The six NN applications in CPU and GPU are below the ceiling than TPU – TPU has higher performance 14 Fig 9 – Roofline Model of CPU Fig 10 – Roofline Model of GPU Fig 11 – Roofline Model of TPU
  • 15. Energy Efficiency of a Google WSC  The workload of a system seems to increase tremendously by consuming a lot of power and energy.  The simple metric used to calculate the efficiency of a WSC is called power utilization effectiveness or PUE.  PUE = (Total Facility Power) / (IT Equipment Power)  Power Usage Effectiveness is the relation between the total energy entering a WSC and the energy used by IT equipment inside the WSC  Many requests make the WSC system busy all the time and contributing to the other kind of energy losses like power distribution loss, cooling loss, air loss, etc., 34% 13% 14% 18% 14% 7% Server Losses Rectifier Losses Power Distibution Loss Cooling Other Losses Air Loss 0% 10% 20% 30% 40% Energy Losses in a Google WSC Energy Losses in a Google WSC 15
  • 16. Energy Efficiency of a Google WSC  Here are some of the workloads, which are considered to be the highest power consumption.  Web-search: high request throughput and data processing requests  Webmail: disk I/O internet service, where each machine is configured with a large number of disk drivers to run this workload  MapReduce: cluster processes use hundreds or thousands of servers to process terabytes of data by large offline jobs  To reduce the power consumption, Google implements  CPU Voltage/Frequency Scaling  DVFS reduces the servers’ power consumption by dynamically changing the voltage and frequency of a CPU according to its load.  Google reduces 23% of power by implementing this technique. 16 Fig 12 – Energy Consumption Comparison
  • 17. Performance of a Google WSC  The overall WSC performance can be calculated by aggregating per-job performance  WSC performance = ∑ weight i× Performance metrici (i denotes unique job ID)  Weight - weight determines how much a job's performance affects the overall performance  Performance Metric - Google WSC uses IPC (Instructions per cycle) to evaluate the performance of a job  Reason for Performance impact  The CPU, which suffers from memory latency and memory bandwidth, has a performance impact in the processor by suffering from stall cycles due to the cache misses.  The lower performance is due to the data cache miss, and instruction cache misses, these two misses contribute to a lower IPC. 17
  • 18. Top-Down Micro-Architectural Analysis Method  The Top-Down Micro-architecture Analysis method is used to identify all the performance issues related to a processor.  Google used Naïve approach (a traditional approach) to identify the performance bottlenecks and Google implemented TDMAM in 2015.  Simple, Structured, Quick  The pipeline of a CPU used by WSC is quite complex as the pipeline is divided into two halves,  Front-End: It is responsible for fetching the program code, and the program code is decoded into two or more low-level hardware operations called micro-ops(uops)  Back-End: The micro-ops are passed to process called allocation. Once the micro-ops are allocated, the back-end checks the available execution unit and try to execute the micro-ops instructions.  The pipeline slots are classified into four broad categories:  Retiring - when a micro-ops leaves the queue and commits  Bad speculation – when a pipeline slot wasted due to incorrect operation  Front-end bound - overheads due to fetching, instruction caches, and decoding  Back-end bound - overheads due to data cache hierarchy and the lack of Instruction Level Parallelism. 18
  • 19. Top-Down Micro-Architectural Analysis Method  The chart represents the pipeline slots breakdown of applications running in Google WSC.  Large number of stalled cycles in back-end due to the lack of instruction level parallelism.  The processor finds difficult to run all the instructions simultaneously and increases the memory stall time.  To overcome this, Google uses Simultaneous Multithreading to hide the latency by overlapping the stall cycles.  SMT is an architectural feature allowing instructions from more than one thread to be executed in any given pipeline stage at a time.  SMT increases the performance of CPU by supporting thread-level parallelism. 19
  • 20. Cooling System  The intention of WSC cooling systems is to remove the heat generated by the equipment.  Google WSC uses Ceiling-mounted cooling as the cooling system.  This type of cooling system comes with a large plenum space that removes the hot air from the data center. Once the plenum space removes the heat from the data center, the fan coil is responsible for blowing the cold air towards the intake of the data center  (1) Hot exhaust from the datacenter rises in a vertical plenum space.  (2) Hot air enters a large plenum space above the drop ceiling.  (3) Heat is exchanged with process water in a fan coil unit  (4) blows the cold air down toward the intake of the data center. 20 Fig 13 – Google’s Cooling System
  • 21. Conclusion  The computation in WSC does not rely on a single machine, and it requires hundreds or thousands of machines connected over a network to achieve greater performance. We also observed that, Google WSC deploys hardware accelerators such as GPU and TPU to increase the performance and energy.  Designing a performance-oriented and energy-efficient WSC is a main concern, Google have implemented some power saving approaches and performance improvement mechanisms like SMT to eliminate the stall cycles due to the cache misses.  Hence, Google uses the above mentioned hardware and techniques to design a performance and energy-efficient warehouse-scale systems. 21