This project deals with the warehouse scale computers that power all the internet services which we use today. The project covers the hardware blocks used in a Google WSC. Also, the project deals with the architecture of hardware accelerators such as the Graphical Processing Unit and the Tensor Processing Unit, which is highly useful for the warehouse scale machines to run heavy tasks and also to support application-specific machine learning and deep learning tasks. Also, the project explains about the energy efficiency of the processors used by the Google WSC to achieve high performance. The project also tries to explain about performance enhancement mechanism used by Google WSC.
1. Google Warehouse-
Scale Computer:
Hardware and
Performance
Prepared by: Tejhaskar Ashok Kumar
Master of Applied Science in Computer
Engineering
Memorial University of Newfoundland
2. Outline
Introduction
Architectural Overview of Google WSC
Server
Storage
Network
Hardware Accelerators
GPU
TPU
Energy and Power Efficiency
Performance of Google WSC
Top-Down Micro-architectural Analysis Method
Cooling Systems
2
3. Introduction
A Warehouse-Scale Computer comprises of ten to thousands of cluster connected to a network to process
thousands to millions of users’ request.
A WSC can be used to provide internet services : Search, Video Sharing, E-Commerce
The difference between a traditional data center and a WSC:
Data centers host services for multiple providers
WSC run by a single organization
WSC exploits Request Level Parallelism and Data Level Parallelism
WSC Design Goals:
Cost-Performance
Energy Efficiency
Dependability via Redundancy
Network I/O
Interactive and Batch processing workloads
3
4. Architectural Overview of Google
WSC
In general, WSC is a building which considered as a
computer, where multiple server nodes, storage are tightly
coupled with interconnection networks.
Inside the building, there are lot of containers and each
container contains multiple servers arranged in a rack
interconnected with storage
Each container also has some cooling support to eliminate
the heat generated.
A Google WSC uses more than 100,000 servers
WSC’s are designed in a way to perform reliable and faster
to perform every request powered by internet services.
4
Fig 1 – WSC as a building
5. Servers
The WSC uses a low-end server in a 1U or blade enclosure
format, and the servers are mounted in a rack and each
server is interconnected with an Ethernet switch.
Google WSC uses a 19-inch wide rack which can hold 48
blade servers connected to a rack switch.
Each server has a PCIe link, which is useful for connecting
the CPU servers with the GPU and TPU trays.
Every server is arranged in a rack, which is of 7ft high, 4ft
wide, and 2ft deep, which contains about 48 slots for
loading the server, power conversion chords, network
switches, and a battery backup tray which is useful during
the power interruption.
CPU used in Google WSC: Intel Xeon Scalable Processor
(Cascade Lake), Intel Xeon Scalable Processor (Skylake), Intel
Xeon E7 (Broadwell E7),Intel Xeon E5 v4 (Broadwell E5),Intel
Xeon E5 v3 (Haswell),Intel Xeon E5 v2 (Ivy Bridge),Intel Xeon
E5 (Sandy Bridge),AMD EPYC Rome
5
Fig 2 – Server to Cluster
Fig 3 – Server Rack
6. Storage
Google WSC rely on local disks and uses Google File System
(GFS), which is a distributed file system developed by
Google
A GFS cluster contains multiple nodes and divided into two
categories:
Master Node: storing the data required for
processing the current request
Chunkserver: maintaining the copies of the
data
Google WSC storage maintains at least three replicas to
improve the dependability
Every server storage is interconnected with the server
storage in the local rack, and every server storage in local
rack is interconnected with cluster.
6
Fig 4 – Storage Hierarchy of Google WSC
7. Network
The Google WSC network uses a network called
'clos,' which is a multistage network that uses
low port-count switches
These clos networks are fault-tolerant and
provide excellent bandwidth. Google improves
the bandwidth of the network by adding more
stages to the multistage network
Google uses a 3-stage clos network:
Ingress Stage (input stage)
Middle Stage
Egress Stage (output stage)
Google uses Jupiter Clos Network inside their
WSC.
7
Fig 5 – 3 stage clos network
8. Hardware Accelerators
If the overall performance of the uniprocessor is too slow,
an additional hardware can be used to speed up the
system. This hardware is called hardware accelerator
The hardware accelerator is a component that works with
the processor and executes the tasks much faster than
the processor
An accelerator appears as a device on the bus
Google WSC uses:
Graphical Processing Unit (GPU)
Tensor Processing Unit (TPU)
8
Fig 6 – Hardware Accelerators
9. Graphical Processing Unit (GPU)
In a Google WSC, each CPU of a server is connected with a PCIe attached
accelerator tray with multiple GPUs
The multiple GPU within the tray are interconnected with NVlink, which is a
wire-based near-range communication protocol developed by Nvidia.
Each SM has an L1 cache associated with the core and a shared L2 cache.
The presence of multiple CUDA cores in the Nvidia GPU makes the
computation faster than a CPU.
In a GPU, the task to be executed is divided into several processes and sent to
the several Processor Clusters(PC) to achieve low memory latency and high
throughput.
The GPU has small cache layers than CPU since the GPU has dedicated
transistors deployed for the computation
Also, since there are multiple cores, parallelism can be achieved by running
the processes effectively, quickly, and reliably
https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/4
9
Fig 7 – Graphical Processing Unit
10. Graphical Processing Unit (GPU)
For Compute workloads, following GPUs are used by
Google WSC (designed specifically for AI & data
center solutions):
NVIDIA® Tesla® T4
NVIDIA® Tesla® V100
NVIDIA® Tesla® P100
NVIDIA® Tesla® P4
NVIDIA® Tesla® K80
No of
Cores
Memory
Size
Memory Type SM Count Tensor
Cores
L1 Cache L2 Cache FP32(float)
performance
Tesla T4 2560 16 GB GDDR6 40 320 64 KB/SM 4 MB 8.141 TFLOPS
Tesla V100 5120 32 GB HBM2 80 640 128 KB/SM 6 MB 14.13 TFLOPS
Tesla P100 3584 16 GB HBM2 56 NA 24 KB/SM 4 MB 9.526 TFLOPS
Tesla P4 2560 8 GB GDDR5 20 NA 48 KB/SM 2 MB 5.704 TFLOPS
Tesla K80 4992 24 GB GDDR4 26 NA 32 KB/SM 1536 KB 8.226 TFLOPS
0
5
10
15
Tesla T4 Tesla V100 Tesla P100 Tesla P4 Tesla K80
Performance
GPU
Performance
https://cloud.google.com/compute/docs/gpus
10
11. Tensor Processing Unit (TPU)
Google’s ASIC specifically designed for AI solutions
Matrix Multiply Unit (MXU) – heart of TPU
Contains 256x256 MAC
Weight FIFO uses 8 GB off-chip DRAM to provides weight to the
MMU
Unified Buffer (24 MB) keeps activation input/output of the
MMU and host
Accumulators collects the 16 MB MMU products
11
Fig 7 – Inside the TPU
12. Parallel Processing on the Matrix Multiply Unit (MXU)
Typical RISC processor process a single operation (=scalar processing)
with each instruction
GPU uses vector processing and performs the operation concurrently
on multiple SMs. GPU performs 100-1000 of operations in a single
clock cycle
To increase the number of operations in single clock cycle, Google
developed a matrix processor that process hundreds of thousands of
operations(=matrix operations) in a single clock cycle
To implement a large scale matrix processor, Google uses a different
architecture than CPUs and GPUs, called a systolic array.
MXU reads each input value once, and reuses it for many different
operations without storing it back to a register – Not like CPU
CPUs and GPUs often spend energy to access multiple registers per
operation. A systolic array chains multiple ALUs together, reusing the
result of reading a single register.
https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
12
Fig 8 – Register Access :CPU and GPU vs. TPU
13. Parallel Processing on the Matrix Multiply Unit (MXU)
Systolic array is a homogeneous network of tightly coupled Data Processing Unit(DPU) called
nodes.
Uses Multiple Instruction Single Data (MISD) architecture
The design is systolic because the data flows through a network of hard-wired processor nodes
The systolic array contains 256x256 = total 65,536 ALUs, which means TPU can process 65,536
multiply and add for 8 bit integer every cycle
Clock Frequency of TPU = 700 MHz, thus TPU can compute 65,536 x 700MHz = 46 ×
1012 multiply-and-add operations per second
Number of operations per cycle between CPU, GPU and TPU
CPU a few
CPU (vector extension) tens
GPU tens of thousands
TPU hundreds of thousands, up to 128k
https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-first-tensor-processing-unit-tpu
13
14. Roofline Model – CPU vs. GPU vs. TPU
It ties floating-point performance, memory performance and
arithmetic intensity together in a 2-D graph.
Arithmetic Intensity
Ratio of floating-point operation per byte of memory
accessed.
X-axis is based on arithmetic intensity and Y-axis is
performance in floating-point operations per second
The graph can be plotted using the following
Attainable GFLOPs/sec = Min(Peak Memory BW x AI,
Peak floating-point perf.)
The comparison is made between Intel Haswell(CPU), Nvidia
Tesla K80 (GPU) and TPUv1 for six different neural network
applications
The six NN applications in CPU and GPU are below the ceiling
than TPU – TPU has higher performance
14
Fig 9 – Roofline Model of CPU
Fig 10 – Roofline Model of GPU
Fig 11 – Roofline Model of TPU
15. Energy Efficiency of a Google WSC
The workload of a system seems to increase
tremendously by consuming a lot of power and energy.
The simple metric used to calculate the efficiency of a
WSC is called power utilization effectiveness or PUE.
PUE = (Total Facility Power) / (IT Equipment
Power)
Power Usage Effectiveness is the relation between the
total energy entering a WSC and the energy used by IT
equipment inside the WSC
Many requests make the WSC system busy all the time
and contributing to the other kind of energy losses like
power distribution loss, cooling loss, air loss, etc.,
34%
13%
14%
18%
14%
7%
Server Losses
Rectifier Losses
Power Distibution Loss
Cooling
Other Losses
Air Loss
0% 10% 20% 30% 40%
Energy Losses in a Google WSC
Energy Losses in a Google WSC
15
16. Energy Efficiency of a Google WSC
Here are some of the workloads, which are considered to be the
highest power consumption.
Web-search: high request throughput and data processing
requests
Webmail: disk I/O internet service, where each machine is
configured with a large number of disk drivers to run this
workload
MapReduce: cluster processes use hundreds or thousands
of servers to process terabytes of data by large offline jobs
To reduce the power consumption, Google implements
CPU Voltage/Frequency Scaling
DVFS reduces the servers’ power consumption by
dynamically changing the voltage and frequency of a
CPU according to its load.
Google reduces 23% of power by implementing this
technique.
16
Fig 12 – Energy Consumption Comparison
17. Performance of a Google WSC
The overall WSC performance can be calculated by aggregating per-job performance
WSC performance = ∑ weight i× Performance metrici (i denotes unique job ID)
Weight - weight determines how much a job's performance affects the overall performance
Performance Metric - Google WSC uses IPC (Instructions per cycle) to evaluate the
performance of a job
Reason for Performance impact
The CPU, which suffers from memory latency and memory bandwidth, has a performance
impact in the processor by suffering from stall cycles due to the cache misses.
The lower performance is due to the data cache miss, and instruction cache misses, these
two misses contribute to a lower IPC.
17
18. Top-Down Micro-Architectural Analysis Method
The Top-Down Micro-architecture Analysis method is used to identify all the performance issues related to a processor.
Google used Naïve approach (a traditional approach) to identify the performance bottlenecks and Google implemented
TDMAM in 2015.
Simple, Structured, Quick
The pipeline of a CPU used by WSC is quite complex as the pipeline is divided into two halves,
Front-End: It is responsible for fetching the program code, and the program code is decoded into two or
more low-level hardware operations called micro-ops(uops)
Back-End: The micro-ops are passed to process called allocation. Once the micro-ops are allocated, the
back-end checks the available execution unit and try to execute the micro-ops instructions.
The pipeline slots are classified into four broad categories:
Retiring - when a micro-ops leaves the queue and commits
Bad speculation – when a pipeline slot wasted due to incorrect operation
Front-end bound - overheads due to fetching, instruction caches, and decoding
Back-end bound - overheads due to data cache hierarchy and the lack of Instruction Level Parallelism.
18
19. Top-Down Micro-Architectural Analysis Method
The chart represents the pipeline slots breakdown of
applications running in Google WSC.
Large number of stalled cycles in back-end due to the lack of
instruction level parallelism.
The processor finds difficult to run all the instructions
simultaneously and increases the memory stall time.
To overcome this, Google uses Simultaneous
Multithreading to hide the latency by overlapping the stall
cycles.
SMT is an architectural feature allowing instructions from
more than one thread to be executed in any given pipeline
stage at a time.
SMT increases the performance of CPU by supporting
thread-level parallelism.
19
20. Cooling System
The intention of WSC cooling systems is to remove the heat
generated by the equipment.
Google WSC uses Ceiling-mounted cooling as the cooling system.
This type of cooling system comes with a large plenum space
that removes the hot air from the data center. Once the plenum
space removes the heat from the data center, the fan coil is
responsible for blowing the cold air towards the intake of the
data center
(1) Hot exhaust from the datacenter rises in a vertical plenum
space.
(2) Hot air enters a large plenum space above the drop ceiling.
(3) Heat is exchanged with process water in a fan coil unit
(4) blows the cold air down toward the intake of the data center.
20
Fig 13 – Google’s Cooling System
21. Conclusion
The computation in WSC does not rely on a single machine, and it requires hundreds
or thousands of machines connected over a network to achieve greater
performance. We also observed that, Google WSC deploys hardware accelerators
such as GPU and TPU to increase the performance and energy.
Designing a performance-oriented and energy-efficient WSC is a main concern,
Google have implemented some power saving approaches and performance
improvement mechanisms like SMT to eliminate the stall cycles due to the cache
misses.
Hence, Google uses the above mentioned hardware and techniques to design a
performance and energy-efficient warehouse-scale systems.
21