Linaro connect 2018 keynote final updated

Qualcomm Datacenter Technologies, Inc.
Emerging Computing Trends in the Datacenter
Dileep Bhandarkar, Ph. D.
Vice President, Technology
Linaro Connect Keynote – 23 March 2018, Hong Kong
Created using DilEEP Neural Network

Outline
• Historical Perspective on 40 Years of Moore’s Law
– Single Core Era enabled by Dennard Scaling
• Post Dennard Scaling Drives Multi-Core Era
• The Shift to Energy Efficient Multi-Core Designs for
the Cloud
• Heterogenous Computing Era with Application
Specific Accelerators

The First 50 Years
after
Shockley’s Transistor Invention

1958: Jack Kilby’s
Integrated Circuit
My 40+ Year Journey From Mainframes to Smartphones https://www.youtube.com/watch?v=7ptXpNFY3XM
Bob Noyce’s
Integrated Circuit

From 2300 to >1Billion Transistors
Moore’s Law video at http://www.cs.ucr.edu/~gupta/hpca9/HPCA-PDFs/Moores_Law_Video_HPCA9.wmv

Dennard Scaling
Device or Circuit Parameter Scaling Factor
Device dimension tox, L, W 1/K
Doping concentration Na K
Voltage V 1/K
Current I 1/K
Capacitance eA/t 1/K
Delay time per circuit VC/I 1/K
Power dissipation per circuit VI 1/K2
Power density VI/A 1
The benefits of scaling : as transistors get smaller, they can switch faster and use less power.
Each new generation of process technology was expected to reduce minimum feature size by
approximately 0.7x (K ~1.4). A 0.7x reduction in linear features size provided roughly a 2x
increase in transistor density.
Dennard scaling broke down around 2004 with unscaled interconnect delays and our inability
to scale the voltage and current due to reliability concerns.
But increasing transistor density (Moore’s Law) has continued to enable multicore designs.

THE MULTICORE ERA
SINGLE THREAD PERFORMANCE IMPROVEMENT SLOWING DOWN
PERFORMANCE DRIVEN BY HIGHER CORE COUNT
Post Dennard Scaling

Transistor Count
Increasing
Slower
Improvement
No Improvement
Power Going Up
With Performance
Core count
increasing to
drive
Performance
Now Performance Improvement Comes from Higher Core Count at Similar Frequency
with Each New Process Node

The last 5 Generations of ~135W Xeon Processors
Slow Improvement in IPC but per thread performance constrained by power
Performance data from www.spec.org
8 cores
Mar 2012
10 cores
Sep 2013
12 cores
Sep 2014
14 cores
Apr 2016
18 cores
Jul 2017

No Improvement in Perf/Watt per Core
even with higher power
Performance data from www.spec.org

© 2017 Arm Limited12
Looking ahead from edge to cloud
The future requires a new approach to CPU design
Safe and autonomous Hyper-efficient
Secure private compute
Cortex beyond mobile Mixed reality
Presented by Peter Greenhalgh at Hot Chips 2017

13
Cloud
Traditional
Enterprise IT
%Totaldatacenterserverrevenue
0%
25%
50%
75%
100%
2013 2014 2015 2016 2017 2018 2019 2020
Server Industry is shifting to the Cloud

Disruptions Come from Below!
Mainframes
Minicomputers
RISC Systems
Desktop PCs
Notebooks
Smart Phones
Volume
Performance
Bell’s Law:
hardware technology,
networks, and interfaces
allows new, smaller, more
specialized computing
devices to be introduced to
serve a computing need.

15
Qualcomm Datacenter
Technologies
Uniquely positioned to leverage
mobile growth and drive datacenter
process leadership
65nm 45nm 28nm 20nm 10nm
1st in the
industry
14nm
Mobile driven
NowThen
Fab process tech
driven by PC
Fab process tech driven
by mobile phones
PC driven
2008 2010 2012
2016
20182014 1.5B
units
256M
units
Smartphone unitsPC units
45nm 32nm 10nm14nm22nm
A new world in datacenter:
Manufacturing
process
Mobile Technology Disrupting the Cloud Datacenter

16
Qualcomm Centriq
™
2400
Throughput performance
Thread Density
Quality of Service
Energy Efficiency
What Cloud means for
Processor Architecture
Key metrics
• Perf / thread
• Perf / Watt
• Perf / mm2
The future requires a new approach to CPU design

Computational + server growth
fuel datacenter energy efficiency considerations
• 2014: US datacenters consumed 70 billion kilowatt-
hours of electricity
• Datacenters can cost between $10M and $20M
per megawatt
• Unused datacenter capacity can be expensive
• 1W of server power can cost $1 per year in energy
costs at 10 cents per KWH
• Server power related costs can be 30-50% of overall
datacenter operating costs
• Servers need to be designed for average power
consumption (not just max peak output)
• Hyper-efficient designs necessary to improve server
energy efficiency

18
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
Falkor
duplex
8-Serdes
SATACTL
HDMA
EMAC
OCMEM
QGIC
USB
USB
USB
USB
PW
QFPROMIMCMPM/CC
8-Serdes
PCle
8-Serdes8-Serdes
PCle
8-Serdes
DDR DDR DDR
MCMCMC
DDR DDR DDR
Coherent segmented ring interconnect
L3L3L3L3 L3L3
L3L3L3L3 L3L3
MCMCMC
• 48 custom Armv8 cores at 2.6 GHz peak frequency
• Large 60 MB L3 cache
• 6 DDR4 memory channels at 2667 MT/s
• High bandwidth coherent ring
• Low average power under typical load
• Ultra low idle power
• Cache Quality of Service
• Inline memory bandwidth compression
• Security rooted in hardware
• Leading performance and energy efficiency
Qualcomm Centriq 2400: Built for The Cloud
Details at https://www.qualcomm.com/products/qualcomm-centriq-2400-processor

19
Qualcomm Centriq 2400 Drives Perf/W and Perf/Thread Leadership
1
1.71
1.04
1.25
1.38
1
1.18
0.77
0.93
0.99
1
0.69
0.74
0.75
0.72
1
2.02
1.84
1.86
1.70
1
1.01
0.92
0.93
0.85
1
0.24
0.59
0.40
0.27
QDF 2460 PLATINUM 8180 GOLD 6138 PLATINUM 8160 PLATINUM 8170
Power SPECintrate2006 Perf/Watt Perf/Core Perf/Thread Perf/$
IsoPower IsoPerf
48 cores
120 W TDP
657 SIR2006
$1,995
20 cores
125 W TDP
504 SIR2006
$2,612
26 cores
165 W TDP
653 SIR2006
$7,405
28 cores
205 W TDP
775 SIR2006
$10,009
Top Bin
E7 Price
24 cores
150 W TDP
612 SIR2006
$4,702
Top Bin E5 Price
SKU
Performance based on internal tests for SPECintrate2006 (SIR) estimates using gcc O2

20
Qualcomm Centriq 2460 Lowers Average and Idle Power
to Improve Cloud Server Density in Datacenters
0
20
40
60
80
100
120
AveragePower(Watts)
8W idle power
400.
perlbench
401.
bzip2
403.
gcc
429.
mcf
445.
gobmk
456.
hmmer
458.
libquantum
464.
h264ref
471.
omnetpp
473.
astar
458.
sieng
483.
xalancbmk
SPECint®_rate2006 subtests
120W TDP
Median = 65W

• Are we really serious about energy efficiency?
• What should the Cost and Power constraints be?
• How many instruction sets is too many?
• X86, ARM, MIPS, Power, RISC V
• Have we reached the limit of high core count? SW Scalability?
• Do we need to improve single thread general purpose performance?
• What should the power limit be for a single socket?
• How much performance are we willing to sacrifice for better security?
• Is there a fundamental conflict between multi-tenancy and security?
• Cost and convenience vs extreme security?
• When does device scaling end? Will there be a sub nm era?
Many Questions to Ponder?

• Energy efficiency must be a implicit design target
• Desktop PC CPU cores are too power hungry and not energy efficient
• Wimpy cores are not good enough for servers
• Servers can be designed by scaling up energy efficient mobile core design philosophy
• Many workloads run best on different kinds of specialized processing engines
• Each processing engine has its own strengths
Lessons from Mobile Computing

• Order of Magnitude higher computational efficiency than general
purpose processors
• Can accept inefficient implementation to reduce time to market
• Many potential applications
– Machine Learning
– Encryption
– Data Compression
– Video processing
• Need reasonable volume for business case
• Algorithms need to be stable
• Can they be programmable? Where do FPGAs fit?
The Age of Application Specific Accelerators

Before the emergence of DNNs
 Algorithms and rule based systems were laboriously hand-coded
But by 2012, the ingredients for change were available
Sufficiently powerful GPU’s
Readily available large data sets on the internet
The Emergence of Deep Neural Networks
Deep Neural Networks are becoming Pervasive
 The turning point - ImageNet Competition 2012
 “ImageNet Classification with Deep Convolutional Neural Networks”, Neural Information
Processing Systems Conference (NIPS 2012)
 Deep Neural Net enabled a performance breakthrough
 Now - DNN’s are simpler to develop and deploy, ushering in radical change in many fields and
entire industries

Deep Learning is Growing Exponentially
Source: Google
Source: Google

2727
Devices,machines,
and things are becoming
more intelligent

2828
Learn, infer
context, anticipate
Reasoning
Act intuitively, interact
naturally, protect privacy
Action
Hear, see,
monitor, observe
Perception
Offering new capabilities to enrich our lives

29
Where does compute need to be and why?
. . .
• Bandwidth / Backhaul traffic
• Compute Resources
• Power/Thermal Envelope
• Privacy & Security
• Latency
• Reliability
Central CloudDevices Edge Cloud

30
What is “Edge”?
Cloudlets / edge nodes / edge
gateways
◦ 5-20ms latency
◦ Optionally co-located with access
networks
◦ Few server racks per site
. . .
Customer devices
◦ Smartphones, connected cars, drones,
IoT sensors/devices
◦ < 2 ms latency; millions of devices
Customer premises
◦ Enterprises, homes, stadiums, cars
◦ < 5 ms latency; 1000s of devices
Centralized clouds
◦ > 100 ms latency
◦ 5-100 per operator or cloud
service provider
◦ 100s-1000s of server racks
per site
EDGE

Server/Cloud
Training
Execution/Inference
Devices
Execution/Inference
AI is Increasingly Everywhere
Inference: on device, on the edge cloud, or centralized cloud depending on use case characteristics
(latency, bandwidth, context)

CPU
• Free cycles available
• ISA enhancements
• Complementary with
other accelerators
GPU
• Over-design (cost,
power) for AI
FPGA
• Offers flexibility
• Typically hard to
program &
expensive
ASIC
• Purpose-built
• Energy and cost
efficient
• Expensive to
design
• Least flexible

Training tends toward concentrated, centralized computation
Inference tends toward wide distribution
GPUs
Large DPU
CPUs
Small DPU
CPUs
Small DPU
Low cost
GPUs
Large DPU
Higher Cost

CPUs are not powerful enough for training, but have free cycles available for
inference – opportunity for add-in accelerator cards
 Instruction Set enhancements can improve performance
GPUs have too much “extra baggage” that add cost and power for features not
needed for AI – opportunity for domain specific accelerators
FPGAs offer more flexibility, but are difficult to program and expensive
ASICs are energy and product cost efficient, but less flexible
Deep neural networks are making significant strides in many areas
 speech, vision, language, search, robotics, medical imaging & treatment, drug discovery …
We have an opportunity to dramatically reshape our computing devices to
better serve this emerging and growing market
Expect to see lots of innovation and excitement in the years to come
Thoughts on Future Silicon for Deep Learning

• Single thread general purpose performance improvement is slowing down
• Energy efficiency is extremely important in datacenters
• ARM architecture enables energy efficient designs with good performance
• Typical-use efficiency is becoming more important than peak output efficiency
in enterprise data centers
• Idle mode power will become more important for servers
• Smart power management can dynamically optimize server operation to
improve efficiency in normal use
• Security improvements need even if they cost performance
• There is plenty of opportunity for innovation on new application specific
architectures targeted for specific workloads
Concluding Remarks
Speculation Can Lead to a Meltdown!

Follow us on:
For more information, visit us at:
www.qualcomm.com & www.qualcomm.com/blog
Nothing in these materials is an offer to sell any of the components or devices referenced herein.
©2018 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.
Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries, Qualcomm Centriq and Falkor are
trademarks of Qualcomm Incorporated. Other products and brand names may be trademarks or registered trademarks of their respective
owners.
References in this presentation to “Qualcomm” may mean Qualcomm Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries or
business units within the Qualcomm corporate structure, as applicable.
Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio.
Qualcomm Technologies, Inc., a wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of
Qualcomm’s engineering, research and development functions, and substantially all of its product and services businesses, including its
semiconductor business, QCT.
Thank you

Linaro connect 2018 keynote final updated

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Linaro connect 2018 keynote final updated

Ähnlich wie Linaro connect 2018 keynote final updated (20)

Mehr von Dileep Bhandarkar

Mehr von Dileep Bhandarkar (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Linaro connect 2018 keynote final updated