SlideShare ist ein Scribd-Unternehmen logo
1 von 76
Solve the colocation conundrum
Performance and density at scale with Kubernetes
Niklas Nielsen – Intel Corp
Legal Notices and Disclaimers
 Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or
service activation. Learn more at intel.com, or from the OEM or retailer.
 No computer system can be absolutely secure.
 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or
configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider
your purchase. For more complete information about performance and benchmark results, visit
http://www.intel.com/performance.
 This document contains information on products, services and/or processes in development. All information provided here is
subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and
roadmaps.
 The products described may contain design defects or errors known as errata which may cause the product to deviate from
published specifications. Current characterized errata are available on request.
 No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
 Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the
referenced web site and confirm whether referenced data are accurate.
 Intel, Xeon, Atom, Core, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.
 *Other names and brands may be claimed as the property of others.
 © 2017 Intel Corporation.
Two Google searches
Notice a difference?
0 1 2 3 4 5 6
First
Second
First and Second
First ‘O’
Done typing
OSCON in autocomplete list
OSCON 2016 is
autocomplete list
Pushed enter
OSCON 2017 is found
Rest of search
OSCON context
OSCON logo
2 seconds
>5 seconds
Let’s talk about micro services
Everyone is pursuing micro service architectures
Single outliers have a big impact at scale
Monolithic service
uService
A
uService
B
uService
C
uService
E
uService
D
Developer
Velocity
Resiliency Scale
The number of components increase linearly
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10
The number of internal requests grow super linearly
A short experiment…
With one hundred services
involved
…
one out of hundred
requests takes over one
second…
1/100
1/100
1/100
One late request for the entire request to be slow
Come on
hurry up!
How many users overall will experience a latency
above one second?
A <30%
B 30-60%
C 60-100%
C: 63%
Experiencing one second or worse!
28% of customers will not return to a slow site[1]
[1] 2016 Holiday Retail Insights Report
1/100
P(>1s) = 1 – (1 – R)^N
R = 1/100
N=3
P(>1s) = 2.9701%
R = 1/100
N=100
P(>1s) = 63.3%
Jeffrey Dean and Luiz André Barroso. 2013. The tail at
scale. Commun. ACM 56, 2 (February 2013), 74-80
Variability accumulates when more than one system
serves a request
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0 2 4 6 8 10 12
Series1
Latency
Frequency
99% 1%
The “tail”
With micro services, scale is easy but hard to
control when coming to tail latency
You will have to deal with this
What causes variability?
Resource sharing
Global Local
Aggressor
Antagoniser
Noisy neighbor
In
Best effort
tasks
Interference
Contention
Variability
for
High priority
tasks
Causes
How have large infrastructure operators dealt with
variability?
Hedge your bets
Server 1
Server 2 Server 3 Server 4
Server 1
Server 2 Server 3 Server 4
Server 1
Server 2 Server 3 Server 4
Server 1
Server 2 Server 3 Server 4
Server 1
Server 2 Server 3 Server 4
We built a tool to help you gain insight into causes
of variability
Swan
100% Load
Objective
Latency
Best caseWorse
100% Load10% Load
Best case
Interference #1
Interference #2
Best case
Interference #1
Interference #2
for load := 10% 20% ... 100%
for aggressor := A ... C
for repetition := 1 ... 3
start_kubernetes()
start_memcached()
sustain_QPS(load)
record_metrics()
start(aggressor)
experiment.go
Import swan
experiment = Experiment(‘9F2DE9AF-177E-4E6F-
A994-2FF59075448B’)
experiment.profile()
Cassandra
Snap
Why didn’t Kubernetes usual performance isolation
protect the workload?
Not a Kubernetes issue (only)
Logical
Core1
Logical
Core2
Time
Process 1
Process 2
Process 3
Cgroups cpu shares is the defacto cpu isolation in
container schedulers
1024
2048 1024
210240
A tiny fraction of cpu time is enough to cause severe
performance issues
Modern CPUs is helping reduce the causes of these
interferences
Core Core Core Core
Interconnect
Last Level Cache
Memory bandwidth
Core Core
IntelⓇ Resource Director
Technology is an umbrella
Cache occupancy
Memory bandwidth
Cache Allocation
Code Data Prioritization
Scenario / Load 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0
Baseline 49% 46% 53% 48% 64% 73% 98% 108% 131% 113%
Experiment 876% 945% 946% 893% 953% 898% 887% 921% 851% 901%
Scenario / Load 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0
Baseline 52% 51% 45% 54% 60% 69% 89% 100% 101% 111%
Experiment 167% 504% 458% 521% 545% 917% 948% 878% 886% 971%
Scenario / Load 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0
Baseline 36% 34% 29% 40% 34% 42% 50% 67% 77% 98%
Experiment 31% 31% 30% 37% 47% 50% 65% 84% 346% 353%
Kubernetes QoS
Core isolation
Intel RDT
Cache Allocation
Code Data Prioritization
# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl
# mkdir p0 p1
# echo "L3:0=3" > /sys/fs/resctrl/p0/schemata
# echo "L3:0=c" > /sys/fs/resctrl/p1/schemata
0xc
0x3
0xfFull L3 cache
P0
P1
Cache Allocation
Code Data Prioritization
# echo 1234 > /sys/fs/resctrl/p0/tasks
# echo C0 > /sys/fs/resctrl/p1/cpus
Core
0
Core
1
Core
2
Core
3
P0
P1
Cache Allocation
Code Data Prioritization
Code
Data
Process
Heap
Stack
Core
I D
pc
*(0xf940)
L2
L3
Cache Allocation
Code Data Prioritization
# mount -t resctrl resctrl -o cdp /sys/fs/resctrl
# mkdir –p /sys/fs/resctrl/p0
# echo "L3data:0=3" >> /sys/fs/resctrl/p0/schemata
# echo "L3code:0=c" >> /sys/fs/resctrl/p0/schemata
Core
I D
L2
Core
I D
L2
L3
L1
Available in Linux 4.10
Cache Allocation
Code Data Prioritization
Cache occupancy
Memory bandwidth
# perf stat -e intel_cqm/llc_occupancy/ -I 1000 dd if=/dev/zero of=/dev/null
# time counts unit events
1.000128952 229,376 Bytes intel_cqm/llc_occupancy/
2.000280860 327,680 Bytes intel_cqm/llc_occupancy/
3.000444894 360,448 Bytes intel_cqm/llc_occupancy/
4.000580058 360,448 Bytes intel_cqm/llc_occupancy/
How do you use this number?
$ lscpu
...
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
Last Level Cache
Process
Occupanc
y
Cache occupancy
Memory bandwidth
# perf stat -e intel_cqm/local_bytes/ -I 1000 dd if=/dev/zero of=/dev/null
# time counts unit events
1.000129604 0.20 MB intel_cqm/local_bytes/
2.000284311 0.00 MB intel_cqm/local_bytes/
3.000426805 0.00 MB intel_cqm/local_bytes/
4.000560934 0.07 MB intel_cqm/local_bytes/
How do you use this number?
Core Core
Interconnect
Last Level Cache
Memory bandwidth
CoreCore
Process
Bandwidth
Cache occupancy
Memory bandwidth
Available in Linux 4.1
Cache Monitoring
Technology (CMT)
Memory Bandwidth
Monitoring
Available in Linux 4.6
What’s next?
Leave you with 4points
The number of services involved in a request is increasing super
linearly
The largest cluster users have dealt with accumulated variability for
years
IntelⓇ helps by using priority to reduce the sources of variability
through IntelⓇ RDT
Swan is a tool to understand the effects of interference and how to
avoid it
Swan is under Apache 2.0 License and available for download today
https://github.com/intelsdi-x/swan
Read more about how to use Intel Ⓡ RDT
https://github.com/01org/intel-cmt-cat/
Thanks to all involved in this project
 Maciej Iwanowski, Pawel Palucki, Szymon Konefal, Maciej Patelczyk, Michal Stachowski,
Arek Chylinski and the rest of the Swan team
 Andrew Herdich and the Intel RDT teams
 Tony Luck, Fenghua Yu and Intel Linux Kernel teams
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedAnne Nicolas
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
 
Extreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningExtreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningMilind Koyande
 
Dell EMC validates your Genetec video management system before it reaches you...
Dell EMC validates your Genetec video management system before it reaches you...Dell EMC validates your Genetec video management system before it reaches you...
Dell EMC validates your Genetec video management system before it reaches you...Principled Technologies
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance AnalysisBrendan Gregg
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
Riyaj real world performance issues rac focus
Riyaj real world performance issues rac focusRiyaj real world performance issues rac focus
Riyaj real world performance issues rac focusRiyaj Shamsudeen
 
AHMED JASSAT SOUTH ARICAN ORACLE USER GROUP PRESENTATION
AHMED JASSAT SOUTH ARICAN ORACLE USER GROUP PRESENTATIONAHMED JASSAT SOUTH ARICAN ORACLE USER GROUP PRESENTATION
AHMED JASSAT SOUTH ARICAN ORACLE USER GROUP PRESENTATIONZahid02
 
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...Masaaki Nakagawa
 
Conf2015 d waddle_defense_pointsecurity_deploying_splunksslbestpractices
Conf2015 d waddle_defense_pointsecurity_deploying_splunksslbestpracticesConf2015 d waddle_defense_pointsecurity_deploying_splunksslbestpractices
Conf2015 d waddle_defense_pointsecurity_deploying_splunksslbestpracticesBrentMatlock
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018Brendan Gregg
 
Cache coloring Xen Summit 2020
Cache coloring Xen Summit 2020Cache coloring Xen Summit 2020
Cache coloring Xen Summit 2020Stefano Stabellini
 
XPDDS19 Keynote: Unikraft Weather Report
XPDDS19 Keynote:  Unikraft Weather ReportXPDDS19 Keynote:  Unikraft Weather Report
XPDDS19 Keynote: Unikraft Weather ReportThe Linux Foundation
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
Dom0less - Xen Developer Summit 2019
Dom0less  - Xen Developer Summit 2019Dom0less  - Xen Developer Summit 2019
Dom0less - Xen Developer Summit 2019Stefano Stabellini
 

Was ist angesagt? (19)

Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all startedKernel Recipes 2019 - ftrace: Where modifying a running kernel all started
Kernel Recipes 2019 - ftrace: Where modifying a running kernel all started
 
YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
Extreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningExtreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and Tuning
 
Dell EMC validates your Genetec video management system before it reaches you...
Dell EMC validates your Genetec video management system before it reaches you...Dell EMC validates your Genetec video management system before it reaches you...
Dell EMC validates your Genetec video management system before it reaches you...
 
LISA17 Container Performance Analysis
LISA17 Container Performance AnalysisLISA17 Container Performance Analysis
LISA17 Container Performance Analysis
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
Riyaj real world performance issues rac focus
Riyaj real world performance issues rac focusRiyaj real world performance issues rac focus
Riyaj real world performance issues rac focus
 
AHMED JASSAT SOUTH ARICAN ORACLE USER GROUP PRESENTATION
AHMED JASSAT SOUTH ARICAN ORACLE USER GROUP PRESENTATIONAHMED JASSAT SOUTH ARICAN ORACLE USER GROUP PRESENTATION
AHMED JASSAT SOUTH ARICAN ORACLE USER GROUP PRESENTATION
 
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
OpenStack Summit Tokyo - Know-how of Challlenging Deploy/Operation NTT DOCOMO...
 
Conf2015 d waddle_defense_pointsecurity_deploying_splunksslbestpractices
Conf2015 d waddle_defense_pointsecurity_deploying_splunksslbestpracticesConf2015 d waddle_defense_pointsecurity_deploying_splunksslbestpractices
Conf2015 d waddle_defense_pointsecurity_deploying_splunksslbestpractices
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
 
Supercomputers and Cloud Games
Supercomputers and Cloud GamesSupercomputers and Cloud Games
Supercomputers and Cloud Games
 
Cache coloring Xen Summit 2020
Cache coloring Xen Summit 2020Cache coloring Xen Summit 2020
Cache coloring Xen Summit 2020
 
XPDDS19 Keynote: Unikraft Weather Report
XPDDS19 Keynote:  Unikraft Weather ReportXPDDS19 Keynote:  Unikraft Weather Report
XPDDS19 Keynote: Unikraft Weather Report
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
Linux Hardening - nullhyd
Linux Hardening - nullhydLinux Hardening - nullhyd
Linux Hardening - nullhyd
 
Dom0less - Xen Developer Summit 2019
Dom0less  - Xen Developer Summit 2019Dom0less  - Xen Developer Summit 2019
Dom0less - Xen Developer Summit 2019
 

Ähnlich wie Solve the colocation conundrum: Performance and density at scale with Kubernetes

Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesAmazon Web Services
 
cynapspro endpoint data protection - installation guide
cynapspro endpoint data protection - installation guidecynapspro endpoint data protection - installation guide
cynapspro endpoint data protection - installation guidecynapspro GmbH
 
9Tuts.Com New CCNA 200-120 New CCNA New Questions 2
9Tuts.Com New CCNA 200-120 New CCNA   New Questions 29Tuts.Com New CCNA 200-120 New CCNA   New Questions 2
9Tuts.Com New CCNA 200-120 New CCNA New Questions 2Lori Head
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
 
Fine line between performance and security
Fine line between performance and securityFine line between performance and security
Fine line between performance and securityAlmudena Vivanco
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
 
Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Intel Software Brasil
 
Accelerating Virtual Machine Access with the Storage Performance Development ...
Accelerating Virtual Machine Access with the Storage Performance Development ...Accelerating Virtual Machine Access with the Storage Performance Development ...
Accelerating Virtual Machine Access with the Storage Performance Development ...Michelle Holley
 
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart NeedhamThe post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart NeedhamStewart Needham
 
Supermicro X12 Performance Update
Supermicro X12 Performance UpdateSupermicro X12 Performance Update
Supermicro X12 Performance UpdateRebekah Rodriguez
 
Building an open memory-centric computing architecture using intel optane
Building an open memory-centric computing architecture using intel optaneBuilding an open memory-centric computing architecture using intel optane
Building an open memory-centric computing architecture using intel optaneUniFabric
 
Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)Julien SIMON
 
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI ConvergenceDAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergenceinside-BigData.com
 
CMP301_Deep Dive on Amazon EC2 Instances
CMP301_Deep Dive on Amazon EC2 InstancesCMP301_Deep Dive on Amazon EC2 Instances
CMP301_Deep Dive on Amazon EC2 InstancesAmazon Web Services
 
How will you manage your PC fleet in the new computing environment?
How will you manage your PC fleet in the new computing environment?How will you manage your PC fleet in the new computing environment?
How will you manage your PC fleet in the new computing environment?RapidSSLOnline.com
 
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...Amazon Web Services
 
Yashi dealer meeting settembre 2016 tecnologie xeon intel italia
Yashi dealer meeting settembre 2016 tecnologie xeon intel italiaYashi dealer meeting settembre 2016 tecnologie xeon intel italia
Yashi dealer meeting settembre 2016 tecnologie xeon intel italiaYashi Italia
 

Ähnlich wie Solve the colocation conundrum: Performance and density at scale with Kubernetes (20)

Deep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instancesDeep Dive on Amazon EC2 instances
Deep Dive on Amazon EC2 instances
 
No[1][1]
No[1][1]No[1][1]
No[1][1]
 
cynapspro endpoint data protection - installation guide
cynapspro endpoint data protection - installation guidecynapspro endpoint data protection - installation guide
cynapspro endpoint data protection - installation guide
 
9Tuts.Com New CCNA 200-120 New CCNA New Questions 2
9Tuts.Com New CCNA 200-120 New CCNA   New Questions 29Tuts.Com New CCNA 200-120 New CCNA   New Questions 2
9Tuts.Com New CCNA 200-120 New CCNA New Questions 2
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
Fine line between performance and security
Fine line between performance and securityFine line between performance and security
Fine line between performance and security
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™ Entenda de onde vem toda a potência do Intel® Xeon Phi™
Entenda de onde vem toda a potência do Intel® Xeon Phi™
 
Accelerating Virtual Machine Access with the Storage Performance Development ...
Accelerating Virtual Machine Access with the Storage Performance Development ...Accelerating Virtual Machine Access with the Storage Performance Development ...
Accelerating Virtual Machine Access with the Storage Performance Development ...
 
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart NeedhamThe post release technologies of Crysis 3 (Slides Only) - Stewart Needham
The post release technologies of Crysis 3 (Slides Only) - Stewart Needham
 
Supermicro X12 Performance Update
Supermicro X12 Performance UpdateSupermicro X12 Performance Update
Supermicro X12 Performance Update
 
TSRT Crashes
TSRT CrashesTSRT Crashes
TSRT Crashes
 
Building an open memory-centric computing architecture using intel optane
Building an open memory-centric computing architecture using intel optaneBuilding an open memory-centric computing architecture using intel optane
Building an open memory-centric computing architecture using intel optane
 
Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)Deep Dive on Amazon EC2 Instances (March 2017)
Deep Dive on Amazon EC2 Instances (March 2017)
 
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI ConvergenceDAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
DAOS - Scale-Out Software-Defined Storage for HPC/Big Data/AI Convergence
 
CMP301_Deep Dive on Amazon EC2 Instances
CMP301_Deep Dive on Amazon EC2 InstancesCMP301_Deep Dive on Amazon EC2 Instances
CMP301_Deep Dive on Amazon EC2 Instances
 
How will you manage your PC fleet in the new computing environment?
How will you manage your PC fleet in the new computing environment?How will you manage your PC fleet in the new computing environment?
How will you manage your PC fleet in the new computing environment?
 
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
AWS re:Invent 2016: [JK REPEAT] Deep Dive on Amazon EC2 Instances, Featuring ...
 
Yashi dealer meeting settembre 2016 tecnologie xeon intel italia
Yashi dealer meeting settembre 2016 tecnologie xeon intel italiaYashi dealer meeting settembre 2016 tecnologie xeon intel italia
Yashi dealer meeting settembre 2016 tecnologie xeon intel italia
 

Kürzlich hochgeladen

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 

Kürzlich hochgeladen (20)

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 

Solve the colocation conundrum: Performance and density at scale with Kubernetes

  • 1. Solve the colocation conundrum Performance and density at scale with Kubernetes Niklas Nielsen – Intel Corp
  • 2. Legal Notices and Disclaimers  Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Learn more at intel.com, or from the OEM or retailer.  No computer system can be absolutely secure.  Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance.  This document contains information on products, services and/or processes in development. All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest forecast, schedule, specifications and roadmaps.  The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request.  No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.  Intel does not control or audit third-party benchmark data or the web sites referenced in this document. You should visit the referenced web site and confirm whether referenced data are accurate.  Intel, Xeon, Atom, Core, and the Intel logo are trademarks of Intel Corporation in the United States and other countries.  *Other names and brands may be claimed as the property of others.  © 2017 Intel Corporation.
  • 4.
  • 5.
  • 7.
  • 8. 0 1 2 3 4 5 6 First Second First and Second First ‘O’ Done typing OSCON in autocomplete list OSCON 2016 is autocomplete list Pushed enter OSCON 2017 is found Rest of search OSCON context OSCON logo 2 seconds >5 seconds
  • 9.
  • 10. Let’s talk about micro services
  • 11. Everyone is pursuing micro service architectures
  • 12. Single outliers have a big impact at scale
  • 15. The number of components increase linearly 0 2 4 6 8 10 12 1 2 3 4 5 6 7 8 9 10
  • 16. The number of internal requests grow super linearly
  • 17.
  • 19. With one hundred services involved …
  • 20. one out of hundred requests takes over one second… 1/100 1/100 1/100
  • 21. One late request for the entire request to be slow Come on hurry up!
  • 22. How many users overall will experience a latency above one second? A <30% B 30-60% C 60-100%
  • 23. C: 63% Experiencing one second or worse! 28% of customers will not return to a slow site[1] [1] 2016 Holiday Retail Insights Report
  • 24. 1/100 P(>1s) = 1 – (1 – R)^N R = 1/100 N=3 P(>1s) = 2.9701% R = 1/100 N=100 P(>1s) = 63.3%
  • 25. Jeffrey Dean and Luiz André Barroso. 2013. The tail at scale. Commun. ACM 56, 2 (February 2013), 74-80
  • 26. Variability accumulates when more than one system serves a request
  • 27. 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0 2 4 6 8 10 12 Series1 Latency Frequency 99% 1% The “tail”
  • 28. With micro services, scale is easy but hard to control when coming to tail latency
  • 29. You will have to deal with this
  • 33.
  • 35. How have large infrastructure operators dealt with variability? Hedge your bets
  • 36. Server 1 Server 2 Server 3 Server 4
  • 37. Server 1 Server 2 Server 3 Server 4
  • 38. Server 1 Server 2 Server 3 Server 4
  • 39. Server 1 Server 2 Server 3 Server 4
  • 40. Server 1 Server 2 Server 3 Server 4
  • 41. We built a tool to help you gain insight into causes of variability Swan
  • 43. 100% Load10% Load Best case Interference #1 Interference #2 Best case Interference #1 Interference #2
  • 44.
  • 45. for load := 10% 20% ... 100% for aggressor := A ... C for repetition := 1 ... 3 start_kubernetes() start_memcached() sustain_QPS(load) record_metrics() start(aggressor) experiment.go Import swan experiment = Experiment(‘9F2DE9AF-177E-4E6F- A994-2FF59075448B’) experiment.profile() Cassandra Snap
  • 46.
  • 47.
  • 48. Why didn’t Kubernetes usual performance isolation protect the workload? Not a Kubernetes issue (only)
  • 50. Cgroups cpu shares is the defacto cpu isolation in container schedulers 1024 2048 1024 210240
  • 51. A tiny fraction of cpu time is enough to cause severe performance issues
  • 52. Modern CPUs is helping reduce the causes of these interferences
  • 53. Core Core Core Core Interconnect Last Level Cache Memory bandwidth Core Core
  • 54. IntelⓇ Resource Director Technology is an umbrella Cache occupancy Memory bandwidth Cache Allocation Code Data Prioritization
  • 55. Scenario / Load 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 Baseline 49% 46% 53% 48% 64% 73% 98% 108% 131% 113% Experiment 876% 945% 946% 893% 953% 898% 887% 921% 851% 901% Scenario / Load 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 Baseline 52% 51% 45% 54% 60% 69% 89% 100% 101% 111% Experiment 167% 504% 458% 521% 545% 917% 948% 878% 886% 971% Scenario / Load 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 100.0 Baseline 36% 34% 29% 40% 34% 42% 50% 67% 77% 98% Experiment 31% 31% 30% 37% 47% 50% 65% 84% 346% 353% Kubernetes QoS Core isolation Intel RDT
  • 56. Cache Allocation Code Data Prioritization # mount -t resctrl resctrl /sys/fs/resctrl # cd /sys/fs/resctrl # mkdir p0 p1 # echo "L3:0=3" > /sys/fs/resctrl/p0/schemata # echo "L3:0=c" > /sys/fs/resctrl/p1/schemata 0xc 0x3 0xfFull L3 cache P0 P1
  • 57. Cache Allocation Code Data Prioritization # echo 1234 > /sys/fs/resctrl/p0/tasks # echo C0 > /sys/fs/resctrl/p1/cpus Core 0 Core 1 Core 2 Core 3 P0 P1
  • 58.
  • 59. Cache Allocation Code Data Prioritization Code Data Process Heap Stack Core I D pc *(0xf940) L2 L3
  • 60.
  • 61. Cache Allocation Code Data Prioritization # mount -t resctrl resctrl -o cdp /sys/fs/resctrl # mkdir –p /sys/fs/resctrl/p0 # echo "L3data:0=3" >> /sys/fs/resctrl/p0/schemata # echo "L3code:0=c" >> /sys/fs/resctrl/p0/schemata Core I D L2 Core I D L2 L3 L1
  • 62. Available in Linux 4.10 Cache Allocation Code Data Prioritization
  • 63. Cache occupancy Memory bandwidth # perf stat -e intel_cqm/llc_occupancy/ -I 1000 dd if=/dev/zero of=/dev/null # time counts unit events 1.000128952 229,376 Bytes intel_cqm/llc_occupancy/ 2.000280860 327,680 Bytes intel_cqm/llc_occupancy/ 3.000444894 360,448 Bytes intel_cqm/llc_occupancy/ 4.000580058 360,448 Bytes intel_cqm/llc_occupancy/
  • 64. How do you use this number? $ lscpu ... L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 12288K Last Level Cache Process Occupanc y
  • 65. Cache occupancy Memory bandwidth # perf stat -e intel_cqm/local_bytes/ -I 1000 dd if=/dev/zero of=/dev/null # time counts unit events 1.000129604 0.20 MB intel_cqm/local_bytes/ 2.000284311 0.00 MB intel_cqm/local_bytes/ 3.000426805 0.00 MB intel_cqm/local_bytes/ 4.000560934 0.07 MB intel_cqm/local_bytes/
  • 66. How do you use this number? Core Core Interconnect Last Level Cache Memory bandwidth CoreCore Process Bandwidth
  • 67. Cache occupancy Memory bandwidth Available in Linux 4.1 Cache Monitoring Technology (CMT) Memory Bandwidth Monitoring Available in Linux 4.6
  • 69. Leave you with 4points
  • 70. The number of services involved in a request is increasing super linearly
  • 71. The largest cluster users have dealt with accumulated variability for years
  • 72. IntelⓇ helps by using priority to reduce the sources of variability through IntelⓇ RDT
  • 73. Swan is a tool to understand the effects of interference and how to avoid it
  • 74. Swan is under Apache 2.0 License and available for download today https://github.com/intelsdi-x/swan Read more about how to use Intel Ⓡ RDT https://github.com/01org/intel-cmt-cat/
  • 75. Thanks to all involved in this project  Maciej Iwanowski, Pawel Palucki, Szymon Konefal, Maciej Patelczyk, Michal Stachowski, Arek Chylinski and the rest of the Swan team  Andrew Herdich and the Intel RDT teams  Tony Luck, Fenghua Yu and Intel Linux Kernel teams

Hinweis der Redaktion

  1. How is everyone feeling? Been seeing some good talk by now? Just getting started? Not so gentle introduction to kubernetes performance The most important thing for me is that you understand and that I don’t loose you mid way So as we all have different levels of experience, feel free to shout out if something doesn’t make sense
  2. First off, since I am an Intel employee and this is a sponsor talk slot, I have to remind you of our legal notice. Mentions of our brand and legal protection, in general and for the contents of this talk
  3. That aside, I want to conduct a small experiment I’m going to show you two google searches and see if you can tell the difference Be aware, each one is only a few seconds. So I need you to pay close attention
  4. An artificial 100ms delay per connection raised the response time from 2 to 5 seconds. I’ve tried to break down the response time here. Few seconds at each graph to slowly explain what the axes mean before diving into interpretation.
  5. It might seem surprising, but 2.4 seconds is the sweet spot for users Another way to interpret this is that online retail, customers starts to turn away after this amount of time The user patience is steadily decreasing Expect instantaneous response for even the most complicated queries
  6. Consider graphic here
  7. Consider graphic here
  8. Maybe get some numbers
  9. To give you an example of the interconnectiveness, Netflix built a tool called visceral which samples network requests
  10. Give options
  11. Need to tie back to initial experiment
  12. Every request is like flipping a coin Too information dense Include highlight Don’t explain the equation. Hard to talk to.
  13. Insert reference Few seconds at each graph to slowly explain what the axes mean before diving into interpretation. High lights? At google scale this matters.
  14. The reason this is called the tail at scale
  15. Not only a problem for the largest companies in the world.
  16. Similar how to these fellas are probably dragging their owner in each direction, each user and system are competing for access to resources in modern data centers.
  17. Global Network oversubscription Queueing in leaf and spine switches Local Issue slots, L1 and L2, power budgets per core during SMT L3, Memory bandwidth and power budget for per socket I/O bandwidth Network links Kernel caches
  18. Talk about what makes an application perform as desired and when it isn’t performing like we expect
  19. Few seconds at each graph to slowly explain what the axes mean before diving into interpretation.
  20. Sensitivity profiles have been used in academia to show how sensitive a workload is to co-location. Used to demonstrate performance isolation in research from Stanford and Google[2] Greener profiles indicate more resilience to interference
  21. Network in data centers have become so fast, memory access over network can outperform disk access Have ‘cache clusters’ either of spare capacity or, more likely, dedicated to speed up the requests Normal pattern used by the largest sites Twitter, Facebook, Wikipedia We chose memcached as a high priority workload as it is notoriously hard to place anything next to.
  22. Kubernetes co-location
  23. Now, why is that?
  24. Compute the fractions The process scheduler does, is to find out which process is furthest away from it’s fair share and schedules it next.
  25. What we call interference
  26. Explain caches in a modern server CPU
  27. These are done on a Xeon D 1541 platform with a single socket Linux is the operating system High lights Core isolation alone is not enough CAT reduce the interference and keeps the SLA up to 80%
  28. Explain axis
  29. Some applications are extremely sensitive to these kinds of workloads Online web search is one
  30. Why does CDP matter?
  31. Maybe more realistic example Show how contention looks like
  32. Maybe more realistic example Show how contention looks like
  33. TODO Split into 4 slides
  34. TODO Split into 4 slides
  35. TODO Split into 4 slides
  36. How do you know how much to give to each partition?
  37. Tying things together
  38. Tying things together
  39. Tying things together
  40. Besides this