SlideShare ist ein Scribd-Unternehmen logo
1 von 70
Netflix
Performance Meetup
Global Client Performance
Fast Metrics
3G in Kazakhstan
● Global Internet:
● faster (better networking)
● slower (broader reach, congestion)
● Don't wait for it, measure it and deal
● Working app > Feature rich app
Making the Internet fast
is slow.
We need to know what the Internet looks like,
without averages, seeing the full distribution.
● Sampling
○ Missed data
○ Rare events
○ Problems aren’t equal in
Population
● Averages
○ Can't see the distribution
○ Outliers heavily distort
∞, 0, negatives, errors
Logging Anti-Patterns
Instead, use the client as a map-reducer and send up aggregated
data, less often.
Sizing up the Internet.
Infinite (free) compute power!
● Calculate the inverse empirical cumulative
distribution function by math.
Get median, 95th, etc.
> library(HistogramTools)
> iecdf <- HistToEcdf(histogram,
method='linear’, inverse=TRUE)
> iecdf(0.5)
[1] 0.7975309 # median
> iecdf(0.95)
[1] 4.65 # 95th
percentile
o ...or just use R which is free and knows how
to do it already
Data > Opinions.
Better than debating opinions.
Architecture is hard. Make it cheap to experiment where your users really are.
"There's no way that the
client makes that many
requests.”
"No one really minds the
spinner."
"Why should we spend
time on that instead of
COOLFEATURE?"
"We live in a
50ms world!"
We built Daedalus
US
Elsewhere
Fast
Slow
DNS Time
● Visual → Numerical, need the IECDF for
Percentiles
○ ƒ(0.50) = 50th
(median)
○ ƒ(0.95) = 95th
● Cluster to get pretty colors similar experiences.
(k-means, hierarchical, etc.)
Interpret the data
● Go there!
● Abstract analysis - hard
● Feeling reality is much simpler than looking at graphs. Build!
Practical Teleportation.
Make a Reality Lab.
Don't guess.
Developing a model based on
production data, without missing the
distribution of samples (network, render,
responsiveness) will lead to better
software.
Global reach doesn't need to be scary. @gcirino42 http://blogofsomeguy.com
Icarus
Martin Spier
@spiermar
Performance Engineering @ Netflix
Problem & Motivation
● Real-user performance monitoring solution
● More insight into the App performance
(as perceived by real users)
● Too many variables to trust synthetic
tests and labs
● Prioritize work around App performance
● Track App improvement progress over time
● Detect issues, internal and external
Device Diversity
● Netflix runs on all sorts of devices
● Smart TVs, Gaming Consoles, Mobile Phones, Cable TV boxes, ...
● Consistently evaluate performance
What are we monitoring?
● User Actions
(or things users do in the App)
● App Startup
● User Navigation
● Playing a Title
● Internal App metrics
What are we measuring?
● When does the timer start and stop?
● Time-to-Interactive (TTI)
○ Interactive, even if
some items were not fully
loaded and rendered
● Time-to-Render (TTR)
○ Everything above the fold
(visible without scrolling)
is rendered
● Play Delay
● Meaningful for what we are monitoring
High-dimensional Data
● Complex device categorization
● Geo regions, subregions, countries
● Highly granular network
classifications
● High volume of A/B tests
● Different facets of the same user action
○ Cold, suspended and backgrounded
App startups
○ Target view/page on App startup
Data Sketches
● Data structures that approximately
resemble a much larger data set
● Preserve essential features!
● Significantly smaller!
● Faster to operate on!
t-Digest
● t-Digest data structure
● Rank-based statistics
(such as quantiles)
● Parallel friendly
(can be merged!)
● Very fast!
● Really accurate!
https://github.com/tdunning/t-digest
+ t-Digest sketches
iOS Median Comparison, Break by Country
iOS Median Comparison, Break by Country + iPhone 6S Plus
CDFs by UI Version
Warm Startup Rate
A/B Cell Comparison
Anomaly Detection
Going Forward
● Resource utilization metrics
● Device profiling
○ Instrumenting client code
● Explore other visualizations
○ Frequency heat maps
● Connection between perceived
performance, acquisition and
retention
@spiermar
Netflix
Autoscaling for experts
Vadim
● Mid-tier stateless services are ~2/3rd of the total
● Savings - 30% of mid-tier footprint (roughly 30K instances)
○ Higher savings if we break it down by region
○ Even higher savings on services that scale well
Savings!
Why we autoscale - philosophical reasons
Why we autoscale - pragmatic reasons
● Encoding
● Precompute
● Failover
● Red/black pushes
● Curing cancer**
● And more...
** Hack-day project
Should you autoscale?
Benefits
● On-demand capacity: direct $$ savings
● RI capacity: re-purposing spare capacity
However, for each server group, beware of
● Uneven distribution of traffic
● Sticky traffic
● Bursty traffic
● Small ASG sizes (<10)
Autoscaling impacts availability - true or false?
* If done correctly
Under-provisioning, however, can impact availability
● Autoscaling is not a problem
● The real problem is not knowing performance characteristics of the
service
AWS autoscaling mechanics
CloudWatch alarm ASG scaling policy
Aggregated metric feed
Notification
Tunables
Metric ● Threshold
● # of eval periods
● Scaling amount
● Warmup time
What metric to scale on?
Pros
● Tracks a direct measure of work
● Linear scaling
● Predictable
● Requires less adjustment over time
Cons
● Thresholds tend to drift over time
● Prone to changes in request mixture
● Less predictable
● More oscillation / jitter
Throughput
Resource
utilization
Autoscaling on multiple metrics
Proceed with caution
● Harder to reason about scaling behavior
● Different metrics might contradict each
other, causing oscillation
Typical Netflix configuration:
● Scale-up policy on throughput
● Scale-down policy on throughput
● Emergency scale-up policy on CPU, aka
“the hammer rule”
Well-behaved autoscaling
Common mistakes - “no rush” scaling
Problem: scaling amounts too
small, cooldown too long
Effect: scaling lags behind the
traffic flow. Not enough
capacity at peak, capacity
wasted in trough
Remedy: increase scaling
amounts, migrate to step
policies
Common mistakes - twitchy scaling
Problem: Scale-up policy is
too aggressive
Effect: unnecessary
capacity churn
Remedy: reduce scale-up
amount, increase the # of
eval periods
Common mistakes - should I stay or should I go
Problem: -up and -down
thresholds are too close to each
other
Effect: constant capacity
oscillation
Remedy: move -up and -down
thresholds farther apart
AWS target tracking - your best bet!
● Think of it as a step policy with auto-steps
● You can also think of it as a thermostat
● Accounts for the rate of change in monitored metric
● Pick a metric, set the target value and warmup time - that’s it!
Step Target-tracking
Netflix
PMCs on the Cloud
Brendan
Busy
Waiting
(“idle”)
90% CPU utilization:
Busy
Waiting
(“idle”)
Busy
Waiting
(“idle”)
Waiting
(“stalled”)
Reality:
90% CPU utilization:
# perf stat -a -- sleep 10
Performance counter stats for 'system wide':
80018.188438 task-clock (msec) # 8.000 CPUs utilized (100.00%)
7,562 context-switches # 0.095 K/sec (100.00%)
1,157 cpu-migrations # 0.014 K/sec (100.00%)
109,734 page-faults # 0.001 M/sec
<not supported> cycles
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
<not supported> instructions
<not supported> branches
<not supported> branch-misses
10.001715965 seconds time elapsed
Performance
Monitoring Counters
(PMCs) in most clouds
# perf stat -a -- sleep 10
Performance counter stats for 'system wide':
641320.173626 task-clock (msec) # 64.122 CPUs utilized [100.00%]
1,047,222 context-switches # 0.002 M/sec [100.00%]
83,420 cpu-migrations # 0.130 K/sec [100.00%]
38,905 page-faults # 0.061 K/sec
655,419,788,755 cycles # 1.022 GHz [75.02%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
536,830,399,277 instructions # 0.82 insns per cycle [75.02%]
97,103,651,128 branches # 151.412 M/sec [75.02%]
1,230,478,597 branch-misses # 1.27% of all branches [74.99%]
10.001622154 seconds time elapsed
AWS EC2 m4.16xl
Interpreting IPC & Actionable Items
IPC: Instructions Per Cycle (invert of CPI)
● IPC < 1.0: likely memory stalled
○ Data usage and layout to improve CPU caching, memory locality.
○ Choose larger CPU caches, faster memory busses and interconnects.
● IPC > 1.0: likely instruction bound
○ Reduce code execution, eliminate unnecessary work, cache operations,
improve algorithm order. Can analyze using CPU flame graphs.
○ Faster CPUs.
Event Name Umask Event S. Example Event Mask Mnemonic
UnHalted Core Cycles 00H 3CH CPU_CLK_UNHALTED.THREAD_P
Instruction Retired 00H C0H INST_RETIRED.ANY_P
UnHalted Reference Cycles 01H 3CH CPU_CLK_THREAD_UNHALTED.REF_XCLK
LLC Reference 4FH 2EH LONGEST_LAT_CACHE.REFERENCE
LLC Misses 41H 2EH LONGEST_LAT_CACHE.MISS
Branch Instruction Retired 00H C4H BR_INST_RETIRED.ALL_BRANCHES
Branch Misses Retired 00H C5H BR_MISP_RETIRED.ALL_BRANCHES
Intel Architectural PMCs
Now available in AWS EC2 on full dedicated hosts (eg, m4.16xl, …)
# pmcarch 1
CYCLES INSTRUCTIONS IPC BR_RETIRED BR_MISPRED BMR% LLCREF LLCMISS LLC%
90755342002 64236243785 0.71 11760496978 174052359 1.48 1542464817 360223840 76.65
75815614312 59253317973 0.78 10665897008 158100874 1.48 1361315177 286800304 78.93
65164313496 53307631673 0.82 9538082731 137444723 1.44 1272163733 268851404 78.87
90820303023 70649824946 0.78 12672090735 181324730 1.43 1685112288 343977678 79.59
76341787799 50830491037 0.67 10542795714 143936677 1.37 1204703117 279162683 76.83
[...]
tiptop - [root]
Tasks: 96 total, 3 displayed screen 0: default
PID [ %CPU] %SYS P Mcycle Minstr IPC %MISS %BMIS %BUS COMMAND
3897 35.3 28.5 4 274.06 178.23 0.65 0.06 0.00 0.0 java
1319+ 5.5 2.6 6 87.32 125.55 1.44 0.34 0.26 0.0 nm-applet
900 0.9 0.0 6 25.91 55.55 2.14 0.12 0.21 0.0 dbus-daemo
https://github.com/brendangregg/pmc-cloud-tools
Netflix
Performance Meetup
Netflix
Performance Meetup

Weitere ähnliche Inhalte

Was ist angesagt?

Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for ExperimentationGleb Kanterov
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
 
DevOps Transformation in Microsoft – Case Study by Ognjen Bajic and Ana Roje ...
DevOps Transformation in Microsoft – Case Study by Ognjen Bajic and Ana Roje ...DevOps Transformation in Microsoft – Case Study by Ognjen Bajic and Ana Roje ...
DevOps Transformation in Microsoft – Case Study by Ognjen Bajic and Ana Roje ...Bosnia Agile
 
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...InfluxData
 
Cluster-as-code. The Many Ways towards Kubernetes
Cluster-as-code. The Many Ways towards KubernetesCluster-as-code. The Many Ways towards Kubernetes
Cluster-as-code. The Many Ways towards KubernetesQAware GmbH
 
Understand your system like never before with OpenTelemetry, Grafana, and Pro...
Understand your system like never before with OpenTelemetry, Grafana, and Pro...Understand your system like never before with OpenTelemetry, Grafana, and Pro...
Understand your system like never before with OpenTelemetry, Grafana, and Pro...LibbySchulze
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at NetflixBrendan Gregg
 
[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기NAVER D2
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringScyllaDB
 
Observability in Java: Getting Started with OpenTelemetry
Observability in Java: Getting Started with OpenTelemetryObservability in Java: Getting Started with OpenTelemetry
Observability in Java: Getting Started with OpenTelemetryDevOps.com
 
Introduction to rook
Introduction to rookIntroduction to rook
Introduction to rookRohan Gupta
 
[2019] Java에서 Fiber를 이용하여 동시성concurrency 프로그래밍 쉽게 하기
[2019] Java에서 Fiber를 이용하여 동시성concurrency 프로그래밍 쉽게 하기[2019] Java에서 Fiber를 이용하여 동시성concurrency 프로그래밍 쉽게 하기
[2019] Java에서 Fiber를 이용하여 동시성concurrency 프로그래밍 쉽게 하기NHN FORWARD
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveYingjun Wu
 
What is platform as a product? Clues from Team Topologies - WTFinar with Cont...
What is platform as a product? Clues from Team Topologies - WTFinar with Cont...What is platform as a product? Clues from Team Topologies - WTFinar with Cont...
What is platform as a product? Clues from Team Topologies - WTFinar with Cont...Matthew Skelton
 
Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...
Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...
Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...HostedbyConfluent
 
OpenTelemetry For Developers
OpenTelemetry For DevelopersOpenTelemetry For Developers
OpenTelemetry For DevelopersKevin Brockhoff
 
Combining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityCombining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityElasticsearch
 
Netflix: From Clouds to Roots
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to RootsBrendan Gregg
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersSATOSHI TAGOMORI
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance AnalysisBrendan Gregg
 

Was ist angesagt? (20)

Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
DevOps Transformation in Microsoft – Case Study by Ognjen Bajic and Ana Roje ...
DevOps Transformation in Microsoft – Case Study by Ognjen Bajic and Ana Roje ...DevOps Transformation in Microsoft – Case Study by Ognjen Bajic and Ana Roje ...
DevOps Transformation in Microsoft – Case Study by Ognjen Bajic and Ana Roje ...
 
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
Scaling Prometheus Metrics in Kubernetes with Telegraf | Chris Goller | Influ...
 
Cluster-as-code. The Many Ways towards Kubernetes
Cluster-as-code. The Many Ways towards KubernetesCluster-as-code. The Many Ways towards Kubernetes
Cluster-as-code. The Many Ways towards Kubernetes
 
Understand your system like never before with OpenTelemetry, Grafana, and Pro...
Understand your system like never before with OpenTelemetry, Grafana, and Pro...Understand your system like never before with OpenTelemetry, Grafana, and Pro...
Understand your system like never before with OpenTelemetry, Grafana, and Pro...
 
Linux Profiling at Netflix
Linux Profiling at NetflixLinux Profiling at Netflix
Linux Profiling at Netflix
 
[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기[261] 실시간 추천엔진 머신한대에 구겨넣기
[261] 실시간 추천엔진 머신한대에 구겨넣기
 
High-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uringHigh-Performance Networking Using eBPF, XDP, and io_uring
High-Performance Networking Using eBPF, XDP, and io_uring
 
Observability in Java: Getting Started with OpenTelemetry
Observability in Java: Getting Started with OpenTelemetryObservability in Java: Getting Started with OpenTelemetry
Observability in Java: Getting Started with OpenTelemetry
 
Introduction to rook
Introduction to rookIntroduction to rook
Introduction to rook
 
[2019] Java에서 Fiber를 이용하여 동시성concurrency 프로그래밍 쉽게 하기
[2019] Java에서 Fiber를 이용하여 동시성concurrency 프로그래밍 쉽게 하기[2019] Java에서 Fiber를 이용하여 동시성concurrency 프로그래밍 쉽게 하기
[2019] Java에서 Fiber를 이용하여 동시성concurrency 프로그래밍 쉽게 하기
 
Battle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWaveBattle of the Stream Processing Titans – Flink versus RisingWave
Battle of the Stream Processing Titans – Flink versus RisingWave
 
What is platform as a product? Clues from Team Topologies - WTFinar with Cont...
What is platform as a product? Clues from Team Topologies - WTFinar with Cont...What is platform as a product? Clues from Team Topologies - WTFinar with Cont...
What is platform as a product? Clues from Team Topologies - WTFinar with Cont...
 
Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...
Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...
Monitoring Kafka without instrumentation using eBPF with Antón Rodríguez | Ka...
 
OpenTelemetry For Developers
OpenTelemetry For DevelopersOpenTelemetry For Developers
OpenTelemetry For Developers
 
Combining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityCombining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observability
 
Netflix: From Clouds to Roots
Netflix: From Clouds to RootsNetflix: From Clouds to Roots
Netflix: From Clouds to Roots
 
The Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and ContainersThe Patterns of Distributed Logging and Containers
The Patterns of Distributed Logging and Containers
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 

Ähnlich wie Netflix SRE perf meetup_slides

Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...InfluxData
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
Gatling - Bordeaux JUG
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUGslandelle
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAnthony Scata
 
Three Perspectives on Measuring Latency
Three Perspectives on Measuring LatencyThree Perspectives on Measuring Latency
Three Perspectives on Measuring LatencyScyllaDB
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
Adventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and InstanaAdventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and InstanaMarcel Birkner
 
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 Adventures in Observability: How in-house ClickHouse deployment enabled Inst... Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...Altinity Ltd
 
Container world 2019 Canary Release
Container world 2019 Canary ReleaseContainer world 2019 Canary Release
Container world 2019 Canary ReleaseBilly Yuen
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeDataWorks Summit
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Brian Brazil
 
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)Martin Spier
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3LibbySchulze
 
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Startupfest
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Flink Forward
 

Ähnlich wie Netflix SRE perf meetup_slides (20)

Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Gatling - Bordeaux JUG
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUG
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runners
 
Three Perspectives on Measuring Latency
Three Perspectives on Measuring LatencyThree Perspectives on Measuring Latency
Three Perspectives on Measuring Latency
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Adventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and InstanaAdventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and Instana
 
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 Adventures in Observability: How in-house ClickHouse deployment enabled Inst... Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 
Container world 2019 Canary Release
Container world 2019 Canary ReleaseContainer world 2019 Canary Release
Container world 2019 Canary Release
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
 
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
 
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
 

Kürzlich hochgeladen

Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...soginsider
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Arindam Chakraborty, Ph.D., P.E. (CA, TX)
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...HenryBriggs2
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stageAbc194748
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxmaisarahman1
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwaitjaanualu31
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesRAJNEESHKUMAR341697
 

Kürzlich hochgeladen (20)

Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
scipt v1.pptxcxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx...
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
Air Compressor reciprocating single stage
Air Compressor reciprocating single stageAir Compressor reciprocating single stage
Air Compressor reciprocating single stage
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptxA CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
A CASE STUDY ON CERAMIC INDUSTRY OF BANGLADESH.pptx
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 

Netflix SRE perf meetup_slides

  • 4. ● Global Internet: ● faster (better networking) ● slower (broader reach, congestion) ● Don't wait for it, measure it and deal ● Working app > Feature rich app Making the Internet fast is slow.
  • 5. We need to know what the Internet looks like, without averages, seeing the full distribution.
  • 6. ● Sampling ○ Missed data ○ Rare events ○ Problems aren’t equal in Population ● Averages ○ Can't see the distribution ○ Outliers heavily distort ∞, 0, negatives, errors Logging Anti-Patterns Instead, use the client as a map-reducer and send up aggregated data, less often.
  • 7. Sizing up the Internet.
  • 9.
  • 10. ● Calculate the inverse empirical cumulative distribution function by math. Get median, 95th, etc. > library(HistogramTools) > iecdf <- HistToEcdf(histogram, method='linear’, inverse=TRUE) > iecdf(0.5) [1] 0.7975309 # median > iecdf(0.95) [1] 4.65 # 95th percentile o ...or just use R which is free and knows how to do it already
  • 11.
  • 12.
  • 14. Better than debating opinions. Architecture is hard. Make it cheap to experiment where your users really are. "There's no way that the client makes that many requests.” "No one really minds the spinner." "Why should we spend time on that instead of COOLFEATURE?" "We live in a 50ms world!"
  • 16. ● Visual → Numerical, need the IECDF for Percentiles ○ ƒ(0.50) = 50th (median) ○ ƒ(0.95) = 95th ● Cluster to get pretty colors similar experiences. (k-means, hierarchical, etc.) Interpret the data
  • 17.
  • 18.
  • 19.
  • 20.
  • 21. ● Go there! ● Abstract analysis - hard ● Feeling reality is much simpler than looking at graphs. Build! Practical Teleportation.
  • 23.
  • 24. Don't guess. Developing a model based on production data, without missing the distribution of samples (network, render, responsiveness) will lead to better software. Global reach doesn't need to be scary. @gcirino42 http://blogofsomeguy.com
  • 26.
  • 27. Problem & Motivation ● Real-user performance monitoring solution ● More insight into the App performance (as perceived by real users) ● Too many variables to trust synthetic tests and labs ● Prioritize work around App performance ● Track App improvement progress over time ● Detect issues, internal and external
  • 28. Device Diversity ● Netflix runs on all sorts of devices ● Smart TVs, Gaming Consoles, Mobile Phones, Cable TV boxes, ... ● Consistently evaluate performance
  • 29.
  • 30. What are we monitoring? ● User Actions (or things users do in the App) ● App Startup ● User Navigation ● Playing a Title ● Internal App metrics
  • 31. What are we measuring? ● When does the timer start and stop? ● Time-to-Interactive (TTI) ○ Interactive, even if some items were not fully loaded and rendered ● Time-to-Render (TTR) ○ Everything above the fold (visible without scrolling) is rendered ● Play Delay ● Meaningful for what we are monitoring
  • 32. High-dimensional Data ● Complex device categorization ● Geo regions, subregions, countries ● Highly granular network classifications ● High volume of A/B tests ● Different facets of the same user action ○ Cold, suspended and backgrounded App startups ○ Target view/page on App startup
  • 33.
  • 34.
  • 35.
  • 36. Data Sketches ● Data structures that approximately resemble a much larger data set ● Preserve essential features! ● Significantly smaller! ● Faster to operate on!
  • 37. t-Digest ● t-Digest data structure ● Rank-based statistics (such as quantiles) ● Parallel friendly (can be merged!) ● Very fast! ● Really accurate! https://github.com/tdunning/t-digest
  • 39.
  • 40. iOS Median Comparison, Break by Country
  • 41. iOS Median Comparison, Break by Country + iPhone 6S Plus
  • 42. CDFs by UI Version
  • 46. Going Forward ● Resource utilization metrics ● Device profiling ○ Instrumenting client code ● Explore other visualizations ○ Frequency heat maps ● Connection between perceived performance, acquisition and retention @spiermar
  • 48. ● Mid-tier stateless services are ~2/3rd of the total ● Savings - 30% of mid-tier footprint (roughly 30K instances) ○ Higher savings if we break it down by region ○ Even higher savings on services that scale well Savings!
  • 49. Why we autoscale - philosophical reasons
  • 50. Why we autoscale - pragmatic reasons ● Encoding ● Precompute ● Failover ● Red/black pushes ● Curing cancer** ● And more... ** Hack-day project
  • 51. Should you autoscale? Benefits ● On-demand capacity: direct $$ savings ● RI capacity: re-purposing spare capacity However, for each server group, beware of ● Uneven distribution of traffic ● Sticky traffic ● Bursty traffic ● Small ASG sizes (<10)
  • 52. Autoscaling impacts availability - true or false? * If done correctly Under-provisioning, however, can impact availability ● Autoscaling is not a problem ● The real problem is not knowing performance characteristics of the service
  • 53. AWS autoscaling mechanics CloudWatch alarm ASG scaling policy Aggregated metric feed Notification Tunables Metric ● Threshold ● # of eval periods ● Scaling amount ● Warmup time
  • 54. What metric to scale on? Pros ● Tracks a direct measure of work ● Linear scaling ● Predictable ● Requires less adjustment over time Cons ● Thresholds tend to drift over time ● Prone to changes in request mixture ● Less predictable ● More oscillation / jitter Throughput Resource utilization
  • 55. Autoscaling on multiple metrics Proceed with caution ● Harder to reason about scaling behavior ● Different metrics might contradict each other, causing oscillation Typical Netflix configuration: ● Scale-up policy on throughput ● Scale-down policy on throughput ● Emergency scale-up policy on CPU, aka “the hammer rule”
  • 57. Common mistakes - “no rush” scaling Problem: scaling amounts too small, cooldown too long Effect: scaling lags behind the traffic flow. Not enough capacity at peak, capacity wasted in trough Remedy: increase scaling amounts, migrate to step policies
  • 58. Common mistakes - twitchy scaling Problem: Scale-up policy is too aggressive Effect: unnecessary capacity churn Remedy: reduce scale-up amount, increase the # of eval periods
  • 59. Common mistakes - should I stay or should I go Problem: -up and -down thresholds are too close to each other Effect: constant capacity oscillation Remedy: move -up and -down thresholds farther apart
  • 60. AWS target tracking - your best bet! ● Think of it as a step policy with auto-steps ● You can also think of it as a thermostat ● Accounts for the rate of change in monitored metric ● Pick a metric, set the target value and warmup time - that’s it! Step Target-tracking
  • 61. Netflix PMCs on the Cloud Brendan
  • 64. # perf stat -a -- sleep 10 Performance counter stats for 'system wide': 80018.188438 task-clock (msec) # 8.000 CPUs utilized (100.00%) 7,562 context-switches # 0.095 K/sec (100.00%) 1,157 cpu-migrations # 0.014 K/sec (100.00%) 109,734 page-faults # 0.001 M/sec <not supported> cycles <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend <not supported> instructions <not supported> branches <not supported> branch-misses 10.001715965 seconds time elapsed Performance Monitoring Counters (PMCs) in most clouds
  • 65. # perf stat -a -- sleep 10 Performance counter stats for 'system wide': 641320.173626 task-clock (msec) # 64.122 CPUs utilized [100.00%] 1,047,222 context-switches # 0.002 M/sec [100.00%] 83,420 cpu-migrations # 0.130 K/sec [100.00%] 38,905 page-faults # 0.061 K/sec 655,419,788,755 cycles # 1.022 GHz [75.02%] <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 536,830,399,277 instructions # 0.82 insns per cycle [75.02%] 97,103,651,128 branches # 151.412 M/sec [75.02%] 1,230,478,597 branch-misses # 1.27% of all branches [74.99%] 10.001622154 seconds time elapsed AWS EC2 m4.16xl
  • 66. Interpreting IPC & Actionable Items IPC: Instructions Per Cycle (invert of CPI) ● IPC < 1.0: likely memory stalled ○ Data usage and layout to improve CPU caching, memory locality. ○ Choose larger CPU caches, faster memory busses and interconnects. ● IPC > 1.0: likely instruction bound ○ Reduce code execution, eliminate unnecessary work, cache operations, improve algorithm order. Can analyze using CPU flame graphs. ○ Faster CPUs.
  • 67. Event Name Umask Event S. Example Event Mask Mnemonic UnHalted Core Cycles 00H 3CH CPU_CLK_UNHALTED.THREAD_P Instruction Retired 00H C0H INST_RETIRED.ANY_P UnHalted Reference Cycles 01H 3CH CPU_CLK_THREAD_UNHALTED.REF_XCLK LLC Reference 4FH 2EH LONGEST_LAT_CACHE.REFERENCE LLC Misses 41H 2EH LONGEST_LAT_CACHE.MISS Branch Instruction Retired 00H C4H BR_INST_RETIRED.ALL_BRANCHES Branch Misses Retired 00H C5H BR_MISP_RETIRED.ALL_BRANCHES Intel Architectural PMCs Now available in AWS EC2 on full dedicated hosts (eg, m4.16xl, …)
  • 68. # pmcarch 1 CYCLES INSTRUCTIONS IPC BR_RETIRED BR_MISPRED BMR% LLCREF LLCMISS LLC% 90755342002 64236243785 0.71 11760496978 174052359 1.48 1542464817 360223840 76.65 75815614312 59253317973 0.78 10665897008 158100874 1.48 1361315177 286800304 78.93 65164313496 53307631673 0.82 9538082731 137444723 1.44 1272163733 268851404 78.87 90820303023 70649824946 0.78 12672090735 181324730 1.43 1685112288 343977678 79.59 76341787799 50830491037 0.67 10542795714 143936677 1.37 1204703117 279162683 76.83 [...] tiptop - [root] Tasks: 96 total, 3 displayed screen 0: default PID [ %CPU] %SYS P Mcycle Minstr IPC %MISS %BMIS %BUS COMMAND 3897 35.3 28.5 4 274.06 178.23 0.65 0.06 0.00 0.0 java 1319+ 5.5 2.6 6 87.32 125.55 1.44 0.34 0.26 0.0 nm-applet 900 0.9 0.0 6 25.91 55.55 2.14 0.12 0.21 0.0 dbus-daemo https://github.com/brendangregg/pmc-cloud-tools