SlideShare ist ein Scribd-Unternehmen logo
1 von 70
Netflix
Performance Meetup
Global Client Performance
Fast Metrics
3G in Kazakhstan
● Global Internet:
● faster (better networking)
● slower (broader reach, congestion)
● Don't wait for it, measure it and deal
● Working app > Feature rich app
Making the Internet fast
is slow.
We need to know what the Internet looks like,
without averages, seeing the full distribution.
● Sampling
○ Missed data
○ Rare events
○ Problems aren’t equal in
Population
● Averages
○ Can't see the distribution
○ Outliers heavily distort
∞, 0, negatives, errors
Logging Anti-Patterns
Instead, use the client as a map-reducer and send up aggregated
data, less often.
Sizing up the Internet.
Infinite (free) compute power!
● Calculate the inverse empirical cumulative
distribution function by math.
Get median, 95th, etc.
> library(HistogramTools)
> iecdf <- HistToEcdf(histogram,
method='linear’, inverse=TRUE)
> iecdf(0.5)
[1] 0.7975309 # median
> iecdf(0.95)
[1] 4.65 # 95th
percentile
o ...or just use R which is free and knows how
to do it already
Data > Opinions.
Better than debating opinions.
Architecture is hard. Make it cheap to experiment where your users really are.
"There's no way that the
client makes that many
requests.”
"No one really minds the
spinner."
"Why should we spend
time on that instead of
COOLFEATURE?"
"We live in a
50ms world!"
We built Daedalus
US
Elsewhere
Fast
Slow
DNS Time
● Visual → Numerical, need the IECDF for
Percentiles
○ ƒ(0.50) = 50th
(median)
○ ƒ(0.95) = 95th
● Cluster to get pretty colors similar experiences.
(k-means, hierarchical, etc.)
Interpret the data
● Go there!
● Abstract analysis - hard
● Feeling reality is much simpler than looking at graphs. Build!
Practical Teleportation.
Make a Reality Lab.
Don't guess.
Developing a model based on
production data, without missing the
distribution of samples (network, render,
responsiveness) will lead to better
software.
Global reach doesn't need to be scary. @gcirino42 http://blogofsomeguy.com
Icarus
Martin Spier
@spiermar
Performance Engineering @ Netflix
Problem & Motivation
● Real-user performance monitoring solution
● More insight into the App performance
(as perceived by real users)
● Too many variables to trust synthetic
tests and labs
● Prioritize work around App performance
● Track App improvement progress over time
● Detect issues, internal and external
Device Diversity
● Netflix runs on all sorts of devices
● Smart TVs, Gaming Consoles, Mobile Phones, Cable TV boxes, ...
● Consistently evaluate performance
What are we monitoring?
● User Actions
(or things users do in the App)
● App Startup
● User Navigation
● Playing a Title
● Internal App metrics
What are we measuring?
● When does the timer start and stop?
● Time-to-Interactive (TTI)
○ Interactive, even if
some items were not fully
loaded and rendered
● Time-to-Render (TTR)
○ Everything above the fold
(visible without scrolling)
is rendered
● Play Delay
● Meaningful for what we are monitoring
High-dimensional Data
● Complex device categorization
● Geo regions, subregions, countries
● Highly granular network
classifications
● High volume of A/B tests
● Different facets of the same user action
○ Cold, suspended and backgrounded
App startups
○ Target view/page on App startup
Data Sketches
● Data structures that approximately
resemble a much larger data set
● Preserve essential features!
● Significantly smaller!
● Faster to operate on!
t-Digest
● t-Digest data structure
● Rank-based statistics
(such as quantiles)
● Parallel friendly
(can be merged!)
● Very fast!
● Really accurate!
https://github.com/tdunning/t-digest
+ t-Digest sketches
iOS Median Comparison, Break by Country
iOS Median Comparison, Break by Country + iPhone 6S Plus
CDFs by UI Version
Warm Startup Rate
A/B Cell Comparison
Anomaly Detection
Going Forward
● Resource utilization metrics
● Device profiling
○ Instrumenting client code
● Explore other visualizations
○ Frequency heat maps
● Connection between perceived
performance, acquisition and
retention
@spiermar
Netflix
Autoscaling for experts
Vadim
● Mid-tier stateless services are ~2/3rd of the total
● Savings - 30% of mid-tier footprint (roughly 30K instances)
○ Higher savings if we break it down by region
○ Even higher savings on services that scale well
Savings!
Why we autoscale - philosophical reasons
Why we autoscale - pragmatic reasons
● Encoding
● Precompute
● Failover
● Red/black pushes
● Curing cancer**
● And more...
** Hack-day project
Should you autoscale?
Benefits
● On-demand capacity: direct $$ savings
● RI capacity: re-purposing spare capacity
However, for each server group, beware of
● Uneven distribution of traffic
● Sticky traffic
● Bursty traffic
● Small ASG sizes (<10)
Autoscaling impacts availability - true or false?
* If done correctly
Under-provisioning, however, can impact availability
● Autoscaling is not a problem
● The real problem is not knowing performance characteristics of the
service
AWS autoscaling mechanics
CloudWatch alarm ASG scaling policy
Aggregated metric feed
Notification
Tunables
Metric ● Threshold
● # of eval periods
● Scaling amount
● Warmup time
What metric to scale on?
Pros
● Tracks a direct measure of work
● Linear scaling
● Predictable
● Requires less adjustment over time
Cons
● Thresholds tend to drift over time
● Prone to changes in request mixture
● Less predictable
● More oscillation / jitter
Throughput
Resource
utilization
Autoscaling on multiple metrics
Proceed with caution
● Harder to reason about scaling behavior
● Different metrics might contradict each
other, causing oscillation
Typical Netflix configuration:
● Scale-up policy on throughput
● Scale-down policy on throughput
● Emergency scale-up policy on CPU, aka
“the hammer rule”
Well-behaved autoscaling
Common mistakes - “no rush” scaling
Problem: scaling amounts too
small, cooldown too long
Effect: scaling lags behind the
traffic flow. Not enough
capacity at peak, capacity
wasted in trough
Remedy: increase scaling
amounts, migrate to step
policies
Common mistakes - twitchy scaling
Problem: Scale-up policy is
too aggressive
Effect: unnecessary
capacity churn
Remedy: reduce scale-up
amount, increase the # of
eval periods
Common mistakes - should I stay or should I go
Problem: -up and -down
thresholds are too close to each
other
Effect: constant capacity
oscillation
Remedy: move -up and -down
thresholds farther apart
AWS target tracking - your best bet!
● Think of it as a step policy with auto-steps
● You can also think of it as a thermostat
● Accounts for the rate of change in monitored metric
● Pick a metric, set the target value and warmup time - that’s it!
Step Target-tracking
Netflix
PMCs on the Cloud
Brendan
Busy
Waiting
(“idle”)
90% CPU utilization:
Busy
Waiting
(“idle”)
Busy
Waiting
(“idle”)
Waiting
(“stalled”)
Reality:
90% CPU utilization:
# perf stat -a -- sleep 10
Performance counter stats for 'system wide':
80018.188438 task-clock (msec) # 8.000 CPUs utilized (100.00%)
7,562 context-switches # 0.095 K/sec (100.00%)
1,157 cpu-migrations # 0.014 K/sec (100.00%)
109,734 page-faults # 0.001 M/sec
<not supported> cycles
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
<not supported> instructions
<not supported> branches
<not supported> branch-misses
10.001715965 seconds time elapsed
Performance
Monitoring Counters
(PMCs) in most clouds
# perf stat -a -- sleep 10
Performance counter stats for 'system wide':
641320.173626 task-clock (msec) # 64.122 CPUs utilized [100.00%]
1,047,222 context-switches # 0.002 M/sec [100.00%]
83,420 cpu-migrations # 0.130 K/sec [100.00%]
38,905 page-faults # 0.061 K/sec
655,419,788,755 cycles # 1.022 GHz [75.02%]
<not supported> stalled-cycles-frontend
<not supported> stalled-cycles-backend
536,830,399,277 instructions # 0.82 insns per cycle [75.02%]
97,103,651,128 branches # 151.412 M/sec [75.02%]
1,230,478,597 branch-misses # 1.27% of all branches [74.99%]
10.001622154 seconds time elapsed
AWS EC2 m4.16xl
Interpreting IPC & Actionable Items
IPC: Instructions Per Cycle (invert of CPI)
● IPC < 1.0: likely memory stalled
○ Data usage and layout to improve CPU caching, memory locality.
○ Choose larger CPU caches, faster memory busses and interconnects.
● IPC > 1.0: likely instruction bound
○ Reduce code execution, eliminate unnecessary work, cache operations,
improve algorithm order. Can analyze using CPU flame graphs.
○ Faster CPUs.
Event Name Umask Event S. Example Event Mask Mnemonic
UnHalted Core Cycles 00H 3CH CPU_CLK_UNHALTED.THREAD_P
Instruction Retired 00H C0H INST_RETIRED.ANY_P
UnHalted Reference Cycles 01H 3CH CPU_CLK_THREAD_UNHALTED.REF_XCLK
LLC Reference 4FH 2EH LONGEST_LAT_CACHE.REFERENCE
LLC Misses 41H 2EH LONGEST_LAT_CACHE.MISS
Branch Instruction Retired 00H C4H BR_INST_RETIRED.ALL_BRANCHES
Branch Misses Retired 00H C5H BR_MISP_RETIRED.ALL_BRANCHES
Intel Architectural PMCs
Now available in AWS EC2 on full dedicated hosts (eg, m4.16xl, …)
# pmcarch 1
CYCLES INSTRUCTIONS IPC BR_RETIRED BR_MISPRED BMR% LLCREF LLCMISS LLC%
90755342002 64236243785 0.71 11760496978 174052359 1.48 1542464817 360223840 76.65
75815614312 59253317973 0.78 10665897008 158100874 1.48 1361315177 286800304 78.93
65164313496 53307631673 0.82 9538082731 137444723 1.44 1272163733 268851404 78.87
90820303023 70649824946 0.78 12672090735 181324730 1.43 1685112288 343977678 79.59
76341787799 50830491037 0.67 10542795714 143936677 1.37 1204703117 279162683 76.83
[...]
tiptop - [root]
Tasks: 96 total, 3 displayed screen 0: default
PID [ %CPU] %SYS P Mcycle Minstr IPC %MISS %BMIS %BUS COMMAND
3897 35.3 28.5 4 274.06 178.23 0.65 0.06 0.00 0.0 java
1319+ 5.5 2.6 6 87.32 125.55 1.44 0.34 0.26 0.0 nm-applet
900 0.9 0.0 6 25.91 55.55 2.14 0.12 0.21 0.0 dbus-daemo
https://github.com/brendangregg/pmc-cloud-tools
Netflix
Performance Meetup
Netflix
Performance Meetup

Weitere ähnliche Inhalte

Was ist angesagt?

The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaScyllaDB
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF SuperpowersBrendan Gregg
 
New Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingNew Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingScyllaDB
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationshadooparchbook
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and moreBrendan Gregg
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingBrendan Gregg
 
Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsStop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsBrendan Gregg
 
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowFrom Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowDatabricks
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Brendan Gregg
 
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization TechniqueSquare Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization TechniqueScyllaDB
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusMarco Pas
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink Forward
 
Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit
 
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringGeorg Schönberger
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf toolsBrendan Gregg
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuSlim Baltagi
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceBrendan Gregg
 

Was ist angesagt? (20)

The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
 
New Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using TracingNew Ways to Find Latency in Linux Using Tracing
New Ways to Find Latency in Linux Using Tracing
 
Top 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applicationsTop 5 mistakes when writing Spark applications
Top 5 mistakes when writing Spark applications
 
BPF: Tracing and more
BPF: Tracing and moreBPF: Tracing and more
BPF: Tracing and more
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
 
Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsStop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production Systems
 
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache AirflowFrom Idea to Model: Productionizing Data Pipelines with Apache Airflow
From Idea to Model: Productionizing Data Pipelines with Apache Airflow
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization TechniqueSquare Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
 
Infrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using PrometheusInfrastructure & System Monitoring using Prometheus
Infrastructure & System Monitoring using Prometheus
 
Flink powered stream processing platform at Pinterest
Flink powered stream processing platform at PinterestFlink powered stream processing platform at Pinterest
Flink powered stream processing platform at Pinterest
 
Spark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon WhitearSpark Summit EU talk by Simon Whitear
Spark Summit EU talk by Simon Whitear
 
Linux Performance Profiling and Monitoring
Linux Performance Profiling and MonitoringLinux Performance Profiling and Monitoring
Linux Performance Profiling and Monitoring
 
Velocity 2015 linux perf tools
Velocity 2015 linux perf toolsVelocity 2015 linux perf tools
Velocity 2015 linux perf tools
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
How Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for PerformanceHow Netflix Tunes EC2 Instances for Performance
How Netflix Tunes EC2 Instances for Performance
 

Ähnlich wie Netflix SRE perf meetup_slides

Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...InfluxData
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Data Con LA
 
Gatling - Bordeaux JUG
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUGslandelle
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uberconfluent
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAnthony Scata
 
Three Perspectives on Measuring Latency
Three Perspectives on Measuring LatencyThree Perspectives on Measuring Latency
Three Perspectives on Measuring LatencyScyllaDB
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS
 
Adventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and InstanaAdventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and InstanaMarcel Birkner
 
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 Adventures in Observability: How in-house ClickHouse deployment enabled Inst... Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...Altinity Ltd
 
Container world 2019 Canary Release
Container world 2019 Canary ReleaseContainer world 2019 Canary Release
Container world 2019 Canary ReleaseBilly Yuen
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeDataWorks Summit
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
 
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Brian Brazil
 
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)Martin Spier
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the Worldjhugg
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3LibbySchulze
 
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Startupfest
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Flink Forward
 

Ähnlich wie Netflix SRE perf meetup_slides (20)

Streamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache PulsarStreamlio and IoT analytics with Apache Pulsar
Streamlio and IoT analytics with Apache Pulsar
 
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
Wayfair Storefront Performance Monitoring with InfluxEnterprise by Richard La...
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Gatling - Bordeaux JUG
Gatling - Bordeaux JUGGatling - Bordeaux JUG
Gatling - Bordeaux JUG
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runners
 
Three Perspectives on Measuring Latency
Three Perspectives on Measuring LatencyThree Perspectives on Measuring Latency
Three Perspectives on Measuring Latency
 
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...
 
Adventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and InstanaAdventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and Instana
 
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 Adventures in Observability: How in-house ClickHouse deployment enabled Inst... Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 
Container world 2019 Canary Release
Container world 2019 Canary ReleaseContainer world 2019 Canary Release
Container world 2019 Canary Release
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
Provisioning and Capacity Planning Workshop (Dogpatch Labs, September 2015)
 
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
 
Building a Database for the End of the World
Building a Database for the End of the WorldBuilding a Database for the End of the World
Building a Database for the End of the World
 
Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3Scaling Monitoring At Databricks From Prometheus to M3
Scaling Monitoring At Databricks From Prometheus to M3
 
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
Jeremy Edberg (MinOps ) - How to build a solid infrastructure for a startup t...
 
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
Keynote: Building and Operating A Serverless Streaming Runtime for Apache Bea...
 

Kürzlich hochgeladen

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...ranjana rawat
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdfankushspencer015
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...Call Girls in Nagpur High Profile
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxupamatechverse
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesPrabhanshu Chaturvedi
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...Call Girls in Nagpur High Profile
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxfenichawla
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdfKamal Acharya
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Christo Ananth
 

Kürzlich hochgeladen (20)

The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 
Introduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptxIntroduction to Multiple Access Protocol.pptx
Introduction to Multiple Access Protocol.pptx
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Glass Ceramics: Processing and Properties
Glass Ceramics: Processing and PropertiesGlass Ceramics: Processing and Properties
Glass Ceramics: Processing and Properties
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 

Netflix SRE perf meetup_slides

  • 4. ● Global Internet: ● faster (better networking) ● slower (broader reach, congestion) ● Don't wait for it, measure it and deal ● Working app > Feature rich app Making the Internet fast is slow.
  • 5. We need to know what the Internet looks like, without averages, seeing the full distribution.
  • 6. ● Sampling ○ Missed data ○ Rare events ○ Problems aren’t equal in Population ● Averages ○ Can't see the distribution ○ Outliers heavily distort ∞, 0, negatives, errors Logging Anti-Patterns Instead, use the client as a map-reducer and send up aggregated data, less often.
  • 7. Sizing up the Internet.
  • 9.
  • 10. ● Calculate the inverse empirical cumulative distribution function by math. Get median, 95th, etc. > library(HistogramTools) > iecdf <- HistToEcdf(histogram, method='linear’, inverse=TRUE) > iecdf(0.5) [1] 0.7975309 # median > iecdf(0.95) [1] 4.65 # 95th percentile o ...or just use R which is free and knows how to do it already
  • 11.
  • 12.
  • 14. Better than debating opinions. Architecture is hard. Make it cheap to experiment where your users really are. "There's no way that the client makes that many requests.” "No one really minds the spinner." "Why should we spend time on that instead of COOLFEATURE?" "We live in a 50ms world!"
  • 16. ● Visual → Numerical, need the IECDF for Percentiles ○ ƒ(0.50) = 50th (median) ○ ƒ(0.95) = 95th ● Cluster to get pretty colors similar experiences. (k-means, hierarchical, etc.) Interpret the data
  • 17.
  • 18.
  • 19.
  • 20.
  • 21. ● Go there! ● Abstract analysis - hard ● Feeling reality is much simpler than looking at graphs. Build! Practical Teleportation.
  • 23.
  • 24. Don't guess. Developing a model based on production data, without missing the distribution of samples (network, render, responsiveness) will lead to better software. Global reach doesn't need to be scary. @gcirino42 http://blogofsomeguy.com
  • 26.
  • 27. Problem & Motivation ● Real-user performance monitoring solution ● More insight into the App performance (as perceived by real users) ● Too many variables to trust synthetic tests and labs ● Prioritize work around App performance ● Track App improvement progress over time ● Detect issues, internal and external
  • 28. Device Diversity ● Netflix runs on all sorts of devices ● Smart TVs, Gaming Consoles, Mobile Phones, Cable TV boxes, ... ● Consistently evaluate performance
  • 29.
  • 30. What are we monitoring? ● User Actions (or things users do in the App) ● App Startup ● User Navigation ● Playing a Title ● Internal App metrics
  • 31. What are we measuring? ● When does the timer start and stop? ● Time-to-Interactive (TTI) ○ Interactive, even if some items were not fully loaded and rendered ● Time-to-Render (TTR) ○ Everything above the fold (visible without scrolling) is rendered ● Play Delay ● Meaningful for what we are monitoring
  • 32. High-dimensional Data ● Complex device categorization ● Geo regions, subregions, countries ● Highly granular network classifications ● High volume of A/B tests ● Different facets of the same user action ○ Cold, suspended and backgrounded App startups ○ Target view/page on App startup
  • 33.
  • 34.
  • 35.
  • 36. Data Sketches ● Data structures that approximately resemble a much larger data set ● Preserve essential features! ● Significantly smaller! ● Faster to operate on!
  • 37. t-Digest ● t-Digest data structure ● Rank-based statistics (such as quantiles) ● Parallel friendly (can be merged!) ● Very fast! ● Really accurate! https://github.com/tdunning/t-digest
  • 39.
  • 40. iOS Median Comparison, Break by Country
  • 41. iOS Median Comparison, Break by Country + iPhone 6S Plus
  • 42. CDFs by UI Version
  • 46. Going Forward ● Resource utilization metrics ● Device profiling ○ Instrumenting client code ● Explore other visualizations ○ Frequency heat maps ● Connection between perceived performance, acquisition and retention @spiermar
  • 48. ● Mid-tier stateless services are ~2/3rd of the total ● Savings - 30% of mid-tier footprint (roughly 30K instances) ○ Higher savings if we break it down by region ○ Even higher savings on services that scale well Savings!
  • 49. Why we autoscale - philosophical reasons
  • 50. Why we autoscale - pragmatic reasons ● Encoding ● Precompute ● Failover ● Red/black pushes ● Curing cancer** ● And more... ** Hack-day project
  • 51. Should you autoscale? Benefits ● On-demand capacity: direct $$ savings ● RI capacity: re-purposing spare capacity However, for each server group, beware of ● Uneven distribution of traffic ● Sticky traffic ● Bursty traffic ● Small ASG sizes (<10)
  • 52. Autoscaling impacts availability - true or false? * If done correctly Under-provisioning, however, can impact availability ● Autoscaling is not a problem ● The real problem is not knowing performance characteristics of the service
  • 53. AWS autoscaling mechanics CloudWatch alarm ASG scaling policy Aggregated metric feed Notification Tunables Metric ● Threshold ● # of eval periods ● Scaling amount ● Warmup time
  • 54. What metric to scale on? Pros ● Tracks a direct measure of work ● Linear scaling ● Predictable ● Requires less adjustment over time Cons ● Thresholds tend to drift over time ● Prone to changes in request mixture ● Less predictable ● More oscillation / jitter Throughput Resource utilization
  • 55. Autoscaling on multiple metrics Proceed with caution ● Harder to reason about scaling behavior ● Different metrics might contradict each other, causing oscillation Typical Netflix configuration: ● Scale-up policy on throughput ● Scale-down policy on throughput ● Emergency scale-up policy on CPU, aka “the hammer rule”
  • 57. Common mistakes - “no rush” scaling Problem: scaling amounts too small, cooldown too long Effect: scaling lags behind the traffic flow. Not enough capacity at peak, capacity wasted in trough Remedy: increase scaling amounts, migrate to step policies
  • 58. Common mistakes - twitchy scaling Problem: Scale-up policy is too aggressive Effect: unnecessary capacity churn Remedy: reduce scale-up amount, increase the # of eval periods
  • 59. Common mistakes - should I stay or should I go Problem: -up and -down thresholds are too close to each other Effect: constant capacity oscillation Remedy: move -up and -down thresholds farther apart
  • 60. AWS target tracking - your best bet! ● Think of it as a step policy with auto-steps ● You can also think of it as a thermostat ● Accounts for the rate of change in monitored metric ● Pick a metric, set the target value and warmup time - that’s it! Step Target-tracking
  • 61. Netflix PMCs on the Cloud Brendan
  • 64. # perf stat -a -- sleep 10 Performance counter stats for 'system wide': 80018.188438 task-clock (msec) # 8.000 CPUs utilized (100.00%) 7,562 context-switches # 0.095 K/sec (100.00%) 1,157 cpu-migrations # 0.014 K/sec (100.00%) 109,734 page-faults # 0.001 M/sec <not supported> cycles <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend <not supported> instructions <not supported> branches <not supported> branch-misses 10.001715965 seconds time elapsed Performance Monitoring Counters (PMCs) in most clouds
  • 65. # perf stat -a -- sleep 10 Performance counter stats for 'system wide': 641320.173626 task-clock (msec) # 64.122 CPUs utilized [100.00%] 1,047,222 context-switches # 0.002 M/sec [100.00%] 83,420 cpu-migrations # 0.130 K/sec [100.00%] 38,905 page-faults # 0.061 K/sec 655,419,788,755 cycles # 1.022 GHz [75.02%] <not supported> stalled-cycles-frontend <not supported> stalled-cycles-backend 536,830,399,277 instructions # 0.82 insns per cycle [75.02%] 97,103,651,128 branches # 151.412 M/sec [75.02%] 1,230,478,597 branch-misses # 1.27% of all branches [74.99%] 10.001622154 seconds time elapsed AWS EC2 m4.16xl
  • 66. Interpreting IPC & Actionable Items IPC: Instructions Per Cycle (invert of CPI) ● IPC < 1.0: likely memory stalled ○ Data usage and layout to improve CPU caching, memory locality. ○ Choose larger CPU caches, faster memory busses and interconnects. ● IPC > 1.0: likely instruction bound ○ Reduce code execution, eliminate unnecessary work, cache operations, improve algorithm order. Can analyze using CPU flame graphs. ○ Faster CPUs.
  • 67. Event Name Umask Event S. Example Event Mask Mnemonic UnHalted Core Cycles 00H 3CH CPU_CLK_UNHALTED.THREAD_P Instruction Retired 00H C0H INST_RETIRED.ANY_P UnHalted Reference Cycles 01H 3CH CPU_CLK_THREAD_UNHALTED.REF_XCLK LLC Reference 4FH 2EH LONGEST_LAT_CACHE.REFERENCE LLC Misses 41H 2EH LONGEST_LAT_CACHE.MISS Branch Instruction Retired 00H C4H BR_INST_RETIRED.ALL_BRANCHES Branch Misses Retired 00H C5H BR_MISP_RETIRED.ALL_BRANCHES Intel Architectural PMCs Now available in AWS EC2 on full dedicated hosts (eg, m4.16xl, …)
  • 68. # pmcarch 1 CYCLES INSTRUCTIONS IPC BR_RETIRED BR_MISPRED BMR% LLCREF LLCMISS LLC% 90755342002 64236243785 0.71 11760496978 174052359 1.48 1542464817 360223840 76.65 75815614312 59253317973 0.78 10665897008 158100874 1.48 1361315177 286800304 78.93 65164313496 53307631673 0.82 9538082731 137444723 1.44 1272163733 268851404 78.87 90820303023 70649824946 0.78 12672090735 181324730 1.43 1685112288 343977678 79.59 76341787799 50830491037 0.67 10542795714 143936677 1.37 1204703117 279162683 76.83 [...] tiptop - [root] Tasks: 96 total, 3 displayed screen 0: default PID [ %CPU] %SYS P Mcycle Minstr IPC %MISS %BMIS %BUS COMMAND 3897 35.3 28.5 4 274.06 178.23 0.65 0.06 0.00 0.0 java 1319+ 5.5 2.6 6 87.32 125.55 1.44 0.34 0.26 0.0 nm-applet 900 0.9 0.0 6 25.91 55.55 2.14 0.12 0.21 0.0 dbus-daemo https://github.com/brendangregg/pmc-cloud-tools