SlideShare ist ein Scribd-Unternehmen logo
1 von 97
Downloaden Sie, um offline zu lesen
From 
Clouds 
to 
Roots 
Brendan 
Gregg 
Senior 
Performance 
Architect 
Performance 
Engineering 
Team 
bgregg@ne5lix.com, 
@brendangregg 
September, 
2014
Root 
Cause 
Analysis 
at 
Ne5lix 
… 
Roots 
… 
Load 
Hystrix 
Service 
Tomcat 
JVM 
… 
Instances 
(Linux) 
ELB 
ASG 
Cluster 
SG 
ApplicaFon 
Ne5lix 
ASG 
1 
Zuul 
AZ 
1 
ASG 
2 
AZ 
2 
AZ 
3 
Devices 
Atlas 
Chronos 
Mogul 
Vector 
sar, 
*stat 
stap, 
Vrace 
rdmsr 
… 
Ribbon
• Massive 
AWS 
EC2 
Linux 
cloud 
– Tens 
of 
thousands 
of 
server 
instances 
– Autoscale 
by 
~3k 
each 
day 
– CentOS 
and 
Ubuntu 
• FreeBSD 
for 
content 
delivery 
– Approx 
33% 
of 
US 
Internet 
traffic 
at 
night 
• Performance 
is 
criFcal 
– Customer 
saFsfacFon: 
>50M 
subscribers 
– $$$ 
price/performance 
– Develop 
tools 
for 
cloud-­‐wide 
analysis
Brendan 
Gregg 
• Senior 
Performance 
Architect, 
Ne5lix 
– Linux 
and 
FreeBSD 
performance 
– Performance 
Engineering 
team 
(@coburnw) 
• Recent 
work: 
– Linux 
perf-­‐tools, 
using 
Vrace 
& 
perf_events 
– Systems 
Performance, 
PrenFce 
Hall 
• Previous 
work 
includes: 
– USE 
Method, 
flame 
graphs, 
latency 
& 
uFlizaFon 
heat 
maps, 
DTraceToolkit, 
iosnoop 
and 
others 
on 
OS 
X, 
ZFS 
L2ARC 
• Twiler 
@brendangregg
Last 
year 
at 
Surge… 
• I 
saw 
a 
great 
Ne5lix 
talk 
by 
Coburn 
Watson: 
• hlps://www.youtube.com/watch?v=7-­‐13wV3WO8Q 
• He’s 
now 
my 
manager 
(and 
also 
sFll 
hiring!)
Agenda 
• The 
Ne5lix 
Cloud 
– How 
it 
works: 
ASG 
clusters, 
Hystrix, 
monkeys 
– And 
how 
it 
may 
fail 
• Root 
Cause 
Performance 
Analysis 
– Why 
it’s 
sFll 
needed 
• Cloud 
analysis 
• Instance 
analysis
Terms 
• AWS: 
Amazon 
Web 
Services 
• EC2: 
AWS 
ElasFc 
Compute 
2 
(cloud 
instances) 
• S3: 
AWS 
Simple 
Storage 
Service 
(object 
store) 
• ELB: 
AWS 
ElasFc 
Load 
Balancers 
• SQS: 
AWS 
Simple 
Queue 
Service 
• SES: 
AWS 
Simple 
Email 
Service 
• CDN: 
Content 
Delivery 
Network 
• OCA: 
Ne5lix 
Open 
Connect 
Appliance 
(streaming 
CDN) 
• QoS: 
Quality 
of 
Service 
• AMI: 
Amazon 
Machine 
Image 
(instance 
image) 
• ASG: 
Auto 
Scaling 
Group 
• AZ: 
Availability 
Zone 
• NIWS: 
Ne5lix 
Internal 
Web 
Service 
framework 
(Ribbon) 
• MSR: 
Model 
Specific 
Register 
(CPU 
info 
register) 
• PMC: 
Performance 
Monitoring 
Counter 
(CPU 
perf 
counter)
The 
Ne5lix 
Cloud
• Tens 
The 
Ne5lix 
Cloud 
of 
thousands 
of 
cloud 
instances 
on 
AWS 
EC2, 
with 
S3 
and 
Cassandra 
for 
storage 
• Ne5lix 
S3 
EC2 
Cassandra 
ApplicaFons 
(Services) 
EVCache 
is 
implemented 
by 
mulFple 
logical 
services 
ELB 
ElasFcsearch 
SES 
SQS
Ne5lix 
Services 
AuthenFcaFon 
User 
Data 
PersonalizaFon 
Viewing 
Hist. 
Web 
Site 
API 
Client 
Streaming 
API 
Devices 
DRM 
QoS 
Logging 
CDN 
Steering 
Encoding 
OCA 
CDN 
… 
• Open 
Connect 
Appliances 
used 
for 
content 
delivery
Freedom 
and 
Responsibility 
• Culture 
deck 
is 
true 
– hlp://www.slideshare.net/reed2001/culture-­‐1798664 
(9M 
views!) 
• Deployment 
freedom 
– Service 
teams 
choose 
their 
own 
tech 
& 
schedules 
– Purchase 
and 
use 
cloud 
instances 
without 
approvals 
– Ne5lix 
environment 
changes 
fast!
Cloud 
Technologies 
• Numerous 
open 
source 
technologies 
are 
in 
use: 
– Linux, 
Java, 
Cassandra, 
Node.js, 
… 
• Ne5lix 
also 
open 
sources: 
ne5lix.github.io
Cloud 
Instances 
• Base 
server 
instance 
image 
+ 
customizaFons 
by 
service 
teams 
(BaseAMI). 
Typically: 
Linux 
(CentOS 
or 
Ubuntu) 
Java 
(JDK 
7 
or 
8) 
GC 
and 
Tomcat 
thread 
dump 
logging 
ApplicaFon 
war 
files, 
base 
servlet, 
pla5orm, 
hystrix, 
health 
check, 
metrics 
(Servo) 
OpFonal 
Apache, 
memcached, 
non-­‐Java 
apps 
(incl. 
Node.js) 
Atlas 
monitoring, 
S3 
log 
rotaFon, 
Vrace, 
perf, 
stap, 
custom 
perf 
tools
Scalability 
and 
Reliability 
# 
Problem 
Solu,on 
1 
Load 
increases 
Auto 
scale 
with 
ASGs 
2 
Poor 
performing 
code 
push 
Rapid 
rollback 
with 
red/black 
ASG 
clusters 
3 
Instance 
failure 
Hystrix 
Fmeouts 
and 
secondaries 
4 
Zone/Region 
failure 
Zuul 
to 
reroute 
traffic 
5 
Overlooked 
and 
unhandled 
issues 
Simian 
army 
6 
Poor 
performance 
Atlas 
metrics, 
alerts, 
Chronos
1. 
Auto 
Scaling 
Groups 
• Instances 
automaFcally 
added 
or 
removed 
by 
a 
custom 
scaling 
policy 
– A 
broken 
policy 
could 
cause 
false 
scaling 
• Alerts 
& 
audits 
used 
to 
check 
scaling 
is 
sane 
Scaling 
Policy 
loadavg, 
latency, 
… 
Instance 
Instance 
Instance 
CloudWatch, 
Servo 
Cloud 
ConfiguraFon 
Management 
ASG
2. 
ASG 
Clusters 
• How 
code 
versions 
are 
really 
deployed 
• Traffic 
managed 
by 
ElasFc 
Load 
Balancers 
(ELBs) 
• Fast 
rollback 
if 
issues 
are 
found 
– Might 
rollback 
undiagnosed 
issues 
• Canaries 
can 
also 
be 
used 
for 
tesFng 
(and 
automated) 
ASG-­‐v011 
… 
Instance 
Instance 
Instance 
ASG 
Cluster 
prod1 
ASG-­‐v010 
… 
Instance 
Instance 
Instance 
Canary 
ELB
• A 
3. 
Hystrix 
library 
for 
latency 
and 
fault 
tolerance 
for 
dependency 
services 
– Fallbacks, 
degradaFon, 
fast 
fail 
and 
rapid 
recovery 
– Supports 
Fmeouts, 
load 
shedding, 
circuit 
breaker 
– Uses 
thread 
pools 
for 
dependency 
services 
– RealFme 
monitoring 
• Plus 
the 
Ribbon 
IPC 
library 
(NIWS), 
which 
adds 
even 
more 
fault 
tolerance 
Tomcat 
ApplicaFon 
get 
A 
Hystrix 
Dependency 
A1 
Dependency 
A2 
>100ms
• All 
4. 
Redundancy 
device 
traffic 
goes 
through 
the 
Zuul 
proxy: 
– dynamic 
rouFng, 
monitoring, 
resiliency, 
security 
• Availability 
Zone 
failure: 
run 
from 
2 
of 
3 
zones 
• Region 
failure: 
reroute 
traffic 
Zuul 
AZ1 
AZ2 
AZ3 
Monitoring
5. 
Simian 
Army 
• Ensures 
cloud 
handles 
failures 
through 
regular 
tesFng 
• Monkeys: 
– Latency: 
arFficial 
delays 
– Conformity: 
kills 
non-­‐ 
best-­‐pracFces 
instances 
– Doctor: 
health 
checks 
– Janitor: 
unused 
instances 
– Security: 
checks 
violaFons 
– 10-­‐18: 
geographic 
issues 
– Chaos 
Gorilla: 
AZ 
failure 
• We’re 
hiring 
Chaos 
Engineers!
6. 
Atlas, 
alerts, 
Chronos 
• Atlas: 
Cloud-­‐wide 
monitoring 
tool 
– Millions 
of 
metrics, 
quick 
rollups, 
custom 
dashboards: 
• Alerts: 
Custom, 
using 
Atlas 
metrics 
– In 
parFcular, 
error 
& 
Fmeout 
rates 
on 
client 
devices 
• Chronos: 
Change 
tracking 
– Used 
during 
incident 
invesFgaFons
In 
Summary 
# 
Problem 
Solu,on 
1 
Load 
increases 
ASGs 
2 
Poor 
performing 
code 
push 
ASG 
clusters 
3 
Instance 
issue 
Hystrix 
4 
Zone/Region 
issue 
Zuul 
5 
Overlooked 
and 
unhandled 
issues 
Monkeys 
6 
Poor 
performance 
Atlas, 
alerts, 
Chronos 
• Ne5lix 
is 
very 
good 
at 
automaFcally 
handling 
failure 
– Issues 
oVen 
lead 
to 
rapid 
instance 
growth 
(ASGs) 
• Good 
for 
customers 
– Fast 
workaround 
• Good 
for 
engineers 
– Fix 
later, 
9-­‐5
Typical 
Ne5lix 
Stack 
Dependencies, 
Atlas 
(monitoring), 
Discovery, 
… 
… 
… 
Load 
Hystrix 
Service 
Tomcat 
JVM 
… 
Instances 
(Linux) 
3. 
ELB 
ASG 
Cluster 
SG 
ApplicaFon 
Ne5lix 
ASG 
1 
Zuul 
AZ 
1 
ASG 
2 
AZ 
2 
AZ 
3 
Devices 
Monkeys 
Ribbon 
5. 
1. 
2. 
4. 
6. 
Problems/soluFons 
enumerated
* 
ExcepFons 
• Apache 
Web 
Server 
• Node.js 
• …
Root 
Cause 
Performance 
Analysis
Root 
Cause 
Performance 
Analysis 
• Conducted 
when: 
– Growth 
becomes 
a 
cost 
problem 
– More 
instances 
or 
roll 
backs 
don’t 
work 
• Eg: 
dependency 
issue, 
networking, 
… 
– A 
fix 
is 
needed 
for 
forward 
progress 
• “But 
it’s 
faster 
on 
Linux 
2.6.21 
m2.xlarge!” 
• Staying 
on 
older 
versions 
for 
an 
undiagnosed 
(and 
fixable) 
reason 
prevents 
gains 
from 
later 
improvements 
– To 
understand 
scalability 
factors 
• IdenFfies 
the 
origin 
of 
poor 
performance
Root 
Cause 
Analysis 
Process 
• From 
cloud 
to 
instance: 
… 
… 
Ne5lix 
ApplicaFon 
SG 
ASG 
Cluster 
ELB 
ASG 
1 
AZ 
1 
Instances 
(Linux) 
JVM 
Tomcat 
Service 
ASG 
2 
AZ 
3 
AZ 
2 
…
Cloud 
Methodologies 
• Resource 
Analysis 
– Any 
resources 
exhausted? 
CPU, 
disk, 
network 
• Metric 
and 
event 
correlaFons 
– When 
things 
got 
bad, 
what 
else 
happened? 
– Correlate 
with 
distributed 
dependencies 
• Latency 
Drilldowns 
– Trace 
origin 
of 
high 
latency 
from 
request 
down 
through 
dependencies 
• USE 
Method 
– For 
every 
service, 
check: 
uFlizaFon, 
saturaFon, 
errors
Instance 
Methodologies 
• Log 
Analysis 
– dmesg, 
GC, 
Apache, 
Tomcat, 
custom 
• USE 
Method 
– For 
every 
resource, 
check: 
uFlizaFon, 
saturaFon, 
errors 
• Micro-­‐benchmarking 
– Test 
and 
measure 
components 
in 
isolaFon 
• Drill-­‐down 
analysis 
– Decompose 
request 
latency, 
repeat 
• And 
other 
system 
performance 
methodologies
Bad 
Instances 
• Not 
all 
issues 
root 
caused 
– “bad 
instance” 
!= 
root 
cause 
• SomeFmes 
efficient 
to 
just 
kill 
“bad 
instances” 
– They 
could 
be 
a 
lone 
hardware 
issue, 
which 
could 
take 
days 
for 
you 
to 
analyze 
• But 
they 
could 
also 
be 
an 
early 
warning 
of 
a 
global 
issue. 
If 
you 
kill 
them, 
you 
don’t 
know. 
Instance 
Bad 
Instance
Bad 
Instance 
AnF-­‐Method 
1. Plot 
request 
latency 
per-­‐instance 
2. Find 
the 
bad 
instance 
3. Terminate 
bad 
instance 
4. Someone 
else’s 
problem 
now! 
95th 
percenFle 
latency 
(Atlas 
Exploder) 
Bad 
instance 
Terminate!
Cloud 
Analysis
Cloud 
Analysis 
• Cloud 
analysis 
tools 
made 
and 
used 
at 
Ne5lix 
include: 
Tool 
Purpose 
Atlas 
Metrics, 
dashboards, 
alerts 
Chronos 
Change 
tracking 
Mogul 
Metric 
correlaFon 
Salp 
Dependency 
graphing 
ICE 
Cloud 
usage 
dashboard 
• Monitor 
everything: 
you 
can’t 
tune 
what 
you 
can’t 
see
Ne5lix 
Cloud 
Analysis 
Process 
Atlas 
Alerts 
Atlas 
Dashboards 
2. 
Check 
Events 
Chronos 
Atlas 
Metrics 
Mogul 
Salp 
Instance 
Analysis 
ICE 
4. 
Check 
Dependencies 
Create 
New 
Alert 
Cost 
3. 
Drill 
Down 
5. 
Root 
Cause 
Redirected 
to 
a 
new 
Target 
Example 
1. 
Check 
Issue 
path 
enumerated
Atlas: 
Alerts 
• Custom 
alerts 
based 
on 
the 
Atlas 
metrics 
– CPU 
usage, 
latency, 
instance 
count 
growth, 
… 
• Usually 
email 
or 
pager 
– Can 
also 
deacFvate 
instances, 
terminate, 
reboot 
• Next 
step: 
check 
the 
dashboards
Atlas: 
Dashboards
Atlas: 
Dashboards 
Custom 
Graphs 
Set 
Time 
Breakdowns 
Interac,ve 
Click 
Graphs 
for 
More 
Metrics
Atlas: 
Dashboards 
• Cloud 
wide 
and 
per-­‐service 
(all 
custom) 
• StarFng 
point 
for 
issue 
invesFgaFons 
1. Confirm 
and 
quanFfy 
issue 
2. Check 
historic 
trend 
3. Launch 
Atlas 
metrics 
view 
to 
drill 
down 
Cloud 
wide: 
streams 
per 
second 
(SPS) 
dashboard
Atlas: 
Metrics
Atlas: 
Metrics 
Region 
App 
Metrics 
Breakdowns 
Op,ons 
Interac,ve 
Graph 
Summary 
Sta,s,cs
Atlas: 
Metrics 
• All 
metrics 
in 
one 
system 
• System 
metrics: 
– CPU 
usage, 
disk 
I/O, 
memory, 
… 
• ApplicaFon 
metrics: 
– latency 
percenFles, 
errors, 
… 
• Filters 
or 
breakdowns 
by 
region, 
applicaFon, 
ASG, 
metric, 
instance, 
… 
– Quickly 
narrow 
an 
invesFgaFon 
• URL 
contains 
session 
state: 
sharable
Chronos: 
Change 
Tracking
Chronos: 
Change 
Tracking 
Cri,cality 
App 
Breakdown 
Legend 
Historic 
Event 
List
Chronos: 
Change 
Tracking 
• Quickly 
filter 
uninteresFng 
events 
• Performance 
issues 
oVen 
coincide 
with 
changes 
• The 
size 
and 
velocity 
of 
Ne5lix 
engineering 
makes 
Chronos 
crucial 
for 
communicaFng 
change
Mogul: 
CorrelaFons 
• Comparing 
performance 
with 
per-­‐resource 
demand
Mogul: 
CorrelaFons 
• Comparing 
performance 
with 
per-­‐resource 
demand 
App 
Latency 
Throughput 
Resource 
Demand 
Correla,on
Mogul: 
CorrelaFons 
• Measures 
demand 
using 
Lille’s 
Law 
– D 
= 
R 
* 
X 
D 
= 
Demand 
(in 
seconds 
per 
second) 
R 
= 
Average 
Response 
Time 
X 
= 
Throughput 
• Discover 
unexpected 
problem 
dependencies 
– That 
aren’t 
on 
the 
service 
dashboards 
• Mogul 
checks 
many 
other 
correlaFons 
– Weeds 
through 
thousands 
of 
applicaFon 
metrics, 
showing 
you 
the 
most 
related/interesFng 
ones 
– (Scol/MarFn 
should 
give 
a 
talk 
just 
on 
these) 
• Bearing 
in 
mind 
correlaFon 
is 
not 
causaFon
Salp: 
Dependency 
Graphing 
• Dependency 
graphs 
based 
on 
live 
trace 
data 
• InteracFve 
• See 
architectural 
issues
Salp: 
Dependency 
Graphing 
• Dependency 
graphs 
based 
on 
live 
trace 
data 
• InteracFve 
• See 
architectural 
issues 
Dependencies 
App 
Their 
Dependencies 
…
ICE: 
AWS 
Usage
ICE: 
AWS 
Usage 
Services 
Cost 
per 
hour
ICE: 
AWS 
Usage 
• Cost 
per 
hour 
by 
AWS 
service, 
and 
Ne5lix 
applicaFon 
(service 
team) 
– IdenFfy 
issues 
of 
slow 
growth 
• Directs 
engineering 
effort 
to 
reduce 
cost
Ne5lix 
Cloud 
Analysis 
Process 
Atlas 
Alerts 
Atlas 
Dashboards 
2. 
Check 
Events 
Chronos 
Atlas 
Metrics 
Mogul 
Salp 
Instance 
Analysis 
ICE 
4. 
Check 
Dependencies 
Create 
New 
Alert 
Plus 
some 
other 
tools 
not 
pictured 
Cost 
3. 
Drill 
Down 
5. 
Root 
Cause 
In 
summary… 
Example 
path 
enumerated 
Redirected 
to 
a 
new 
Target 
1. 
Check 
Issue
Generic 
Cloud 
Analysis 
Process 
Alerts 
Usage 
Reports 
1. 
Check 
Issue 
Cost 
Custom 
Dashboards 
3. 
Drill 
Down 
2. 
Check 
Events 
Change 
Tracking 
Metric 
Analysis 
4. 
Check 
Dependencies 
Dependency 
Analysis 
5. 
Root 
Cause 
Instance 
Analysis 
Example 
path 
enumerated 
Create 
New 
Alert 
Redirected 
to 
a 
new 
Target
Instance 
Analysis
Instance 
Analysis 
Locate, 
quanFfy, 
and 
fix 
performance 
issues 
anywhere 
in 
the 
system
Instance 
Tools 
• Linux 
– top, 
ps, 
pidstat, 
vmstat, 
iostat, 
mpstat, 
netstat, 
nicstat, 
sar, 
strace, 
tcpdump, 
ss, 
… 
• System 
Tracing 
– Vrace, 
perf_events, 
SystemTap 
• CPU 
Performance 
Counters 
– perf_events, 
rdmsr 
• ApplicaFon 
Profiling 
– applicaFon 
logs, 
perf_events, 
Google 
Lightweight 
Java 
Profiler 
(LJP), 
Java 
Flight 
Recorder 
(JFR)
Tools 
in 
an 
AWS 
EC2 
Linux 
Instance
Linux 
Performance 
Analysis 
• vmstat, 
pidstat, 
sar, 
etc, 
used 
mostly 
normally 
$ sar -n TCP,ETCP,DEV 1! 
Linux 3.2.55 (test-e4f1a80b) !08/18/2014 !_x86_64_!(8 CPU)! 
! 
09:10:43 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s! 
09:10:44 PM lo 14.00 14.00 1.34 1.34 0.00 0.00 0.00! 
09:10:44 PM eth0 4114.00 4186.00 4537.46 28513.24 0.00 0.00 0.00! 
! 
09:10:43 PM active/s passive/s iseg/s oseg/s! 
09:10:44 PM 21.00 4.00 4107.00 22511.00! 
! 
09:10:43 PM atmptf/s estres/s retrans/s isegerr/s orsts/s! 
09:10:44 PM 0.00 0.00 36.00 0.00 1.00! 
[…]! 
• Micro 
benchmarking 
can 
be 
used 
to 
invesFgate 
hypervisor 
behavior 
that 
can’t 
be 
observed 
directly
Instance 
Challenges 
• ApplicaFon 
Profiling 
– For 
Java, 
Node.js 
• System 
Tracing 
– On 
Linux 
• Accessing 
CPU 
Performance 
Counters 
– From 
cloud 
guests
ApplicaFon 
Profiling 
• We’ve 
found 
many 
tools 
are 
inaccurate 
or 
broken 
– Eg, 
those 
based 
on 
java 
hprof 
• Stack 
profiling 
can 
be 
problemaFc: 
– Linux 
perf_events: 
frame 
pointer 
for 
the 
JVM 
is 
oVen 
missing 
(by 
hotspot), 
breaking 
stacks. 
Also 
needs 
perf-­‐ 
map-­‐agent 
loaded 
for 
symbol 
translaFon. 
– DTrace: 
jstack() 
also 
broken 
by 
missing 
FPs 
hlps://bugs.openjdk.java.net/browse/JDK-­‐6276264, 
2005 
• Flame 
graphs 
are 
solving 
many 
performance 
issues. 
These 
need 
working 
stacks.
ApplicaFon 
Profiling: 
Java 
• Java 
Flight 
Recorder 
– CPU 
& 
memory 
profiling. 
Oracle. 
$$$ 
• Google 
Lightweight 
Java 
Profiler 
– Basic, 
open 
source, 
free, 
asynchronous 
CPU 
profiler 
– Uses 
an 
agent 
that 
dumps 
hprof-­‐like 
output 
• hlps://code.google.com/p/lightweight-­‐java-­‐profiler/wiki/Ge~ngStarted 
• hlp://www.brendangregg.com/blog/2014-­‐06-­‐12/java-­‐flame-­‐graphs.html 
• Plus 
others 
at 
various 
Fmes 
(YourKit, 
…)
LJP 
CPU 
Flame 
Graph 
(Java)
LJP 
CPU 
Flame 
Graph 
(Java) 
Stack 
frame 
Mouse-­‐over 
frames 
to 
quanFfy 
Ancestry
Linux 
System 
Profiling 
• Previous 
profilers 
only 
show 
Java 
CPU 
Fme 
• We 
use 
perf_events 
(aka 
the 
“perf” 
command) 
to 
sample 
everything 
else: 
– JVM 
internals 
& 
libraries 
– The 
Linux 
kernel 
– Other 
apps, 
incl. 
Node.js 
• perf 
CPU 
Flame 
graphs: 
# git clone https://github.com/brendangregg/FlameGraph! 
# cd FlameGraph! 
# perf record -F 99 -ag -- sleep 60! 
# perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg!
perf 
CPU 
Flame 
Graph
perf 
CPU 
Flame 
Graph 
Broken 
Java 
stacks 
(missing 
frame 
pointer) 
Kernel 
TCP/IP 
GC 
Idle 
Time 
thread 
Locks 
epoll
ApplicaFon 
Profiling: 
Node.js 
• Performance 
analysis 
on 
Linux 
a 
growing 
area 
– Eg, 
new 
postmortem 
tools 
from 
2 
weeks 
ago: 
hlps://github.com/tjfontaine/lldb-­‐v8 
• Flame 
graphs 
are 
possible 
using 
Linux 
perf_events 
(perf) 
and 
v8 
-­‐-­‐perf_basic_prof 
(node 
v0.11.13+) 
– Although 
there 
is 
currently 
a 
map 
growth 
bug; 
see: 
hlp://www.brendangregg.com/blog/ 
2014-­‐09-­‐17/node-­‐flame-­‐graphs-­‐on-­‐linux.html 
• Also 
do 
heap 
analysis 
– node-­‐heapdump
Flame 
Graphs 
• CPU 
sample 
flame 
graphs 
solve 
many 
issues 
– We’re 
automaFng 
their 
collecFon 
– If 
you 
aren’t 
using 
them 
yet, 
you’re 
missing 
out 
on 
low 
hanging 
fruit! 
• Other 
flame 
graph 
types 
useful 
as 
well 
– Disk 
I/O, 
network 
I/O, 
memory 
events, 
etc 
– Any 
profile 
that 
includes 
more 
stacks 
than 
can 
be 
quickly 
read
Linux 
Tracing 
• ... 
now 
for 
something 
more 
challenging
Linux 
Tracing 
• Too 
many 
choices, 
and 
many 
are 
sFll 
in-­‐ 
development: 
– Vrace 
– perf_events 
– eBPF 
– SystemTap 
– ktap 
– LTTng 
– dtrace4linux 
– sysdig
Linux 
Tracing 
• A 
system 
tracer 
is 
needed 
to 
root 
cause 
many 
issues: 
kernel, 
library, 
app 
– (There’s 
a 
prely 
good 
book 
covering 
use 
cases) 
• DTrace 
is 
awesome, 
but 
the 
Linux 
ports 
are 
incomplete 
• Linux 
does 
have 
has 
Vrace 
and 
perf_events 
in 
the 
kernel 
source, 
which 
– 
it 
turns 
out 
– 
can 
saFsfy 
many 
needs 
already!
Linux 
Tracing: 
Vrace 
• Added 
by 
Steven 
Rostedt 
and 
others 
since 
2.6.27 
• Already 
enabled 
on 
our 
servers 
(3.2+) 
– CONFIG_FTRACE, 
CONFIG_FUNCTION_PROFILER, 
… 
– Use 
directly 
via 
/sys/kernel/debug/tracing 
• Front-­‐end 
tools 
to 
aid 
usage: 
perf-­‐tools 
– hlps://github.com/brendangregg/perf-­‐tools 
– Unsupported 
hacks: 
see 
WARNINGs 
– Also 
see 
the 
trace-­‐cmd 
front-­‐end, 
as 
well 
as 
perf 
• lwn.net: 
“Ftrace: 
The 
Hidden 
Light 
Switch”
perf-­‐tools: 
iosnoop 
• Block 
I/O 
(disk) 
events 
with 
latency: 
# ./iosnoop –ts! 
Tracing block I/O. Ctrl-C to end.! 
STARTs ENDs COMM PID TYPE DEV BLOCK BYTES LATms! 
5982800.302061 5982800.302679 supervise 1809 W 202,1 17039600 4096 0.62! 
5982800.302423 5982800.302842 supervise 1809 W 202,1 17039608 4096 0.42! 
5982800.304962 5982800.305446 supervise 1801 W 202,1 17039616 4096 0.48! 
5982800.305250 5982800.305676 supervise 1801 W 202,1 17039624 4096 0.43! 
[…]! 
# ./iosnoop –h! 
USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name] [duration]! 
-d device # device string (eg, "202,1)! 
-i iotype # match type (eg, '*R*' for all reads)! 
-n name # process name to match on I/O issue! 
-p PID # PID to match on I/O issue! 
-Q # include queueing time in LATms! 
-s # include start time of I/O (s)! 
-t # include completion time of I/O (s)! 
-h # this usage message! 
duration # duration seconds, and use buffers! 
[…]!
perf-­‐tools: 
iolatency 
• Block 
I/O 
(disk) 
latency 
distribuFons: 
# ./iolatency ! 
Tracing block I/O. Output every 1 seconds. Ctrl-C to end.! 
! 
>=(ms) .. <(ms) : I/O |Distribution |! 
0 -> 1 : 2104 |######################################|! 
1 -> 2 : 280 |###### |! 
2 -> 4 : 2 |# |! 
4 -> 8 : 0 | |! 
8 -> 16 : 202 |#### |! 
! 
>=(ms) .. <(ms) : I/O |Distribution |! 
0 -> 1 : 1144 |######################################|! 
1 -> 2 : 267 |######### |! 
2 -> 4 : 10 |# |! 
4 -> 8 : 5 |# |! 
8 -> 16 : 248 |######### |! 
16 -> 32 : 601 |#################### |! 
32 -> 64 : 117 |#### |! 
[…]!
perf-­‐tools: 
opensnoop 
• Trace 
open() 
syscalls 
showing 
filenames: 
# ./opensnoop -t! 
Tracing open()s. Ctrl-C to end.! 
TIMEs COMM PID FD FILE! 
4345768.332626 postgres 23886 0x8 /proc/self/oom_adj! 
4345768.333923 postgres 23886 0x5 global/pg_filenode.map! 
4345768.333971 postgres 23886 0x5 global/pg_internal.init! 
4345768.334813 postgres 23886 0x5 base/16384/PG_VERSION! 
4345768.334877 postgres 23886 0x5 base/16384/pg_filenode.map! 
4345768.334891 postgres 23886 0x5 base/16384/pg_internal.init! 
4345768.335821 postgres 23886 0x5 base/16384/11725! 
4345768.347911 svstat 24649 0x4 supervise/ok! 
4345768.347921 svstat 24649 0x4 supervise/status! 
4345768.350340 stat 24651 0x3 /etc/ld.so.cache! 
4345768.350372 stat 24651 0x3 /lib/x86_64-linux-gnu/libselinux…! 
4345768.350460 stat 24651 0x3 /lib/x86_64-linux-gnu/libc.so.6! 
4345768.350526 stat 24651 0x3 /lib/x86_64-linux-gnu/libdl.so.2! 
4345768.350981 stat 24651 0x3 /proc/filesystems! 
4345768.351182 stat 24651 0x3 /etc/nsswitch.conf! 
[…]!
perf-­‐tools: 
funcgraph 
• Trace 
a 
graph 
of 
kernel 
code 
flow: 
# ./funcgraph -Htp 5363 vfs_read! 
Tracing "vfs_read" for PID 5363... Ctrl-C to end.! 
# tracer: function_graph! 
#! 
# TIME CPU DURATION FUNCTION CALLS! 
# | | | | | | | |! 
4346366.073832 | 0) | vfs_read() {! 
4346366.073834 | 0) | rw_verify_area() {! 
4346366.073834 | 0) | security_file_permission() {! 
4346366.073834 | 0) | apparmor_file_permission() {! 
4346366.073835 | 0) 0.153 us | common_file_perm();! 
4346366.073836 | 0) 0.947 us | }! 
4346366.073836 | 0) 0.066 us | __fsnotify_parent();! 
4346366.073836 | 0) 0.080 us | fsnotify();! 
4346366.073837 | 0) 2.174 us | }! 
4346366.073837 | 0) 2.656 us | }! 
4346366.073837 | 0) | tty_read() {! 
4346366.073837 | 0) 0.060 us | tty_paranoia_check();! 
[…]!
perf-­‐tools: 
kprobe 
• Dynamically 
trace 
a 
kernel 
funcFon 
call 
or 
return, 
with 
variables, 
and 
in-­‐kernel 
filtering: 
# ./kprobe 'p:open do_sys_open filename=+0(%si):string' 'filename ~ "*stat"'! 
Tracing kprobe myopen. Ctrl-C to end.! 
postgres-1172 [000] d... 6594028.787166: open: (do_sys_open 
+0x0/0x220) filename="pg_stat_tmp/pgstat.stat"! 
postgres-1172 [001] d... 6594028.797410: open: (do_sys_open 
+0x0/0x220) filename="pg_stat_tmp/pgstat.stat"! 
postgres-1172 [001] d... 6594028.797467: open: (do_sys_open 
+0x0/0x220) filename="pg_stat_tmp/pgstat.stat”! 
^C! 
Ending tracing...! 
• Add 
-­‐s 
for 
stack 
traces; 
-­‐p 
for 
PID 
filter 
in-­‐kernel. 
• Quickly 
confirm 
kernel 
behavior; 
eg: 
did 
a 
tunable 
take 
effect?
perf-­‐tools 
(so 
far…)
Heat 
Maps 
• Vrace 
or 
perf_events 
for 
tracing 
disk 
I/O 
and 
other 
latencies 
as 
a 
heat 
map:
Other 
Tracing 
OpFons 
• SystemTap 
– The 
most 
powerful 
of 
the 
system 
tracers 
– We’ll 
use 
it 
as 
a 
last 
resort: 
deep 
custom 
tracing 
– I’ve 
historically 
had 
issues 
with 
panics 
and 
freezes 
• SFll 
present 
in 
the 
latest 
version? 
• The 
Ne5lix 
fault 
tolerant 
architecture 
makes 
panics 
much 
less 
of 
a 
problem 
(that 
was 
the 
panic 
monkey) 
• Instance 
canaries 
with 
DTrace 
are 
possible 
too 
– OmniOS 
– FreeBSD
Linux 
Tracing 
Future 
• Vrace 
+ 
perf_events 
cover 
much, 
but 
not 
custom 
in-­‐kernel 
aggregaFons 
• eBPF 
may 
provide 
this 
missing 
feature 
– eg, 
in-­‐kernel 
latency 
heat 
map 
(showing 
bimodal):
Linux 
Tracing 
Future 
• Vrace 
+ 
perf_events 
cover 
much, 
but 
not 
custom 
in-­‐kernel 
aggregaFons 
• eBPF 
may 
provide 
this 
missing 
feature 
– eg, 
in-­‐kernel 
latency 
heat 
map 
(showing 
bimodal): 
Low 
latency 
cache 
hits 
High 
latency 
device 
I/O 
Time
CPU 
Performance 
Counters 
• … 
is 
this 
even 
possible 
from 
a 
cloud 
guest?
CPU 
Performance 
Counters 
• Model 
Specific 
Registers 
(MSRs) 
– Basic 
details: 
Fmestamp 
clock, 
temperature, 
power 
– Some 
are 
available 
in 
EC2 
• Performance 
Monitoring 
Counters 
(PMCs) 
– Advanced 
details: 
cycles, 
stall 
cycles, 
cache 
misses, 
… 
– Not 
available 
in 
EC2 
(by 
default) 
• Root 
cause 
CPU 
usage 
at 
the 
cycle 
level 
– Eg, 
higher 
CPU 
usage 
due 
to 
more 
memory 
stall 
cycles
msr-­‐cloud-­‐tools 
• Uses 
the 
msr-­‐tools 
package 
and 
rdmsr(1) 
– hlps://github.com/brendangregg/msr-­‐cloud-­‐tools 
ec2-guest# ./cputemp 1! 
CPU1 CPU2 CPU3 CPU4! 
61 61 60 59! 
CPU 
Temperature 
60 61 60 60! 
[...]! 
ec2-guest# ./showboost! 
CPU MHz : 2500! 
Turbo MHz : 2900 (10 active)! 
Turbo Ratio : 116% (10 active)! 
CPU 0 summary every 5 seconds...! 
! 
TIME C0_MCYC C0_ACYC UTIL RATIO MHz! 
06:11:35 6428553166 7457384521 51% 116% 2900! 
06:11:40 6349881107 7365764152 50% 115% 2899! 
06:11:45 6240610655 7239046277 49% 115% 2899! 
06:11:50 6225704733 7221962116 49% 116% 2900! 
[...]! 
Real 
CPU 
MHz
MSRs: 
CPU 
Temperature 
• Useful 
to 
explain 
variaFon 
in 
turbo 
boost 
(if 
seen) 
• Temperature 
for 
a 
syntheFc 
workload:
MSRs: 
Intel 
Turbo 
Boost 
• Can 
dynamically 
increase 
CPU 
speed 
up 
to 
30+% 
• This 
can 
mess 
up 
all 
performance 
comparisons 
• Clock 
speed 
can 
be 
observed 
from 
MSRs 
using 
– IA32_MPERF: 
Bits 
63:0 
is 
TSC 
Frequency 
Clock 
Counter 
C0_MCNT 
TSC 
relaFve 
– IA32_APERF: 
Bits 
63:0 
is 
TSC 
Frequency 
Clock 
Counter 
C0_ACNT 
actual 
clocks 
• This 
is 
how 
msr-­‐cloud-­‐tools 
showturbo 
works
PMCs
PMCs 
• Needed 
for 
remaining 
low-­‐level 
CPU 
analysis: 
– CPU 
stall 
cycles, 
and 
stall 
cycle 
breakdowns 
– L1, 
L2, 
L3 
cache 
hit/miss 
raFo 
– Memory, 
CPU 
Interconnect, 
and 
bus 
I/O 
• Not 
enabled 
by 
default 
in 
EC2. 
Is 
possible, 
eg: 
# perf stat -e cycles,instructions,r0480,r01A2 -p `pgrep -n java` sleep 10! 
! 
Performance counter stats for process id '17190':! 
! 
71,208,028,133 cycles # 0.000 GHz [100.00%]! 
41,603,452,060 instructions # 0.58 insns per cycle [100.00%]! 
23,489,032,742 r0480 [100.00%]! 
20,241,290,520 r01A2! 
! 
10.000894718 seconds time elapsed! 
ICACHE.IFETCH_STALL 
RESOURCE_STALLS.ANY
Using 
Advanced 
Perf 
Tools 
• Everyone 
doesn’t 
need 
to 
learn 
these 
• Reality: 
– A. 
Your 
company 
has 
one 
or 
more 
people 
for 
advanced 
perf 
analysis 
(perf 
team). 
Ask 
them. 
– B. 
You 
are 
that 
person 
– C. 
You 
buy 
a 
product 
that 
does 
it. 
Ask 
them. 
• If 
you 
aren’t 
the 
advanced 
perf 
engineer, 
you 
need 
to 
know 
what 
to 
ask 
for 
– Flame 
graphs, 
latency 
heat 
maps, 
Vrace, 
PMCs, 
etc… 
• At 
Ne5lix, 
we’re 
building 
the 
(C) 
opFon: 
Vector
Future 
Work: 
Vector
Future 
Work: 
Vector 
U,liza,on 
Satura,on 
Per 
device 
Errors 
Breakdowns
Future 
Work: 
Vector 
• Real-­‐Fme, 
per-­‐second, 
instance 
metrics 
• On-­‐demand 
CPU 
flame 
graphs, 
heat 
maps, 
Vrace 
metrics, 
and 
SystemTap 
metrics 
• Analyze 
from 
clouds 
to 
roots 
quickly, 
and 
from 
a 
web 
interface 
• Scalable: 
other 
teams 
can 
use 
it 
easily 
Atlas 
Alerts 
Atlas 
Dashboards 
Atlas 
Metrics 
Mogul 
Salp 
Vector 
ICE 
Chronos
In 
Summary 
• 1. 
Ne5lix 
architecture 
– Fault 
tolerance: 
ASGs, 
ASG 
clusters, 
Hystrix 
(dependency 
API), 
Zuul 
(proxy), 
Simian 
army 
(tesFng) 
– Reduces 
the 
severity 
and 
urgency 
of 
issues 
• 2. 
Cloud 
Analysis 
– Atlas 
(alerts/dashboards/metrics), 
Chronos 
(event 
tracking), 
Mogul 
& 
Salp 
(dependency 
analysis), 
ICE 
(AWS 
usage) 
– Quickly 
narrow 
focus 
from 
cloud 
to 
ASG 
to 
instance 
• 3. 
Instance 
Analysis 
– Linux 
tools 
(*stat, 
sar, 
…), 
perf_events, 
Vrace, 
perf-­‐tools, 
rdmsr, 
msr-­‐cloud-­‐tools, 
Vector 
– Read 
logs, 
profile 
& 
trace 
all 
soVware, 
read 
CPU 
counters
References 
& 
Links 
hlps://ne5lix.github.io/#repo 
hlp://techblog.ne5lix.com/2012/01/auto-­‐scaling-­‐in-­‐amazon-­‐cloud.html 
hlp://techblog.ne5lix.com/2012/06/asgard-­‐web-­‐based-­‐cloud-­‐management-­‐and.html 
hlp://www.slideshare.net/benjchristensen/performance-­‐and-­‐fault-­‐tolerance-­‐for-­‐the-­‐ 
ne5lix-­‐api-­‐qcon-­‐sao-­‐paulo 
hlp://www.slideshare.net/adrianco/ne5lix-­‐nosql-­‐search 
hlp://www.slideshare.net/ufried/resilience-­‐with-­‐hystrix 
hlps://github.com/Ne5lix/Hystrix, 
hlps://github.com/Ne5lix/Zuul 
hlp://techblog.ne5lix.com/2011/07/ne5lix-­‐simian-­‐army.html 
hlp://techblog.ne5lix.com/2014/09/introducing-­‐chaos-­‐engineering.html 
hlp://www.brendangregg.com/blog/2014-­‐06-­‐12/java-­‐flame-­‐graphs.html 
hlp://www.brendangregg.com/blog/2014-­‐09-­‐17/node-­‐flame-­‐graphs-­‐on-­‐linux.html 
Systems 
Performance: 
Enterprise 
and 
the 
Cloud, 
PrenFce 
Hall, 
2014 
hlp://sourceforge.net/projects/nicstat/ 
perf-­‐tools: 
hlps://github.com/brendangregg/perf-­‐tools 
Ftrace: 
The 
Hidden 
Light 
Switch: 
hlp://lwn.net/ArFcles/608497/ 
msr-­‐cloud-­‐tools: 
hlps://github.com/brendangregg/msr-­‐cloud-­‐tools
Thanks 
• Coburn 
Watson, 
Adrian 
CockcroV 
• Atlas: 
Insight 
Engineering 
(Roy 
Rapoport, 
etc.) 
• Mogul: 
Performance 
Engineering 
(Scol 
Emmons, 
MarFn 
Spier) 
• Vector: 
Performance 
Engineering 
(MarFn 
Spier, 
Amer 
Ather)
Thanks 
• QuesFons? 
• hlp://techblog.ne5lix.com 
• hlp://slideshare.net/brendangregg 
• hlp://www.brendangregg.com 
• bgregg@ne5lix.com 
• @brendangregg

Weitere ähnliche Inhalte

Was ist angesagt?

Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafkaconfluent
 
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScalePinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScaleSeunghyun Lee
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
 
Extreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningExtreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningMilind Koyande
 
Kubernetes Networking - Sreenivas Makam - Google - CC18
Kubernetes Networking - Sreenivas Makam - Google - CC18Kubernetes Networking - Sreenivas Makam - Google - CC18
Kubernetes Networking - Sreenivas Makam - Google - CC18CodeOps Technologies LLP
 
Cruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clustersCruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clustersPrateek Maheshwari
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlJiangjie Qin
 
eBPF - Observability In Deep
eBPF - Observability In DeepeBPF - Observability In Deep
eBPF - Observability In DeepMydbops
 
OSMC 2022 | The Power of Metrics, Logs & Traces with Open Source by Emil-Andr...
OSMC 2022 | The Power of Metrics, Logs & Traces with Open Source by Emil-Andr...OSMC 2022 | The Power of Metrics, Logs & Traces with Open Source by Emil-Andr...
OSMC 2022 | The Power of Metrics, Logs & Traces with Open Source by Emil-Andr...NETWAYS
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaDatabricks
 
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin OmerogluStorage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin OmerogluHostedbyConfluent
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaMetrics
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotXiang Fu
 
Vitess VReplication: Standing on the Shoulders of a MySQL Giant
Vitess VReplication: Standing on the Shoulders of a MySQL GiantVitess VReplication: Standing on the Shoulders of a MySQL Giant
Vitess VReplication: Standing on the Shoulders of a MySQL GiantMatt Lord
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudDatabricks
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
 
OSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdf
OSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdfOSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdf
OSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdfNETWAYS
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks
 

Was ist angesagt? (20)

Deep Dive into Apache Kafka
Deep Dive into Apache KafkaDeep Dive into Apache Kafka
Deep Dive into Apache Kafka
 
ELK Stack
ELK StackELK Stack
ELK Stack
 
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's ScalePinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
Pinot: Enabling Real-time Analytics Applications @ LinkedIn's Scale
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
Extreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and TuningExtreme Linux Performance Monitoring and Tuning
Extreme Linux Performance Monitoring and Tuning
 
Kubernetes Networking - Sreenivas Makam - Google - CC18
Kubernetes Networking - Sreenivas Makam - Google - CC18Kubernetes Networking - Sreenivas Makam - Google - CC18
Kubernetes Networking - Sreenivas Makam - Google - CC18
 
Cruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clustersCruise Control: Effortless management of Kafka clusters
Cruise Control: Effortless management of Kafka clusters
 
Introduction to Kafka Cruise Control
Introduction to Kafka Cruise ControlIntroduction to Kafka Cruise Control
Introduction to Kafka Cruise Control
 
eBPF - Observability In Deep
eBPF - Observability In DeepeBPF - Observability In Deep
eBPF - Observability In Deep
 
OSMC 2022 | The Power of Metrics, Logs & Traces with Open Source by Emil-Andr...
OSMC 2022 | The Power of Metrics, Logs & Traces with Open Source by Emil-Andr...OSMC 2022 | The Power of Metrics, Logs & Traces with Open Source by Emil-Andr...
OSMC 2022 | The Power of Metrics, Logs & Traces with Open Source by Emil-Andr...
 
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital KediaTuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
Tuning Apache Spark for Large-Scale Workloads Gaoxiang Liu and Sital Kedia
 
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin OmerogluStorage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
Storage Capacity Management on Multi-tenant Kafka Cluster with Nurettin Omeroglu
 
VictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - PreviewVictoriaLogs: Open Source Log Management System - Preview
VictoriaLogs: Open Source Log Management System - Preview
 
The basics of fluentd
The basics of fluentdThe basics of fluentd
The basics of fluentd
 
Real-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache PinotReal-time Analytics with Presto and Apache Pinot
Real-time Analytics with Presto and Apache Pinot
 
Vitess VReplication: Standing on the Shoulders of a MySQL Giant
Vitess VReplication: Standing on the Shoulders of a MySQL GiantVitess VReplication: Standing on the Shoulders of a MySQL Giant
Vitess VReplication: Standing on the Shoulders of a MySQL Giant
 
Apache Spark At Scale in the Cloud
Apache Spark At Scale in the CloudApache Spark At Scale in the Cloud
Apache Spark At Scale in the Cloud
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
OSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdf
OSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdfOSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdf
OSMC 2022 | OpenTelemetry 101 by Dotan Horovit s.pdf
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 

Andere mochten auch

ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016Brendan Gregg
 
Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsStop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsBrendan Gregg
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsBrendan Gregg
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF SuperpowersBrendan Gregg
 
Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFBrendan Gregg
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBrendan Gregg
 
Performance Tuning EC2 Instances
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 InstancesBrendan Gregg
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixBrendan Gregg
 
Modern Data Center Network Architecture - The house that Clos built
Modern Data Center Network Architecture - The house that Clos builtModern Data Center Network Architecture - The house that Clos built
Modern Data Center Network Architecture - The house that Clos builtCumulus Networks
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2Tanel Poder
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1Tanel Poder
 
Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)Tanel Poder
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFBrendan Gregg
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing ToolsBrendan Gregg
 
広く知ってほしいDNSのこと ―とあるセキュリティ屋から見たDNS受難の10年間―
広く知ってほしいDNSのこと ―とあるセキュリティ屋から見たDNS受難の10年間―広く知ってほしいDNSのこと ―とあるセキュリティ屋から見たDNS受難の10年間―
広く知ってほしいDNSのこと ―とあるセキュリティ屋から見たDNS受難の10年間―Tomohiro Nakashima
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersBrendan Gregg
 
4つの戦犯から考えるサービスづくりの失敗
4つの戦犯から考えるサービスづくりの失敗4つの戦犯から考えるサービスづくりの失敗
4つの戦犯から考えるサービスづくりの失敗toshihiro ichitani
 

Andere mochten auch (17)

ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016ACM Applicative System Methodology 2016
ACM Applicative System Methodology 2016
 
Stop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production SystemsStop the Guessing: Performance Methodologies for Production Systems
Stop the Guessing: Performance Methodologies for Production Systems
 
SREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREsSREcon 2016 Performance Checklists for SREs
SREcon 2016 Performance Checklists for SREs
 
Linux BPF Superpowers
Linux BPF SuperpowersLinux BPF Superpowers
Linux BPF Superpowers
 
Velocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPFVelocity 2017 Performance analysis superpowers with Linux eBPF
Velocity 2017 Performance analysis superpowers with Linux eBPF
 
Blazing Performance with Flame Graphs
Blazing Performance with Flame GraphsBlazing Performance with Flame Graphs
Blazing Performance with Flame Graphs
 
Performance Tuning EC2 Instances
Performance Tuning EC2 InstancesPerformance Tuning EC2 Instances
Performance Tuning EC2 Instances
 
Kernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at NetflixKernel Recipes 2017: Using Linux perf at Netflix
Kernel Recipes 2017: Using Linux perf at Netflix
 
Modern Data Center Network Architecture - The house that Clos built
Modern Data Center Network Architecture - The house that Clos builtModern Data Center Network Architecture - The house that Clos built
Modern Data Center Network Architecture - The house that Clos built
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 2
 
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
Tanel Poder - Troubleshooting Complex Oracle Performance Issues - Part 1
 
Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)Tanel Poder Oracle Scripts and Tools (2010)
Tanel Poder Oracle Scripts and Tools (2010)
 
Linux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPFLinux 4.x Tracing: Performance Analysis with bcc/BPF
Linux 4.x Tracing: Performance Analysis with bcc/BPF
 
Designing Tracing Tools
Designing Tracing ToolsDesigning Tracing Tools
Designing Tracing Tools
 
広く知ってほしいDNSのこと ―とあるセキュリティ屋から見たDNS受難の10年間―
広く知ってほしいDNSのこと ―とあるセキュリティ屋から見たDNS受難の10年間―広く知ってほしいDNSのこと ―とあるセキュリティ屋から見たDNS受難の10年間―
広く知ってほしいDNSのこと ―とあるセキュリティ屋から見たDNS受難の10年間―
 
Linux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF SuperpowersLinux 4.x Tracing Tools: Using BPF Superpowers
Linux 4.x Tracing Tools: Using BPF Superpowers
 
4つの戦犯から考えるサービスづくりの失敗
4つの戦犯から考えるサービスづくりの失敗4つの戦犯から考えるサービスづくりの失敗
4つの戦犯から考えるサービスづくりの失敗
 

Ähnlich wie Netflix Cloud Performance Analysis and Root Cause Diagnostics

YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixBrendan Gregg
 
Kaseya Connect 2012 - THE ABC'S OF MONITORING
Kaseya Connect 2012 - THE ABC'S OF MONITORINGKaseya Connect 2012 - THE ABC'S OF MONITORING
Kaseya Connect 2012 - THE ABC'S OF MONITORINGKaseya
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasMonal Daxini
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Robert Metzger
 
NetflixOSS for Triangle Devops Oct 2013
NetflixOSS for Triangle Devops Oct 2013NetflixOSS for Triangle Devops Oct 2013
NetflixOSS for Triangle Devops Oct 2013aspyker
 
(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014
(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014
(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014Amazon Web Services
 
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵Amazon Web Services Korea
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly SolarWinds Loggly
 
Streaming meetup
Streaming meetupStreaming meetup
Streaming meetupkarthik_krk
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Lucidworks
 
Running & Monitoring Docker at Scale
Running & Monitoring Docker at ScaleRunning & Monitoring Docker at Scale
Running & Monitoring Docker at ScaleDatadog
 
Deploying and managing Solr at scale
Deploying and managing Solr at scaleDeploying and managing Solr at scale
Deploying and managing Solr at scaleAnshum Gupta
 
Maria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High AvailabilityMaria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High AvailabilityOSSCube
 
MariaDB Galera Cluster
MariaDB Galera ClusterMariaDB Galera Cluster
MariaDB Galera ClusterAbdul Manaf
 
Real world Scala hAkking NLJUG JFall 2011
Real world Scala hAkking NLJUG JFall 2011Real world Scala hAkking NLJUG JFall 2011
Real world Scala hAkking NLJUG JFall 2011Raymond Roestenburg
 

Ähnlich wie Netflix Cloud Performance Analysis and Root Cause Diagnostics (20)

YOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at NetflixYOW2018 Cloud Performance Root Cause Analysis at Netflix
YOW2018 Cloud Performance Root Cause Analysis at Netflix
 
Kaseya Connect 2012 - THE ABC'S OF MONITORING
Kaseya Connect 2012 - THE ABC'S OF MONITORINGKaseya Connect 2012 - THE ABC'S OF MONITORING
Kaseya Connect 2012 - THE ABC'S OF MONITORING
 
Tech4Africa 2014
Tech4Africa 2014Tech4Africa 2014
Tech4Africa 2014
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
 
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
Architecture of Flink's Streaming Runtime @ ApacheCon EU 2015
 
NetflixOSS for Triangle Devops Oct 2013
NetflixOSS for Triangle Devops Oct 2013NetflixOSS for Triangle Devops Oct 2013
NetflixOSS for Triangle Devops Oct 2013
 
mg-ccr-ws-10122008
mg-ccr-ws-10122008mg-ccr-ws-10122008
mg-ccr-ws-10122008
 
(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014
(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014
(APP309) Running and Monitoring Docker Containers at Scale | AWS re:Invent 2014
 
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵 [AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
Streaming meetup
Streaming meetupStreaming meetup
Streaming meetup
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
Solr Compute Cloud – An Elastic Solr Infrastructure: Presented by Nitin Sharm...
 
Running & Monitoring Docker at Scale
Running & Monitoring Docker at ScaleRunning & Monitoring Docker at Scale
Running & Monitoring Docker at Scale
 
Deploying and managing Solr at scale
Deploying and managing Solr at scaleDeploying and managing Solr at scale
Deploying and managing Solr at scale
 
Svc 202-netflix-open-source
Svc 202-netflix-open-sourceSvc 202-netflix-open-source
Svc 202-netflix-open-source
 
Maria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High AvailabilityMaria DB Galera Cluster for High Availability
Maria DB Galera Cluster for High Availability
 
MariaDB Galera Cluster
MariaDB Galera ClusterMariaDB Galera Cluster
MariaDB Galera Cluster
 
Real world Scala hAkking NLJUG JFall 2011
Real world Scala hAkking NLJUG JFall 2011Real world Scala hAkking NLJUG JFall 2011
Real world Scala hAkking NLJUG JFall 2011
 

Mehr von Brendan Gregg

YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing PerformanceBrendan Gregg
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingBrendan Gregg
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Brendan Gregg
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedBrendan Gregg
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)Brendan Gregg
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedBrendan Gregg
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceBrendan Gregg
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at NetflixBrendan Gregg
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareBrendan Gregg
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsBrendan Gregg
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityBrendan Gregg
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixBrendan Gregg
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019Brendan Gregg
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityBrendan Gregg
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018Brendan Gregg
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Brendan Gregg
 

Mehr von Brendan Gregg (20)

YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
 
IntelON 2021 Processor Benchmarking
IntelON 2021 Processor BenchmarkingIntelON 2021 Processor Benchmarking
IntelON 2021 Processor Benchmarking
 
Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)Performance Wins with eBPF: Getting Started (2021)
Performance Wins with eBPF: Getting Started (2021)
 
Systems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting StartedSystems@Scale 2021 BPF Performance Getting Started
Systems@Scale 2021 BPF Performance Getting Started
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
BPF Internals (eBPF)
BPF Internals (eBPF)BPF Internals (eBPF)
BPF Internals (eBPF)
 
Performance Wins with BPF: Getting Started
Performance Wins with BPF: Getting StartedPerformance Wins with BPF: Getting Started
Performance Wins with BPF: Getting Started
 
YOW2020 Linux Systems Performance
YOW2020 Linux Systems PerformanceYOW2020 Linux Systems Performance
YOW2020 Linux Systems Performance
 
re:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflixre:Invent 2019 BPF Performance Analysis at Netflix
re:Invent 2019 BPF Performance Analysis at Netflix
 
UM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of SoftwareUM2019 Extended BPF: A New Type of Software
UM2019 Extended BPF: A New Type of Software
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
LPC2019 BPF Tracing Tools
LPC2019 BPF Tracing ToolsLPC2019 BPF Tracing Tools
LPC2019 BPF Tracing Tools
 
LSFMM 2019 BPF Observability
LSFMM 2019 BPF ObservabilityLSFMM 2019 BPF Observability
LSFMM 2019 BPF Observability
 
YOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflixYOW2018 CTO Summit: Working at netflix
YOW2018 CTO Summit: Working at netflix
 
eBPF Perf Tools 2019
eBPF Perf Tools 2019eBPF Perf Tools 2019
eBPF Perf Tools 2019
 
BPF Tools 2017
BPF Tools 2017BPF Tools 2017
BPF Tools 2017
 
NetConf 2018 BPF Observability
NetConf 2018 BPF ObservabilityNetConf 2018 BPF Observability
NetConf 2018 BPF Observability
 
FlameScope 2018
FlameScope 2018FlameScope 2018
FlameScope 2018
 
ATO Linux Performance 2018
ATO Linux Performance 2018ATO Linux Performance 2018
ATO Linux Performance 2018
 
Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)Linux Performance 2018 (PerconaLive keynote)
Linux Performance 2018 (PerconaLive keynote)
 

Netflix Cloud Performance Analysis and Root Cause Diagnostics

  • 1. From Clouds to Roots Brendan Gregg Senior Performance Architect Performance Engineering Team bgregg@ne5lix.com, @brendangregg September, 2014
  • 2. Root Cause Analysis at Ne5lix … Roots … Load Hystrix Service Tomcat JVM … Instances (Linux) ELB ASG Cluster SG ApplicaFon Ne5lix ASG 1 Zuul AZ 1 ASG 2 AZ 2 AZ 3 Devices Atlas Chronos Mogul Vector sar, *stat stap, Vrace rdmsr … Ribbon
  • 3. • Massive AWS EC2 Linux cloud – Tens of thousands of server instances – Autoscale by ~3k each day – CentOS and Ubuntu • FreeBSD for content delivery – Approx 33% of US Internet traffic at night • Performance is criFcal – Customer saFsfacFon: >50M subscribers – $$$ price/performance – Develop tools for cloud-­‐wide analysis
  • 4. Brendan Gregg • Senior Performance Architect, Ne5lix – Linux and FreeBSD performance – Performance Engineering team (@coburnw) • Recent work: – Linux perf-­‐tools, using Vrace & perf_events – Systems Performance, PrenFce Hall • Previous work includes: – USE Method, flame graphs, latency & uFlizaFon heat maps, DTraceToolkit, iosnoop and others on OS X, ZFS L2ARC • Twiler @brendangregg
  • 5. Last year at Surge… • I saw a great Ne5lix talk by Coburn Watson: • hlps://www.youtube.com/watch?v=7-­‐13wV3WO8Q • He’s now my manager (and also sFll hiring!)
  • 6. Agenda • The Ne5lix Cloud – How it works: ASG clusters, Hystrix, monkeys – And how it may fail • Root Cause Performance Analysis – Why it’s sFll needed • Cloud analysis • Instance analysis
  • 7. Terms • AWS: Amazon Web Services • EC2: AWS ElasFc Compute 2 (cloud instances) • S3: AWS Simple Storage Service (object store) • ELB: AWS ElasFc Load Balancers • SQS: AWS Simple Queue Service • SES: AWS Simple Email Service • CDN: Content Delivery Network • OCA: Ne5lix Open Connect Appliance (streaming CDN) • QoS: Quality of Service • AMI: Amazon Machine Image (instance image) • ASG: Auto Scaling Group • AZ: Availability Zone • NIWS: Ne5lix Internal Web Service framework (Ribbon) • MSR: Model Specific Register (CPU info register) • PMC: Performance Monitoring Counter (CPU perf counter)
  • 9. • Tens The Ne5lix Cloud of thousands of cloud instances on AWS EC2, with S3 and Cassandra for storage • Ne5lix S3 EC2 Cassandra ApplicaFons (Services) EVCache is implemented by mulFple logical services ELB ElasFcsearch SES SQS
  • 10. Ne5lix Services AuthenFcaFon User Data PersonalizaFon Viewing Hist. Web Site API Client Streaming API Devices DRM QoS Logging CDN Steering Encoding OCA CDN … • Open Connect Appliances used for content delivery
  • 11. Freedom and Responsibility • Culture deck is true – hlp://www.slideshare.net/reed2001/culture-­‐1798664 (9M views!) • Deployment freedom – Service teams choose their own tech & schedules – Purchase and use cloud instances without approvals – Ne5lix environment changes fast!
  • 12. Cloud Technologies • Numerous open source technologies are in use: – Linux, Java, Cassandra, Node.js, … • Ne5lix also open sources: ne5lix.github.io
  • 13. Cloud Instances • Base server instance image + customizaFons by service teams (BaseAMI). Typically: Linux (CentOS or Ubuntu) Java (JDK 7 or 8) GC and Tomcat thread dump logging ApplicaFon war files, base servlet, pla5orm, hystrix, health check, metrics (Servo) OpFonal Apache, memcached, non-­‐Java apps (incl. Node.js) Atlas monitoring, S3 log rotaFon, Vrace, perf, stap, custom perf tools
  • 14. Scalability and Reliability # Problem Solu,on 1 Load increases Auto scale with ASGs 2 Poor performing code push Rapid rollback with red/black ASG clusters 3 Instance failure Hystrix Fmeouts and secondaries 4 Zone/Region failure Zuul to reroute traffic 5 Overlooked and unhandled issues Simian army 6 Poor performance Atlas metrics, alerts, Chronos
  • 15. 1. Auto Scaling Groups • Instances automaFcally added or removed by a custom scaling policy – A broken policy could cause false scaling • Alerts & audits used to check scaling is sane Scaling Policy loadavg, latency, … Instance Instance Instance CloudWatch, Servo Cloud ConfiguraFon Management ASG
  • 16. 2. ASG Clusters • How code versions are really deployed • Traffic managed by ElasFc Load Balancers (ELBs) • Fast rollback if issues are found – Might rollback undiagnosed issues • Canaries can also be used for tesFng (and automated) ASG-­‐v011 … Instance Instance Instance ASG Cluster prod1 ASG-­‐v010 … Instance Instance Instance Canary ELB
  • 17. • A 3. Hystrix library for latency and fault tolerance for dependency services – Fallbacks, degradaFon, fast fail and rapid recovery – Supports Fmeouts, load shedding, circuit breaker – Uses thread pools for dependency services – RealFme monitoring • Plus the Ribbon IPC library (NIWS), which adds even more fault tolerance Tomcat ApplicaFon get A Hystrix Dependency A1 Dependency A2 >100ms
  • 18. • All 4. Redundancy device traffic goes through the Zuul proxy: – dynamic rouFng, monitoring, resiliency, security • Availability Zone failure: run from 2 of 3 zones • Region failure: reroute traffic Zuul AZ1 AZ2 AZ3 Monitoring
  • 19. 5. Simian Army • Ensures cloud handles failures through regular tesFng • Monkeys: – Latency: arFficial delays – Conformity: kills non-­‐ best-­‐pracFces instances – Doctor: health checks – Janitor: unused instances – Security: checks violaFons – 10-­‐18: geographic issues – Chaos Gorilla: AZ failure • We’re hiring Chaos Engineers!
  • 20. 6. Atlas, alerts, Chronos • Atlas: Cloud-­‐wide monitoring tool – Millions of metrics, quick rollups, custom dashboards: • Alerts: Custom, using Atlas metrics – In parFcular, error & Fmeout rates on client devices • Chronos: Change tracking – Used during incident invesFgaFons
  • 21. In Summary # Problem Solu,on 1 Load increases ASGs 2 Poor performing code push ASG clusters 3 Instance issue Hystrix 4 Zone/Region issue Zuul 5 Overlooked and unhandled issues Monkeys 6 Poor performance Atlas, alerts, Chronos • Ne5lix is very good at automaFcally handling failure – Issues oVen lead to rapid instance growth (ASGs) • Good for customers – Fast workaround • Good for engineers – Fix later, 9-­‐5
  • 22. Typical Ne5lix Stack Dependencies, Atlas (monitoring), Discovery, … … … Load Hystrix Service Tomcat JVM … Instances (Linux) 3. ELB ASG Cluster SG ApplicaFon Ne5lix ASG 1 Zuul AZ 1 ASG 2 AZ 2 AZ 3 Devices Monkeys Ribbon 5. 1. 2. 4. 6. Problems/soluFons enumerated
  • 23. * ExcepFons • Apache Web Server • Node.js • …
  • 25. Root Cause Performance Analysis • Conducted when: – Growth becomes a cost problem – More instances or roll backs don’t work • Eg: dependency issue, networking, … – A fix is needed for forward progress • “But it’s faster on Linux 2.6.21 m2.xlarge!” • Staying on older versions for an undiagnosed (and fixable) reason prevents gains from later improvements – To understand scalability factors • IdenFfies the origin of poor performance
  • 26. Root Cause Analysis Process • From cloud to instance: … … Ne5lix ApplicaFon SG ASG Cluster ELB ASG 1 AZ 1 Instances (Linux) JVM Tomcat Service ASG 2 AZ 3 AZ 2 …
  • 27. Cloud Methodologies • Resource Analysis – Any resources exhausted? CPU, disk, network • Metric and event correlaFons – When things got bad, what else happened? – Correlate with distributed dependencies • Latency Drilldowns – Trace origin of high latency from request down through dependencies • USE Method – For every service, check: uFlizaFon, saturaFon, errors
  • 28. Instance Methodologies • Log Analysis – dmesg, GC, Apache, Tomcat, custom • USE Method – For every resource, check: uFlizaFon, saturaFon, errors • Micro-­‐benchmarking – Test and measure components in isolaFon • Drill-­‐down analysis – Decompose request latency, repeat • And other system performance methodologies
  • 29. Bad Instances • Not all issues root caused – “bad instance” != root cause • SomeFmes efficient to just kill “bad instances” – They could be a lone hardware issue, which could take days for you to analyze • But they could also be an early warning of a global issue. If you kill them, you don’t know. Instance Bad Instance
  • 30. Bad Instance AnF-­‐Method 1. Plot request latency per-­‐instance 2. Find the bad instance 3. Terminate bad instance 4. Someone else’s problem now! 95th percenFle latency (Atlas Exploder) Bad instance Terminate!
  • 32. Cloud Analysis • Cloud analysis tools made and used at Ne5lix include: Tool Purpose Atlas Metrics, dashboards, alerts Chronos Change tracking Mogul Metric correlaFon Salp Dependency graphing ICE Cloud usage dashboard • Monitor everything: you can’t tune what you can’t see
  • 33. Ne5lix Cloud Analysis Process Atlas Alerts Atlas Dashboards 2. Check Events Chronos Atlas Metrics Mogul Salp Instance Analysis ICE 4. Check Dependencies Create New Alert Cost 3. Drill Down 5. Root Cause Redirected to a new Target Example 1. Check Issue path enumerated
  • 34. Atlas: Alerts • Custom alerts based on the Atlas metrics – CPU usage, latency, instance count growth, … • Usually email or pager – Can also deacFvate instances, terminate, reboot • Next step: check the dashboards
  • 36. Atlas: Dashboards Custom Graphs Set Time Breakdowns Interac,ve Click Graphs for More Metrics
  • 37. Atlas: Dashboards • Cloud wide and per-­‐service (all custom) • StarFng point for issue invesFgaFons 1. Confirm and quanFfy issue 2. Check historic trend 3. Launch Atlas metrics view to drill down Cloud wide: streams per second (SPS) dashboard
  • 39. Atlas: Metrics Region App Metrics Breakdowns Op,ons Interac,ve Graph Summary Sta,s,cs
  • 40. Atlas: Metrics • All metrics in one system • System metrics: – CPU usage, disk I/O, memory, … • ApplicaFon metrics: – latency percenFles, errors, … • Filters or breakdowns by region, applicaFon, ASG, metric, instance, … – Quickly narrow an invesFgaFon • URL contains session state: sharable
  • 42. Chronos: Change Tracking Cri,cality App Breakdown Legend Historic Event List
  • 43. Chronos: Change Tracking • Quickly filter uninteresFng events • Performance issues oVen coincide with changes • The size and velocity of Ne5lix engineering makes Chronos crucial for communicaFng change
  • 44. Mogul: CorrelaFons • Comparing performance with per-­‐resource demand
  • 45. Mogul: CorrelaFons • Comparing performance with per-­‐resource demand App Latency Throughput Resource Demand Correla,on
  • 46. Mogul: CorrelaFons • Measures demand using Lille’s Law – D = R * X D = Demand (in seconds per second) R = Average Response Time X = Throughput • Discover unexpected problem dependencies – That aren’t on the service dashboards • Mogul checks many other correlaFons – Weeds through thousands of applicaFon metrics, showing you the most related/interesFng ones – (Scol/MarFn should give a talk just on these) • Bearing in mind correlaFon is not causaFon
  • 47. Salp: Dependency Graphing • Dependency graphs based on live trace data • InteracFve • See architectural issues
  • 48. Salp: Dependency Graphing • Dependency graphs based on live trace data • InteracFve • See architectural issues Dependencies App Their Dependencies …
  • 50. ICE: AWS Usage Services Cost per hour
  • 51. ICE: AWS Usage • Cost per hour by AWS service, and Ne5lix applicaFon (service team) – IdenFfy issues of slow growth • Directs engineering effort to reduce cost
  • 52. Ne5lix Cloud Analysis Process Atlas Alerts Atlas Dashboards 2. Check Events Chronos Atlas Metrics Mogul Salp Instance Analysis ICE 4. Check Dependencies Create New Alert Plus some other tools not pictured Cost 3. Drill Down 5. Root Cause In summary… Example path enumerated Redirected to a new Target 1. Check Issue
  • 53. Generic Cloud Analysis Process Alerts Usage Reports 1. Check Issue Cost Custom Dashboards 3. Drill Down 2. Check Events Change Tracking Metric Analysis 4. Check Dependencies Dependency Analysis 5. Root Cause Instance Analysis Example path enumerated Create New Alert Redirected to a new Target
  • 55. Instance Analysis Locate, quanFfy, and fix performance issues anywhere in the system
  • 56. Instance Tools • Linux – top, ps, pidstat, vmstat, iostat, mpstat, netstat, nicstat, sar, strace, tcpdump, ss, … • System Tracing – Vrace, perf_events, SystemTap • CPU Performance Counters – perf_events, rdmsr • ApplicaFon Profiling – applicaFon logs, perf_events, Google Lightweight Java Profiler (LJP), Java Flight Recorder (JFR)
  • 57. Tools in an AWS EC2 Linux Instance
  • 58. Linux Performance Analysis • vmstat, pidstat, sar, etc, used mostly normally $ sar -n TCP,ETCP,DEV 1! Linux 3.2.55 (test-e4f1a80b) !08/18/2014 !_x86_64_!(8 CPU)! ! 09:10:43 PM IFACE rxpck/s txpck/s rxkB/s txkB/s rxcmp/s txcmp/s rxmcst/s! 09:10:44 PM lo 14.00 14.00 1.34 1.34 0.00 0.00 0.00! 09:10:44 PM eth0 4114.00 4186.00 4537.46 28513.24 0.00 0.00 0.00! ! 09:10:43 PM active/s passive/s iseg/s oseg/s! 09:10:44 PM 21.00 4.00 4107.00 22511.00! ! 09:10:43 PM atmptf/s estres/s retrans/s isegerr/s orsts/s! 09:10:44 PM 0.00 0.00 36.00 0.00 1.00! […]! • Micro benchmarking can be used to invesFgate hypervisor behavior that can’t be observed directly
  • 59. Instance Challenges • ApplicaFon Profiling – For Java, Node.js • System Tracing – On Linux • Accessing CPU Performance Counters – From cloud guests
  • 60. ApplicaFon Profiling • We’ve found many tools are inaccurate or broken – Eg, those based on java hprof • Stack profiling can be problemaFc: – Linux perf_events: frame pointer for the JVM is oVen missing (by hotspot), breaking stacks. Also needs perf-­‐ map-­‐agent loaded for symbol translaFon. – DTrace: jstack() also broken by missing FPs hlps://bugs.openjdk.java.net/browse/JDK-­‐6276264, 2005 • Flame graphs are solving many performance issues. These need working stacks.
  • 61. ApplicaFon Profiling: Java • Java Flight Recorder – CPU & memory profiling. Oracle. $$$ • Google Lightweight Java Profiler – Basic, open source, free, asynchronous CPU profiler – Uses an agent that dumps hprof-­‐like output • hlps://code.google.com/p/lightweight-­‐java-­‐profiler/wiki/Ge~ngStarted • hlp://www.brendangregg.com/blog/2014-­‐06-­‐12/java-­‐flame-­‐graphs.html • Plus others at various Fmes (YourKit, …)
  • 62. LJP CPU Flame Graph (Java)
  • 63. LJP CPU Flame Graph (Java) Stack frame Mouse-­‐over frames to quanFfy Ancestry
  • 64. Linux System Profiling • Previous profilers only show Java CPU Fme • We use perf_events (aka the “perf” command) to sample everything else: – JVM internals & libraries – The Linux kernel – Other apps, incl. Node.js • perf CPU Flame graphs: # git clone https://github.com/brendangregg/FlameGraph! # cd FlameGraph! # perf record -F 99 -ag -- sleep 60! # perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg!
  • 65. perf CPU Flame Graph
  • 66. perf CPU Flame Graph Broken Java stacks (missing frame pointer) Kernel TCP/IP GC Idle Time thread Locks epoll
  • 67. ApplicaFon Profiling: Node.js • Performance analysis on Linux a growing area – Eg, new postmortem tools from 2 weeks ago: hlps://github.com/tjfontaine/lldb-­‐v8 • Flame graphs are possible using Linux perf_events (perf) and v8 -­‐-­‐perf_basic_prof (node v0.11.13+) – Although there is currently a map growth bug; see: hlp://www.brendangregg.com/blog/ 2014-­‐09-­‐17/node-­‐flame-­‐graphs-­‐on-­‐linux.html • Also do heap analysis – node-­‐heapdump
  • 68. Flame Graphs • CPU sample flame graphs solve many issues – We’re automaFng their collecFon – If you aren’t using them yet, you’re missing out on low hanging fruit! • Other flame graph types useful as well – Disk I/O, network I/O, memory events, etc – Any profile that includes more stacks than can be quickly read
  • 69. Linux Tracing • ... now for something more challenging
  • 70. Linux Tracing • Too many choices, and many are sFll in-­‐ development: – Vrace – perf_events – eBPF – SystemTap – ktap – LTTng – dtrace4linux – sysdig
  • 71. Linux Tracing • A system tracer is needed to root cause many issues: kernel, library, app – (There’s a prely good book covering use cases) • DTrace is awesome, but the Linux ports are incomplete • Linux does have has Vrace and perf_events in the kernel source, which – it turns out – can saFsfy many needs already!
  • 72. Linux Tracing: Vrace • Added by Steven Rostedt and others since 2.6.27 • Already enabled on our servers (3.2+) – CONFIG_FTRACE, CONFIG_FUNCTION_PROFILER, … – Use directly via /sys/kernel/debug/tracing • Front-­‐end tools to aid usage: perf-­‐tools – hlps://github.com/brendangregg/perf-­‐tools – Unsupported hacks: see WARNINGs – Also see the trace-­‐cmd front-­‐end, as well as perf • lwn.net: “Ftrace: The Hidden Light Switch”
  • 73. perf-­‐tools: iosnoop • Block I/O (disk) events with latency: # ./iosnoop –ts! Tracing block I/O. Ctrl-C to end.! STARTs ENDs COMM PID TYPE DEV BLOCK BYTES LATms! 5982800.302061 5982800.302679 supervise 1809 W 202,1 17039600 4096 0.62! 5982800.302423 5982800.302842 supervise 1809 W 202,1 17039608 4096 0.42! 5982800.304962 5982800.305446 supervise 1801 W 202,1 17039616 4096 0.48! 5982800.305250 5982800.305676 supervise 1801 W 202,1 17039624 4096 0.43! […]! # ./iosnoop –h! USAGE: iosnoop [-hQst] [-d device] [-i iotype] [-p PID] [-n name] [duration]! -d device # device string (eg, "202,1)! -i iotype # match type (eg, '*R*' for all reads)! -n name # process name to match on I/O issue! -p PID # PID to match on I/O issue! -Q # include queueing time in LATms! -s # include start time of I/O (s)! -t # include completion time of I/O (s)! -h # this usage message! duration # duration seconds, and use buffers! […]!
  • 74. perf-­‐tools: iolatency • Block I/O (disk) latency distribuFons: # ./iolatency ! Tracing block I/O. Output every 1 seconds. Ctrl-C to end.! ! >=(ms) .. <(ms) : I/O |Distribution |! 0 -> 1 : 2104 |######################################|! 1 -> 2 : 280 |###### |! 2 -> 4 : 2 |# |! 4 -> 8 : 0 | |! 8 -> 16 : 202 |#### |! ! >=(ms) .. <(ms) : I/O |Distribution |! 0 -> 1 : 1144 |######################################|! 1 -> 2 : 267 |######### |! 2 -> 4 : 10 |# |! 4 -> 8 : 5 |# |! 8 -> 16 : 248 |######### |! 16 -> 32 : 601 |#################### |! 32 -> 64 : 117 |#### |! […]!
  • 75. perf-­‐tools: opensnoop • Trace open() syscalls showing filenames: # ./opensnoop -t! Tracing open()s. Ctrl-C to end.! TIMEs COMM PID FD FILE! 4345768.332626 postgres 23886 0x8 /proc/self/oom_adj! 4345768.333923 postgres 23886 0x5 global/pg_filenode.map! 4345768.333971 postgres 23886 0x5 global/pg_internal.init! 4345768.334813 postgres 23886 0x5 base/16384/PG_VERSION! 4345768.334877 postgres 23886 0x5 base/16384/pg_filenode.map! 4345768.334891 postgres 23886 0x5 base/16384/pg_internal.init! 4345768.335821 postgres 23886 0x5 base/16384/11725! 4345768.347911 svstat 24649 0x4 supervise/ok! 4345768.347921 svstat 24649 0x4 supervise/status! 4345768.350340 stat 24651 0x3 /etc/ld.so.cache! 4345768.350372 stat 24651 0x3 /lib/x86_64-linux-gnu/libselinux…! 4345768.350460 stat 24651 0x3 /lib/x86_64-linux-gnu/libc.so.6! 4345768.350526 stat 24651 0x3 /lib/x86_64-linux-gnu/libdl.so.2! 4345768.350981 stat 24651 0x3 /proc/filesystems! 4345768.351182 stat 24651 0x3 /etc/nsswitch.conf! […]!
  • 76. perf-­‐tools: funcgraph • Trace a graph of kernel code flow: # ./funcgraph -Htp 5363 vfs_read! Tracing "vfs_read" for PID 5363... Ctrl-C to end.! # tracer: function_graph! #! # TIME CPU DURATION FUNCTION CALLS! # | | | | | | | |! 4346366.073832 | 0) | vfs_read() {! 4346366.073834 | 0) | rw_verify_area() {! 4346366.073834 | 0) | security_file_permission() {! 4346366.073834 | 0) | apparmor_file_permission() {! 4346366.073835 | 0) 0.153 us | common_file_perm();! 4346366.073836 | 0) 0.947 us | }! 4346366.073836 | 0) 0.066 us | __fsnotify_parent();! 4346366.073836 | 0) 0.080 us | fsnotify();! 4346366.073837 | 0) 2.174 us | }! 4346366.073837 | 0) 2.656 us | }! 4346366.073837 | 0) | tty_read() {! 4346366.073837 | 0) 0.060 us | tty_paranoia_check();! […]!
  • 77. perf-­‐tools: kprobe • Dynamically trace a kernel funcFon call or return, with variables, and in-­‐kernel filtering: # ./kprobe 'p:open do_sys_open filename=+0(%si):string' 'filename ~ "*stat"'! Tracing kprobe myopen. Ctrl-C to end.! postgres-1172 [000] d... 6594028.787166: open: (do_sys_open +0x0/0x220) filename="pg_stat_tmp/pgstat.stat"! postgres-1172 [001] d... 6594028.797410: open: (do_sys_open +0x0/0x220) filename="pg_stat_tmp/pgstat.stat"! postgres-1172 [001] d... 6594028.797467: open: (do_sys_open +0x0/0x220) filename="pg_stat_tmp/pgstat.stat”! ^C! Ending tracing...! • Add -­‐s for stack traces; -­‐p for PID filter in-­‐kernel. • Quickly confirm kernel behavior; eg: did a tunable take effect?
  • 79. Heat Maps • Vrace or perf_events for tracing disk I/O and other latencies as a heat map:
  • 80. Other Tracing OpFons • SystemTap – The most powerful of the system tracers – We’ll use it as a last resort: deep custom tracing – I’ve historically had issues with panics and freezes • SFll present in the latest version? • The Ne5lix fault tolerant architecture makes panics much less of a problem (that was the panic monkey) • Instance canaries with DTrace are possible too – OmniOS – FreeBSD
  • 81. Linux Tracing Future • Vrace + perf_events cover much, but not custom in-­‐kernel aggregaFons • eBPF may provide this missing feature – eg, in-­‐kernel latency heat map (showing bimodal):
  • 82. Linux Tracing Future • Vrace + perf_events cover much, but not custom in-­‐kernel aggregaFons • eBPF may provide this missing feature – eg, in-­‐kernel latency heat map (showing bimodal): Low latency cache hits High latency device I/O Time
  • 83. CPU Performance Counters • … is this even possible from a cloud guest?
  • 84. CPU Performance Counters • Model Specific Registers (MSRs) – Basic details: Fmestamp clock, temperature, power – Some are available in EC2 • Performance Monitoring Counters (PMCs) – Advanced details: cycles, stall cycles, cache misses, … – Not available in EC2 (by default) • Root cause CPU usage at the cycle level – Eg, higher CPU usage due to more memory stall cycles
  • 85. msr-­‐cloud-­‐tools • Uses the msr-­‐tools package and rdmsr(1) – hlps://github.com/brendangregg/msr-­‐cloud-­‐tools ec2-guest# ./cputemp 1! CPU1 CPU2 CPU3 CPU4! 61 61 60 59! CPU Temperature 60 61 60 60! [...]! ec2-guest# ./showboost! CPU MHz : 2500! Turbo MHz : 2900 (10 active)! Turbo Ratio : 116% (10 active)! CPU 0 summary every 5 seconds...! ! TIME C0_MCYC C0_ACYC UTIL RATIO MHz! 06:11:35 6428553166 7457384521 51% 116% 2900! 06:11:40 6349881107 7365764152 50% 115% 2899! 06:11:45 6240610655 7239046277 49% 115% 2899! 06:11:50 6225704733 7221962116 49% 116% 2900! [...]! Real CPU MHz
  • 86. MSRs: CPU Temperature • Useful to explain variaFon in turbo boost (if seen) • Temperature for a syntheFc workload:
  • 87. MSRs: Intel Turbo Boost • Can dynamically increase CPU speed up to 30+% • This can mess up all performance comparisons • Clock speed can be observed from MSRs using – IA32_MPERF: Bits 63:0 is TSC Frequency Clock Counter C0_MCNT TSC relaFve – IA32_APERF: Bits 63:0 is TSC Frequency Clock Counter C0_ACNT actual clocks • This is how msr-­‐cloud-­‐tools showturbo works
  • 88. PMCs
  • 89. PMCs • Needed for remaining low-­‐level CPU analysis: – CPU stall cycles, and stall cycle breakdowns – L1, L2, L3 cache hit/miss raFo – Memory, CPU Interconnect, and bus I/O • Not enabled by default in EC2. Is possible, eg: # perf stat -e cycles,instructions,r0480,r01A2 -p `pgrep -n java` sleep 10! ! Performance counter stats for process id '17190':! ! 71,208,028,133 cycles # 0.000 GHz [100.00%]! 41,603,452,060 instructions # 0.58 insns per cycle [100.00%]! 23,489,032,742 r0480 [100.00%]! 20,241,290,520 r01A2! ! 10.000894718 seconds time elapsed! ICACHE.IFETCH_STALL RESOURCE_STALLS.ANY
  • 90. Using Advanced Perf Tools • Everyone doesn’t need to learn these • Reality: – A. Your company has one or more people for advanced perf analysis (perf team). Ask them. – B. You are that person – C. You buy a product that does it. Ask them. • If you aren’t the advanced perf engineer, you need to know what to ask for – Flame graphs, latency heat maps, Vrace, PMCs, etc… • At Ne5lix, we’re building the (C) opFon: Vector
  • 92. Future Work: Vector U,liza,on Satura,on Per device Errors Breakdowns
  • 93. Future Work: Vector • Real-­‐Fme, per-­‐second, instance metrics • On-­‐demand CPU flame graphs, heat maps, Vrace metrics, and SystemTap metrics • Analyze from clouds to roots quickly, and from a web interface • Scalable: other teams can use it easily Atlas Alerts Atlas Dashboards Atlas Metrics Mogul Salp Vector ICE Chronos
  • 94. In Summary • 1. Ne5lix architecture – Fault tolerance: ASGs, ASG clusters, Hystrix (dependency API), Zuul (proxy), Simian army (tesFng) – Reduces the severity and urgency of issues • 2. Cloud Analysis – Atlas (alerts/dashboards/metrics), Chronos (event tracking), Mogul & Salp (dependency analysis), ICE (AWS usage) – Quickly narrow focus from cloud to ASG to instance • 3. Instance Analysis – Linux tools (*stat, sar, …), perf_events, Vrace, perf-­‐tools, rdmsr, msr-­‐cloud-­‐tools, Vector – Read logs, profile & trace all soVware, read CPU counters
  • 95. References & Links hlps://ne5lix.github.io/#repo hlp://techblog.ne5lix.com/2012/01/auto-­‐scaling-­‐in-­‐amazon-­‐cloud.html hlp://techblog.ne5lix.com/2012/06/asgard-­‐web-­‐based-­‐cloud-­‐management-­‐and.html hlp://www.slideshare.net/benjchristensen/performance-­‐and-­‐fault-­‐tolerance-­‐for-­‐the-­‐ ne5lix-­‐api-­‐qcon-­‐sao-­‐paulo hlp://www.slideshare.net/adrianco/ne5lix-­‐nosql-­‐search hlp://www.slideshare.net/ufried/resilience-­‐with-­‐hystrix hlps://github.com/Ne5lix/Hystrix, hlps://github.com/Ne5lix/Zuul hlp://techblog.ne5lix.com/2011/07/ne5lix-­‐simian-­‐army.html hlp://techblog.ne5lix.com/2014/09/introducing-­‐chaos-­‐engineering.html hlp://www.brendangregg.com/blog/2014-­‐06-­‐12/java-­‐flame-­‐graphs.html hlp://www.brendangregg.com/blog/2014-­‐09-­‐17/node-­‐flame-­‐graphs-­‐on-­‐linux.html Systems Performance: Enterprise and the Cloud, PrenFce Hall, 2014 hlp://sourceforge.net/projects/nicstat/ perf-­‐tools: hlps://github.com/brendangregg/perf-­‐tools Ftrace: The Hidden Light Switch: hlp://lwn.net/ArFcles/608497/ msr-­‐cloud-­‐tools: hlps://github.com/brendangregg/msr-­‐cloud-­‐tools
  • 96. Thanks • Coburn Watson, Adrian CockcroV • Atlas: Insight Engineering (Roy Rapoport, etc.) • Mogul: Performance Engineering (Scol Emmons, MarFn Spier) • Vector: Performance Engineering (MarFn Spier, Amer Ather)
  • 97. Thanks • QuesFons? • hlp://techblog.ne5lix.com • hlp://slideshare.net/brendangregg • hlp://www.brendangregg.com • bgregg@ne5lix.com • @brendangregg