Learn about the tools and methodologies we use in production at Netflix to diagnose and fix performance issues, bugs and memory leaks -- all without having to restart or change our Node application. Find out about profiling and post mortem tools such as perf events and mdb, visualizations like flame graphs and latency distributions, and how they help us keep our Node stack efficient.
15. Snapshot Whatâs Currently Executing
Stacktrace: A stack trace is a report of the active stack frames
at a certain point in time during the execution of a program.
> console.log(ex, ex.stack.split("n"))
ReferenceError: ex is not defined
at repl:1:13
at REPLServer.defaultEval (repl.js:132:27)
at bound (domain.js:254:14)
at REPLServer.runBound [as eval] (domain.js:267:12)
at REPLServer.<anonymous> (repl.js:279:12)
at REPLServer.emit (events.js:107:17)
at REPLServer.Interface._onLine (readline.js:214:10)
at REPLServer.Interface._line (readline.js:553:8)
at REPLServer.Interface._ttyWrite (readline.js:830:14)
at ReadStream.onkeypress (readline.js:109:10)
16. Two Problems
1) How to sample stack traces from a running
process?
2) How to do 1) without affecting the process?
17. Linux Perf Events
PERF(1) perf Manual PERF(1)
NAME
perf - Performance analysis tools for Linux
SYNOPSIS
perf [--version] [--help] COMMAND [ARGS]
DESCRIPTION
Performance counters for Linux are a new kernel-based subsystem
that provide a framework for all things performance analysis.
It covers hardware level (CPU/PMU, Performance Monitoring Unit)
features and software features (software counters, tracepoints)
as well.
18. Sample Stack Traces w/ perf(1)
# perf record -F 99 -p `pgrep -n node` -g -- sleep 30
[ perf record: Woken up 2 times to write data ]
[ perf record: Captured and wrote 0.524 MB perf.data
(~22912 samples) ]
27. Flamegraph
â Each box presents a function in
the stack (stack frame)
â x-axis: percent of time on CPU
â y-axis: stack depth
â colors: random, or can be a
dimension
â https://github.com/
brendangregg/FlameGraph
v8
libc
JS
built ins
44. - ChaïŹn, R. "Pioneer F & G Telemetry and Command Processor Core Dump
Program." JPL Technical Report XVI, no. 32-1526 (1971): 174.
âThe method described in this article was designed
to provide a core dump⊠with a minimal impact
on the spacecraft⊠as the resumption of data
acquisition from the spacecraft is the highest
priority.â
45. Core Dumps â A Brief History
â Magnetic core memory
â Dump out the contents of
âcoreâ memory for debugging
â âCore dumpâ was born
â Initially printed on paper!
â Postmortem debugging was
born!
46.
47. Production Constraints
â Uptime is critical
â Not easily reproducible
â Canât simulate environment
â Resume normal operations ASAP
50. Node Post Mortem Tooling
â NetïŹix uses Linux in Prod
â Linux â Work in progress
â https://github.com/tjfontaine/lldb-v8
â https://github.com/indutny/llnode
â Solaris â Full featured, compatible with Linux cores
â https://github.com/joyent/mdb_v8
51.
52. Socks & Duct Tape: Setup a Debug Solaris Instance
EC2: http://omnios.omniti.com/wiki.php/
Installation#IntheCloud
VM: http://omnios.omniti.com/wiki.php/
Installation#Quickstart
75. Memory Leak Strategy
â Look at objects on heap for suspicious objects
â Take successive core dumps and compare object counts
â Growing object counts are likely leaking
â Inspect object for more context
â Walk reverse references to ïŹnd root object
90. Spot the Leak
var cache = {};
function checkCache(someModule) {
var mod = cache[someModule];
if (!mod) {
try {
mod = require(someModule);
cache[someModule] = mod;
return mod;
} catch (e) {
return {};
}
}
return mod;
}
Module could be client only, must catch
Should cache the
fact we caught an exception here
91. Root Cause
â Node caches metadata for each module
â If require process throws an exception, the module
metadata is leaked (bug?)
â Client side module meant we were throwing during
every request, and not caching the fact we tried to
require it
â Each request leaks 3+ module metadata objects
92. Memory Leaks
â Take successive core dumps (gcore(1))
â Compare object counts (::ïŹndjsobjects)
â Growing objects are likely leaking
â Inspect object for more context (::jsprint)
â Walk reverse references to ïŹnd root obj (::ïŹndjsobjects -
r)
94. More State than Just Logs
â Detailed stack trace (::jsstack)
â Function args for each frame (::jsstack -vn0)
â Get state of any object and its provenance
(::jsprint, ::jsconstructor)
â Get source code of any function (::jssource)
â Find arbitrary JS objects (::ïŹndjsobjects)
â UnmodiïŹed Node binary!
97. Production Debugging
â Runtime Performance
â CPU proïŹling/ïŹame graphs
â Runtime Crashes
â Inspect program state with core dumps and mdb
â Memory leaks
â Analyze objects and references with core dumps and
mdb
99. Epilogue â State of Tooling
â Join Working Group https://github.com/nodejs/post-
mortem
â Help make mdb_v8 cross platform https://github.com/
joyent/mdb_v8
â Contribute to https://github.com/tjfontaine/lldb-v8
and https://github.com/indutny/llnode
100. Acknowledgements
â mdb_v8
â Dave Pacheco, TJ Fontaine, Julien Gilli, Bryan Cantrill
â CPU ProïŹling/Flamegraphs
â Brendan Gregg, Google V8 team, Ali Ijaz Sheikh
â Linux Perf
â Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Jiri Olsa, Peter Zijlstra
â lldb-v8
â TJ Fontaine
â llnode
â Fedor Indutny
103. Citations
â Slides 29-32 used with permission from âJava Mixed-
Mode Flame Graphsâ, Brendan Gregg, Oct 2015
â Slide 26 used with permission from http://
www.brendangregg.com/FlameGraphs/
cpuïŹamegraphs.html