1. Advanced
Performance Forensics
Uncovering the Mysteries of Performance and Scalability
Incidents through Forensic Engineering
Stephen Feldman
Senior Director Performance Engineering and Architecture
stephen.feldman@blackboard.com
2. Sessions Goals
The goals of today’s session are…
• Introduce the practice of performance forensics.
• Present an argument for session level analysis.
• Discuss the difference between Resources and
Interfaces.
• Present tools that can be used for performance
forensics at different layers of the architectural
stack and the client layer.
3. Definition of Performance Forensics
• The practice of collecting evidence, performing
interviews and modeling for the purpose of root
cause analysis of a performance or scalability
problem.
– In context of a performance (response time problem)
– Discussing an individual event (session experience)
• Performance problems can be classified in two
main categories:
– Response Time Latency
– Queuing Latency
4. Performance Forensics Methodology
Identify the
Problem
Develop a Problem Statement
Identify the Most Important Operations that Affect Your Business
Interviewing
Formulate a Hypothesis
Collecting
Evidence
Establish a Diagnosis
Data
Analysis
Modeling
and Perform
Visualizing Session
Inspection
Sampling
and
Simulating
Turn the Problem Statement into a Diagnosis to Get to
Method-R
Root Cause Root Cause
5. Putting Performance Forensics in Context
• Emphasis on the user and the user’s actions and
experiences.
– How can this be measured?
• Capture the response time experience and the
response time expectations of the user.
– Put into perspective user action in-line with the goals
of Method-R (what’s most important to the business)
• Identify the contributors of response latency
• Everyone needs to be involved
6. Measuring the Session
• When should this happen?
– When a problem statement cannot be developed from
the data you do have (evidence or interviews) and
more data needs to be collected.
• How should you go about this?
– Want to minimize disruption to the production
environment.
– Adaptive collection: Less Intensive to More Intensive
over time.
Basic Sampling Continuous Collection Profiling
7. Resources vs. Interfaces
• One of the most critical data points to collect
• Interfaces are critical for understanding
throughput and queuing models.
– Queuing is another cause of latency
– Also a cause of time-outs
• Resources are critical for understanding the cost
of performing a transaction.
– Core Resources: CPU, Memory and I/O
• Response Time = Service Time + Queue Time
8. The Importance of Wait Events
• Rise of Session Level Forensics
– Underlying theme with all of these tools that “Session” is more
important then “System”
• Wait event tuning used to account for latency
– Exists in SQL Server (Waits and Queues) and Oracle (10046)
– Other components not mature enough to represent
• Waits are statistical explanations of latency
• Each individual wait event might be deceiving, but
looking at both aggregates and outliers can explain why
a performance problem exists.
• When sampling directly, usually only have about 1 hour
to act on the data.
12. Fiddler2
• Fiddler 2 measures end-to-end client responsiveness of
a web request.
• Little to no overhead (less intrusive forensics)
• Captures requests in order to present http codes, size of
objects, sequence of loading, time to process request,
performance by bandwidth speed.
– Rough estimation of User Experience based on locality.
• Inspects every detail of the http request
– Detailed session inspection
– Breakdown of http transformation
• Other Tools in Category: Y-slow/Firebug, Charlesproxy,
liveHTTPheaders and IEInspector
13.
14. Coradiant Truesight
• Commercial tool used for passive user experience
monitoring.
• Captures page, object and session level data.
• Capable of defining Service Level Thresholds and
Automatic Incident Management.
• Used to trace back session as if you were watching over
the user’s shoulder.
• Exceptional tool for trend analysis. (Less Intrusive)
• Primarily used in forensics as evidence for analysis.
• Other Tools in the Category: Quest User Experience and
Citrix EdgeSight
17. Log Analyzers
• Both commercial and open source tools are available to
parse and analyze http access logs.
• Provides trend data, client statistical data, http summary
information.
• Recommend using this data to study request and
bandwidth trends for correlation purposes with resource
utilization graphs.
– Such a large volume of data.
– Recommend working within small time slices
• Post-processing tool (No Impact to Application)
• Examples: Urchin, Summary, WebTrends, SawMill,
Surfstats and AlterWind Log Analyzer
18. JSTAT
• Low intrusive statistic collector that provides
– Percentages of usage by each region
– Frequency/Counts of collections
– Time spent in pause state
• Can be invoked any time without restarting the JVM by
obtaining the Process ID
– Exception is on Windows when the JVM is run as a background
service
• Critical for understanding windows of stall times between
sampling
– Assume you collect every 5 seconds and observe a 3 second
pause time
– Means the application could only work for 2 seconds
22. -VerboseGC and -Xloggc
• JVM flags that invoke JVM logging
• Verbose JVM logging is a low-overhead
collector (less intrusive measurement)
– Requires a restart of the instance to run
• -XX:+PrintGCDetails is a recommended setting
to be used with:
– -XX:+PrintGCApplicationConcurrentTime
– -XX:+PrintGCApplicationStoppedTime
• Provides aggregate statistics about Pause
Times versus Working Times.
24. IBM Pattern Modeling Tool for Java GC
• Post processing tool used for visualizing a –
VerboseGC or –Xloggc file.
• Can make the analysis efforts for analyzing a log
file substantially easier.
• Represents pauses/stalls at particular times
• Has no affect on the application environment as
it reads a log file that is dormant.
26. JHAT, JMAP and SAP Memory Analyzer
• Jhat: Java Heap Analysis Tool takes a heap dump and
parses the data into useful and human-digestible
information about what's in the JVM's memory.
• JMap: Java Memory Map is a JVM tool that provides
information about what is in the heap at a given time.
– Provides text and OQL views into JHat data
• SAP Memory Analyzer will visualize the JHat output
• Should be run when a problem is occurring right now
– When the system is unresponsive
– When the JVM runs into continuous collections
27.
28. ASH
• ASH: Active Session History
– Samples session activity in the system every second.
– 1 hour of history in memory for immediate access at your
fingertips
• ASH in Memory
– Collects active session data only
– History v$session_wait + v$session + extras
• Circular Buffer - 1M to 128M (~2% of SGA)
• Flushed every hour to disk or when buffer 2/3 full (it protects
itself so you can relax)
• Tools to Consider: SessPack and SessSnaper
29. SQL Server Performance Dashboard
• Feature of SQL Server 2005 SP2
• Template report that take advantage of DMVs
• Provides views into wait events
– Doesn’t link events to SQL IDs in the report
– Provides aggregate views of wait events
– Session Level DMVs (sys.dm_os_wait_stats and
sys.dm_exec_sessions)
• Complimentary Tools: SQL Server Health and
History Tool and Quest Spotlight for SQL Server
30.
31. Importance of Cost Execution Plans
• Can be run on databases with low overhead
– Do not need the literal values to run
– Both SQL Server and Oracle can run “Estimated Cost Plans”
• Each database uses an “Optimizer” that determines the
best path of execution of SQL
– Calculates IO, CPU and Number of Executes (Loop Conditions)
• Understanding cost operations on a particular object can
help change your tuning strategy (ex: TABLE ACCESS
BY INDEX ROWID)
• Cost is time
– Query cost refers to the estimated elapsed time, in seconds,
required to complete a query on a specific hardware
configuration.
32. RML and Profiler
• The RML utilities process SQL Server trace files and view reports
showing how SQL Server is performing.
– Which application, database or login is using the most resources, and
which queries are responsible for that.
– Whether there were any plan changes for a batch during the time when
the trace was captured and how each of those plans performed.
– What queries are running slower in today's data compared to a previous
set of data
• Profiler captures statements, query counts/statistics, wait events
– Can capture and correlate profile data to Perfmon data
• Heavy overhead with both
• Other Tools to Consider: Quest Performance Analysis for SQL
Server
33. Oracle OEM and 10046
• Oracle finally delivered with OEM with a web-based
interface.
– Performance dashboard provides great historical and present
overview
– Access to ADDM and ASH simplifies job of DBA
– SQL History
• Problems
– licensing somewhat cost prohibitive
– Still doesn’t provide wait events
• For 10046 still need to consider profiling on your own
and using a profiler reader like Hotsos P4.
– Difficult to trace and capture sessions
34.
35. Want More?
• Check-out my blog for postings of the
presentation:
http://sevenseconds.wordpress.com
• To view my resources and references for this
presentation, visit www.scholar.com
• Simply click “Advanced Search” and search by
sfeldman@blackboard.com and tag: ‘bbworld08’
or ‘forensics’