SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Using the big guns
Advanced OS performance tools and their more basic alternatives for
tackling production issues
Nikolai Savvinov, Deutsche Bank
About me
• worked with databases for 19 years (started with MySQL)
• working with Oracle database since approx. 2005
• Oracle database engineer, Deutsche Bank Moscow/London 2007-now
• Last 10 years, specialize in database performance/internals
• blog: savvinov.com, twitter: @oradiag
About this presentation
• 50/50 overview of tools / practical use cases
• All examples based on Oracle Linux 6 (kernel 4.1.12),
Exadata X6-2 EF 3-node cluster
• All opinions are my own and not of my employer (or Oracle corp)
• Caveat emptor
Why use OS performance tools?
• Database instruments things that are expected to be slow
• Also, bugs and blind spots
• Database doesn’t sit in a vacuum, there isn’t always clear-cut
separation between application/database/OS/hw layers
• Better understand Unix/Storage/Network when jointly
troubleshooting issues
What kind of database prod issues are solved with
OS performance tools
• Memory issues
• low-level memory leaks
• memory fragmentation
• NUMA
• swapping
• Kernel-level locking
• I/O issues
• e.g. suboptimal low-level I/O settings
• Filesystem issues
• e.g. too many files in a directory because a bug leading to excessive tracing
• Network issues
• e.g. poor TCP throughput due to congestion events
Overview of Linux performance tools
Basic process level tools
• pidstat
• -u for CPU (default)
• -d for disk I/O
• -r for memory
• ps
• -e to do output for all processes
• -o to pick fields you like
• e.g. wchan, state, rss
• most useful when used “ASH-style” in ExaWatcher / OSWatcher
• especially when combined with “OEM-style” visualization
WCHAN
• WCHAN: the outermost system call where the process is waiting
• Zero-overhead stack profiling tool that is always on (w/ OSWatcher)
• Low frequency (every few seconds)
• Sometimes doesn’t go all the way for some reason
• Off CPU only
• Kernel-space only
WCHAN example
• Removed state=‘S’ (interruptible sleep)
• cma_acquire_dev is biggest
• CMA = contiguos memory allocation
• called inside rdma_bind_addr
• in drivers/infiniband/core/cma.c
• RDMA is core IB technology
• Exafusion relies on RDMA
• RDMA is not NUMA-friendly
DIY-profilers
• debugger command dumping a stack with a loop around it
• not always stable/safe/predictable
• pstack (Tanel Poder’s “poor man’s profiler”)
• /proc/<pid>/stack (Luca Canali’s kstacksampler)
• oradebug short_stack
Profilers/tracers in Linux
• perf
• systemtap
• dtrace for Oracle Linux
• ftrace
• lttng
https://sourceware.org/systemtap/wiki/SystemtapDtraceComparison
http://www.brendangregg.com/blog/2015-07-08/choosing-a-linux-tracer.html
Scope of profiling tools
• The output from a profiler does not always represent wall clock time
• The standard (on-CPU) profiling is only valid for CPU-intensive loads
• Off-CPU profiling can be tricky
(e.g. perf requires CONFIG_SCHEDSTATS=y which is not the case for
some Linux distributions), DIY-profilers or WCHAN can help
• User-space vs kernel-space: a profiler can only be giving you one half
of the picture, and not necessarily the one you need
• Be sure to pick the right profiler for the task
Safety considerations
• ptrace-based tools (strace, gdb, pstack etc.) are less safe
• systemtap had some teething problems in early days,
considered relatively safe now, but still has issues
• Oracle Linux dtrace, perf and bcc(bpf) are probably the safest
• It’s a good idea to build a large arsenal of tools
• Use UAT as much as possible: even if the issue doesn’t reproduce itself in
all its entirety, doesn’t mean some aspects of it can’t be reproduced
• If the problem can’t be reproduced on UAT, one can try to reproduce it as
isolated mockup activity on production – so if tracing crashes it, no biggie
• Balance of risk: side effects of diagnostics vs issue going undiagnosed
Case study:
memory
fragmentation
Step 1: high-level picture
• Problems started shortly after 18c upgrade
• The main symptom experienced by users was connection delays
• The obvious things to check were listener, cluster database alert and
kernel logs
• On the system level, somewhat elevated sys CPU was noticed
• WCHAN analysis didn’t reveal any interesting waits, but showed that
extended periods of busy CPU for some of the listeners
• pidstat confirmed there were periods of 100% sys CPU for listener
processes
Step 2: getting stacks
• We know what processes we want
• We are interested in on-CPU samples
• We are interested in kernel-space stacks
• Two obvious choices: /proc/<pid>/stack based sampling or
“proper” stack profiling using perf
• Both are sufficiently safe, but perf can cause noticeable overhead, so
we started with kstacksampler, but then also used perf
Step 3 (optional here): visualization
Step 4: make sense out of results
• Identify biggest branch(es)
• Identify stack structure, e.g. in this example
• Oracle Network Session layer calls
• Oracle Network Transport calls
• VFS syscalls
• TCP syscalls
• page allocation
• direct compaction
• page migration
• Identify key elements
• Read relevant parts of documentation
• Look at the source code
• Read source comments + git blame/history
Sidenote: compaction and fragmentation
• Compaction is an algorithm for memory defragmentation
• Normally kernel shouldn’t care if memory is fragmented
• Some pieces do, however (like device drivers)
• Apparently, TCP implementation also relies on contiguous allocations
• A chunk is 2^N pages (4kB), N = order
• When initial allocation attempt fails, there are a number of possible
fallback strategies (depending on GFP flags)
• One common scenario is direct compaction
• While doing compaction, the process will be unresponsive
Step 5: finding solution
Step 6: getting additional detail
sudo perf probe --add 'sk_stream_alloc_skb sk=%di size=%si gfp=%dx force=%cx'
sudo perf probe --add '__alloc_pages_nodemask gfp_mask=%di order=%si
zonelist=%dx zonemask=%cx'
sudo perf record -e probe:sk_stream_alloc_skb --filter 'size>0x1000’ –e
probe:__alloc_pages_nodemask --filter 'order>4' -agR sleep 100
============================================================
tgtd 5216 [004] 5706280.274976: probe:sk_stream_alloc_skb: (ffffffff816232d0)
sk=0xffff8804959b9800 size=0x1b30 gfp=0xd0 force=0x40
ibportstate 17206 [003] 5706280.311569: probe:__alloc_pages_nodemask:
(ffffffff81191fa0) gfp_mask=0x2c0 order=0x6 zonelist=0xffff88407ffd8e00
zonemask=0x0
7fff81193fa1 __alloc_pages_nodemask ([kernel.kallsyms])
7fff81069641 x86_swiotlb_alloc_coherent ([kernel.kallsyms])
7fffa01bd88f mlx4_alloc_icm ([kernel.kallsyms])
Step 7: digging a little bit deeper…
Summary
• OS tools can be very useful or even necessary for troubleshooting
complex cluster or database issues
• Much can be done with basic risk-free OS tools like ps
• Some tracing/profiling tools are safer than others
• Low-level OS tools become safer over time, but can still carry risk
• There are ways to minimize the risk
• Weigh the risk of side effects against the risk of not solving the issue
Credits
• Brendan Gregg – Linux performance expert
• Tanel Poder, Luca Canali, Frits Hoogland, Andrey Nikolaev,
Alexander Anokhin – pioneered use of OS low-level tools in Oracle
troubleshooting
• Thanks UKOUG organizers for the opportunity!
Bonus slides
WCHAN example 2: NUMA balancing
WCHAN example 3: inode cache depletion
Systemtap safety
“In practice, there are several weak points in systemtap and the
underlying kprobes system at the time of writing. Putting probes
indiscriminately into unusually sensitive parts of the kernel (low level
context switching, interrupt dispatching) has reportedly caused crashes
in the past. We are fixing these bugs as they are found, and
constructing a probe point “blacklist”, but it is not complete”
Frank Ch. Eigler, Systemtap tutorial, November 2019
https://sourceware.org/systemtap/tutorial.pdf
Bcc (bpf) safety
It is unlikely that the programs should cause the kernel to crash, loop or
become unresponsive because they run in a safe virtual machine in the
kernel.
Hanging the system with a simple “cat”
Other low-level OS tools
• tcpdump
• iosnoop
Tcpdump for network performance
• SQL*Net tracing on client/server side doesn’t always reveal problem
• Network-side metrics don’t always reveal problem
• Various pings almost never reveal the problem
• Many TCP performance problems have to do with congestion control
• A variety of tools for analyzing the dumps, e.g. Wireshark
• Can dump to ASCII and use own tools
Our case
• Log file sync delays, sometimes spiking to tens of seconds
• Production synchronously replicated to standby via DataGuard
• Synchronicity was essential (Max Availability)
• Pings show nothing
• netops say the network is fine
Analysis
• We did a tcpdump
capture at both ends
• Tcpdump shows
congestion window
(bytes in flight)
• It was shrinking in
response to congestion
events
Remediation
• Remediation: netops removed bottlenecks, optimized QoS,
top users working on improving colocality of their estate
• Monitoring/alerting: how do we define thresholds for packet loss?
• Zero-loss networks are expensive
• In a non zero-loss network, what level of packet loss is acceptable?
• Relationship between throughput and packet loss in TCP was approximated
by Mathis in 1997
• 𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 ≤
𝑀𝑆𝑆
𝑙𝑎𝑡𝑒𝑛𝑐𝑦 𝑝𝑎𝑐𝑘𝑒𝑡 𝑙𝑜𝑠𝑠
• Assuming 1 ms latency, this meant
1.5
𝑝𝑎𝑐𝑘𝑒𝑡 𝑙𝑜𝑠𝑠
MB/s, or 0.25% for 30 MB/s
Iosnoop
• gives a high resolution picture of
I/O usage (low-res can be
obtained from iotop)
• ftrace based
• reported safe
• observed high performance
overhead
Our case
• General slowness on one of the nodes during certain periods
• Nothing helpful in AWR/ASH
• ExaWatcher iostat showed I/O spikes on another node
• High “reliable message” waits on that other node
• iotop didn’t reveal the culprit
Analysis & remediation
• iosnoop told us the high I/O was from
admin f/s housekeeping
• housekeeping had too much work due to
excessive tracing
• processes were slow due to slow writes
to trace files
• slowness propagated to another node
via inter-node communication (“reliable
message”)
• Remediation: excessive tracing patched,
housekeeping job optimized, old files
moved manually, scheduling clash
resolved

Weitere ähnliche Inhalte

Was ist angesagt?

Fun With Dr Brown
Fun With Dr BrownFun With Dr Brown
Fun With Dr BrownzeroSteiner
 
Process Scheduling Algorithms | Interviews | Operating system
Process Scheduling Algorithms | Interviews | Operating systemProcess Scheduling Algorithms | Interviews | Operating system
Process Scheduling Algorithms | Interviews | Operating systemShivam Mitra
 
Process management in operating system | process states | PCB | FORK() | Zomb...
Process management in operating system | process states | PCB | FORK() | Zomb...Process management in operating system | process states | PCB | FORK() | Zomb...
Process management in operating system | process states | PCB | FORK() | Zomb...Shivam Mitra
 
Threads in Operating System | Multithreading | Interprocess Communication
Threads in Operating System | Multithreading | Interprocess CommunicationThreads in Operating System | Multithreading | Interprocess Communication
Threads in Operating System | Multithreading | Interprocess CommunicationShivam Mitra
 
Steelcon 2014 - Process Injection with Python
Steelcon 2014 - Process Injection with PythonSteelcon 2014 - Process Injection with Python
Steelcon 2014 - Process Injection with Pythoninfodox
 
CNIT 126 2: Malware Analysis in Virtual Machines & 3: Basic Dynamic Analysis
CNIT 126 2: Malware Analysis in Virtual Machines & 3: Basic Dynamic AnalysisCNIT 126 2: Malware Analysis in Virtual Machines & 3: Basic Dynamic Analysis
CNIT 126 2: Malware Analysis in Virtual Machines & 3: Basic Dynamic AnalysisSam Bowne
 
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012TEST Huddle
 
Security research over Windows #defcon china
Security research over Windows #defcon chinaSecurity research over Windows #defcon china
Security research over Windows #defcon chinaPeter Hlavaty
 
Towards "write once - run whenever possible" with Safety Critical Java af Ben...
Towards "write once - run whenever possible" with Safety Critical Java af Ben...Towards "write once - run whenever possible" with Safety Critical Java af Ben...
Towards "write once - run whenever possible" with Safety Critical Java af Ben...InfinIT - Innovationsnetværket for it
 
Memory management in operating system | Paging | Virtual memory
Memory management in operating system | Paging | Virtual memoryMemory management in operating system | Paging | Virtual memory
Memory management in operating system | Paging | Virtual memoryShivam Mitra
 
Why internal pen tests are still fun
Why internal pen tests are still funWhy internal pen tests are still fun
Why internal pen tests are still funpyschedelicsupernova
 
Stop Feeding IBM i Performance Hogs - Robot
Stop Feeding IBM i Performance Hogs - RobotStop Feeding IBM i Performance Hogs - Robot
Stop Feeding IBM i Performance Hogs - RobotHelpSystems
 
Rainbow Over the Windows: More Colors Than You Could Expect
Rainbow Over the Windows: More Colors Than You Could ExpectRainbow Over the Windows: More Colors Than You Could Expect
Rainbow Over the Windows: More Colors Than You Could ExpectPeter Hlavaty
 
Process injection - Malware style
Process injection - Malware styleProcess injection - Malware style
Process injection - Malware styleSander Demeester
 

Was ist angesagt? (20)

Thread
ThreadThread
Thread
 
Fun With Dr Brown
Fun With Dr BrownFun With Dr Brown
Fun With Dr Brown
 
Process Scheduling Algorithms | Interviews | Operating system
Process Scheduling Algorithms | Interviews | Operating systemProcess Scheduling Algorithms | Interviews | Operating system
Process Scheduling Algorithms | Interviews | Operating system
 
Process management in operating system | process states | PCB | FORK() | Zomb...
Process management in operating system | process states | PCB | FORK() | Zomb...Process management in operating system | process states | PCB | FORK() | Zomb...
Process management in operating system | process states | PCB | FORK() | Zomb...
 
Techno-Fest-15nov16
Techno-Fest-15nov16Techno-Fest-15nov16
Techno-Fest-15nov16
 
Mastering Real-time Linux
Mastering Real-time LinuxMastering Real-time Linux
Mastering Real-time Linux
 
Threads in Operating System | Multithreading | Interprocess Communication
Threads in Operating System | Multithreading | Interprocess CommunicationThreads in Operating System | Multithreading | Interprocess Communication
Threads in Operating System | Multithreading | Interprocess Communication
 
Steelcon 2014 - Process Injection with Python
Steelcon 2014 - Process Injection with PythonSteelcon 2014 - Process Injection with Python
Steelcon 2014 - Process Injection with Python
 
Making Linux do Hard Real-time
Making Linux do Hard Real-timeMaking Linux do Hard Real-time
Making Linux do Hard Real-time
 
CNIT 126 2: Malware Analysis in Virtual Machines & 3: Basic Dynamic Analysis
CNIT 126 2: Malware Analysis in Virtual Machines & 3: Basic Dynamic AnalysisCNIT 126 2: Malware Analysis in Virtual Machines & 3: Basic Dynamic Analysis
CNIT 126 2: Malware Analysis in Virtual Machines & 3: Basic Dynamic Analysis
 
RT linux
RT linuxRT linux
RT linux
 
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
Mike Bartley - Innovations for Testing Parallel Software - EuroSTAR 2012
 
Security research over Windows #defcon china
Security research over Windows #defcon chinaSecurity research over Windows #defcon china
Security research over Windows #defcon china
 
Towards "write once - run whenever possible" with Safety Critical Java af Ben...
Towards "write once - run whenever possible" with Safety Critical Java af Ben...Towards "write once - run whenever possible" with Safety Critical Java af Ben...
Towards "write once - run whenever possible" with Safety Critical Java af Ben...
 
Memory management in operating system | Paging | Virtual memory
Memory management in operating system | Paging | Virtual memoryMemory management in operating system | Paging | Virtual memory
Memory management in operating system | Paging | Virtual memory
 
Why internal pen tests are still fun
Why internal pen tests are still funWhy internal pen tests are still fun
Why internal pen tests are still fun
 
Stop Feeding IBM i Performance Hogs - Robot
Stop Feeding IBM i Performance Hogs - RobotStop Feeding IBM i Performance Hogs - Robot
Stop Feeding IBM i Performance Hogs - Robot
 
Rt linux-lab1
Rt linux-lab1Rt linux-lab1
Rt linux-lab1
 
Rainbow Over the Windows: More Colors Than You Could Expect
Rainbow Over the Windows: More Colors Than You Could ExpectRainbow Over the Windows: More Colors Than You Could Expect
Rainbow Over the Windows: More Colors Than You Could Expect
 
Process injection - Malware style
Process injection - Malware styleProcess injection - Malware style
Process injection - Malware style
 

Ähnlich wie Using the big guns: Advanced OS performance tools for troubleshooting database issues

Systems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudSystems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudBrendan Gregg
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksAnne Nicolas
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHungWei Chiu
 
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...srisatish ambati
 
Getting Deep on Orchestration - Nickoloff - DockerCon16
Getting Deep on Orchestration - Nickoloff - DockerCon16Getting Deep on Orchestration - Nickoloff - DockerCon16
Getting Deep on Orchestration - Nickoloff - DockerCon16allingeek
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesDavid Martínez Rego
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Dave Holland
 
Container Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixContainer Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixDocker, Inc.
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyserAlex Moskvin
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance AnalysisBrendan Gregg
 
Performance Analysis: The USE Method
Performance Analysis: The USE MethodPerformance Analysis: The USE Method
Performance Analysis: The USE MethodBrendan Gregg
 
Lec 9-os-review
Lec 9-os-reviewLec 9-os-review
Lec 9-os-reviewMothi R
 
Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)MongoDB
 
UKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O StatisticsUKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O StatisticsKyle Hailey
 
Metasploit & Windows Kernel Exploitation
Metasploit & Windows Kernel ExploitationMetasploit & Windows Kernel Exploitation
Metasploit & Windows Kernel ExploitationzeroSteiner
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Spark Summit
 
Chapter -2 operating system presentation
Chapter -2 operating system presentationChapter -2 operating system presentation
Chapter -2 operating system presentationchnrketan
 
Application Profiling for Memory and Performance
Application Profiling for Memory and PerformanceApplication Profiling for Memory and Performance
Application Profiling for Memory and PerformanceWSO2
 
A New Tracer for Reverse Engineering - PacSec 2010
A New Tracer for Reverse Engineering - PacSec 2010A New Tracer for Reverse Engineering - PacSec 2010
A New Tracer for Reverse Engineering - PacSec 2010Tsukasa Oi
 
LISA2010 visualizations
LISA2010 visualizationsLISA2010 visualizations
LISA2010 visualizationsBrendan Gregg
 

Ähnlich wie Using the big guns: Advanced OS performance tools for troubleshooting database issues (20)

Systems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the CloudSystems Performance: Enterprise and the Cloud
Systems Performance: Enterprise and the Cloud
 
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecksKernel Recipes 2015: Solving the Linux storage scalability bottlenecks
Kernel Recipes 2015: Solving the Linux storage scalability bottlenecks
 
High performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User GroupHigh performace network of Cloud Native Taiwan User Group
High performace network of Cloud Native Taiwan User Group
 
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
JavaOne 2010: Top 10 Causes for Java Issues in Production and What to Do When...
 
Getting Deep on Orchestration - Nickoloff - DockerCon16
Getting Deep on Orchestration - Nickoloff - DockerCon16Getting Deep on Orchestration - Nickoloff - DockerCon16
Getting Deep on Orchestration - Nickoloff - DockerCon16
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017Sanger OpenStack presentation March 2017
Sanger OpenStack presentation March 2017
 
Container Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, NetflixContainer Performance Analysis Brendan Gregg, Netflix
Container Performance Analysis Brendan Gregg, Netflix
 
Realtime traffic analyser
Realtime traffic analyserRealtime traffic analyser
Realtime traffic analyser
 
Container Performance Analysis
Container Performance AnalysisContainer Performance Analysis
Container Performance Analysis
 
Performance Analysis: The USE Method
Performance Analysis: The USE MethodPerformance Analysis: The USE Method
Performance Analysis: The USE Method
 
Lec 9-os-review
Lec 9-os-reviewLec 9-os-review
Lec 9-os-review
 
Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)
 
UKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O StatisticsUKOUG, Lies, Damn Lies and I/O Statistics
UKOUG, Lies, Damn Lies and I/O Statistics
 
Metasploit & Windows Kernel Exploitation
Metasploit & Windows Kernel ExploitationMetasploit & Windows Kernel Exploitation
Metasploit & Windows Kernel Exploitation
 
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
Fault Tolerance in Spark: Lessons Learned from Production: Spark Summit East ...
 
Chapter -2 operating system presentation
Chapter -2 operating system presentationChapter -2 operating system presentation
Chapter -2 operating system presentation
 
Application Profiling for Memory and Performance
Application Profiling for Memory and PerformanceApplication Profiling for Memory and Performance
Application Profiling for Memory and Performance
 
A New Tracer for Reverse Engineering - PacSec 2010
A New Tracer for Reverse Engineering - PacSec 2010A New Tracer for Reverse Engineering - PacSec 2010
A New Tracer for Reverse Engineering - PacSec 2010
 
LISA2010 visualizations
LISA2010 visualizationsLISA2010 visualizations
LISA2010 visualizations
 

Kürzlich hochgeladen

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 

Kürzlich hochgeladen (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 

Using the big guns: Advanced OS performance tools for troubleshooting database issues

  • 1. Using the big guns Advanced OS performance tools and their more basic alternatives for tackling production issues Nikolai Savvinov, Deutsche Bank
  • 2. About me • worked with databases for 19 years (started with MySQL) • working with Oracle database since approx. 2005 • Oracle database engineer, Deutsche Bank Moscow/London 2007-now • Last 10 years, specialize in database performance/internals • blog: savvinov.com, twitter: @oradiag
  • 3. About this presentation • 50/50 overview of tools / practical use cases • All examples based on Oracle Linux 6 (kernel 4.1.12), Exadata X6-2 EF 3-node cluster • All opinions are my own and not of my employer (or Oracle corp) • Caveat emptor
  • 4. Why use OS performance tools? • Database instruments things that are expected to be slow • Also, bugs and blind spots • Database doesn’t sit in a vacuum, there isn’t always clear-cut separation between application/database/OS/hw layers • Better understand Unix/Storage/Network when jointly troubleshooting issues
  • 5. What kind of database prod issues are solved with OS performance tools • Memory issues • low-level memory leaks • memory fragmentation • NUMA • swapping • Kernel-level locking • I/O issues • e.g. suboptimal low-level I/O settings • Filesystem issues • e.g. too many files in a directory because a bug leading to excessive tracing • Network issues • e.g. poor TCP throughput due to congestion events
  • 6. Overview of Linux performance tools
  • 7. Basic process level tools • pidstat • -u for CPU (default) • -d for disk I/O • -r for memory • ps • -e to do output for all processes • -o to pick fields you like • e.g. wchan, state, rss • most useful when used “ASH-style” in ExaWatcher / OSWatcher • especially when combined with “OEM-style” visualization
  • 8. WCHAN • WCHAN: the outermost system call where the process is waiting • Zero-overhead stack profiling tool that is always on (w/ OSWatcher) • Low frequency (every few seconds) • Sometimes doesn’t go all the way for some reason • Off CPU only • Kernel-space only
  • 9. WCHAN example • Removed state=‘S’ (interruptible sleep) • cma_acquire_dev is biggest • CMA = contiguos memory allocation • called inside rdma_bind_addr • in drivers/infiniband/core/cma.c • RDMA is core IB technology • Exafusion relies on RDMA • RDMA is not NUMA-friendly
  • 10. DIY-profilers • debugger command dumping a stack with a loop around it • not always stable/safe/predictable • pstack (Tanel Poder’s “poor man’s profiler”) • /proc/<pid>/stack (Luca Canali’s kstacksampler) • oradebug short_stack
  • 11. Profilers/tracers in Linux • perf • systemtap • dtrace for Oracle Linux • ftrace • lttng https://sourceware.org/systemtap/wiki/SystemtapDtraceComparison http://www.brendangregg.com/blog/2015-07-08/choosing-a-linux-tracer.html
  • 12. Scope of profiling tools • The output from a profiler does not always represent wall clock time • The standard (on-CPU) profiling is only valid for CPU-intensive loads • Off-CPU profiling can be tricky (e.g. perf requires CONFIG_SCHEDSTATS=y which is not the case for some Linux distributions), DIY-profilers or WCHAN can help • User-space vs kernel-space: a profiler can only be giving you one half of the picture, and not necessarily the one you need • Be sure to pick the right profiler for the task
  • 13. Safety considerations • ptrace-based tools (strace, gdb, pstack etc.) are less safe • systemtap had some teething problems in early days, considered relatively safe now, but still has issues • Oracle Linux dtrace, perf and bcc(bpf) are probably the safest • It’s a good idea to build a large arsenal of tools • Use UAT as much as possible: even if the issue doesn’t reproduce itself in all its entirety, doesn’t mean some aspects of it can’t be reproduced • If the problem can’t be reproduced on UAT, one can try to reproduce it as isolated mockup activity on production – so if tracing crashes it, no biggie • Balance of risk: side effects of diagnostics vs issue going undiagnosed
  • 15. Step 1: high-level picture • Problems started shortly after 18c upgrade • The main symptom experienced by users was connection delays • The obvious things to check were listener, cluster database alert and kernel logs • On the system level, somewhat elevated sys CPU was noticed • WCHAN analysis didn’t reveal any interesting waits, but showed that extended periods of busy CPU for some of the listeners • pidstat confirmed there were periods of 100% sys CPU for listener processes
  • 16. Step 2: getting stacks • We know what processes we want • We are interested in on-CPU samples • We are interested in kernel-space stacks • Two obvious choices: /proc/<pid>/stack based sampling or “proper” stack profiling using perf • Both are sufficiently safe, but perf can cause noticeable overhead, so we started with kstacksampler, but then also used perf
  • 17. Step 3 (optional here): visualization
  • 18. Step 4: make sense out of results • Identify biggest branch(es) • Identify stack structure, e.g. in this example • Oracle Network Session layer calls • Oracle Network Transport calls • VFS syscalls • TCP syscalls • page allocation • direct compaction • page migration • Identify key elements • Read relevant parts of documentation • Look at the source code • Read source comments + git blame/history
  • 19. Sidenote: compaction and fragmentation • Compaction is an algorithm for memory defragmentation • Normally kernel shouldn’t care if memory is fragmented • Some pieces do, however (like device drivers) • Apparently, TCP implementation also relies on contiguous allocations • A chunk is 2^N pages (4kB), N = order • When initial allocation attempt fails, there are a number of possible fallback strategies (depending on GFP flags) • One common scenario is direct compaction • While doing compaction, the process will be unresponsive
  • 20. Step 5: finding solution
  • 21. Step 6: getting additional detail sudo perf probe --add 'sk_stream_alloc_skb sk=%di size=%si gfp=%dx force=%cx' sudo perf probe --add '__alloc_pages_nodemask gfp_mask=%di order=%si zonelist=%dx zonemask=%cx' sudo perf record -e probe:sk_stream_alloc_skb --filter 'size>0x1000’ –e probe:__alloc_pages_nodemask --filter 'order>4' -agR sleep 100 ============================================================ tgtd 5216 [004] 5706280.274976: probe:sk_stream_alloc_skb: (ffffffff816232d0) sk=0xffff8804959b9800 size=0x1b30 gfp=0xd0 force=0x40 ibportstate 17206 [003] 5706280.311569: probe:__alloc_pages_nodemask: (ffffffff81191fa0) gfp_mask=0x2c0 order=0x6 zonelist=0xffff88407ffd8e00 zonemask=0x0 7fff81193fa1 __alloc_pages_nodemask ([kernel.kallsyms]) 7fff81069641 x86_swiotlb_alloc_coherent ([kernel.kallsyms]) 7fffa01bd88f mlx4_alloc_icm ([kernel.kallsyms])
  • 22. Step 7: digging a little bit deeper…
  • 23. Summary • OS tools can be very useful or even necessary for troubleshooting complex cluster or database issues • Much can be done with basic risk-free OS tools like ps • Some tracing/profiling tools are safer than others • Low-level OS tools become safer over time, but can still carry risk • There are ways to minimize the risk • Weigh the risk of side effects against the risk of not solving the issue
  • 24. Credits • Brendan Gregg – Linux performance expert • Tanel Poder, Luca Canali, Frits Hoogland, Andrey Nikolaev, Alexander Anokhin – pioneered use of OS low-level tools in Oracle troubleshooting • Thanks UKOUG organizers for the opportunity!
  • 26. WCHAN example 2: NUMA balancing
  • 27. WCHAN example 3: inode cache depletion
  • 28. Systemtap safety “In practice, there are several weak points in systemtap and the underlying kprobes system at the time of writing. Putting probes indiscriminately into unusually sensitive parts of the kernel (low level context switching, interrupt dispatching) has reportedly caused crashes in the past. We are fixing these bugs as they are found, and constructing a probe point “blacklist”, but it is not complete” Frank Ch. Eigler, Systemtap tutorial, November 2019 https://sourceware.org/systemtap/tutorial.pdf
  • 29. Bcc (bpf) safety It is unlikely that the programs should cause the kernel to crash, loop or become unresponsive because they run in a safe virtual machine in the kernel.
  • 30. Hanging the system with a simple “cat”
  • 31. Other low-level OS tools • tcpdump • iosnoop
  • 32. Tcpdump for network performance • SQL*Net tracing on client/server side doesn’t always reveal problem • Network-side metrics don’t always reveal problem • Various pings almost never reveal the problem • Many TCP performance problems have to do with congestion control • A variety of tools for analyzing the dumps, e.g. Wireshark • Can dump to ASCII and use own tools
  • 33. Our case • Log file sync delays, sometimes spiking to tens of seconds • Production synchronously replicated to standby via DataGuard • Synchronicity was essential (Max Availability) • Pings show nothing • netops say the network is fine
  • 34. Analysis • We did a tcpdump capture at both ends • Tcpdump shows congestion window (bytes in flight) • It was shrinking in response to congestion events
  • 35. Remediation • Remediation: netops removed bottlenecks, optimized QoS, top users working on improving colocality of their estate • Monitoring/alerting: how do we define thresholds for packet loss? • Zero-loss networks are expensive • In a non zero-loss network, what level of packet loss is acceptable? • Relationship between throughput and packet loss in TCP was approximated by Mathis in 1997 • 𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 ≤ 𝑀𝑆𝑆 𝑙𝑎𝑡𝑒𝑛𝑐𝑦 𝑝𝑎𝑐𝑘𝑒𝑡 𝑙𝑜𝑠𝑠 • Assuming 1 ms latency, this meant 1.5 𝑝𝑎𝑐𝑘𝑒𝑡 𝑙𝑜𝑠𝑠 MB/s, or 0.25% for 30 MB/s
  • 36. Iosnoop • gives a high resolution picture of I/O usage (low-res can be obtained from iotop) • ftrace based • reported safe • observed high performance overhead
  • 37. Our case • General slowness on one of the nodes during certain periods • Nothing helpful in AWR/ASH • ExaWatcher iostat showed I/O spikes on another node • High “reliable message” waits on that other node • iotop didn’t reveal the culprit
  • 38. Analysis & remediation • iosnoop told us the high I/O was from admin f/s housekeeping • housekeeping had too much work due to excessive tracing • processes were slow due to slow writes to trace files • slowness propagated to another node via inter-node communication (“reliable message”) • Remediation: excessive tracing patched, housekeeping job optimized, old files moved manually, scheduling clash resolved