This session will cover advanced techniques in troubleshooting the Citrix NetScaler Appliance using tools such as Citrix TaaS, IPMI, nsconmsg, wireshark and log analysis. We will review usages of these tools along with case studies showing how to best troubleshoot common issues seen in operating Citrix NetScaler Appliances.
What you will learn
- Various tools available to troubleshoot issues and how to use them to isolate NetScaler Issues
- Common deployment problems and how to isolate the causes
9. > show techsupport
Critical
System Data
In-Depth
Performance
Monitoring
Stats
Detailed Log
Files
USER
Command
Logging
/var/tmp/support/collector_P_10.10.10.10_21Apr2014_21_42_tar.gz
NetScaler Tech Support Bundle
Let’s invest a minute chatting about the NetScaler File System.
With advanced NetScaler troubleshooting you’ll frequently find yourself in the BSD SHELL of the system (I’ll discuss the SHELL in more detail later in the presentation), so knowing the actual structure of the file system will greatly assist you in your troubleshooting efforts.
/var contains historical data in the form of logs and is one of the first places to look when trying to troubleshoot a NetScaler issue.
/var/log is the “traditional” location for logs in a UNIX-based operating systems.
/var/nslog contains NetScaler-specific logs --- <click>
/var/nstrace will house all of the trace files taken on the NetScaler --- <click>
/var/crash & /var/core will contain any crash files or core dumps on the system --- <click>
/flash contains the actual NetScaler configuration file and any customizations that have been done --- <click>
/flash/nsconfig/ssl will store all of the SSL certificates installed on the system
<click>
/flash (cont.)
Flash also includes User Monitors and additional custom options as well --- <click>
/ (or the ram drive) contains the operating system
So what happens if the various components fail?
Well, the appliance will be able to operate without /var, but will not be able to log any statistics or other relevant data.
The appliance cannot boot without /flash
The appliance also cannot boot without / the RAM drive
<click>
Now that we have a high-level overview of the key process and file system structure, let’s invest some time reviewing Troubleshooting Tools & Techniques
<click>
The NetScaler tech support bundle, sometimes referred to as the collector file, is one of your very best resources in analyzing the health of your NetScaler appliance.
<click>
The tech support bundle captures critical system data about the performance of the appliance, error logs and a host of other extremely important data that can be used for analysis.
<click>
To create a new tech support bundle that can be analyzed for potential issues on the appliance, simply log into the NetScaler via your favorite SSH client and enter the command: > show techsupport
The tech support file will be generated and stored on the hard drive of the NetScaler in the /var/tmp/support directory and the file name will start with collector_P or S
You can log into the NetScaler via WinSCP and navigate to the /var/tmp/support directory to transfer the collector file to your local computer.
IMPORTANT NOTE: If this appliance is part of an HA pair, make sure that you log into the SECONDARY appliance and collect a tech support bundle on it as well. Citrix Technical Support will use both support bundles to correlate issues between the HA pair.
<click>
Another equally good method for harvesting the support bundle is via the NetScaler GUI.
Log into the NetScaler GUI via your favorite web browser, navigate to the System Node, then select Diagnostics, and then select Generate support file under the Technical Support Tools section.
Click Run to start the process, which is really just a set of scripts to harvest key data.
Once the process has completed, click Download… and you’ll be presented with a simple dialogue to choose a suitable download location on your local computer for the newly generated support bundle.
I’ll talk more about how you will use the tech support bundle shortly.
<click>
Let’s talk for a minute about Citrix Predictive Support.
Predictive Support (formerly known as TaaS) is an initiative from Citrix focused on making the support of your Citrix environment as easy as possible.
Citrix has developed tools and online analysis capabilities to help you collect environment information, analyze that information and receive tailored recommendations based on your Citrix environment and configuration.
<click>
The tools are focused on a single mission (data collection), and their impact to your environment is minimal in terms of disk space, prerequisites and performance impact during the data collection process.
Predictive Support is going to analyze the data captured in the support bundle and provide you with Tailored Recommendations, specific to your environment.
Let’s take a look at how you should use Predictive Support.
<click>
Login to Predictive Support with your ** CITRIX ** username and password, upload your NetScaler tech support bundle with the ‘Upload Data’ option, then select the ‘Upload File’ option when prompted.
You’ll be presented with a simple dialogue option to browse your local computer where you saved the support bundle for upload.
Depending on the size of your support bundle and the relevant speed of your internet connection, it may take a few minutes to accomplish this task.
The size of your support bundle will be directly affected by the size of your configuration file, the rate of traffic flowing through the appliance, error logs and potential crash files that may have been captured.
<click>
When you log into Predictive Support, you’ll see each of the files which have been uploaded. This will include your support bundle at a minimum, but may include trace files or other related log files that you have uploaded as well for analysis.
You can see in the example provided that there are (10) different issues which have been flagged from the uploaded support bundle.
You can also see that this support bundle is from the PRIMARY NetScaler in an HA (High Availability) pair, as the collector references a capital ‘P’ to identify the file as primary.
You’ll see a capital ‘S’ if the file is from the secondary.
Click on the line with the (10) issues and you’ll be presented with another dialogue screen which itemizes each of the respective issues for your review.
<click>
Once the tech support bundle has been uploaded, Predictive Support will execute a series of scripts against the bundle and will flag important issues that have been identified. Each issue will have a brief problem summary for review.
Items marked with the RED BELL icon are the most important issues to address first.
In this particular example we can see that the NetScaler crashed and produced a ‘crash file’ –referred to also as a ‘core dump’.
Crash files are exceptionally helpful to Citrix, as in most cases the RCA (root cause analysis) for the actual crash can be identified by running what’s called a ‘back trace’ on the crash file to identify the reason for the crash.
Understanding why the NetScaler crashed from the back trace will provide key information to assist in stabilizing your environment and for charting a course to resolve the issue entirely.
Selecting the Crash file found on NetScaler link will take you to another dialogue with additional detail about the location of the crash files on the NetScaler.
<click>
Crash files will be stored on the hard drive of the NetScaler appliance.
You can log into the NetScaler with a tool such as WinSCP (http://winscp.net) and navigate to the /var/core directory to find the crash files. You may need to navigate even further into the directory structure depending on how many times the NetScaler may have crashed. You can see in the example above that there are (5) directories created under the root /var/core directory, which would represent different days or times for the relevant crash files.
You’ll note from a previous slide that I mentioned there are two locations for crash files, /var/core and /var/crash respectively. Make sure that you inspect both locations for potential crash files.
Tech Support will request these crash files for further analysis.
Additionally, the NetScaler tech support bundle may or may not include the crash files depending on file size, so it is important to inspect these directories for files.
<click>
When you log into Predictive Support you’ll be presented with a NetScaler Overview. You’ll see the issues which have been flagged for your attention as previously referenced, but you can also navigate to the NetScaler Environment option (among the other options of course) to investigate key sub-systems of the appliance, such as CPU, Memory, System traffic rate, etc.
Select the area that you want to investigate to drill down further.
<click>
You can also select a different newnslog file to analyze the data for a different time-frame if so desired.
<click>
Navigating into each sub-system will provide you with an excellent selection of very intuitive and informative graphs to help pinpoint specific issues during certain time-frames.
Simple place your <cursor> over the interactive graph for even more specific data.
Additionally you can download the data detail as an Excel spreadsheet for further analysis, correlation or data manipulation with other data points.
Leveraging the information provided to you by Predictive Support will not only empower you to effectively troubleshoot your NetScaler appliance, but it will provide you with critical data points and helpful graphs to incorporate into a Post Mortem presentation for internal management as required.
Please feel free, and I’m going to highly encourage you to leverage Predictive Support on a regular basis for a system health check. All you need to do is upload a fresh support bundle to gain instant insight into the health of your NetScaler appliance.
Later in the presentation I’ll show you how I use Predictive Support in two case studies that I will be sharing with you.
<click
Let’s invest a brief minute discussing IPMI.
Many of the NetScaler appliances have been equipped with an IPMI (or Intelligent Platform Mgmt. Interface), perhaps more commonly referred to as the LOM (or Lights Out Management) in the industry.
The MPX 8005/8015/8200/8400/8600/8800, MPX 11500/13500/14500/16500/18500/20500, MPX 11515/11520/11530/11540/11542, MPX 17550/19550/20550/21550, and MPX 22040/22060/22080/22100/22120 appliances have the LOM port on the front panel of the appliance.
By using the LOM, you can remotely monitor and manage the appliance, completely independent of the NetScaler software.
So what are the things that you can do with the LOM?
You can remotely change the NetScaler IP address, perform different power operations, and obtain information from the appliance, such as health monitoring information, the MAC address, serial number, and properties of the host, by connecting to the appliance through the LOM port.
<click>
Simply connect a computer with a standard copper cable to the LOM port. In a web browser, type the IP address of the LOM port to access the intuitive GUI, which by default is http://192.168.1.3.
You’ll need to ensure that the computer from which you’re access the LOM port has been configured for the same subnet. Once logged into the GUI you can modify the default IP address and associated username and password for critical access control.
Show commands in the NetScaler CLI primarily provide configuration and status information about the system or specified entity:
Show commands for SYSTEM information
sh node – is an essential command to leverage when troubleshooting HA-related issues
sh info - consolidates sh version, sh feature, sh mode in one output
sh license
<click>
Show commands for a Vserver and Service
sh lb vserver/sh cs vserver, generic command vs specific referencing an entity
sh service, just like sh lb vserver, generic vs specific output
sh persistencesession, helpful for tracking a persistent session without a trace
sh connectiontable, large output but also useful for connection tracking
<click>
Other helpful show commands
sh route & sh ip, sh
<click>
The primary function of the “stat” command is to provide statistical information about a particular entity.
Similar to the “show” commands, and as a general rule, there are generic and specific output based on whether or not an entity name is specified
Common system stats
Stat ns – system overview. Shows SSL card, disk, TCP, HTTP, SSL, IC, CMP, AppFW statistics
Stat cpu – cpu utilization
Stat interface – interface generic and specific information. Most useful output is achieved by specifying interface number specifically.
<click>
Common entity statistics
Stat lb vserver – provides generic req/s data on bound services. Specific output gives us greater detail on how many connections are coming in and how the server is performing
Stat cs vserver – similar to stat lb vserver output
Stat service – provides generic req/s data on all services. Specific output gives us extended data regarding connections and server performance.
<click>
Other common statistics commands
Stat dns – dns request/response/type statistics
Stat ssl – ssl session statistics
Stat http – http statistics
<click>
As you’ve seen there are quite a few different show and stat commands accessed from the NSCLI which can provide you with very quick and insightful data about your NetScaler appliance.
There will be times where you would like to know more in-depth information about how your NetScaler appliance is performing. In addition you may need to investigate why a particular problem occurred.
This is where you’ll use nsconmsg and not the NSCLI.
nsconmsg logs all of the statistics in the NetScaler every (7) seconds. This will include performance statistics, error messages, console messages, etc. It logs almost every process that the NetScaler performs.
There are numerous different counters that you can harvest critical data from, and nsconmsg is one of the primary tools used by the Citrix support teams for performing analysis or RCA.
As highlighted in the slide, when using the nsconmsg command, make absolutely sure that you use a capital (K) and not a lower-case (k) or you’ll wipe out the log file
<click>
Let’s take a look at an example nsconmsg command.
When you log into the NetScaler appliance via an SSH client, you’ll need to drop to the BSD SHELL with the command > shell
You’ll see the prompt change to the # (pound sign).
Change to the /var/nslog directory with the # cd /var/nslog command and then execute the referenced nsconmsg command string shown.
newnslog is the current log file, fqdn-ssl-vip is the name of the vserver, and ConLb=1 will present the load balancing stats
Again, use a capital (K) with the command.
<click>
So in this example from the previous newnslog command I am analyzing the latest LB statistics of a particular vServer stored in the current log file, the newnslog.
<click for each number>
From this output we can see a few key data points:
The actual recorded time of the (7) second log record
The total # of monitor probes sent to the back-end service and the # of probes which have failed
You can see the VIP detail and associated Hits/sec, Mbps and Persistency method used
You can also see the Service associated with the VIP and specific details about that Service
Additionally you can reference the CPU and MEM utilization, coupled with the UP TIME of that VIP, referenced of course during the (7) second log record interval
<click>
Here are some additional examples of leveraging nsconmsg to perform in-depth analysis of major functions within the NetScaler system.
You may want to append the | more option at the end of your command to truncate the output of the statistics onto a single page for ease of readability.
Also as a point of reference, some of the functions won’t have all 1, 2 or 3 options.
The detail of the command output will increase with the 2 or 3 options. So with load balancing stats, you’ll see significantly more information when appending the 3 to ConLb.
<click>
So far in this presentation I’ve discussed using the stat, show and nsconmsg commands for system analysis.
As referenced, the nsconmsg binary log file has considerable detail captured, but you may want to correlate different log files during the same time frame to see what may have caused a particular event to occur.
<click>
For example, you may want to see why an INTERFACE flapped, or perhaps see why all of your CSW Vservers went DOWN at the same time.
Each of the respective log files has a time stamp associated with each event record. If you notice that all CSW Vservers go DOWN at the same time in the newnslog, perhaps an engineer within your company logged into the NetScaler appliance and issued a particular command that caused the condition to occur.
The ns.log log file located in the /var/log directory captures all of the USER NSCLI or SHELL commands.
<click>
You can use the following command in the /var/log directory to search through ALL of the ns.log log files to see if a particular command was issued at the time at which all of the CSW Vserver went DOWN:
# zgrep -i CMD_EXECUTED ns.log* | more
The ns.log and messages log files are some of the most frequent log files that I reference for certain time-frames when attempting to correlate events in the system. There are numerous other log files that have excellent data recorded that you can leverage for analysis purposes, so don’t be bashful, dig around the system to familiarize yourself with each respective file.
You can simply use the command CAT or MORE against a standard log file to examine the contents for your edification. The more familiar you are with the various log files, the more confident and efficient you will be at determining the RCA (root cause analysis).
<click>
Leverage Wireshark to perform detailed traffic analysis. There’s an old saying, “Packets never lie”. NetScaler log file analysis is a fantastic resource, but when you really need to get into the packets to see the details of what may be occurring, you’ll want to use the Wireshark tool in conjunction with a NetScaler trace file to improve your opportunity of determining RCA.
Take a NetScaler trace to capture the bits and bites, and then dig into the details with Wireshark.
As a note, I frequently browse the Wireshark web page to download the latest developer editions to keep my revision as up-to-date as I can: http://www.wireshark.org
<click>
They key message that I want to share here is … invest the time to enhance your default Wireshark configuration. There are many excellent additions that can be configured into your edition of Wireshark within 10 or 15 minutes to greatly speed up your analysis towards determining root cause for an issue experienced.
You can create custom menu options, such as HTTP errors, bad TCP packets, etc. A push of a menu button can instantly apply a comprehensive filter combination that you don’t have to memorize!
<click>
You can add custom columns to your Wireshark view that will assist you in your analysis. You can see from my example that I have created a few key columns that I use all the time.
You just have to use your creativity to enhance your Wireshark edition.
<click>
Now we’re going to focus on some key troubleshooting techniques and look at a few case studies.
<click>
When I examined the NIC counters I could see that interfaces 1/1 and 1/2 were dropping packets due to rate limiting occurring. This was because the NetScaler appliance was exceeding its system limits per the purchased license. Basically there were more packets hitting the NetScaler than licensed for.
When the NetScaler drops packets because of a rate limit, it’s a hard policing at the NIC. This will cause significant issues for TCP, with a high rate of retransmissions, further exasperating the problem with superfluous traffic.
The end result was that when USERS were attempting to access their XA or XD sessions, sometimes it would take several tries before the application or desktop would launch; and then when launched, there was quite a bit of slowness while using the resource.
The resolution was two-fold: 1) The customer correctly VLAN’d their different IP subnets, binding the subnets to the VLAN’s and associated interfaces to correctly segment their traffic. 2) The customer purchased an upgraded NetScaler license to facilitate the growth in their traffic base.
<click>
The moral of the story is to leverage Citrix Predictive Support often with a tech support bundle.
Pay attention to the issues which have been flagged for your attention.
Use the NetScalers NSCLI to gain quick insight into live performance.
Dig into the counters with nsconmsg and review the associated log files in the BSD SHELL to give you critical insight into the relative health and performance of the NetScaler appliance.
Follow this systematic, but really straight-forward process and you’ll be well on your way to determining the RCA for issues experienced much more efficiently. When all else fails, contact Citrix Technical Support and we’ll be more than happy to engage, partnering with you towards problem resolution.
<click>
We all like extra goodies, so I’ve put together a few Resources that I believe will help bolster your NetScaler toolkit!
<click>
Here are some excellent resources for your reference and review at a later time:
Comprehensive NetScaler Counters
Wireshark Developer Editions
Customizing Wireshark Tutorial
Citrix Predictive Support Forum
NSTRACE Options
How To Manage VLAN’s, Interfaces and Subnets
<click>
Conclusion: So let’s see what we’ve actually covered.
<click>
During this presentation I have provided you with:
<click>
An Overview of the NetScaler System to give you a high-level understanding of the core system --- <click>
I shared with you some excellent Troubleshooting Tools that are available at your disposal --- <click>
I also discussed a few key Troubleshooting Techniques that you can use to diagnose issues with your NetScaler appliance --- <click>
I then highlighted two different Case Studies leveraging the tools and techniques that I shared with you in the presentation --- <click>
In addition I have provided you with a few Resources for your future reference and edification --- DO NOT CLICK
Again I want to thank you for your kind attention during this presentation.
<click>
Q & A
This is your opportunity to ask a few questions.
As a brief note, if during this Q & A session we don’t have enough time to address your particular question, please do find me while I’m here at TechEdge and I’ll be quite happy to chat with you for any follow-up questions that you may have.
ASK: Are there any questions?
Wait for the questions to be asked and answered and then…
<click>
Real quick before you leave…
Conference surveys are available online at www.citrixsynergy.com.
Please do provide your valued feedback by 6:00 p.m. tonight to be entered to win one of many prizes.
In addition, you’ll be able to download each of the respective presentations starting Monday, May 19th from the My Event Planning Tool
<click>