Webinar recording - https://www1.gotomeeting.com/register/753997104
Citrix NetScaler has a rich Web-based management suite of tools available. To dig deep troubleshooting NetScaler, sometimes it’s best to roll up your sleeves and dig out the command line!
The goal of this session is to demystify some useful command line tools and provide a tactical approach to troubleshooting of NetScaler.
In this session we will demonstrate troubleshooting approaches using the command line and many tips for common issues seen in customer deployments.
In this session you will learn about:
· Differences between NetScaler kernel and BSD
· Processes and disk layout
· Look up stats and statuses
· Troubleshoot using various different logs
· Use counters to help identify issues
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
In-depth Troubleshooting on NetScaler using Command Line Tools
1. Andrew Sandford
Senior Readiness Specialist, Worldwide Support Readiness EMEA
Citrix Support Secrets Webinar
Series
In-depth Troubleshooting on NetScaler using Command
Line Tools
27 March 2014
Access to the appliance CLI is through the serial console, or by connecting with SSH to the NetScaler management IP /SNIP with Management Access enabled.This lands you in the NetScaler CLI – which has custom commands. Some TCSH-like shortcuts can apply.TCSH shortcuts like CTRL-A brings you to the start of the line, and CTRL-E brings you to the end. Most day to day NetScaler commands will begin with the following verbs: show Display information about an entity.add Create an entity.remove Delete an entity.set Change/modify an entity.enable Turn a feature or setting ON.disable Turn a feature or setting OFF.force Used in High Availability to Sync & Failover.bind Create a relationship between two entities.unbind Remove a relationship between two entities.
>set cli prompt %u@%h-%T>set cli mode -color ON>help set cli promptAutocomplete is your best friend… <tab> & ? %! - will be replaced by the history event number %u - will be replaced by the NetScaler user name %h - will be replaced by the NetScaler hostname %t - will be replaced by the current time %T - will be replaced by the current time (24 hr format) %d - will be replaced by the current date %s - will be replaced by the node state
There is another shell (BASH) that is only used for file handling – never to configure the NetScaler. Note the different prompts.> shell to enter BASH shell in BSD#Ctrl+D to exit BASH or type exit
All of the UNIX command goodness in FreeBSD!
Show feature provides an output that allows us to quickly identify which features are enabledNotice these common features are disabledLBICREWRITE This output almost certainly would indicate a misconfiguration2. At a bare minimum most deployments will have at LEAST “load balancing” enabled. If LB is off, you can see issues where a vserver wont come up, or it will only utilize 1 service. If you have any inexplicable errors where features just don’t work at all despite a proper config it is almost ALWAYS that the feature has been disabled. This is a common mistake, I make it frequently.
In this case we specified a vserver name, this is the most desirable way to execute this command as it shows you the most detail, namely the state of your bound services.Key difference between this example and the generic example is the bound service summaryWe now know that this vserver is down because the only service bound to it is down. A vserver will go down when all of its bound services are down.
Two different directories. One for cores and one for crashes
Pitboss controls the processes on a NetScalerIf the pitboss detects a process failing it will try to restart itThe nsppe process is in userland so can be “warm” startedIf a process fails 5 times, on the sixth failure the NetScaler will undergo a full reboot
Show commands primarily provide configuration and status information about the system or specified entityShow commands for SYSTEM infoShow node, talk about various bad states. Essential to troubleshooting HA issuesConsolidate show version, show feature, show mode into “show info”Show license3. Show commands for a vserver and service:Shlbvserver/shcsvserver, generic command vs specific referencing an entitySh service, just like shlbvserver, generic vs specific outputShpersistencesession, helpful for tracking a persistent session without a traceShconnectiontable, large output but also useful for connection tracking4. Other show commandsSh route, ship, shShdnsaddrec –type proxy, useful for debugging cached DNS records
Why is this node down?2. Things to notice here:Node State: NOT UPMaster State: Secondary3. Its down because unused interfaces are enabled and not receiving heartbeats. If we compare which interfaces are enabled to which interfaces are not receiving heartbeats we can determine that 1/7 is the only interface receiving heartbeats.4. We can correct this by disabling all interfaces except for 1/7 (which is the only interface in use).5. Notice the partner node, he is secondary but has only one interface enabled. Its *NOT* failing for the same reason, notice the Node State: STAYSECONDARY setting.6. So we know the node we are on is down because its interfaces are misconfigured, and the partner node is forced to stay secondary.
BSD/NS relationship:BSD controls disk/time slicing, primarily a bootloader for NS kernelLogs are written by BSDLogs are rolled by BSDNetScaler runs inside of FreeBSD, nsppeConsumes majority of user land processer (99%)NetScaler controls NICs/packet processing/etc.Key processes:NsppeNsaaadNsconfigdNsauthdNslog.shKey processes:NsnetsvcNsumondNsconmsgDynamic routing processes
Taken on an MPX 15500
1. Things to notice here:Various hit countersClient conn vs Server connSpilloverService stats (ttfb, transactions)
A significant portion of the information we care about is already available from “stat lbvserver [name]”TTFBSurge QueueThings to notice:TTFBSurge queue
1. Discuss an overview of each mount point “/var” contains historical data in the form of logs. This is one of the first places to look when trying to troubleshoot a NetScaler issue./var/log is the “traditional” location for logs in Unix/Linux operating systems/var/nslog contains NetScaler specific logs“/flash” contains configuration and customizationsrc.netscalerAny modified configs from /etcUser monitorsKernel itself“/” contains the OSRam driveAvoid writing anything to it, no reason to2. What happens if the components fail?Device can operate without /var, but will not be able to logDevice cannot boot without /flashDevice cannot boot without /
1. Discuss an overview of each mount point “/var” contains historical data in the form of logs. This is one of the first places to look when trying to troubleshoot a NetScaler issue./var/log is the “traditional” location for logs in Unix/Linux operating systems/var/nslog contains NetScaler specific logs“/flash” contains configuration and customizationsrc.netscalerAny modified configs from /etcUser monitorsKernel itself“/” contains the OSRam driveAvoid writing anything to it, no reason to2. What happens if the components fail?Device can operate without /var, but will not be able to logDevice cannot boot without /flashDevice cannot boot without /
1. THIS IS AN INCOMPLETE LIST, THESE ARE THE MOST COMMON FILESNs.log (INCLUDING BUT NOT LIMITED TO):contains NSCLI commandscontains syslog messages useful for reconstructing user input and event timelinesBy far the most informative file containing the most information in one placeMessages contains system events authentication messages system startup messages commands executed under shellconsole messages
1. THIS IS AN INCOMPLETE LISTNewnslogCurrent live fileNewnslog.*.gzRolled log filesNs.log - contains newnslog related eventsNsumond.log – contains log output for user land monitors using KASNslog.nextfile – next newnslog file to be written
If a problem hasn’t been solved by what we have done so far:Logfile analysisShow commandsStat commandsThen the next step is to get even further debug information from Nsconmsg. 2. Nsconmsg logs all of the statistics we have seen so far, and additionally there are literally thousands of other counters we don’t see in the NSCLI which are logged. All of these counters are recorded every 7 seconds and written to the file in a binary format.Nsconmsg is one of the primary tools support uses to debug issues3. Some things we can get from newnslog are:Events: UP/DOWN messages for entities (vservers, services), HA events, interface events, etc. Most of these events are also logged to syslog.Console messages: Mostly BSD messages, disk write errors, etc
If a problem hasn’t been solved by what we have done so far:Logfile analysisShow commandsStat commandsThen the next step is to get even further debug information from Nsconmsg. 2. Nsconmsg logs all of the statistics we have seen so far, and additionally there are literally thousands of other counters we don’t see in the NSCLI which are logged. All of these counters are recorded every 7 seconds and written to the file in a binary format.Nsconmsg is one of the primary tools support uses to debug issues3. Some things we can get from newnslog are:Events: UP/DOWN messages for entities (vservers, services), HA events, interface events, etc. Most of these events are also logged to syslog.Console messages: Mostly BSD messages, disk write errors, etcSystem statistics: all counters are captured every 7 seconds. We can view lb statistics, system statistics.System counters: mostly for software debugLoad balancing counters: ConLb shows us detailed load balancing statistics
This is a freshly booted device so we see a variety of messages here:Service up/down eventsInternal services coming upBootup messagesCPUCONFIG STARTUbsec_0 UPInterface eventsDisabled interfacesHA eventsVersion mismatch messageRemote node UP
-d consmsg provides output on any BSD console messagesThis console output consists only of bootup output, but you may also see things like: IP conflicts here NIC errors (duplex issues) lack of file handlers OS errors, etc.
–d oldconmsg provides CPU and memory utilization outputI will use this to quickly establish trends in CPU/MEM utilization (just let the output scroll and watch mem/cpu and see if they increase steadily).
1. –s ConLB=2 provides detailed debug output regarding load balancingSpecific detail on the lbvserver, the types of traffic its handling, Specific detail on services and the traffic THEY are subsequently handling.2. Most interested in the following sections: Hits, particularly Pers (persistence status to explain hits) Pkt (packet stats) Conn (Current server, Maximum server, Open Established, Established, Reuse Pool, Surge queue)
1. HDD issues – primary failure is that logging fails, /var is missingCheck df, are any drives missing?Check dmesg, are there any drives missing or errors?Run fsck on the drive to check for errorsAttempt to re-mount the drive2. Flash issues – config save issues, sync fails, device fails to bootThe box wont boot without flash, so if the NetScaler is running the device mounted OK.Check df, is the /flash missing or full?Check dmesg, is flash missing or getting any errors?Run fsck on the drive to check for errors3. Memory starvation – dropped session, cant allocate memory for other tasks (CPU profile, etc.)Feature memory allocation, IC, APPFW, TCPBUFFERINGConMEM4. CPU overutilizationSNMP pollingNewnslogs roll USIP? CMP?Anything in userland?
http://support.citrix.com/article/CTX109304
Perl script/netscaler/showtechsupport.plAlso available in the UI: System > Diagnostics > Generate support file
Perl script/netscaler/showtechsupport.plAlso available in the UI: System > Diagnostics > Generate support file
If we haven't been able to solve the issue with stat and show commands, cant find anything in the logs, the next step might be to get a sniff. Personally I like to get a sniff first and work forward from there, assuming the problem warrants it.2 kinds of traces,Nstrace – preferable, gives extended NetScalerdataNstcpdump – all NIC wrapper for tcpdump. quick, familiar, uses tcpdump syntax. 2. Common syntax -sz 0, or with a filter. Truncated traces are almost always worthless.3. 9.0 adds filter support to nstrace. This is the preferred acquisition method as we write extended data to the capture useful for session tracking, NIC tracking, operation tracking, etc. Requires custom build of Wireshark to view.If you use a filter, use the –link option to capture all of the other related traffic on that session.Operators: ==, !=, etc.4. Nstcpdump good for looking at live traffic, or for when traces need to be viewed with a standard Wireshark build. All we are doing is an all NIC trace when invoking tcpdump via nstcpdump.sh. Since we now have filtering in nstrace, it is much more preferable to use that method to capture instead of nstcpdump. Filtering in tcpdump is expensive in userspace due to the way we provide the data to tcpdump.Common syntax –X to print payload in ASCIICommon syntax –w to write the capture to a fileTcpdump not effective due to the way packets are captured from the kernel
Cover various switches, the most common syntax will be “nstrace.sh –sz 0”-sz specifies size of data to be captured-nf number of files in a cycle-tcpdump, writes file in tcpdump format (doesn’t need special Wireshark build)-filter, specifies the filter to apply-link, link associated traffic from filter
type qualifiers say what kind of thing the id name or number refers to. Possible types are host, net and port. E.g., `host foo', `net 128.3', `port 20'. If there is no type qualifier, host is assumed. dir qualifiers specify a particular transfer direction to and/or from id. Possible directions are src, dst, srcordst and srcanddst. E.g., `src foo', `dst net 128.3', `src or dst port ftp-data'. If there is no dir qualifier, srcordst is assumed. For `null' link layers (i.e. point to point pro- tocols such as slip) the inbound and out-bound qualifiers can be used to specify a desired direction. proto qualifiers restrict the match to a particular protocol. Possible protos are: ether, fddi, ip, arp, rarp, decnet, lat, moprc, mopdl, tcp and udp. E.g., `ether src foo', `arp net 128.3', `tcp port 21'. If there is no proto qualifier, all protocols consistent with the type are assumed. E.g., `src foo' means `(ip or arp or rarp) src foo' (except the latter is not legal syntax), `net bar' means `(ip or arp or rarp) net bar' and `port 53' means `(tcp or udp) port 53'. The –r option must not be given to the script because internally we are supplying ‘-r –‘ option as a default entry, so that TCPDUMP reads the traces from the standard input. So it does look logical, not to supply a ‘-r’ option from the CLI.The –i option must be avoided too. This is because TCPDUMP listens only on 1 interface at a time. In our case we are dumping packets that arrive on all the interfaces on to the standard output. If you want to view per interface packets, ‘nstrace.sh’ can be used with the ‘-tcpdump 1 -nic 1’ option as input.The ‘-F’ option is not encouraged. If this is to be put to full use, use ‘nstrace.sh’ or ‘nstcpdump.sh –w <file>’ and do an offline filtering using TCPDUMP directly. This is because we felt that, since you are using the standard output to dump the traces, it would not be a wise idea to use ‘complex’ filter expressions. So the better idea would be to store in a file the entire trace and view it using Ethereal or graphical packet analyzer and based on the fields you are interested, generate your ‘expression’ file and use TCPDUMP directly on the shell with the ‘-F’ option to filter the trace captured.-sSnarfsnaplen bytes of data from each packet rather than the default of 68 (with SunOS's NIT, the mini- mum is actually 96). 68 bytes is adequate for IP, ICMP, TCP and UDP but may truncate protocol infor- mation from name server and NFS packets (see below). Packets truncated because of a limited snapshot are indicated in the output with ``[|proto]'', where proto is the name of the proto- col level at which the truncation has occurred. Note that taking larger snapshots both increases the amount of time it takes to process packets and, effectively, decreases the amount of packet buffering. This may cause packets to be lost. You should limit snaplen to the smallest number that will capture the protocol information you're interested in. -T Force packets selected by "expression" to be interpreted the specified type. Currently known types are rpc (Remote Procedure Call), rtp (Real-Time Applications protocol), rtcp (Real-Time Applications control protocol), vat (Visual Audio Tool), and wb (distributed White Board).
Nstcpdump is just a wrapper for tcpdump, standard syntax applies.
# nsapimgr-K <nstrace-file> -s tcpdump=1 -k <tcpdump-file>Offline conversion of traces, those are in our NSTRACE format, to TCPDUMP format.<nstrace-file>: the file, which is in the NSTRACE format.<tcpdump-file>: the file into which the traces are to be converted and dumped in the TCPDUMP format.0=nstrace-format (default)
Two different directories. One for cores and one for crashes
Pitboss controls the processes on a NetScalerIf the pitboss detects a process failing it will try to restart itThe nsppe process is in userland so can be “warm” startedIf a process fails 5 times, on the sixth failure the NetScaler will undergo a full reboot