Streamlining Python Development: A Guide to a Modern Project Setup
Ipso vrrp troubleshooting
1. The purpose of this article is to help in troubleshooting VRRP related issues on
NOkia Checkpoint Firewalls. One of the most common problems faced in Nokia VRRP
Implementations is that interfaces on active and standby firewalls go into the
master master state. THe main reason for this is because the individual vrids of
the master and backup firewall are not able to see the vrrp multicast requests
of each other.
The first step is to check the vrrp state of the interfaces. This is how you can
check that:
PrimaryFW-A[admin]# iclid
PrimaryFW-A> show vrrp
VRRP State
Flags: On
6 interface enabled
6 virtual routers configured
0 in Init state
0 in Backup state
6 in Master state
PrimaryFW-A>
PrimaryFW-A> exit
Bye.
PrimaryFW-A[admin]#
SecondaryFW-B[admin]# iclid
SecondaryFW-B> sh vrrp
VRRP State
Flags: On
6 interface enabled
6 virtual routers configured
0 in Init state
4 in Backup state
2 in Master state
SecondaryFW-B>
SecondaryFW-B> exit
Bye.
SecondaryFW-B[admin]#
In the example shown you see that 2 interfaces each from both firewalls are in
the Master state.
The next step should involve running tcpdumps to see if the vrrp multicasts are
reaching the particular interface.
As the first troubleshooting measure, put a tcpdump on the problematic interface
of the master and backup firewalls. If you want to know what the problematic
interface is, “echo sh vrrp int | iclid“ should give you the answer. It is that
interface on the backup firewall which would be in a Master state.
PrimaryFW-A[admin]# tcpdump -i eth-s4p2c0 proto vrrp
tcpdump: listening on eth-s4p2c0
00:46:11.379961 O 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
00:46:12.399982 O 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
00:46:13.479985 O 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
00:46:14.560007 O 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
2. When you put a tcpdump on the Primary Firewall, you see that the vrrp multicast
request is leaving the interface.
Next put the tcpdump on the secondary firewall.
SecondaryFW-B[admin]# tcpdump -i eth-s4p2c0 proto vrrp
tcpdump: listening on eth-s4p2c0
00:19:38.507294 O 192.168.1.2 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 95
[tos 0xc0]
00:19:39.527316 O 192.168.1.2 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 95
[tos 0xc0]
00:19:40.607328 O 192.168.1.2 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 95
[tos 0xc0]
00:19:41.687351 O 192.168.1.2 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 95
[tos 0xc0]
00:19:42.707364 O 192.168.1.2 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 95
[tos 0xc0]
Now you can see that the interface on both the primary and the secondary
firewalls are broadcasting vrrp multicasts. This is because the vrrp multicasts
are not reaching the firewalls interfaces. This means there is a communication
breakdown which can be possibly caused by network issues.
Once the network issue is resolved, communication would be possible and the
interface with the lower priority will go as the secondary or backup state.
Now let us discuss another scenario where there is a problem with the firewall
interfaces in Master Master state.
Again put a tcpdump on both the interfaces in question:
PrimaryFW-A[admin]# tcpdump -i eth-s4p2c0 proto vrrp
tcpdump: listening on eth-s4p2c0
00:46:11.206994 I 10.10.10.1 > 224.0.0.18: VRRPv2-adver 20: vrid 103 pri 95 [tos
0xc0]
00:46:11.379961 O 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
00:46:12.286990 I 10.10.10.1 > 224.0.0.18: VRRPv2-adver 20: vrid 103 pri 95 [tos
0xc0]
00:46:12.399982 O 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
00:46:13.307014 I 10.10.10.1 > 224.0.0.18: VRRPv2-adver 20: vrid 103 pri 95 [tos
0xc0]
00:46:13.479985 O 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
00:46:14.387098 I 10.10.10.1 > 224.0.0.18: VRRPv2-adver 20: vrid 103 pri 95 [tos
0xc0]
00:46:14.560007 O 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
00:46:15.467064 I 10.10.10.1 > 224.0.0.18: VRRPv2-adver 20: vrid 103 pri 95 [tos
0xc0]
00:46:15.580010 O 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
SecondaryFW-B[admin]# tcpdump -i eth-s4p2c0 proto vrrp
tcpdump: listening on eth-s4p2c0
00:19:38.507294 O 192.168.1.2 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 95
[tos 0xc0]
00:19:38.630075 I 10.10.10.2 > 224.0.0.18: VRRPv2-adver 20: vrid 103 pri 100
[tos 0xc0]
00:19:39.527316 O 192.168.1.2 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 95
[tos 0xc0]
00:19:39.710131 I 10.10.10.2 > 224.0.0.18: VRRPv2-adver 20: vrid 103 pri 100
[tos 0xc0]
3. 00:19:40.607328 O 192.168.1.2 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 95
[tos 0xc0]
00:19:40.790142 I 10.10.10.2 > 224.0.0.18: VRRPv2-adver 20: vrid 103 pri 100
[tos 0xc0]
00:19:41.687351 O 192.168.1.2 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 95
[tos 0xc0]
00:19:41.810150 I 10.10.10.2 > 224.0.0.18: VRRPv2-adver 20: vrid 103 pri 100
[tos 0xc0]
In the above example look at the vrid numbers of the incoming and outgoing
packets. From the vrids you see that that the vrids donot match. This is an
indication that the cabling is not correct. The cables going to vrid 102 and 103
are not connected correctly and they need to be swapped to fix this issue.
Swap the cables and the issue will be resolved. The firewall with the higher
priority will go into the Master state.
A properly functioning firewall will be like this:
PrimaryFW-A[admin]# iclid
PrimaryFW-A> sh vrrp
VRRP State
Flags: On
6 interface enabled
6 virtual routers configured
0 in Init state
0 in Backup state
6 in Master state
PrimaryFW-A> exit
Bye.
PrimaryFW-A[admin]#
SecondaryFW-B[admin]# iclid
SecondaryFW-B> sh vrrp
VRRP State
Flags: On
6 interface enabled
6 virtual routers configured
0 in Init state
6 in Backup state
0 in Master state
SecondaryFW-B> exit
Bye.
SecondaryFW-B[admin]#
If you were to tcpdump the healthy interface, this is how it would look:
PrimaryFW-A[admin]# tcpdump -i eth-s4p2c0 proto vrrp
tcpdump: listening on eth-s4p2c0
18:25:44.015711 O 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
18:25:45.095726 O 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
18:25:46.175751 O 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
18:25:47.195770 O 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
18:25:48.275819 O 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
18:25:49.355812 O 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
4. [tos 0xc0]
^C
97 packets received by filter
0 packets dropped by kernel
PrimaryFW-A[admin]#
SecondaryFW-B[admin]# tcpdump -i eth-s4p2c0 proto vrrp
tcpdump: listening on eth-s4p2c0
18:26:07.415446 I 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
18:26:08.495451 I 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
18:26:09.515480 I 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
18:26:10.595486 I 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
18:26:11.675485 I 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
18:26:12.695522 I 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
18:26:13.775590 I 192.168.1.1 > 224.0.0.18: VRRPv2-adver 20: vrid 102 pri 100
[tos 0xc0]
^C
14 packets received by filter
0 packets dropped by kernel
SecondaryFW-B[admin]#
“““““““““
VRRP Transitions can happen due to several causes:
The first (and most common) cause is that one or more of the monitored
interfaces looses link state.
The next cause is due to network issues VRRP hello packets are not seen
originating from the master VRRP member on the backup.
The third cuase is that one of the Check Point critical devices fails to check-
in its state to the Kernel within the specified timeout.
Solution
VRRP Transitions due to loss of link state
It is often difficult to determine if the VRRP transition has occured due to a
loss of link state on one of the monitored interfaces. To isolate the failover
cause to a link transition of one of the following interfaces do the following:
Gather switch statistics from the devices directly connected to the VRRP pair to
analyze whether or not you can determine if a link transition occurred.
Run following commands to determine what interface is loosing link state causing
the transition to occur.
(NOTE: This command shows Up to Down Transitions only. It will not increment
when the link state goes from Down to UP.)
ipso[admin]# clish -c “show interfacemonitor“
5. Interface Monitor
Interface eth1c0
Status up
Logical Name eth1c0
State PhysAvail,LinkAvail,Up,Broadcast,Multicast,AutoLink
MTU 1518
Up to Down Transitions 1
Interface eth2c0
Status up
Logical Name eth2c0
State PhysAvail,LinkAvail,Up,Broadcast,Multicast,AutoLink
MTU 1518
Up to Down Transitions 1
Interface eth3c0
Status up
Logical Name eth3c0
State PhysAvail,LinkAvail,Up,Broadcast,Multicast,AutoLink
MTU 1518
Up to Down Transitions 1
Interface eth4c0
Status down
Logical Name eth4c0
Interface loop0c0
Status up
Logical Name loop0c0
State PhysAvail,LinkAvail,Up,Loopback,Multicast
MTU 0
Up to Down Transitions 0
ipso[admin]# clish -c “show vrrp interfaces“
VRRP Interfaces
Interface eth1c0
Number of virtual routers: 1
Flags: MonitoredCircuitMode
Authentication: NoAuthentication
VRID 10
State: Master Time since
transition: 85236
BasePriority: 110 Effective Priority:
110
Master transitions: 3 Flags:
Advertisement interval: 1 Router Dead Interval:
3
VMAC Mode: VRRP VMAC:
00:00:5e:00:01:0a
Primary address: 10.207.159.5
Next advertisement:
Number of Addresses: 1
10.207.159.88
Monitored circuits
eth3c0 (priority 10)
Interface eth3c0
Number of virtual routers: 1
Flags: MonitoredCircuitMode
Authentication: NoAuthentication
6. VRID 10
State: Master Time since
transition: 85236
BasePriority: 110 Effective Priority:
110
Master transitions: 3 Flags:
Advertisement interval: 1 Router Dead Interval:
3
VMAC Mode: VRRP VMAC:
00:00:5e:00:01:0a
Primary address: 192.168.159.4
Next advertisement:
Number of Addresses: 1
192.168.159.88
Monitored circuits
eth1c0 (priority 10)
VRRP Transitions due to not recieving VRRP hello packets
In order to determine if VRRP hello packets are seen from the master on the
backup you will need to run tcpdump on each interface (configured for VRRP)
looking for the inbound hello packets.The following command will allow you to
see all VRRP hello packets:
ipso[admin]# tcpdump -vv -i eth1c0 proto vrrp
tcpdump: listening on eth1c0
18:18:20.605420 I 10.207.159.5 > 224.0.0.18: VRRPv2-adver 20: vrid 10 pri 110
int 1 sum 9684 naddrs 1 10.207.159.88 [tos 0xc0] (ttl 255, id 14906)
36 packets received by filter
0 packets dropped by kernel
When analyzing the VRRP hello packet there are several things that need to be
looked at:
VRID “ make sure that the packets you are looking at belong to the VRID in
question.
pri “ this is the effective priority that is being announced to the other VRRP
member
VRRP Transitions due to a failure of a Check Point Critical Device
VRRP will only monitor the state of the Check Point processes only if “FW
Monitoring“ is selected in the VRRP configuration. For troubleshooting purposes
this can be disabled from Voyager to rule out a critical device failure. Nokia
does not recommend that customer run with this setting disabled in a production
environment.
A Check Point Critical Device is a process that is monitored by the cpha daemon.
These devices must report their state to the kernel within the timeout
specified. If the device fails to report its state to the kernel within the
7. specified timeout the kernel will assume that there is a problem with the
process and will force a VRRP failover.
Note: When “ FW Monitoring “ is enabled on VRRP; any backward clock move will
cause fwd to go into problem state and as a result VRRP fail over will occur.
To obtain a list of the Check Point Critical Devices and timeouts run the
following command:
ipso[admin]# cphaprob -i list
Built-in Devices:
Device Name: IPSO member status
Current state: OK
Registered Devices:
Device Name: Synchronization
Registration number: 0
Timeout: none
Current state: OK
Time since last report: 102563 sec
Device Name: Filter
Registration number: 1
Timeout: none
Current state: OK
Time since last report: 102548 sec
Device Name: cphad
Registration number: 2
Timeout: 5 sec
Current state: OK
Time since last report: 0.2 sec
Device Name: fwd
Registration number: 3
Timeout: 5 sec
Current state: OK
Time since last report: 0.6 sec
To enable debugging (which will write an event to the messages file and console
upon a critical device failure) run the following commands:
ipso[admin]# ipsctl -w net:log:partner:status:debug 1
That will log to the console and to /var/log/messages. If you want to turn off:
ipso[admin]# ipsctl -w net:log:sink:console 0
After enabling debugging, analyze the /var/log/messages file and look for lines
containing “noksr“. The log event will look like the following:
Oct 12 18:55:28 IP650A [LOG_DEBUG] kernel: netlog:noksr_timeout .. Firewall-
1/cphad expired
Oct 12 18:55:28 IP650A [LOG_DEBUG] kernel: netlog:noksr_timeout .. Firewall-
1/fwd expired
8. Analyzing this information you will be able to determine exactly which critical
device has failed. You should then take a look at the timeout value for this
critical device to determine if the value is high enough.
In relatively high CPU usage situations failover may occur due to the critical
device not getting the CPU time required to check its state in with the kernel.
It is recommended to increase the parameter to 600 seconds if the machine is
under heavy load.
If the above does not improve the situation, use the following command to
completely remove the FWD from the “response“ list:
ipso[admin]# cphaprob -d fwd unregister
Take into consideration that this means that failover will not occur if the FWD
daemon crashes during normal operation.
To change a timeout value to a higher value use the following command:
ipso[admin]# cphaprob -d [device] -t [timeout] -s [state] -p register
Example:
ipso[admin]# cphaprob -d fwd -t 120 -s ok -p register
This command has registered the fwd process with the state “OK“ and a timeout
value of 120 seconds.
(NOTE: this command will not survive a reboot so the commands will need to be
added to the fwstart script or rc.local with a 60 seconds sleep to make this
persistant across reboots)
““
show vrrp interfaces
Detailed configuration of VRRP, including priority, hello interval, and VRID
clish -c “show interfacemonitor“
Displays interface transitions
cphaprob -i list
Displays Checkpoint critical processes and their timeouts.
To log critical process failures:
ipsctl -w net:log:partner:status:debug 1
That will log to the console and to /var/log/messages. If you want to turn off:
ipsctl -w net:log:sink:console 0
To change the timeout value of a monitored process:
cphaprob -d [device] -t [timeout] -s [state] -p register