The document discusses troubleshooting common issues in OpenStack, specifically focusing on tracebacks, Nova issues, and Neutron issues. It provides tips on reading tracebacks and diagnosing specific failures related to the Nova scheduler, Neutron DHCP agent, L2 agent, and L3 agent. Key troubleshooting techniques include checking logs, packet captures, and debugging configuration issues. The presenters emphasize becoming familiar with underlying technologies like Open vSwitch, iptables, and Linux bridging to properly diagnose OpenStack problems.
3. 3
What we’re here to talk about…
Troubleshooting OpenStack Issues:
• Tracebacks
• Common Nova issues
• Common Neutron Issues
4. 4
Slides available at SlideShare
These slides will be available at the following location after this presentation:
http://www.slideshare.net/JamesDenton1
8. 8
Traceback 101
• When errors occur, sometimes exceptions are raised.
• When an exception is caught, an error and a list functions
that got us to the point of the error are logged. This is a
traceback.
• The traceback output can be useful to operators and
developers and allows them to trace the steps to the error.
• As you’ll see, a traceback doesn’t always provide clear
insight into the real error.
9. Image FPO
9
D e c i p h e r i n g a t r a c e b a c k i s a b i t l i ke r e a d i n g t h e M a t r i x
9
10. Image FPO
10
D e c i p h e r i n g a t r a c e b a c k i s a b i t l i ke r e a d i n g t h e M a t r i x
10
11. 11
Tips on reading a traceback
Read from the bottom to the top
– The last few lines are the most relevant
In this case, within the init function the program was unable to connect to MySQL.
12. 12
Slides available at SlideShare
These slides will be available at the following location after this presentation:
http://www.slideshare.net/JamesDenton1
16. 16
No valid host was found
This error is likely seen when booting an instance. Common reasons for failing:
• There really are no hosts available
• Networking issues on compute node
• Lack of resources
23. 23
Bringing it back together
Taking a look at the function, we can see there is no exception handling:
24. 24
Bringing it back together
By adding some exception handling to the function…
25. 25
Bringing it back together
… we get a nice, clean error that clearly indicates what is wrong
26. 26
Bringing it back together
Interface mappings can be found in the ML2 configuration file:
If eht2 does not exist on this host, the Neutron agent may be unable to complete
the network configuration.
30. 30
NTP! NTP! NTP!
Wonky behavior caused by inconsistencies in time between hosts
• Services and agents can appear DOWN
when they’re UP
• Service and agent flapping can cause
scheduling issues
32. 32
Neutron architecture
Neutron is composed of various services and agents responsible for building and
maintaining the virtual network:
Failures can occur at any point.
34. 34
Neutron architecture
The DHCP agent is responsible for:
• Creating network namespaces
• Configuring dnsmasq – a DHCP
server
When instances are created, IPs are
statically assigned.
35. 35
Neutron architecture
Failures of the DHCP agent on a host can result in:
• Instances not getting their
initial lease
• Instances not renewing
their lease
36. 36
Dnsmasq Basics
As subnets and ports are created, the DHCP agent is responsible for configuring
the files used by dnsmasq to provide DHCP services to the network:
When dnsmasq hands out the lease, it updates its active lease database.
• /var/lib/neutron/dhcp/<network_uuid>/host
37. 37
Dnsmasq Basics
By default, dnsmasq writes its logs to:
• /var/log/syslog (Ubuntu,Debian)
• /var/log/messages (RHEL,CentOS,Fedora)
38. 38
Troubleshooting DHCP
If there are issues obtaining an IP, start with packet captures on the following
devices:
• Compute node:
– Tap interface
– Bridge interface
– Physical interface
• Network node:
– Physical interface
– Bridge interface
– Veth interface
– Namespace interface
Listen on UDP ports 67 and 68. You should see the full DHCP cycle in the packet
capture on most interfaces.
39. 39
Troubleshooting DHCP – Packet Captures
• Working example:
• Non-working example:
When DHCP isn’t working, investigate the switching layer or dnsmasq.
41. … w e ’ v e g o t a l i v e b u g : D H C P N A K !
41
42. 42
Troubleshooting DHCP – DHCPNAK Issues
I see DHCPNAK packets. HELP!
• Likely means the DHCP agent was restarted and active lease file deleted
• Instances may receive DHCPNAK when requesting / renewing address
• This may result in delayed or no connectivity
• Addressed in patch for bug #1345947, which sets dnsmasq to renew the
lease anyway without sending a NAK and repopulate its lease file
43. 43
Troubleshooting DHCP – DHCPNAK Issues
When a network is scheduled to more than 1 DHCP agent, there may be issues:
• That fix expected only 1 DHCP server in the network!
• The DHCPREQUEST packet sent on renewal attempt is received by all
DHCP agents (it’s a broadcast, after all)
• The renewal attempt is accepted by the agent that provided the original lease
• At the same time, the renewal attempt is rejected by the agent that didn’t
provide the original lease
44. 44
Troubleshooting DHCP – DHCPNAK Issues
The end result? The client honors the DHCPNAK and restarts the DHCP process
45. 45
Troubleshooting DHCP – DHCPNAK Issues
However, there is hope!
• Bug 1457900 addresses the multiple DHCP agent issue
• The fix is to pre-populate the dnsmasq leases file on all DHCP agents with all
known MACs/IPs for respective networks
• Fixed in Liberty, coming to a backport near you!
47. 47
Neutron architecture
The L2 agent is responsible for:
• Programming the virtual switching
infrastructure
• Applying security groups
48. 48
Neutron architecture
Failures of the L2 agent on a host can result in:
• Lack of instance connectivity
• Security group issues
• ERROR state during nova boot
50. 50
Troubleshooting OVS connections
Every interface plugged into the integration bridge
should have a local VLAN ID that is unique to that node,
no matter what the network type (VLAN, flat, local,
VXLAN, GRE):
If the tag is missing, try restarting the OVS agent to
force a rebuild of the integration bridge VLAN tagging
and corresponding flows.
51. 51
Troubleshooting OVS connections
If you see an OVS port in VLAN 4095, it typically means
that the agent was unable to find a corresponding
Neutron port in the database:
When this happens, it usually means that the port was
deleted from the DB manually or as part of another
action that did not complete successfully.
52. 52
Troubleshooting OVS connections
Useful commands include:
• ovs-vsctl show
– High-level view of virtual bridges on the respective node
– Shows local VLAN IDs for each port
• ovs-ofctl dump-flows BRIDGE
– Show the flow rules for the respective bridge
– The flow rules determine how traffic is manipulated and forwarded
• ovs-ofctl show BRIDGE
– Port-level view of respective virtual switch
– Shows port IDs on the bridge. Useful when reading flows.
53. 53
Troubleshooting LinuxBridge connections
When troubleshooting L2
connectivity issues, run packet
captures on highlighted interfaces: brqXXXX
(Linux Bridge)
eth1
(NO IP ADDR)
eth0
IP address for
MGMT & API
eth1.100
brqXXXX
(Linux Bridge)
eth1
(NO IP ADDR)
PhysicalNetworkSwitch
eth0
IP address for
MGMT & API
VM1
eth0
VM0
eth0
tap0
tap1
K
V
M eth1.100
qdhcp
qrouter
tap2xxxx
ns-2xxxx
tap1xxxx
qr-1xxxx
Network Node
Compute Node
54. 54
Troubleshooting LinuxBridge connections
In a working environment, every interface will connect to a bridge that
corresponds to a Neutron network:
If a bridge is missing, check the agent log to see if there is an
error.
Network A
(VXLAN Network)
Network B
(VLAN Network)
55. 55
Troubleshooting LinuxBridge connections
Useful commands include:
• brctl show
– High-level view of virtual bridges on the respective node
– One bridge for each network
• bridge fdb show
– Shows the bridge forwarding database
– Useful for knowing how MAC addresses are reached
• ip neigh show
– Shows the ARP cache
56. 56
Binding Failed is back!
• Usually seen when booting instance or attaching interface
• Typically result of Neutron misconfiguration or agent issues
• Not limited to just instance ports
Unexpected
vif_type=binding_failed
57. 57
Binding Failed is back!
In this example, both the DHCP and L3 agent ports were in binding_failed status:
58. 58
Binding failed is back!
In this case, a look at the L2 agent log shows the misconfiguration:
If the agent is stopped or in a restart loop, port bindings will likely fail.
59. 59
Binding Failed: The Fallout
For existing DHCP and L3 ports you may need to:
• Fix router port:
– Unschedule tenant network from L3
agent
– Reschedule tenant network to L3
agent
– This creates new port
• Fix DHCP port:
– Unschedule tenant network from DHCP
agent
– Delete DHCP port
– Reschedule tenant network to DHCP agent
– This creates new port
60. 60
L2 agent troubleshooting tips
• Check to make sure the respective L2 agent is configured
properly and is running (not restarting!)
• Make sure OVS is running (if applicable)
• Check the Neutron agent logs
– /var/log/neutron/neutron-*-linuxbridge-agent.log
– /var/log/neutron/neutron-*-openvswitch-agent.log
Tips:
62. 62
Neutron architecture
The L3 agent is responsible for:
• Creating network
namespaces for each
router
• Providing routing between
networks
• Providing NAT to instances
65. 65
L3 agent troubleshooting tips
• Check to make sure the L3 agent is running and configured
properly
• Perform packet captures within the router namespace and other
interfaces to observe traffic entering and leaving the router
• Check iptables within the router namespace to observe the
proper rules have been created
• Check the Neutron L3 agent log:
– /var/log/neutron/l3-agent.log
Tips:
67. 67
MTU
If the plumbing looks good, but you still experience connectivity issues to instances
over certain protocols, it may be worth checking out the MTU size.
• Overlay network header can cause packet to exceed
MTU
• Often manifests itself as SSH issues
• Try ssh –v to see where it hangs
• Pass lower MTU with DHCP option 26
68. 68
Don’t forget security groups!
• Try applying a test rule
• Test connectivity from a namespace
• Verify iptables on compute nodes
• L2 agents are responsible for applying rules
When things are plumbed up correctly and
everything looks normal, there may be an issue
with security group rules.
69. 69
Neutron architecture
Other issues can only be observed at scale:
• Race conditions
• System limits too low
• No disk space available
• Syslog is your friend
71. 71
Neutron failures
Many common Neutron failures can be traced back to misconfigurations of the:
• Neutron configuration file
• ML2 configuration file
• Interface configuration files
74. 74
Takeaways
• Turn on DEBUG mode
• Check syslog
• Start services by hand
• Start out with simple configurations
• Reach out to community
• Gather as much information as possible before submitting a bug
75. Image FPO
75
D o n ’ t b e a f r a i d to b r e a k t h i n g s
75
76. 76
Stop by the Rackspace booth in the marketplace
Free book giveaways at the Rackspace booth during the morning and afternoon
breaks!
Morning Afternoon
77. 77
Slides available at SlideShare
These slides will be available at the following location after this presentation:
http://www.slideshare.net/JamesDenton1
Just mention that this presentation is geared towards openstack operators
Just mention that this presentation is geared towards openstack operators
JAMES
OpenStack is a complex system with a lot of moving parts. Many things that can go wrong cannot be determined through the API and can only be seen by someone on the infrastructure nodes themselves
Old methods of "turning it off and on again" no longer apply.
If you restart, you may lose valuable information about the environment that
OpenStack relies on a collection of python programs to build clouds. When things fail to operate as expected, you might see an error in a log file or the console that says ‘traceback’. What is that?
FIRST BLOCK: An exception is an event that occurs during the execution of a program when an error is encountered. When you TRY something and it fails, you can raise an EXCEPTION.
SECOND BLOCK: The standard Python traceback module produces useful information about where and why an error occurred. That information can greatly assist operators and developers in detecting the cause of the error.
A traceback is the ourput of an exception that's been raised.
James:
I think we’ve all seen a traceback before
The traceback contains some useful information about the error.
Sometimes it's easy to understand other times... it's not.
Just mention that this presentation is geared towards openstack operators
No Valid Host is a generic error often seen for a number of different reasons.
There are different conditions depending on the version of openstack you are running. Newer versions have better error reporting and lower failure rates
Let’s take a look at this example. In this example there are a few key things to identify:
The status of the VM - The instance is clearly in an ERROR state.
Also, an IP address has been assigned to this instance.
What do we know by looking at this: the Neutron API is functional enough to assign an IP.
(Do a nova show with the ID or the instance name if its unique)
Using nova show, we can see additional details about the instance
The instance name is a key identifier, as that is how virsh identifies the instance on the compute node.
Notice that the instance has been scheduled to a node. When the instance has been scheduled, its safe to assume that node has met the criteria of the scheduling filter. The instance was scheduled to a node, but the fault area indicates an issue was experienced during the launching of this VM.
The message “there are not enough hosts available” is what is reported to the user, but that message is not terribly helpful when it comes to troubleshooting what happened.
The nova compute logs on the compute node should provide a good indication as to what went wrong. Binding failed?? Now what??
When nova creates a virtual machine instance, it must “plug” each virtual network interface into the virtual switch. The virtual network interface is known as the ‘VIF’. Nova uses drivers (specified in nova.conf) to interface with virtual switches. When Nova is unable to interface with the network agent and properly setup the port, the vif_type is set to ‘binding_failed’ and eventually an error is triggered.
Because the error is related to networking, let’s take a look at the network agent log…
JAMES
In this example, the LinuxBridge agent log is continuously reporting a CRITICAL failure stating NO SUCH DEVICE. But what device??? Let’s dig in a little deeper.
So don’t judge me!
Taking a look at the get_interface_mac function in the traceback, we can see that the interface is passed to the function and there is no exception handling here.
The interface is passed to another function for processing, and that function is likely returning 'No Such Device'. How do we find out what device it is?
By adding some exception handling to the function, we’re able to catch the error and present a more useful error message
As the result of adding some exception handling, or at a minimum, some logging, we can now see that the agent is complaining about interface eht2.
Neutron interacts with interfaces defined in the plugin configuration file. In this case, the ML2 configuration file.
Depending on the driver you will have different options. When the agent is started, it consults the config file for information about bridge mappings and interfaces. When those interfaces in the config don’t actually exist, an error may be generated and the agent will fail to start.
In this example, the physical_interfa ce_mapping was incorrect and pointed to an interface that did not exist on the host. When the agent was started and couldn’t find the eth4 interface, the agent kept restarting as a result of the failure.
By changing the interface from eht2 to eth2 and restarting the agent, we were able to successfully create instances on that host.
Let’s take a look at another example.
In this example, we see 'No valid host was found' as the fault. Similar to the last example, but different in a couple of ways. First, no additional info about the failure is provided. In addition, the instances does not appear to have been scheduled to a host.
If there isn’t a host identified in the nova show, the error is likely localized to the scheduler node.
Relevant logs on the controller node can be found in /var/log/nova.
Take a look at the scheduler and conductor logs. In this case, the scheduler log reported ‘no valid host was found’.
Using the nova service-list command, we found that the nova-compute service, at times, was UP and DOWN. In DOWN moments, the failures were observed.
What could cause that??
When a service or agent checks in, the database is updated with the time of check in. Other services, such as the scheduler, depend on that check in time to determine if the service is available.
The scheduler determines the availability of a host by comparing the difference between its local time and the ‘last seen time’ of the compute node. By default, that difference cannot exceed 60 seconds. If it’s greater than 60 seconds, the node is considered ‘unavailable’.
If you have wide variances in time, in this case 4 minutes between controllers, you may see inconsistent behavior in the environment.
So there really WAS no valid host found at that point in time!
Neutron is composed of various services and agents that are responsible for constructing and maintaining the virtual network.
Let’s start with the DHCP agent. When you create an instance, Neutron statically assigns IP addresses to ports associated with those instances.
Instances will then use a DHCP client to obtain that address and configure the interface.
In the standard Neutron architecture, the DHCP agent builds network namespaces for each network that each contain their own dnsmasq process.
Failures of the DHCP agent can result in:
• instances not getting an initial lease
• instances not renewing a lease
The DHCP agent constructs a host file that is used by dnsmasq to provide IPs to clients that ask for one.
When a client sends a DHCPDISCOVER packet, dnsmasq looks to the host file for IP information respective to the client’s MAC address:
Dnsmasq logs the DHCP cycle in syslog. The full lease cycle can be observed in the syslog:
The DHCPDISCOVER is the clientrequesting an IP. This is a broadcast
The DHCPOFFER is the server proposing an address. This is unicast.
The DHCPREQUEST is the client requesting the proposed address. This is a broadcast.
The DHCPACK is the server acknowledging the request.
If you instance doesn’t get its lease, and applying an IP directly to the interface doesn’t work either, consider running packet captures on the following interfaces:
Taps
Bridges
Physical
With the proper tcpdump syntax, You should see all messages on all interfaces.
In a working example, the full DHCP cycle can be observed on the tap interface of the instance
In the non-working example, the DHCPDISCOVER message appears to go unanswered. This may be the result of L2 connectivity issues or issues with dnsmasq, an example of which we’ll cover next.
Now that we know how the DHCP process works, let's talk about a bug that may be impacting a lot of you out there that may not realize it.
Now that we know how the DHCP process works, let's talk about a bug that may be impacting a lot of you out there that may not realize it.
If your instance is having issues procuring or renewing a lease, you may see DHCPNACK packets when troubleshooting.
In some releases, when the DHCP agent is restarted the dnsmasq process loses track of leases it has handed out. When this happens, instances that attempt to renew their lease will be met with a NACK packet, causing the DHCP lease cycle to start over. This can result in a brief loss of connectivity as the instance works to procure its IP address again.
A patch was introduced to allow dnsmasq to quietly rebuild its lease file without sending a NACK, but the fix relied on there being only one DHCP server in the network
In highly-available environments, when the network is scheduled to multiple DHCP agents and the instance attempts to renew its lease, each DHCP server will see the request and all may respond. The renewal attempt is accepted by the agent that provided the original lease while the ones that didn’t reject the renewal attempt with a NACK.
The end result is that the client will start the DHCP process over, briefly causing connectivity in the process.
The new method of handling this issue is for Neutron to pre-populate the lease DB for each DHCP server, much like it does with the host file. This way, when the agent is restarted, dnsmasq is reloaded with a populated lease database!
The Neutron L2 agents are responsible for programming the virtual switching infrastructure when instances and ports are created.
Failures of the L2 agent can often result in:
• lack of instance connectivity
• Security group issues
• errors booting instances
When using OVS, there are a lot of moving parts. The OVS agent connects instances to bridges, applys security group rules, and maintains flow rules that dictate how traffic is forwarded.
When clients have connectivity issues, it is worth starting with packet captures on the highlighted interfaces, starting with those closest to the instance.
The linuxbridge agent is a little simpler in its implementation compared to OVS. Again, when clients complain of connectivity issues it is worth performing captures on the highlights interfaces to see where traffic may be dropped.
If the packet makes it out of a server and is not seen again, it may be necessary to take a look at the physical infrastructure. Improper physical switch configurations are commonly responsible for network issues.
Knowing how a particular agent provides network connectivity across the cloud is important to troubleshooting potential issues.
Sometimes, a restart of the respective L2 agent is needed to rebuild connections and flows that restore connectivity. Knowing what connections should be made and what flows should exist will help you make the call.
As we saw with the Nova example earlier, when Nova or Neutron are not able to determine how to “attach” interfaces to the respective bridge, or there are other issues with the L2 agent on the host, you will often see a ‘binding failed’ error. Common issues are ML2 misconfigurations that can usually be identified by looking at the OVS or LB agent log files.
Often times, the agent may constantly restart until the configuration is corrected. At first appearances it is UP and available but isnt able to do its job.
In this example, the user had created a tenant network and attached it to a Neutron router. The Neutron API successfully completed those tasks without error.
However, instances were unable to obtain an IP or hit their gateway when manually configured. Looking at the bridge, we found that neither the DHCP not Router ports were connected.
A look at the L2 agent log reported that VXLAN had been enabled in the Ml2 config, but there was an issue with the specified IP address. A look at the host revealed that the specified IP was not configured on any interfaces. To help solve this problem, I configured the address on an interface and restarted the agent.
If this happens to you, try unscheduling the network from the respective agent and rescheduling.
Failures of the L3 agent can result in:
• failure to route traffic if the neutron routers have not been created properly or interfaces have not been added
• missing snat/dnat rules in the namespace
• etc
Check logs on the nodes at /var/log/neutron/neutron-l3-agent.log
When floating IPs are associated with an instance/port, there are changes made to iptables within the corresponding router namespace. These rules dictate how traffic is translated when it egresses and ingresses the router.
If and when the agent is exhibiting issues, these rules may not get applied and floating ips will not operate.
If everything looks good, but you notice packet loss or issues with SSH, you may be exceeding the MTU of the interface. This is most often seen when overlay technologies like VXLAN are used. The addition of the overlay headed causes the packet to exceed the MTU. You can pass a lower MTU via DHCP with option 26.
If things are plumbed up correctly but you’re still experiencing issues, make sure to verify security group rules are not prohibiting traffic flow.
Try applying a secondary security group to the port that allows limited connected (ICMP/SSH) from a particular IP or group. Test connectivity from DHCP or Router namespace, then branch out from there. Test connectivity to fixed IP before testing floating IP externally.
Continuous work is being done on enhancing the operation of the L2/L3 agents., but some issues can only be observed at scale:
• race conditions
• default kernel parameters too low
If you are experiencing random, unexplainable issues, consult the syslog to see if the system itself is reporting issues.
Often, failures can be traced back to Layer 8 issues – those originating between the keyboard and the chair.
It goes without saying that misconfigured files will definitely cause issues within the environment. A misconfiguration can result in a service not starting at all, or can lead you down a long troubleshooting path thanks to obscure symptoms and messages.
familiarize yourself with the underlying technologies
Do your best to familiarize yourself with a working environment so that you know how to spot an issue.
If you’re new to OpenStack, consider installing an all-in-one distro that will allow you to setup a prescribed environment that you can reverse engineer and learn from.
Examples of this are OSAD and RDO. The docs on openstack.org are also very helpful in setting up basic environments.
Create virtual machines manually
Create Linuxbridges manually. Place physical interfaces in there and your VM tap interfaces
Create OVS bridges manually. Create a few flows, or just use NORMAL flow. Assign vlan to to ports.
Work to figure out how it all fits together.
Work to get it working. It may not be pretty, but break things and put them back together.
And remember, you’re not alone. There’s a strong community here that is willing to help.
Just mention that this presentation is geared towards openstack operators
Just mention that this presentation is geared towards openstack operators