Troubleshooting Tracebacks

Image FPO
NO VALID WAS HOST FOUND
Troubleshooting tracebacks and other
common failure scenarios

2
The presenters…
WADE
LEWIS
OpenStack
Architect
JAMES
DENTON
Principal
Architect

3
What we’re here to talk about…
Troubleshooting OpenStack Issues:
• Tracebacks
• Common Nova issues
• Common Neutron Issues

4
Slides available at SlideShare
These slides will be available at the following location after this presentation:
http://www.slideshare.net/JamesDenton1

5
OpenStack is complex
OpenStack is a complex system:
• Many moving parts
• Limited visibility to problems via API

8
Traceback 101
• When errors occur, sometimes exceptions are raised.
• When an exception is caught, an error and a list functions
that got us to the point of the error are logged. This is a
traceback.
• The traceback output can be useful to operators and
developers and allows them to trace the steps to the error.
• As you’ll see, a traceback doesn’t always provide clear
insight into the real error.

Image FPO
9
D e c i p h e r i n g a t r a c e b a c k i s a b i t l i ke r e a d i n g t h e M a t r i x
9

Image FPO
10
D e c i p h e r i n g a t r a c e b a c k i s a b i t l i ke r e a d i n g t h e M a t r i x
10

11
Tips on reading a traceback
Read from the bottom to the top
– The last few lines are the most relevant
In this case, within the init function the program was unable to connect to MySQL.

12

Image FPO
14
“ N o v a l i d h o s t w a s f o u n d . W h a t t h e h e c k d o e s t h a t m e a n ? ! ”
14

16
No valid host was found
This error is likely seen when booting an instance. Common reasons for failing:
• There really are no hosts available
• Networking issues on compute node
• Lack of resources

17
So you spun up an instance..

18
Identify the host. If there is one…

19
Check the compute logs on the compute node

20
Check the networking logs on the compute node

21
Bringing it back together
!! WARNING !!
The following example may not utilize Python or Neutron coding best
practices.

22
Let’s take a look at that traceback:

23
Taking a look at the function, we can see there is no exception handling:

24
By adding some exception handling to the function…

25
… we get a nice, clean error that clearly indicates what is wrong

26
Interface mappings can be found in the ML2 configuration file:
If eht2 does not exist on this host, the Neutron agent may be unable to complete
the network configuration.

27
Next example: When there isn’t a host…

28
Check the scheduler and conductor logs
• /var/log/nova/nova-scheduler.log
• /var/log/nova/nova-conductor.log

29
First pass:
2015-10-16 17:14:18
Second pass:
2015-10-16 17:10:10

30
NTP! NTP! NTP!
Wonky behavior caused by inconsistencies in time between hosts
• Services and agents can appear DOWN
when they’re UP
• Service and agent flapping can cause
scheduling issues

32
Neutron architecture
Neutron is composed of various services and agents responsible for building and
maintaining the virtual network:
Failures can occur at any point.

34
The DHCP agent is responsible for:
• Creating network namespaces
• Configuring dnsmasq – a DHCP
server
When instances are created, IPs are
statically assigned.

35
Failures of the DHCP agent on a host can result in:
• Instances not getting their
initial lease
• Instances not renewing
their lease

36
Dnsmasq Basics
As subnets and ports are created, the DHCP agent is responsible for configuring
the files used by dnsmasq to provide DHCP services to the network:
When dnsmasq hands out the lease, it updates its active lease database.
• /var/lib/neutron/dhcp/<network_uuid>/host

37
Dnsmasq Basics
By default, dnsmasq writes its logs to:
• /var/log/syslog (Ubuntu,Debian)
• /var/log/messages (RHEL,CentOS,Fedora)

38
Troubleshooting DHCP
If there are issues obtaining an IP, start with packet captures on the following
devices:
• Compute node:
– Tap interface
– Bridge interface
– Physical interface
• Network node:
– Physical interface
– Bridge interface
– Veth interface
– Namespace interface
Listen on UDP ports 67 and 68. You should see the full DHCP cycle in the packet
capture on most interfaces.

39
Troubleshooting DHCP – Packet Captures
• Working example:
• Non-working example:
When DHCP isn’t working, investigate the switching layer or dnsmasq.

N o w w e k n o w h o w i t w o r k s …
40

… w e ’ v e g o t a l i v e b u g : D H C P N A K !
41

42
Troubleshooting DHCP – DHCPNAK Issues
I see DHCPNAK packets. HELP!
• Likely means the DHCP agent was restarted and active lease file deleted
• Instances may receive DHCPNAK when requesting / renewing address
• This may result in delayed or no connectivity
• Addressed in patch for bug #1345947, which sets dnsmasq to renew the
lease anyway without sending a NAK and repopulate its lease file

43
When a network is scheduled to more than 1 DHCP agent, there may be issues:
• That fix expected only 1 DHCP server in the network!
• The DHCPREQUEST packet sent on renewal attempt is received by all
DHCP agents (it’s a broadcast, after all)
• The renewal attempt is accepted by the agent that provided the original lease

• At the same time, the renewal attempt is rejected by the agent that didn’t
provide the original lease 

44
The end result? The client honors the DHCPNAK and restarts the DHCP process 

45
However, there is hope!
• Bug 1457900 addresses the multiple DHCP agent issue
• The fix is to pre-populate the dnsmasq leases file on all DHCP agents with all
known MACs/IPs for respective networks
• Fixed in Liberty, coming to a backport near you!

47
The L2 agent is responsible for:
• Programming the virtual switching
infrastructure
• Applying security groups

48
Failures of the L2 agent on a host can result in:
• Lack of instance connectivity
• Security group issues
• ERROR state during nova boot

49
Troubleshooting OVS connections
When troubleshooting L2 connectivity issues,
run packet captures on highlighted
interfaces:

50
Every interface plugged into the integration bridge
should have a local VLAN ID that is unique to that node,
no matter what the network type (VLAN, flat, local,
VXLAN, GRE):
If the tag is missing, try restarting the OVS agent to
force a rebuild of the integration bridge VLAN tagging
and corresponding flows.

51
If you see an OVS port in VLAN 4095, it typically means
that the agent was unable to find a corresponding
Neutron port in the database:
When this happens, it usually means that the port was
deleted from the DB manually or as part of another
action that did not complete successfully.

52
Useful commands include:
• ovs-vsctl show
– High-level view of virtual bridges on the respective node
– Shows local VLAN IDs for each port
• ovs-ofctl dump-flows BRIDGE
– Show the flow rules for the respective bridge
– The flow rules determine how traffic is manipulated and forwarded
• ovs-ofctl show BRIDGE
– Port-level view of respective virtual switch
– Shows port IDs on the bridge. Useful when reading flows.

53
Troubleshooting LinuxBridge connections
When troubleshooting L2
connectivity issues, run packet
captures on highlighted interfaces: brqXXXX
(Linux Bridge)
eth1
(NO IP ADDR)
eth0
IP address for
MGMT & API
eth1.100
brqXXXX
(Linux Bridge)
eth1
(NO IP ADDR)
PhysicalNetworkSwitch
eth0
IP address for
MGMT & API
VM1
eth0
VM0
eth0
tap0
tap1
K
V
M eth1.100
qdhcp
qrouter
tap2xxxx
ns-2xxxx
tap1xxxx
qr-1xxxx
Network Node
Compute Node

54
In a working environment, every interface will connect to a bridge that
corresponds to a Neutron network:
If a bridge is missing, check the agent log to see if there is an
error.
Network A
(VXLAN Network)
Network B
(VLAN Network)

55
Useful commands include:
• brctl show
– High-level view of virtual bridges on the respective node
– One bridge for each network
• bridge fdb show
– Shows the bridge forwarding database
– Useful for knowing how MAC addresses are reached
• ip neigh show
– Shows the ARP cache

56
Binding Failed is back!
• Usually seen when booting instance or attaching interface
• Typically result of Neutron misconfiguration or agent issues
• Not limited to just instance ports
Unexpected
vif_type=binding_failed

57
Binding Failed is back!
In this example, both the DHCP and L3 agent ports were in binding_failed status:

58
Binding failed is back!
In this case, a look at the L2 agent log shows the misconfiguration:
If the agent is stopped or in a restart loop, port bindings will likely fail.

59
Binding Failed: The Fallout
For existing DHCP and L3 ports you may need to:
• Fix router port:
– Unschedule tenant network from L3
agent
– Reschedule tenant network to L3
agent
– This creates new port
• Fix DHCP port:
– Unschedule tenant network from DHCP
agent
– Delete DHCP port
– Reschedule tenant network to DHCP agent
– This creates new port

60
L2 agent troubleshooting tips
• Check to make sure the respective L2 agent is configured
properly and is running (not restarting!)
• Make sure OVS is running (if applicable)
• Check the Neutron agent logs
– /var/log/neutron/neutron-*-linuxbridge-agent.log
– /var/log/neutron/neutron-*-openvswitch-agent.log
Tips:

62
The L3 agent is responsible for:
• Creating network
namespaces for each
router
• Providing routing between
networks
• Providing NAT to instances

63
Failures of the L3 agent on a host can result in:
• Failure to route traffic
• Floating IPs not
functioning

65
L3 agent troubleshooting tips
• Check to make sure the L3 agent is running and configured
properly
• Perform packet captures within the router namespace and other
interfaces to observe traffic entering and leaving the router
• Check iptables within the router namespace to observe the
proper rules have been created
• Check the Neutron L3 agent log:
– /var/log/neutron/l3-agent.log
Tips:

67
MTU
If the plumbing looks good, but you still experience connectivity issues to instances
over certain protocols, it may be worth checking out the MTU size.
• Overlay network header can cause packet to exceed
MTU
• Often manifests itself as SSH issues
• Try ssh –v to see where it hangs
• Pass lower MTU with DHCP option 26

68
Don’t forget security groups!
• Try applying a test rule
• Test connectivity from a namespace
• Verify iptables on compute nodes
• L2 agents are responsible for applying rules
When things are plumbed up correctly and
everything looks normal, there may be an issue
with security group rules.

69
Other issues can only be observed at scale:
• Race conditions
• System limits too low
• No disk space available
• Syslog is your friend

71
Neutron failures
Many common Neutron failures can be traced back to misconfigurations of the:
• Neutron configuration file
• ML2 configuration file
• Interface configuration files

72
Takeaways
Get familiar with the underlying technologies:
• KVM
• Open vSwitch
• Linux bridging
• IPtables

73
Takeaways
Familiarize yourself with a working environment
so that you know how to spot an issue.

74
Takeaways
• Turn on DEBUG mode
• Check syslog
• Start services by hand
• Start out with simple configurations
• Reach out to community
• Gather as much information as possible before submitting a bug

Image FPO
75
D o n ’ t b e a f r a i d to b r e a k t h i n g s
75

76
Stop by the Rackspace booth in the marketplace
Free book giveaways at the Rackspace booth during the morning and afternoon
breaks!
Morning Afternoon

77

O N E FA N AT I C A L P L A C E | S A N A N T O N I O , T X 7 8 2 1 8
U S S A L E S : 1 - 8 0 0 - 9 61 - 2 8 8 8 | U S S U P P O R T: 1 - 8 0 0 - 9 61 - 4 4 5 4 | W W W. R AC K S PAC E . C O M
© RACKSPACE LTD. | RACKSPACE® AND FANATICAL SUPPORT® ARE SERVICE MARKS OF RACKSPACE US, INC. REGISTERED IN THE UNITED S TATES AND OTHER COUNTRIES. |
WWW.RACKSPACE.COM
Thank you

Troubleshooting Tracebacks

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Troubleshooting Tracebacks

Ähnlich wie Troubleshooting Tracebacks (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Troubleshooting Tracebacks

Hinweis der Redaktion