SlideShare ist ein Scribd-Unternehmen logo
1 von 78
Image FPO
NO VALID WAS HOST FOUND
Troubleshooting tracebacks and other
common failure scenarios
2
The presenters…
WADE
LEWIS
OpenStack
Architect
JAMES
DENTON
Principal
Architect
3
What we’re here to talk about…
Troubleshooting OpenStack Issues:
• Tracebacks
• Common Nova issues
• Common Neutron Issues
4
Slides available at SlideShare
These slides will be available at the following location after this presentation:
http://www.slideshare.net/JamesDenton1
5
OpenStack is complex
OpenStack is a complex system:
• Many moving parts
• Limited visibility to problems via API
6
Troubleshooting methods
What is a traceback?
8
Traceback 101
• When errors occur, sometimes exceptions are raised.
• When an exception is caught, an error and a list functions
that got us to the point of the error are logged. This is a
traceback.
• The traceback output can be useful to operators and
developers and allows them to trace the steps to the error.
• As you’ll see, a traceback doesn’t always provide clear
insight into the real error.
Image FPO
9
D e c i p h e r i n g a t r a c e b a c k i s a b i t l i ke r e a d i n g t h e M a t r i x
9
Image FPO
10
D e c i p h e r i n g a t r a c e b a c k i s a b i t l i ke r e a d i n g t h e M a t r i x
10
11
Tips on reading a traceback
Read from the bottom to the top
– The last few lines are the most relevant
In this case, within the init function the program was unable to connect to MySQL.
12
Slides available at SlideShare
These slides will be available at the following location after this presentation:
http://www.slideshare.net/JamesDenton1
Nova
Image FPO
14
“ N o v a l i d h o s t w a s f o u n d . W h a t t h e h e c k d o e s t h a t m e a n ? ! ”
14
15
No valid host was found
16
No valid host was found
This error is likely seen when booting an instance. Common reasons for failing:
• There really are no hosts available
• Networking issues on compute node
• Lack of resources
17
So you spun up an instance..
18
Identify the host. If there is one…
19
Check the compute logs on the compute node
20
Check the networking logs on the compute node
21
Bringing it back together
!! WARNING !!
The following example may not utilize Python or Neutron coding best
practices.
22
Bringing it back together
Let’s take a look at that traceback:
23
Bringing it back together
Taking a look at the function, we can see there is no exception handling:
24
Bringing it back together
By adding some exception handling to the function…
25
Bringing it back together
… we get a nice, clean error that clearly indicates what is wrong
26
Bringing it back together
Interface mappings can be found in the ML2 configuration file:
If eht2 does not exist on this host, the Neutron agent may be unable to complete
the network configuration.
27
Next example: When there isn’t a host…
28
Check the scheduler and conductor logs
• /var/log/nova/nova-scheduler.log
• /var/log/nova/nova-conductor.log
29
First pass:
2015-10-16 17:14:18
Second pass:
2015-10-16 17:10:10
30
NTP! NTP! NTP!
Wonky behavior caused by inconsistencies in time between hosts
• Services and agents can appear DOWN
when they’re UP
• Service and agent flapping can cause
scheduling issues
Neutron
32
Neutron architecture
Neutron is composed of various services and agents responsible for building and
maintaining the virtual network:
Failures can occur at any point.
DHCP Agent
34
Neutron architecture
The DHCP agent is responsible for:
• Creating network namespaces
• Configuring dnsmasq – a DHCP
server
When instances are created, IPs are
statically assigned.
35
Neutron architecture
Failures of the DHCP agent on a host can result in:
• Instances not getting their
initial lease
• Instances not renewing
their lease
36
Dnsmasq Basics
As subnets and ports are created, the DHCP agent is responsible for configuring
the files used by dnsmasq to provide DHCP services to the network:
When dnsmasq hands out the lease, it updates its active lease database.
• /var/lib/neutron/dhcp/<network_uuid>/host
37
Dnsmasq Basics
By default, dnsmasq writes its logs to:
• /var/log/syslog (Ubuntu,Debian)
• /var/log/messages (RHEL,CentOS,Fedora)
38
Troubleshooting DHCP
If there are issues obtaining an IP, start with packet captures on the following
devices:
• Compute node:
– Tap interface
– Bridge interface
– Physical interface
• Network node:
– Physical interface
– Bridge interface
– Veth interface
– Namespace interface
Listen on UDP ports 67 and 68. You should see the full DHCP cycle in the packet
capture on most interfaces.
39
Troubleshooting DHCP – Packet Captures
• Working example:
• Non-working example:
When DHCP isn’t working, investigate the switching layer or dnsmasq.
N o w w e k n o w h o w i t w o r k s …
40
… w e ’ v e g o t a l i v e b u g : D H C P N A K !
41
42
Troubleshooting DHCP – DHCPNAK Issues
I see DHCPNAK packets. HELP!
• Likely means the DHCP agent was restarted and active lease file deleted
• Instances may receive DHCPNAK when requesting / renewing address
• This may result in delayed or no connectivity
• Addressed in patch for bug #1345947, which sets dnsmasq to renew the
lease anyway without sending a NAK and repopulate its lease file
43
Troubleshooting DHCP – DHCPNAK Issues
When a network is scheduled to more than 1 DHCP agent, there may be issues:
• That fix expected only 1 DHCP server in the network!
• The DHCPREQUEST packet sent on renewal attempt is received by all
DHCP agents (it’s a broadcast, after all)
• The renewal attempt is accepted by the agent that provided the original lease

• At the same time, the renewal attempt is rejected by the agent that didn’t
provide the original lease 
44
Troubleshooting DHCP – DHCPNAK Issues
The end result? The client honors the DHCPNAK and restarts the DHCP process 
45
Troubleshooting DHCP – DHCPNAK Issues
However, there is hope!
• Bug 1457900 addresses the multiple DHCP agent issue
• The fix is to pre-populate the dnsmasq leases file on all DHCP agents with all
known MACs/IPs for respective networks
• Fixed in Liberty, coming to a backport near you!
L2 Agent
47
Neutron architecture
The L2 agent is responsible for:
• Programming the virtual switching
infrastructure
• Applying security groups
48
Neutron architecture
Failures of the L2 agent on a host can result in:
• Lack of instance connectivity
• Security group issues
• ERROR state during nova boot
49
Troubleshooting OVS connections
When troubleshooting L2 connectivity issues,
run packet captures on highlighted
interfaces:
50
Troubleshooting OVS connections
Every interface plugged into the integration bridge
should have a local VLAN ID that is unique to that node,
no matter what the network type (VLAN, flat, local,
VXLAN, GRE):
If the tag is missing, try restarting the OVS agent to
force a rebuild of the integration bridge VLAN tagging
and corresponding flows.
51
Troubleshooting OVS connections
If you see an OVS port in VLAN 4095, it typically means
that the agent was unable to find a corresponding
Neutron port in the database:
When this happens, it usually means that the port was
deleted from the DB manually or as part of another
action that did not complete successfully.
52
Troubleshooting OVS connections
Useful commands include:
• ovs-vsctl show
– High-level view of virtual bridges on the respective node
– Shows local VLAN IDs for each port
• ovs-ofctl dump-flows BRIDGE
– Show the flow rules for the respective bridge
– The flow rules determine how traffic is manipulated and forwarded
• ovs-ofctl show BRIDGE
– Port-level view of respective virtual switch
– Shows port IDs on the bridge. Useful when reading flows.
53
Troubleshooting LinuxBridge connections
When troubleshooting L2
connectivity issues, run packet
captures on highlighted interfaces: brqXXXX
(Linux Bridge)
eth1
(NO IP ADDR)
eth0
IP address for
MGMT & API
eth1.100
brqXXXX
(Linux Bridge)
eth1
(NO IP ADDR)
PhysicalNetworkSwitch
eth0
IP address for
MGMT & API
VM1
eth0
VM0
eth0
tap0
tap1
K
V
M eth1.100
qdhcp
qrouter
tap2xxxx
ns-2xxxx
tap1xxxx
qr-1xxxx
Network Node
Compute Node
54
Troubleshooting LinuxBridge connections
In a working environment, every interface will connect to a bridge that
corresponds to a Neutron network:
If a bridge is missing, check the agent log to see if there is an
error.
Network A
(VXLAN Network)
Network B
(VLAN Network)
55
Troubleshooting LinuxBridge connections
Useful commands include:
• brctl show
– High-level view of virtual bridges on the respective node
– One bridge for each network
• bridge fdb show
– Shows the bridge forwarding database
– Useful for knowing how MAC addresses are reached
• ip neigh show
– Shows the ARP cache
56
Binding Failed is back!
• Usually seen when booting instance or attaching interface
• Typically result of Neutron misconfiguration or agent issues
• Not limited to just instance ports
Unexpected
vif_type=binding_failed
57
Binding Failed is back!
In this example, both the DHCP and L3 agent ports were in binding_failed status:
58
Binding failed is back!
In this case, a look at the L2 agent log shows the misconfiguration:
If the agent is stopped or in a restart loop, port bindings will likely fail.
59
Binding Failed: The Fallout
For existing DHCP and L3 ports you may need to:
• Fix router port:
– Unschedule tenant network from L3
agent
– Reschedule tenant network to L3
agent
– This creates new port
• Fix DHCP port:
– Unschedule tenant network from DHCP
agent
– Delete DHCP port
– Reschedule tenant network to DHCP agent
– This creates new port
60
L2 agent troubleshooting tips
• Check to make sure the respective L2 agent is configured
properly and is running (not restarting!)
• Make sure OVS is running (if applicable)
• Check the Neutron agent logs
– /var/log/neutron/neutron-*-linuxbridge-agent.log
– /var/log/neutron/neutron-*-openvswitch-agent.log
Tips:
L3 Agent
62
Neutron architecture
The L3 agent is responsible for:
• Creating network
namespaces for each
router
• Providing routing between
networks
• Providing NAT to instances
63
Neutron architecture
Failures of the L3 agent on a host can result in:
• Failure to route traffic
• Floating IPs not
functioning
64
Neutron architecture
65
L3 agent troubleshooting tips
• Check to make sure the L3 agent is running and configured
properly
• Perform packet captures within the router namespace and other
interfaces to observe traffic entering and leaving the router
• Check iptables within the router namespace to observe the
proper rules have been created
• Check the Neutron L3 agent log:
– /var/log/neutron/l3-agent.log
Tips:
More Neutron…
67
MTU
If the plumbing looks good, but you still experience connectivity issues to instances
over certain protocols, it may be worth checking out the MTU size.
• Overlay network header can cause packet to exceed
MTU
• Often manifests itself as SSH issues
• Try ssh –v to see where it hangs
• Pass lower MTU with DHCP option 26
68
Don’t forget security groups!
• Try applying a test rule
• Test connectivity from a namespace
• Verify iptables on compute nodes
• L2 agents are responsible for applying rules
When things are plumbed up correctly and
everything looks normal, there may be an issue
with security group rules.
69
Neutron architecture
Other issues can only be observed at scale:
• Race conditions
• System limits too low
• No disk space available
• Syslog is your friend
Takeaways
71
Neutron failures
Many common Neutron failures can be traced back to misconfigurations of the:
• Neutron configuration file
• ML2 configuration file
• Interface configuration files
72
Takeaways
Get familiar with the underlying technologies:
• KVM
• Open vSwitch
• Linux bridging
• IPtables
73
Takeaways
Familiarize yourself with a working environment
so that you know how to spot an issue.
74
Takeaways
• Turn on DEBUG mode
• Check syslog
• Start services by hand
• Start out with simple configurations
• Reach out to community
• Gather as much information as possible before submitting a bug
Image FPO
75
D o n ’ t b e a f r a i d to b r e a k t h i n g s
75
76
Stop by the Rackspace booth in the marketplace
Free book giveaways at the Rackspace booth during the morning and afternoon
breaks!
Morning Afternoon
77
Slides available at SlideShare
These slides will be available at the following location after this presentation:
http://www.slideshare.net/JamesDenton1
O N E FA N AT I C A L P L A C E | S A N A N T O N I O , T X 7 8 2 1 8
U S S A L E S : 1 - 8 0 0 - 9 61 - 2 8 8 8 | U S S U P P O R T: 1 - 8 0 0 - 9 61 - 4 4 5 4 | W W W. R AC K S PAC E . C O M
© RACKSPACE LTD. | RACKSPACE® AND FANATICAL SUPPORT® ARE SERVICE MARKS OF RACKSPACE US, INC. REGISTERED IN THE UNITED S TATES AND OTHER COUNTRIES. |
WWW.RACKSPACE.COM
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

Quantum (OpenStack Meetup Feb 9th, 2012)
Quantum (OpenStack Meetup Feb 9th, 2012)Quantum (OpenStack Meetup Feb 9th, 2012)
Quantum (OpenStack Meetup Feb 9th, 2012)Dan Wendlandt
 
Open stack networking vlan, gre
Open stack networking   vlan, greOpen stack networking   vlan, gre
Open stack networking vlan, greSim Janghoon
 
Open stack networking_101_part-1
Open stack networking_101_part-1Open stack networking_101_part-1
Open stack networking_101_part-1yfauser
 
Open stack networking_101_update_2014
Open stack networking_101_update_2014Open stack networking_101_update_2014
Open stack networking_101_update_2014yfauser
 
Openstack Basic with Neutron
Openstack Basic with NeutronOpenstack Basic with Neutron
Openstack Basic with NeutronKwonSun Bae
 
Osdc2014 openstack networking yves_fauser
Osdc2014 openstack networking yves_fauserOsdc2014 openstack networking yves_fauser
Osdc2014 openstack networking yves_fauseryfauser
 
OpenStack Networking and Automation
OpenStack Networking and AutomationOpenStack Networking and Automation
OpenStack Networking and AutomationAdam Johnson
 
Dockerizing the Hard Services: Neutron and Nova
Dockerizing the Hard Services: Neutron and NovaDockerizing the Hard Services: Neutron and Nova
Dockerizing the Hard Services: Neutron and Novaclayton_oneill
 
OpenStack: Virtual Routers On Compute Nodes
OpenStack: Virtual Routers On Compute NodesOpenStack: Virtual Routers On Compute Nodes
OpenStack: Virtual Routers On Compute Nodesclayton_oneill
 
Open stack networking_101_update_2014-os-meetups
Open stack networking_101_update_2014-os-meetupsOpen stack networking_101_update_2014-os-meetups
Open stack networking_101_update_2014-os-meetupsyfauser
 
DockerCon US 2016 - Docker Networking deep dive
DockerCon US 2016 - Docker Networking deep diveDockerCon US 2016 - Docker Networking deep dive
DockerCon US 2016 - Docker Networking deep diveMadhu Venugopal
 
Neutron behind the scenes
Neutron   behind the scenesNeutron   behind the scenes
Neutron behind the scenesinbroker
 
Nova net-or-neutron-atlanta2014.pptx
Nova net-or-neutron-atlanta2014.pptxNova net-or-neutron-atlanta2014.pptx
Nova net-or-neutron-atlanta2014.pptxSomik Behera
 
Understanding Open vSwitch
Understanding Open vSwitch Understanding Open vSwitch
Understanding Open vSwitch YongKi Kim
 
OpenvSwitch Deep Dive
OpenvSwitch Deep DiveOpenvSwitch Deep Dive
OpenvSwitch Deep Diverajdeep
 
OpenStack Neutron Tutorial
OpenStack Neutron TutorialOpenStack Neutron Tutorial
OpenStack Neutron Tutorialmestery
 
Quantum - Virtual networks for Openstack
Quantum - Virtual networks for OpenstackQuantum - Virtual networks for Openstack
Quantum - Virtual networks for Openstacksalv_orlando
 

Was ist angesagt? (20)

Quantum (OpenStack Meetup Feb 9th, 2012)
Quantum (OpenStack Meetup Feb 9th, 2012)Quantum (OpenStack Meetup Feb 9th, 2012)
Quantum (OpenStack Meetup Feb 9th, 2012)
 
Open stack networking vlan, gre
Open stack networking   vlan, greOpen stack networking   vlan, gre
Open stack networking vlan, gre
 
Open stack networking_101_part-1
Open stack networking_101_part-1Open stack networking_101_part-1
Open stack networking_101_part-1
 
rtnetlink
rtnetlinkrtnetlink
rtnetlink
 
Open stack networking_101_update_2014
Open stack networking_101_update_2014Open stack networking_101_update_2014
Open stack networking_101_update_2014
 
Openstack Basic with Neutron
Openstack Basic with NeutronOpenstack Basic with Neutron
Openstack Basic with Neutron
 
Osdc2014 openstack networking yves_fauser
Osdc2014 openstack networking yves_fauserOsdc2014 openstack networking yves_fauser
Osdc2014 openstack networking yves_fauser
 
OpenStack Networking and Automation
OpenStack Networking and AutomationOpenStack Networking and Automation
OpenStack Networking and Automation
 
Dockerizing the Hard Services: Neutron and Nova
Dockerizing the Hard Services: Neutron and NovaDockerizing the Hard Services: Neutron and Nova
Dockerizing the Hard Services: Neutron and Nova
 
OpenStack: Virtual Routers On Compute Nodes
OpenStack: Virtual Routers On Compute NodesOpenStack: Virtual Routers On Compute Nodes
OpenStack: Virtual Routers On Compute Nodes
 
Demystifying openvswitch
Demystifying openvswitchDemystifying openvswitch
Demystifying openvswitch
 
Neutron DVR
Neutron DVRNeutron DVR
Neutron DVR
 
Open stack networking_101_update_2014-os-meetups
Open stack networking_101_update_2014-os-meetupsOpen stack networking_101_update_2014-os-meetups
Open stack networking_101_update_2014-os-meetups
 
DockerCon US 2016 - Docker Networking deep dive
DockerCon US 2016 - Docker Networking deep diveDockerCon US 2016 - Docker Networking deep dive
DockerCon US 2016 - Docker Networking deep dive
 
Neutron behind the scenes
Neutron   behind the scenesNeutron   behind the scenes
Neutron behind the scenes
 
Nova net-or-neutron-atlanta2014.pptx
Nova net-or-neutron-atlanta2014.pptxNova net-or-neutron-atlanta2014.pptx
Nova net-or-neutron-atlanta2014.pptx
 
Understanding Open vSwitch
Understanding Open vSwitch Understanding Open vSwitch
Understanding Open vSwitch
 
OpenvSwitch Deep Dive
OpenvSwitch Deep DiveOpenvSwitch Deep Dive
OpenvSwitch Deep Dive
 
OpenStack Neutron Tutorial
OpenStack Neutron TutorialOpenStack Neutron Tutorial
OpenStack Neutron Tutorial
 
Quantum - Virtual networks for Openstack
Quantum - Virtual networks for OpenstackQuantum - Virtual networks for Openstack
Quantum - Virtual networks for Openstack
 

Ähnlich wie Troubleshooting Tracebacks

Simplifying openstack instances networking
Simplifying openstack instances networkingSimplifying openstack instances networking
Simplifying openstack instances networkingMohamed ELMesseiry
 
OpenStack Discovery and Networking Assurance - Koren Lev - Meetup
OpenStack Discovery and Networking Assurance - Koren Lev - MeetupOpenStack Discovery and Networking Assurance - Koren Lev - Meetup
OpenStack Discovery and Networking Assurance - Koren Lev - MeetupCloud Native Day Tel Aviv
 
When DevOps and Networking Intersect by Brent Salisbury of socketplane.io
When DevOps and Networking Intersect by Brent Salisbury of socketplane.ioWhen DevOps and Networking Intersect by Brent Salisbury of socketplane.io
When DevOps and Networking Intersect by Brent Salisbury of socketplane.ioDevOps4Networks
 
Webinar: Agile Network Deployment
Webinar: Agile Network DeploymentWebinar: Agile Network Deployment
Webinar: Agile Network DeploymentVasudhaSridharan
 
Sdn not just a buzzword
Sdn not just a buzzwordSdn not just a buzzword
Sdn not just a buzzwordJorge Bonilla
 
Anatomy of neutron from the eagle eyes of troubelshoorters
Anatomy of neutron from the eagle eyes of troubelshoortersAnatomy of neutron from the eagle eyes of troubelshoorters
Anatomy of neutron from the eagle eyes of troubelshoortersSadique Puthen
 
FreeSWITCH as a Microservice
FreeSWITCH as a MicroserviceFreeSWITCH as a Microservice
FreeSWITCH as a MicroserviceEvan McGee
 
OpenStack Scale-out Networking Architecture
OpenStack Scale-out Networking ArchitectureOpenStack Scale-out Networking Architecture
OpenStack Scale-out Networking ArchitectureRandy Bias
 
Using the KVMhypervisor in CloudStack
Using the KVMhypervisor in CloudStackUsing the KVMhypervisor in CloudStack
Using the KVMhypervisor in CloudStackShapeBlue
 
F5 link controller
F5  link controllerF5  link controller
F5 link controllerJimmy Saigon
 
How to Prevent DHCP Spoofing
How to Prevent DHCP SpoofingHow to Prevent DHCP Spoofing
How to Prevent DHCP SpoofingKHNOG
 
Docker Networking in OpenStack: What you need to know now
Docker Networking in OpenStack: What you need to know nowDocker Networking in OpenStack: What you need to know now
Docker Networking in OpenStack: What you need to know nowPLUMgrid
 
Control Your Network ASICs, What Benefits switchdev Can Bring Us
Control Your Network ASICs, What Benefits switchdev Can Bring UsControl Your Network ASICs, What Benefits switchdev Can Bring Us
Control Your Network ASICs, What Benefits switchdev Can Bring UsHungWei Chiu
 
Network troubleshooting
Network troubleshootingNetwork troubleshooting
Network troubleshootingSkillspire LLC
 
How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)Salvatore Orlando
 
How to write a Neutron plugin (stadium edition)
How to write a Neutron plugin (stadium edition)How to write a Neutron plugin (stadium edition)
How to write a Neutron plugin (stadium edition)salv_orlando
 
Improving performance and efficiency with Network Virtualization Overlays
Improving performance and efficiency with Network Virtualization OverlaysImproving performance and efficiency with Network Virtualization Overlays
Improving performance and efficiency with Network Virtualization OverlaysAdam Johnson
 
Magnum Networking Update
Magnum Networking UpdateMagnum Networking Update
Magnum Networking UpdateDaneyon Hansen
 
Docker Networking - Current Status and goals of Experimental Networking
Docker Networking - Current Status and goals of Experimental NetworkingDocker Networking - Current Status and goals of Experimental Networking
Docker Networking - Current Status and goals of Experimental NetworkingSreenivas Makam
 

Ähnlich wie Troubleshooting Tracebacks (20)

Simplifying openstack instances networking
Simplifying openstack instances networkingSimplifying openstack instances networking
Simplifying openstack instances networking
 
OpenStack Discovery and Networking Assurance - Koren Lev - Meetup
OpenStack Discovery and Networking Assurance - Koren Lev - MeetupOpenStack Discovery and Networking Assurance - Koren Lev - Meetup
OpenStack Discovery and Networking Assurance - Koren Lev - Meetup
 
When DevOps and Networking Intersect by Brent Salisbury of socketplane.io
When DevOps and Networking Intersect by Brent Salisbury of socketplane.ioWhen DevOps and Networking Intersect by Brent Salisbury of socketplane.io
When DevOps and Networking Intersect by Brent Salisbury of socketplane.io
 
Webinar: Agile Network Deployment
Webinar: Agile Network DeploymentWebinar: Agile Network Deployment
Webinar: Agile Network Deployment
 
Sdn not just a buzzword
Sdn not just a buzzwordSdn not just a buzzword
Sdn not just a buzzword
 
Anatomy of neutron from the eagle eyes of troubelshoorters
Anatomy of neutron from the eagle eyes of troubelshoortersAnatomy of neutron from the eagle eyes of troubelshoorters
Anatomy of neutron from the eagle eyes of troubelshoorters
 
FreeSWITCH as a Microservice
FreeSWITCH as a MicroserviceFreeSWITCH as a Microservice
FreeSWITCH as a Microservice
 
OpenStack Scale-out Networking Architecture
OpenStack Scale-out Networking ArchitectureOpenStack Scale-out Networking Architecture
OpenStack Scale-out Networking Architecture
 
Using the KVMhypervisor in CloudStack
Using the KVMhypervisor in CloudStackUsing the KVMhypervisor in CloudStack
Using the KVMhypervisor in CloudStack
 
F5 link controller
F5  link controllerF5  link controller
F5 link controller
 
How to Prevent DHCP Spoofing
How to Prevent DHCP SpoofingHow to Prevent DHCP Spoofing
How to Prevent DHCP Spoofing
 
Docker Networking in OpenStack: What you need to know now
Docker Networking in OpenStack: What you need to know nowDocker Networking in OpenStack: What you need to know now
Docker Networking in OpenStack: What you need to know now
 
Control Your Network ASICs, What Benefits switchdev Can Bring Us
Control Your Network ASICs, What Benefits switchdev Can Bring UsControl Your Network ASICs, What Benefits switchdev Can Bring Us
Control Your Network ASICs, What Benefits switchdev Can Bring Us
 
Network troubleshooting
Network troubleshootingNetwork troubleshooting
Network troubleshooting
 
How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)
 
How to write a Neutron plugin (stadium edition)
How to write a Neutron plugin (stadium edition)How to write a Neutron plugin (stadium edition)
How to write a Neutron plugin (stadium edition)
 
Improving performance and efficiency with Network Virtualization Overlays
Improving performance and efficiency with Network Virtualization OverlaysImproving performance and efficiency with Network Virtualization Overlays
Improving performance and efficiency with Network Virtualization Overlays
 
Magnum Networking Update
Magnum Networking UpdateMagnum Networking Update
Magnum Networking Update
 
Docker Networking - Current Status and goals of Experimental Networking
Docker Networking - Current Status and goals of Experimental NetworkingDocker Networking - Current Status and goals of Experimental Networking
Docker Networking - Current Status and goals of Experimental Networking
 
HP Virtual Connect technical fundamental101 v2.1
HP Virtual Connect technical fundamental101   v2.1HP Virtual Connect technical fundamental101   v2.1
HP Virtual Connect technical fundamental101 v2.1
 

Kürzlich hochgeladen

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Kürzlich hochgeladen (20)

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Troubleshooting Tracebacks

  • 1. Image FPO NO VALID WAS HOST FOUND Troubleshooting tracebacks and other common failure scenarios
  • 3. 3 What we’re here to talk about… Troubleshooting OpenStack Issues: • Tracebacks • Common Nova issues • Common Neutron Issues
  • 4. 4 Slides available at SlideShare These slides will be available at the following location after this presentation: http://www.slideshare.net/JamesDenton1
  • 5. 5 OpenStack is complex OpenStack is a complex system: • Many moving parts • Limited visibility to problems via API
  • 7. What is a traceback?
  • 8. 8 Traceback 101 • When errors occur, sometimes exceptions are raised. • When an exception is caught, an error and a list functions that got us to the point of the error are logged. This is a traceback. • The traceback output can be useful to operators and developers and allows them to trace the steps to the error. • As you’ll see, a traceback doesn’t always provide clear insight into the real error.
  • 9. Image FPO 9 D e c i p h e r i n g a t r a c e b a c k i s a b i t l i ke r e a d i n g t h e M a t r i x 9
  • 10. Image FPO 10 D e c i p h e r i n g a t r a c e b a c k i s a b i t l i ke r e a d i n g t h e M a t r i x 10
  • 11. 11 Tips on reading a traceback Read from the bottom to the top – The last few lines are the most relevant In this case, within the init function the program was unable to connect to MySQL.
  • 12. 12 Slides available at SlideShare These slides will be available at the following location after this presentation: http://www.slideshare.net/JamesDenton1
  • 13. Nova
  • 14. Image FPO 14 “ N o v a l i d h o s t w a s f o u n d . W h a t t h e h e c k d o e s t h a t m e a n ? ! ” 14
  • 15. 15 No valid host was found
  • 16. 16 No valid host was found This error is likely seen when booting an instance. Common reasons for failing: • There really are no hosts available • Networking issues on compute node • Lack of resources
  • 17. 17 So you spun up an instance..
  • 18. 18 Identify the host. If there is one…
  • 19. 19 Check the compute logs on the compute node
  • 20. 20 Check the networking logs on the compute node
  • 21. 21 Bringing it back together !! WARNING !! The following example may not utilize Python or Neutron coding best practices.
  • 22. 22 Bringing it back together Let’s take a look at that traceback:
  • 23. 23 Bringing it back together Taking a look at the function, we can see there is no exception handling:
  • 24. 24 Bringing it back together By adding some exception handling to the function…
  • 25. 25 Bringing it back together … we get a nice, clean error that clearly indicates what is wrong
  • 26. 26 Bringing it back together Interface mappings can be found in the ML2 configuration file: If eht2 does not exist on this host, the Neutron agent may be unable to complete the network configuration.
  • 27. 27 Next example: When there isn’t a host…
  • 28. 28 Check the scheduler and conductor logs • /var/log/nova/nova-scheduler.log • /var/log/nova/nova-conductor.log
  • 29. 29 First pass: 2015-10-16 17:14:18 Second pass: 2015-10-16 17:10:10
  • 30. 30 NTP! NTP! NTP! Wonky behavior caused by inconsistencies in time between hosts • Services and agents can appear DOWN when they’re UP • Service and agent flapping can cause scheduling issues
  • 32. 32 Neutron architecture Neutron is composed of various services and agents responsible for building and maintaining the virtual network: Failures can occur at any point.
  • 34. 34 Neutron architecture The DHCP agent is responsible for: • Creating network namespaces • Configuring dnsmasq – a DHCP server When instances are created, IPs are statically assigned.
  • 35. 35 Neutron architecture Failures of the DHCP agent on a host can result in: • Instances not getting their initial lease • Instances not renewing their lease
  • 36. 36 Dnsmasq Basics As subnets and ports are created, the DHCP agent is responsible for configuring the files used by dnsmasq to provide DHCP services to the network: When dnsmasq hands out the lease, it updates its active lease database. • /var/lib/neutron/dhcp/<network_uuid>/host
  • 37. 37 Dnsmasq Basics By default, dnsmasq writes its logs to: • /var/log/syslog (Ubuntu,Debian) • /var/log/messages (RHEL,CentOS,Fedora)
  • 38. 38 Troubleshooting DHCP If there are issues obtaining an IP, start with packet captures on the following devices: • Compute node: – Tap interface – Bridge interface – Physical interface • Network node: – Physical interface – Bridge interface – Veth interface – Namespace interface Listen on UDP ports 67 and 68. You should see the full DHCP cycle in the packet capture on most interfaces.
  • 39. 39 Troubleshooting DHCP – Packet Captures • Working example: • Non-working example: When DHCP isn’t working, investigate the switching layer or dnsmasq.
  • 40. N o w w e k n o w h o w i t w o r k s … 40
  • 41. … w e ’ v e g o t a l i v e b u g : D H C P N A K ! 41
  • 42. 42 Troubleshooting DHCP – DHCPNAK Issues I see DHCPNAK packets. HELP! • Likely means the DHCP agent was restarted and active lease file deleted • Instances may receive DHCPNAK when requesting / renewing address • This may result in delayed or no connectivity • Addressed in patch for bug #1345947, which sets dnsmasq to renew the lease anyway without sending a NAK and repopulate its lease file
  • 43. 43 Troubleshooting DHCP – DHCPNAK Issues When a network is scheduled to more than 1 DHCP agent, there may be issues: • That fix expected only 1 DHCP server in the network! • The DHCPREQUEST packet sent on renewal attempt is received by all DHCP agents (it’s a broadcast, after all) • The renewal attempt is accepted by the agent that provided the original lease  • At the same time, the renewal attempt is rejected by the agent that didn’t provide the original lease 
  • 44. 44 Troubleshooting DHCP – DHCPNAK Issues The end result? The client honors the DHCPNAK and restarts the DHCP process 
  • 45. 45 Troubleshooting DHCP – DHCPNAK Issues However, there is hope! • Bug 1457900 addresses the multiple DHCP agent issue • The fix is to pre-populate the dnsmasq leases file on all DHCP agents with all known MACs/IPs for respective networks • Fixed in Liberty, coming to a backport near you!
  • 47. 47 Neutron architecture The L2 agent is responsible for: • Programming the virtual switching infrastructure • Applying security groups
  • 48. 48 Neutron architecture Failures of the L2 agent on a host can result in: • Lack of instance connectivity • Security group issues • ERROR state during nova boot
  • 49. 49 Troubleshooting OVS connections When troubleshooting L2 connectivity issues, run packet captures on highlighted interfaces:
  • 50. 50 Troubleshooting OVS connections Every interface plugged into the integration bridge should have a local VLAN ID that is unique to that node, no matter what the network type (VLAN, flat, local, VXLAN, GRE): If the tag is missing, try restarting the OVS agent to force a rebuild of the integration bridge VLAN tagging and corresponding flows.
  • 51. 51 Troubleshooting OVS connections If you see an OVS port in VLAN 4095, it typically means that the agent was unable to find a corresponding Neutron port in the database: When this happens, it usually means that the port was deleted from the DB manually or as part of another action that did not complete successfully.
  • 52. 52 Troubleshooting OVS connections Useful commands include: • ovs-vsctl show – High-level view of virtual bridges on the respective node – Shows local VLAN IDs for each port • ovs-ofctl dump-flows BRIDGE – Show the flow rules for the respective bridge – The flow rules determine how traffic is manipulated and forwarded • ovs-ofctl show BRIDGE – Port-level view of respective virtual switch – Shows port IDs on the bridge. Useful when reading flows.
  • 53. 53 Troubleshooting LinuxBridge connections When troubleshooting L2 connectivity issues, run packet captures on highlighted interfaces: brqXXXX (Linux Bridge) eth1 (NO IP ADDR) eth0 IP address for MGMT & API eth1.100 brqXXXX (Linux Bridge) eth1 (NO IP ADDR) PhysicalNetworkSwitch eth0 IP address for MGMT & API VM1 eth0 VM0 eth0 tap0 tap1 K V M eth1.100 qdhcp qrouter tap2xxxx ns-2xxxx tap1xxxx qr-1xxxx Network Node Compute Node
  • 54. 54 Troubleshooting LinuxBridge connections In a working environment, every interface will connect to a bridge that corresponds to a Neutron network: If a bridge is missing, check the agent log to see if there is an error. Network A (VXLAN Network) Network B (VLAN Network)
  • 55. 55 Troubleshooting LinuxBridge connections Useful commands include: • brctl show – High-level view of virtual bridges on the respective node – One bridge for each network • bridge fdb show – Shows the bridge forwarding database – Useful for knowing how MAC addresses are reached • ip neigh show – Shows the ARP cache
  • 56. 56 Binding Failed is back! • Usually seen when booting instance or attaching interface • Typically result of Neutron misconfiguration or agent issues • Not limited to just instance ports Unexpected vif_type=binding_failed
  • 57. 57 Binding Failed is back! In this example, both the DHCP and L3 agent ports were in binding_failed status:
  • 58. 58 Binding failed is back! In this case, a look at the L2 agent log shows the misconfiguration: If the agent is stopped or in a restart loop, port bindings will likely fail.
  • 59. 59 Binding Failed: The Fallout For existing DHCP and L3 ports you may need to: • Fix router port: – Unschedule tenant network from L3 agent – Reschedule tenant network to L3 agent – This creates new port • Fix DHCP port: – Unschedule tenant network from DHCP agent – Delete DHCP port – Reschedule tenant network to DHCP agent – This creates new port
  • 60. 60 L2 agent troubleshooting tips • Check to make sure the respective L2 agent is configured properly and is running (not restarting!) • Make sure OVS is running (if applicable) • Check the Neutron agent logs – /var/log/neutron/neutron-*-linuxbridge-agent.log – /var/log/neutron/neutron-*-openvswitch-agent.log Tips:
  • 62. 62 Neutron architecture The L3 agent is responsible for: • Creating network namespaces for each router • Providing routing between networks • Providing NAT to instances
  • 63. 63 Neutron architecture Failures of the L3 agent on a host can result in: • Failure to route traffic • Floating IPs not functioning
  • 65. 65 L3 agent troubleshooting tips • Check to make sure the L3 agent is running and configured properly • Perform packet captures within the router namespace and other interfaces to observe traffic entering and leaving the router • Check iptables within the router namespace to observe the proper rules have been created • Check the Neutron L3 agent log: – /var/log/neutron/l3-agent.log Tips:
  • 67. 67 MTU If the plumbing looks good, but you still experience connectivity issues to instances over certain protocols, it may be worth checking out the MTU size. • Overlay network header can cause packet to exceed MTU • Often manifests itself as SSH issues • Try ssh –v to see where it hangs • Pass lower MTU with DHCP option 26
  • 68. 68 Don’t forget security groups! • Try applying a test rule • Test connectivity from a namespace • Verify iptables on compute nodes • L2 agents are responsible for applying rules When things are plumbed up correctly and everything looks normal, there may be an issue with security group rules.
  • 69. 69 Neutron architecture Other issues can only be observed at scale: • Race conditions • System limits too low • No disk space available • Syslog is your friend
  • 71. 71 Neutron failures Many common Neutron failures can be traced back to misconfigurations of the: • Neutron configuration file • ML2 configuration file • Interface configuration files
  • 72. 72 Takeaways Get familiar with the underlying technologies: • KVM • Open vSwitch • Linux bridging • IPtables
  • 73. 73 Takeaways Familiarize yourself with a working environment so that you know how to spot an issue.
  • 74. 74 Takeaways • Turn on DEBUG mode • Check syslog • Start services by hand • Start out with simple configurations • Reach out to community • Gather as much information as possible before submitting a bug
  • 75. Image FPO 75 D o n ’ t b e a f r a i d to b r e a k t h i n g s 75
  • 76. 76 Stop by the Rackspace booth in the marketplace Free book giveaways at the Rackspace booth during the morning and afternoon breaks! Morning Afternoon
  • 77. 77 Slides available at SlideShare These slides will be available at the following location after this presentation: http://www.slideshare.net/JamesDenton1
  • 78. O N E FA N AT I C A L P L A C E | S A N A N T O N I O , T X 7 8 2 1 8 U S S A L E S : 1 - 8 0 0 - 9 61 - 2 8 8 8 | U S S U P P O R T: 1 - 8 0 0 - 9 61 - 4 4 5 4 | W W W. R AC K S PAC E . C O M © RACKSPACE LTD. | RACKSPACE® AND FANATICAL SUPPORT® ARE SERVICE MARKS OF RACKSPACE US, INC. REGISTERED IN THE UNITED S TATES AND OTHER COUNTRIES. | WWW.RACKSPACE.COM Thank you

Hinweis der Redaktion

  1. Just mention that this presentation is geared towards openstack operators
  2. Just mention that this presentation is geared towards openstack operators
  3. JAMES OpenStack is a complex system with a lot of moving parts. Many things that can go wrong cannot be determined through the API and can only be seen by someone on the infrastructure nodes themselves
  4. Old methods of "turning it off and on again" no longer apply. If you restart, you may lose valuable information about the environment that
  5. OpenStack relies on a collection of python programs to build clouds. When things fail to operate as expected, you might see an error in a log file or the console that says ‘traceback’. What is that?
  6. FIRST BLOCK: An exception is an event that occurs during the execution of a program when an error is encountered. When you TRY something and it fails, you can raise an EXCEPTION. SECOND BLOCK: The standard Python traceback module produces useful information about where and why an error occurred. That information can greatly assist operators and developers in detecting the cause of the error. A traceback is the ourput of an exception that's been raised.
  7. James: I think we’ve all seen a traceback before
  8. The traceback contains some useful information about the error. Sometimes it's easy to understand other times... it's not.
  9. Just mention that this presentation is geared towards openstack operators
  10. No Valid Host is a generic error often seen for a number of different reasons.
  11. There are different conditions depending on the version of openstack you are running. Newer versions have better error reporting and lower failure rates
  12. Let’s take a look at this example. In this example there are a few key things to identify: The status of the VM - The instance is clearly in an ERROR state. Also, an IP address has been assigned to this instance. What do we know by looking at this: the Neutron API is functional enough to assign an IP. (Do a nova show with the ID or the instance name if its unique)
  13. Using nova show, we can see additional details about the instance The instance name is a key identifier, as that is how virsh identifies the instance on the compute node. Notice that the instance has been scheduled to a node. When the instance has been scheduled, its safe to assume that node has met the criteria of the scheduling filter. The instance was scheduled to a node, but the fault area indicates an issue was experienced during the launching of this VM. The message “there are not enough hosts available” is what is reported to the user, but that message is not terribly helpful when it comes to troubleshooting what happened.
  14. The nova compute logs on the compute node should provide a good indication as to what went wrong. Binding failed?? Now what?? When nova creates a virtual machine instance, it must “plug” each virtual network interface into the virtual switch. The virtual network interface is known as the ‘VIF’. Nova uses drivers (specified in nova.conf) to interface with virtual switches. When Nova is unable to interface with the network agent and properly setup the port, the vif_type is set to ‘binding_failed’ and eventually an error is triggered. Because the error is related to networking, let’s take a look at the network agent log…
  15. JAMES In this example, the LinuxBridge agent log is continuously reporting a CRITICAL failure stating NO SUCH DEVICE. But what device??? Let’s dig in a little deeper.
  16. So don’t judge me!
  17. Taking a look at the get_interface_mac function in the traceback, we can see that the interface is passed to the function and there is no exception handling here. The interface is passed to another function for processing, and that function is likely returning 'No Such Device'. How do we find out what device it is?
  18. By adding some exception handling to the function, we’re able to catch the error and present a more useful error message
  19. As the result of adding some exception handling, or at a minimum, some logging, we can now see that the agent is complaining about interface eht2.
  20. Neutron interacts with interfaces defined in the plugin configuration file. In this case, the ML2 configuration file. Depending on the driver you will have different options. When the agent is started, it consults the config file for information about bridge mappings and interfaces. When those interfaces in the config don’t actually exist, an error may be generated and the agent will fail to start. In this example, the physical_interfa ce_mapping was incorrect and pointed to an interface that did not exist on the host. When the agent was started and couldn’t find the eth4 interface, the agent kept restarting as a result of the failure. By changing the interface from eht2 to eth2 and restarting the agent, we were able to successfully create instances on that host.
  21. Let’s take a look at another example. In this example, we see 'No valid host was found' as the fault. Similar to the last example, but different in a couple of ways. First, no additional info about the failure is provided. In addition, the instances does not appear to have been scheduled to a host. If there isn’t a host identified in the nova show, the error is likely localized to the scheduler node.
  22. Relevant logs on the controller node can be found in /var/log/nova. Take a look at the scheduler and conductor logs. In this case, the scheduler log reported ‘no valid host was found’.
  23. Using the nova service-list command, we found that the nova-compute service, at times, was UP and DOWN. In DOWN moments, the failures were observed. What could cause that??
  24. When a service or agent checks in, the database is updated with the time of check in. Other services, such as the scheduler, depend on that check in time to determine if the service is available. The scheduler determines the availability of a host by comparing the difference between its local time and the ‘last seen time’ of the compute node. By default, that difference cannot exceed 60 seconds. If it’s greater than 60 seconds, the node is considered ‘unavailable’. If you have wide variances in time, in this case 4 minutes between controllers, you may see inconsistent behavior in the environment. So there really WAS no valid host found at that point in time!
  25. Neutron is composed of various services and agents that are responsible for constructing and maintaining the virtual network.
  26. Let’s start with the DHCP agent. When you create an instance, Neutron statically assigns IP addresses to ports associated with those instances. Instances will then use a DHCP client to obtain that address and configure the interface. In the standard Neutron architecture, the DHCP agent builds network namespaces for each network that each contain their own dnsmasq process.
  27. Failures of the DHCP agent can result in: • instances not getting an initial lease • instances not renewing a lease
  28. The DHCP agent constructs a host file that is used by dnsmasq to provide IPs to clients that ask for one. When a client sends a DHCPDISCOVER packet, dnsmasq looks to the host file for IP information respective to the client’s MAC address:
  29. Dnsmasq logs the DHCP cycle in syslog. The full lease cycle can be observed in the syslog: The DHCPDISCOVER is the clientrequesting an IP. This is a broadcast The DHCPOFFER is the server proposing an address. This is unicast. The DHCPREQUEST is the client requesting the proposed address. This is a broadcast. The DHCPACK is the server acknowledging the request.
  30. If you instance doesn’t get its lease, and applying an IP directly to the interface doesn’t work either, consider running packet captures on the following interfaces: Taps Bridges Physical With the proper tcpdump syntax, You should see all messages on all interfaces.
  31. In a working example, the full DHCP cycle can be observed on the tap interface of the instance In the non-working example, the DHCPDISCOVER message appears to go unanswered. This may be the result of L2 connectivity issues or issues with dnsmasq, an example of which we’ll cover next.
  32. Now that we know how the DHCP process works, let's talk about a bug that may be impacting a lot of you out there that may not realize it.
  33. Now that we know how the DHCP process works, let's talk about a bug that may be impacting a lot of you out there that may not realize it.
  34. If your instance is having issues procuring or renewing a lease, you may see DHCPNACK packets when troubleshooting. In some releases, when the DHCP agent is restarted the dnsmasq process loses track of leases it has handed out. When this happens, instances that attempt to renew their lease will be met with a NACK packet, causing the DHCP lease cycle to start over. This can result in a brief loss of connectivity as the instance works to procure its IP address again. A patch was introduced to allow dnsmasq to quietly rebuild its lease file without sending a NACK, but the fix relied on there being only one DHCP server in the network
  35. In highly-available environments, when the network is scheduled to multiple DHCP agents and the instance attempts to renew its lease, each DHCP server will see the request and all may respond. The renewal attempt is accepted by the agent that provided the original lease while the ones that didn’t reject the renewal attempt with a NACK.
  36. The end result is that the client will start the DHCP process over, briefly causing connectivity in the process.
  37. The new method of handling this issue is for Neutron to pre-populate the lease DB for each DHCP server, much like it does with the host file. This way, when the agent is restarted, dnsmasq is reloaded with a populated lease database!
  38. The Neutron L2 agents are responsible for programming the virtual switching infrastructure when instances and ports are created.
  39. Failures of the L2 agent can often result in: • lack of instance connectivity • Security group issues • errors booting instances
  40. When using OVS, there are a lot of moving parts. The OVS agent connects instances to bridges, applys security group rules, and maintains flow rules that dictate how traffic is forwarded. When clients have connectivity issues, it is worth starting with packet captures on the highlighted interfaces, starting with those closest to the instance.
  41. The linuxbridge agent is a little simpler in its implementation compared to OVS. Again, when clients complain of connectivity issues it is worth performing captures on the highlights interfaces to see where traffic may be dropped. If the packet makes it out of a server and is not seen again, it may be necessary to take a look at the physical infrastructure. Improper physical switch configurations are commonly responsible for network issues. Knowing how a particular agent provides network connectivity across the cloud is important to troubleshooting potential issues. Sometimes, a restart of the respective L2 agent is needed to rebuild connections and flows that restore connectivity. Knowing what connections should be made and what flows should exist will help you make the call.
  42. As we saw with the Nova example earlier, when Nova or Neutron are not able to determine how to “attach” interfaces to the respective bridge, or there are other issues with the L2 agent on the host, you will often see a ‘binding failed’ error. Common issues are ML2 misconfigurations that can usually be identified by looking at the OVS or LB agent log files. Often times, the agent may constantly restart until the configuration is corrected. At first appearances it is UP and available but isnt able to do its job.
  43. In this example, the user had created a tenant network and attached it to a Neutron router. The Neutron API successfully completed those tasks without error. However, instances were unable to obtain an IP or hit their gateway when manually configured. Looking at the bridge, we found that neither the DHCP not Router ports were connected.
  44. A look at the L2 agent log reported that VXLAN had been enabled in the Ml2 config, but there was an issue with the specified IP address. A look at the host revealed that the specified IP was not configured on any interfaces. To help solve this problem, I configured the address on an interface and restarted the agent.
  45. If this happens to you, try unscheduling the network from the respective agent and rescheduling.
  46. Failures of the L3 agent can result in: • failure to route traffic if the neutron routers have not been created properly or interfaces have not been added • missing snat/dnat rules in the namespace • etc Check logs on the nodes at /var/log/neutron/neutron-l3-agent.log
  47. When floating IPs are associated with an instance/port, there are changes made to iptables within the corresponding router namespace. These rules dictate how traffic is translated when it egresses and ingresses the router. If and when the agent is exhibiting issues, these rules may not get applied and floating ips will not operate.
  48. If everything looks good, but you notice packet loss or issues with SSH, you may be exceeding the MTU of the interface. This is most often seen when overlay technologies like VXLAN are used. The addition of the overlay headed causes the packet to exceed the MTU. You can pass a lower MTU via DHCP with option 26.
  49. If things are plumbed up correctly but you’re still experiencing issues, make sure to verify security group rules are not prohibiting traffic flow. Try applying a secondary security group to the port that allows limited connected (ICMP/SSH) from a particular IP or group. Test connectivity from DHCP or Router namespace, then branch out from there. Test connectivity to fixed IP before testing floating IP externally.
  50. Continuous work is being done on enhancing the operation of the L2/L3 agents., but some issues can only be observed at scale: • race conditions • default kernel parameters too low If you are experiencing random, unexplainable issues, consult the syslog to see if the system itself is reporting issues.
  51. Often, failures can be traced back to Layer 8 issues – those originating between the keyboard and the chair. It goes without saying that misconfigured files will definitely cause issues within the environment. A misconfiguration can result in a service not starting at all, or can lead you down a long troubleshooting path thanks to obscure symptoms and messages.
  52. familiarize yourself with the underlying technologies
  53. Do your best to familiarize yourself with a working environment so that you know how to spot an issue. If you’re new to OpenStack, consider installing an all-in-one distro that will allow you to setup a prescribed environment that you can reverse engineer and learn from. Examples of this are OSAD and RDO. The docs on openstack.org are also very helpful in setting up basic environments.
  54. Create virtual machines manually Create Linuxbridges manually. Place physical interfaces in there and your VM tap interfaces Create OVS bridges manually. Create a few flows, or just use NORMAL flow. Assign vlan to to ports. Work to figure out how it all fits together.
  55. Work to get it working. It may not be pretty, but break things and put them back together. And remember, you’re not alone. There’s a strong community here that is willing to help.
  56. Just mention that this presentation is geared towards openstack operators
  57. Just mention that this presentation is geared towards openstack operators