Troubleshooting is a valuable skill for service providers to have in order to reduce downtime and save costs. It involves both technical skills like understanding systems and soft skills like communication. In today's complex, distributed environments, troubleshooting has become more of an art. The process involves fast resolution by understanding the problem fully, observing its behavior, localizing the issue, testing resolutions, monitoring for success, and establishing historical monitoring data and people skills. Automating parts of troubleshooting and integrating various tools can also help scale this process.
2. Background
• 17+ Years experience as a service provider.
• MSP - We own your infrastructure stack.
• Managed Public, Private, and Hybrid Cloud.
• High Touch - Monitoring, Managing, Securing
all layers (app/db/fw/nw/cdn)
• Mix of Open Source, Commercial, and
Proprietary Software.
• Context of an MSP but applicable to all IT.
3. The Art of Troubleshooting
• Troubleshooting is a standalone skill - Must be
Taught, Trained, Documented, Reviewed.
• Technical skills are important but soft skills
cannot be forgotten.
• Big buzz around hyped technologies in
DevOps.
• Other important soft skills - Resourcefulness,
Communication, Technical Documentation,
Mentoring.
4. Why Care?
• We're already doing it, internally or customer facing.
• We're trusted to do it well.
• We’re judged how we act during crisis.
• Technology agnostic skill. Never become obsolete and
only becomes more important as systems become more
complex.
• Applicable to infrastructure & software development.
• Allows us to scale - It's how we manage many complex
environments with small teams.
• Reduce downtime & Save lives!!
4
5. Brief History
• Hardware & Software
Systems traditionally
monolithic in nature.
• Flat topology without
abstraction.
• Mostly physical
infrastructure.
• Perpetual
configurations .
• No automation.
• Easier Troubleshooting.
5
6. Today
• Distributed, virtualized, and
abstracted infrastructure -
Network, Storage, Compute.
• Self-healing & Autoscaling
• Microservices based architecture
• Increased complexity with many
benefits (scaling, CI,time).
• Decreased visibility, control, and
ease of Troubleshooting.
• Troubleshooting == An Art.
6
7. Where to Start?
• Fast resolution starts with absolute
understanding of the issue.
• Is this the cause or a symptom?
• Was it brought to us pre-diagnosed?
• What is the context?
• Why is this a problem? Why is it important?
• No tunnel vision. Look at the big picture.
• Examples: 'Email Issue' , 'Wifi/Speed' , 'DDoS &
API'
7
8. Observe
• Can you have a clear understanding of the issue
without seeing it for yourself?
• Before attempting resolution, be sure you can
reproduce.
• Gain perspective using tools - remote logins,
screen shares, screen shots.
• Understand expected behavior vs fault.
• Software development & Debugging tools.
• Supplement lack of perspective with solid
communication.
8
9. Localize
• Drill Down - Peel back the layers of the onion.
• Process of elimination to localize the issue.
• Examples: static content vs dynamic content, local
storage vs shared/network storage, directly test
and bypass layers, rule out network
• Follow the flow of data and test each step.
• Major Outages - localize to commonality - What do
the effected services have in common?
• Use monitoring tools or SPOTs to localize.
• Don't assume anything - NIH (Not invented here).
9
10. Resolve and Test
• Scientific Method: Question -> Hypothesis –>
Experiment -> Observation ->Analysis ->
Conclusion
• Make small incremental changes.
• Use sandboxes where possible.
• Document each test and result, and change.
• Use ability to reproduce and observe to confirm
resolution.
10
11. Monitor for Success
• Fast resolution starts with a solid monitoring strategy.
• Empower DevOps to create layered alerts.
• Create simple interfaces, scripts, APIs to allow for easy additions of
monitoring to standardized systems.
• Make it part of development & build process.
• Monitor the Big and the Small:
• Big - The finished product. Look for expected result via end user
interface to ensure all services are properly functioning.
• Small - Every layer! OK/FAIL monitoring for multi-dimensional
services and 3rd party providers.
• If setup properly, big and small alerts should trigger simultaneously
allowing instant localization.
11
16. Automated Troubleshooting
• Trigger action based on logging events (syslog,
splunk, logstash).
• Scan configuration files for dangerous
conditions.
• Pipe events to software designed to diagnose
and take action.
• Actions: disable interfaces, line cards,
services, servers, notify operations.
• Platform that allows easy addition of new
tests based on experience.
16
17. Tools & Integration
• Use Single Point of truth (ie collins).
• Integrate 3rd party best in breed to SPOT (Nagios,
MRTG, ManageEngine, Vmware, Xen, ScienceLogic)
• Combine communication, monitoring, and culture
• We've connected: Nagios, Phone calls, LiveChats,
Ticket updates, WO updates, Ansible deployments,
Network alerts, Network capacity alerts, Physical
data center alerts, DDoS attacks, Confluence.
• Make documentation & diagram management easier.
"If its not documented, it didnt happen"
17
20. Historical Data a Must
• Establish baselines.
• Am I looking at a
preexisting condition?
• What changed and
when?
• Data Aggregation
• 3rd party: DataDog,
ScienceLogic,
GroundWorks,
NewRelic
20
24. People Factor
• Why is it not taught alongside technical skill?
• Becomes part of the interview process.
• Adopt approach used by the medical field - Bed
Side Clinics. Use your superstars.
• Make it part of your culture and develop a reward
system around it.
• Collapse silos, broaden context for everyone.
24