2. @diego_pacheco
â Cat's Father
â Head of Software Architecture
â Agile Coach
â SOA/Microservices Expert
â DevOps Practitioner
â Speaker
â Author
diegopacheco
http://diego-pacheco.blogspot.com.br/
About me...
https://diegopacheco.github.io/tinyurl.com/diegopacheco
3. Things we will
talk about...
Things you should avoid.
DON'TS01
Science 101
How to run a investigation02
Observability, Debug Tricks,
Linux Tools...
Tools & Tricks03
Post Mortems, RCAs and
Retrospectives
Closing the Case04
5. Don't: FREAK OUT
â Avoid extra pressure on yourself
â Avoid worry about time.
â Avoid making comparisons like this
should be done in 1h or so.
â As time pass is easy to Freak Out
â Making progress is your friend, don't
nail it down only about solving the
mystery.
â Your eyes can trick you, use tools.
â Make sure you do things properly.
â Comparing Strings
â Read Errors Carefully.
â Don't forget to Breath
6. Don't: Do two many things at time.
â Don't change 2 things at the time.
â Code
â Vars / Config
â Trys
â One Hypotesy at the time
â Otherwise how do you know what
did what?
â You need to be:
â Methodical
â Boring
â Slow
7. How to Run a investigation
Science 101
Have Theories
Write Down Facts
Have a
Partner
(Pair Programing)
Minimize Investigation
Efforts
11. Have Theories
â It's a like a guessing game, how the murder happen?
â Think on most likely thing it could be.
â There are classical Offenders like:
â NPE Lack of validation/test
â Code Change
â Config Change
â Credentials Change
â Typo somewhere
â Lack of Security groups
â Not enough Resources (memory, space)
â OOM Killer
12. Write down Facts
â Write down (paper, txt file, evernote, google docs)
â Write Important fact like:
â Class Names, Methods, Variables
â Make sure you don't get lots.
â It's easy to forget important pieces of information
when you are worried and want to fix the issue fast.
21. Write your own tools
â Help to nail down problems faster.
â Simple Utilitary.
â Program or scripts (Groovy, Python , Rust, Go).
â Quick diagnostics.
22. Could you compare with something else?
â It's very likely the issue is a code change, so maybe was working
before. Some digging is needed.
â Not always possible (i.g new feature)
â Use diff tools.
â Compare previous versions that worked.
â Run previous tests that worked.
â Debug and write down differences.
23. Closing the Case
Fix, Share & Improve
â Adding More Tests
â Post Mortems
â RCAs
â Retrospectives
24. Adding more Tests
â Boys Scout rules.
â Closing the Case means you found the issue.
â So we need to create test to simulate the issue.
â Having the test will proof we don't suffer again.
26. Closing the Case: RCA
â Similar to Post Mortems.
â Lean tool. Can be used with: 5 whys, fishbone and other
system thinking tools.
â Excel file:
â Issue
â classification
â Why it happen
â How we make sure it does not happen again