Monitoring systems has traditionally been the responsibility of Ops teams. But our goal is to align devs, ops, & other roles in the organization (aka DevOps), so we need to ensure they are all monitoring critical business systems & do so in ways that take advantage of the unique perspective that each role offers. In this session, I’ll break down the expansive monitoring landscape into 5 categories that each provide a unique view of your systems. I’ll show how each category allows your team to have complete observability, avoid blind spots, & work together to quickly resolve issues & outages.
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
100% Visibility - Jason Yee - Codemotion Amsterdam 2018
1. 1 0 0 % V I S I B I L I T Y
H O L I S T I C A L LY V I E W I N G S Y S T E M S
2. A M B I G U O U S C Y L I N D E R S
P E R S P E C T I V E M AT T E R S
3. J A S O N Y E E
D O C S & TA L K S
T R AV E L H A C K E R
P O K E M O N T R A I N E R
W H I S K E Y H U N T E R
T W: @ g i t b i s e c t
E M : j y e e @ d a t a d o g h q . c o m
4. D ATA D O G
S A A S - B A S E D M O N I T O R I N G
T R I L L I O N S O F P O I N T S / D AY
W E ’ R E H I R I N G :
j o b s . d a t a d o g h q . c o m
T W: @ d a t a d o g h q
5. V I S I B I L I T Y ?
W H E R E A R E W E G E T T I N G
20. M E T R I C S
• Often combined or aggregated
• Useful for spotting trends/patterns
21. M E T R I C S
• Often combined or aggregated
• Useful for spotting trends/patterns
• Send alerts from metrics
22. M E T R I C S
• Often combined or aggregated
• Useful for spotting trends/patterns
• Send alerts from metrics
• Help catch known unknowns
Unown Pokemon
24. L O G S
• Event-based
• Easy to read for humans
25. L O G S
• Event-based
• Easy to read for humans
• Well structured & easy to parse/grep for computers
26. L O G S
• Event-based
• Easy to read for humans
• Well structured & easy to parse/grep for computers
• Ideally verbose & contain a lot of information
27. L O G S
• Event-based
• Easy to read for humans
• Well structured & easy to parse/grep for computers
• Ideally verbose & contain a lot of information
• Useful for finding details of an event
28. L O G S
• Event-based
• Easy to read for humans
• Well structured & easy to parse/grep for computers
• Ideally verbose & contain a lot of information
• Useful for finding details of an event
• Help catch unknown unknowns
29. The Data
• Metrics
• Logs
• Traces
The Tools
• Application Monitoring
• Log Management
• APM
B A C K E N D
V I S I B I L I T Y
31. T R A C E S
• Request-based
• Follow activity from request across function and service
calls.
32. T R A C E S
• Request-based
• Follow activity from request across function and service
calls.
• Useful for following code to answer “Where?” and
“How long?”
33. The Data
• Metrics
The Tools
• Real-User Monitoring
(RUM)
• Synthetics
F R O N T E N D
V I S I B I L I T Y
34. P E O P L E & R O B O T S
• RUM & Synthetics work best together
35. P E O P L E & R O B O T S
• RUM & Synthetics work best together
• RUM provides insight into how users actually use a
product
36. P E O P L E & R O B O T S
• RUM & Synthetics work best together
• RUM provides insight into how users actually use a
product
• Synthetics operate independently of users
37. D AT E - A - D O G
W H AT ’ S I T A L L M E A N ?
T I N D E R F O R P U P S
38. T H I S A P P I S
G R E AT !
W H O ’ S A G O O D B O Y ? ! ?
39. I G O T TA T E L L
M Y F R I E N D S
A B O U T T H I S
A P P !
T H E Y ’ R E S O C U T E ! ! !
40. A N D M Y
F R I E N D S A R E
G O N N A T E L L
T H E I R F R I E N D S …
A A A W W W W W W W ! ! !
41. W H AT J U S T
H A P P E N E D ? ! ?
W H E R E ’ D T H E P U P P I E S G O ?
42. H O W D O W E K N O W
S O M E T H I N G W E N T W R O N G ?
U S E R S A R E H AV I N G A H O R R I B L E E X P E R I E N C E
43.
44. R E A L - U S E R
M O N I T O R I N G
H O W D O W E K N O W ?
45. R E A L - U S E R M O N I T O R I N G
H O W D O W E K N O W S O M E T H I N G W E N T W R O N G ?
47. S Y N T H E T I C S
H O W D O W E K N O W S O M E T H I N G W E N T W R O N G ?
48. S C E N A R I O : T H I R D PA R T Y C D N O U TA G E
We host puppy photos on Fastly & the app pulls
directly from the Fastly CDN. Fastly suffers massive
DDOS attack.
49. S C E N A R I O : T H I R D PA R T Y C D N O U TA G E
We host puppy photos on Fastly & the app pulls
directly from the Fastly CDN. Fastly suffers massive
DDOS attack.
• RUM & Synthetics: Will alert and can show what assets
are slow or are not being served.
50. S C E N A R I O : T H I R D PA R T Y C D N O U TA G E
We host puppy photos on Fastly & the app pulls
directly from the Fastly CDN. Fastly suffers massive
DDOS attack.
• RUM & Synthetics: Will alert and can show what assets
are slow or are not being served.
• APM, Application and Infrastructure Monitoring: No
alerts. Everything is fine!
51. T R A C I N G ( A P M )
H O W D O W E K N O W ?
52. T R A C I N G ( A P M )
H O W D O W E K N O W S O M E T H I N G W E N T W R O N G ?
53. T R A C I N G ( A P M )
H O W D O W E K N O W S O M E T H I N G W E N T W R O N G ?
54. T R A C I N G ( A P M )
H O W D O W E K N O W W H AT W E N T W R O N G ?
55. T R A C I N G ( A P M )
H O W D O W E K N O W W H AT W E N T W R O N G ?
56. T R A C I N G ( A P M )
H O W D O W E K N O W W H AT W E N T W R O N G ?
57. S C E N A R I O : S E R V I C E O U TA G E
We use an image resizing/optimizing service that
resizes images asynchronously. It has issues. Images are
returned slowly.
58. S C E N A R I O : S E R V I C E O U TA G E
We use an image resizing/optimizing service that
resizes images asynchronously. It has issues. Images are
returned slowly.
• RUM & Synthetics: Might see alerts, but not know
where
59. S C E N A R I O : S E R V I C E O U TA G E
We use an image resizing/optimizing service that
resizes images asynchronously. It has issues. Images are
returned slowly.
• RUM & Synthetics: Might see alerts, but not know
where
• Application & Infrastructure Monitoring: Everything is
fine!
60. S C E N A R I O : S E R V I C E O U TA G E
We use an image resizing/optimizing service that
resizes images asynchronously. It has issues. Images are
returned slowly.
• RUM & Synthetics: Might see alerts, but not know
where
• Application & Infrastructure Monitoring: Everything is
fine!
• APM: Can alert on latency and show where in the code
you are making the API calls.
61. A P P L I C AT I O N
M O N I T O R I N G
H O W D O W E K N O W ?
62. S C E N A R I O : D E V D E P L O Y S B A D C O D E
Developer accidentally deploys code that improperly
checks password hashes, so all user logins fail.
63. S C E N A R I O : D E V D E P L O Y S B A D C O D E
Developer accidentally deploys code that improperly
checks password hashes, so all user logins fail.
• RUM & Synthetics, APM: No alerts. Everything is fine!
64. S C E N A R I O : D E V D E P L O Y S B A D C O D E
Developer accidentally deploys code that improperly
checks password hashes, so all user logins fail.
• RUM & Synthetics, APM: No alerts. Everything is fine!
• Infrastructure Monitoring: No alerts. Everything is fine!
65. S C E N A R I O : D E V D E P L O Y S B A D C O D E
Developer accidentally deploys code that improperly
checks password hashes, so all user logins fail.
• RUM & Synthetics, APM: No alerts. Everything is fine!
• Infrastructure Monitoring: No alerts. Everything is fine!
• Application Monitoring: Will alert impact on custom
metrics and can help identify why.
66. A P P L I C AT I O N M O N I T O R I N G
H O W D O W E K N O W S O M E T H I N G W E N T W R O N G ?
67. I N F R A S T R U C T U R E
M O N I T O R I N G
H O W D O W E K N O W ?
68. I N F R A S T R U C T U R E M O N I T O R I N G
H O W D O W E K N O W S O M E T H I N G W E N T W R O N G ?
69. S C E N A R I O : W E ’ R E T O O P O P U L A R
Everyone loves puppies and we’re completely out of
resources.
70. S C E N A R I O : W E ’ R E T O O P O P U L A R
Everyone loves puppies and we’re completely out of
resources.
• RUM & Synthetics, APM, Application Monitoring: Alerts
that latency is high. Will not be able to help identify
why.
71. S C E N A R I O : W E ’ R E T O O P O P U L A R
Everyone loves puppies and we’re completely out of
resources.
• RUM & Synthetics, APM, Application Monitoring: Alerts
that latency is high. Will not be able to help identify
why.
• Infrastructure Monitoring: Alerts on high resource use
and may be able to trigger automatic remediation.
72. A N O M A LY D E T E C T I O N
H O W D O W E K N O W S O M E T H I N G W E N T W R O N G ?
73. H O W D O W E K N O W
W H AT W E N T W R O N G ?
74. U N T I L Y O U F I N D T H E C A U S E S
R E C U R S E R E C U R S E R E C U R S E
75. U N T I L Y O U F I N D T H E C A U S E S
R E C U R S E R E C U R S E R E C U R S E
76. L O G S
E X P L O R I N G W H AT W E N T W R O N G
77. H O W T O G E T 1 0 0 % V I S I B I L I T Y ?
• Think about your system as a whole
78. H O W T O G E T 1 0 0 % V I S I B I L I T Y ?
• Think about your system as a whole
• Get multiple perspectives
79. H O W T O G E T 1 0 0 % V I S I B I L I T Y ?
• Think about your system as a whole
• Get multiple perspectives
• Consider all 5 observability tools:
• RUM
• Synthetics
• Tracing
• Application+Infrastructure Monitoring
• Logs
80. Q U E S T I O N S ?
@ G I T B I S E C T
J Y E E @ D ATA D O G H Q . C O M
81. S L I D E S : h t t p : / / b i t . l y / c m - 1 0 0 v i z
@ G I T B I S E C T
J Y E E @ D ATA D O G H Q . C O M