9. Why Sensu?
• Designed to be pluggable / extensible
• Arbitrary check metadata
• Simple model
• Components do exactly one thing
10. Why Sensu?
• Designed to be pluggable / extensible
• Arbitrary check metadata
• Simple model
• Components do exactly one thing
• Ruby
• Not afraid to extend (or fork!)
15. How to use Sensu
• Don’t use all of this!
• Standalone checks only
• Default in Puppet module
16. Sensu Data Flow
• Sensu client runs checks locally on each machine
• Results are published to RabbitMQ Cluster
17. Sensu Data Flow
• Sensu client runs checks locally on each machine
• Results are published to RabbitMQ Cluster
• Sensu Server in H/A Cluster
• Processes check results from RabbitMQ
• Invokes appropriate handlers
• Writes state to Redis
18. Sensu Data Flow
• Sensu client runs checks locally on each machine
• Results are published to RabbitMQ Cluster
• Sensu Server in H/A Cluster
• Processes check results from RabbitMQ
• Invokes appropriate handlers
• Writes state to Redis
• Redis + Redis Sentinel
• 2+ instances in each cluster
• Read by the Sensu API
19. Sensu Data Flow
• Sensu client runs checks locally on each machine
• Results are published to RabbitMQ Cluster
• Sensu Server in H/A Cluster
• Processes check results from RabbitMQ
• Invokes appropriate handlers
• Writes state to Redis
• Redis + Redis Sentinel
• 2+ instances in each cluster
• Read by the Sensu API
• Every layer is behind HAProxy
20. Mutually Assured Monitoring
• Multiple independent Sensu clusters per data centre/environment
• 2+ RabbitMQ Servers
• 2+ Redis Servers
• 2+ Sensu Server/API Servers
21. Mutually Assured Monitoring
• Multiple independent Sensu clusters per data centre/environment
• 2+ RabbitMQ Servers
• 2+ Redis Servers
• 2+ Sensu Server/API Servers
• Each cluster monitors each other
24. Machine Readable Config
• /etc/sensu/conf.d/checks/$check_name.json
• One check per file
• Extensible with arbitrary metadata
25. Machine Readable Config
• /etc/sensu/conf.d/checks/$check_name.json
• One check per file
• Extensible with arbitrary metadata
• Hash merge
26. Machine Readable Config
• /etc/sensu/conf.d/checks/$check_name.json
• One check per file
• Extensible with arbitrary metadata
• Hash merge
• Never edit by hand!
27. Let Puppet Do The Work
• Puppet is already working in the environment
28. Let Puppet Do The Work
• Puppet is already working in the environment
• It knows everything about every node in the environment
29. Let Puppet Do The Work
• Puppet is already working in the environment
• It knows everything about every node in the environment
• Puppet is human readable
43. • Same as Nagios checks
• Simple text output
• Posix Exit Codes
Check Scripts
44. Check Scripts
• Same as Nagios checks
• Simple text output
• Posix Exit Codes
• Result sent to Sensu Server, along with check definition
• Includes all custom metadata
• Custom handlers process the extra data
53. Single Source of Truth
• DNS is canonical source for sensu servers
54. # Use DNS to detect if this server is a sensu server
$local_sensu_server_ips_array = gethostbyname2array("sensu.local-$
{::habitat}.yelpcorp.com")
$array_intersection = intersection($::all_ipaddresses,
$local_sensu_server_ips_array)
# If one of our ipaddresses are in the dns entries, we must be a sensu server!
if size($array_intersection) > 0 {
$is_sensu_server = true
$local_sensu_server_array = gethostbyaddr2array($local_sensu_server_ips_array)
Single Source of Truth
• DNS is canonical source for sensu servers
• Configure things in one place
55. # Use DNS to detect if this server is a sensu server
$local_sensu_server_ips_array = gethostbyname2array("sensu.local-$
{::habitat}.yelpcorp.com")
$array_intersection = intersection($::all_ipaddresses,
$local_sensu_server_ips_array)
# If one of our ipaddresses are in the dns entries, we must be a sensu server!
if size($array_intersection) > 0 {
$is_sensu_server = true
$local_sensu_server_array = gethostbyaddr2array($local_sensu_server_ips_array)
Single Source of Truth
• DNS is canonical source for sensu servers
• Configure things in one place
56. # Use DNS to detect if this server is a sensu server
$local_sensu_server_ips_array = gethostbyname2array("sensu.local-$
{::habitat}.yelpcorp.com")
$array_intersection = intersection($::all_ipaddresses,
$local_sensu_server_ips_array)
# If one of our ipaddresses are in the dns entries, we must be a sensu server!
if size($array_intersection) > 0 {
$is_sensu_server = true
$local_sensu_server_array = gethostbyaddr2array($local_sensu_server_ips_array)
Single Source of Truth
• DNS is canonical source for sensu servers
• Configure things in one place
57. # Use DNS to detect if this server is a sensu server
$local_sensu_server_ips_array = gethostbyname2array("sensu.local-$
{::habitat}.yelpcorp.com")
$array_intersection = intersection($::all_ipaddresses,
$local_sensu_server_ips_array)
# If one of our ipaddresses are in the dns entries, we must be a sensu server!
if size($array_intersection) > 0 {
$is_sensu_server = true
$local_sensu_server_array = gethostbyaddr2array($local_sensu_server_ips_array)
Single Source of Truth
• DNS is canonical source for sensu servers
• Configure things in one place
71. function is_sensu_cli_available {
which sensu-cli >/dev/null
return $?
}
deregister () {
if is_sensu_cli_available ; then
# By this run level, sensu-client should be stopped, but we can stop anyway.
service sensu-client stop 2>/dev/null
sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn`
fi
}
case "$1" in
start)
echo "$0 does nothing on start."
stop)
# Refuse to run on any runlevel except 0.
if runlevel | grep -q 0; then
deregister
fi
;;
esac
rc0.d
72. function is_sensu_cli_available {
which sensu-cli >/dev/null
return $?
}
deregister () {
if is_sensu_cli_available ; then
# By this run level, sensu-client should be stopped, but we can stop anyway.
service sensu-client stop 2>/dev/null
sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn`
fi
}
case "$1" in
start)
echo "$0 does nothing on start."
stop)
# Refuse to run on any runlevel except 0.
if runlevel | grep -q 0; then
deregister
fi
;;
esac
rc0.d
73. function is_sensu_cli_available {
which sensu-cli >/dev/null
return $?
}
deregister () {
if is_sensu_cli_available ; then
# By this run level, sensu-client should be stopped, but we can stop anyway.
service sensu-client stop 2>/dev/null
sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn`
fi
}
case "$1" in
start)
echo "$0 does nothing on start."
stop)
# Refuse to run on any runlevel except 0.
if runlevel | grep -q 0; then
deregister
fi
;;
esac
rc0.d
74. function is_sensu_cli_available {
which sensu-cli >/dev/null
return $?
}
deregister () {
if is_sensu_cli_available ; then
# By this run level, sensu-client should be stopped, but we can stop anyway.
service sensu-client stop 2>/dev/null
sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn`
fi
}
case "$1" in
start)
echo "$0 does nothing on start."
stop)
# Refuse to run on any runlevel except 0.
if runlevel | grep -q 0; then
deregister
fi
;;
esac
rc0.d
75. function is_sensu_cli_available {
which sensu-cli >/dev/null
return $?
}
deregister () {
if is_sensu_cli_available ; then
# By this run level, sensu-client should be stopped, but we can stop anyway.
service sensu-client stop 2>/dev/null
sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn`
fi
}
case "$1" in
start)
echo "$0 does nothing on start."
stop)
# Refuse to run on any runlevel except 0.
if runlevel | grep -q 0; then
deregister
fi
;;
esac
rc0.d
76. function is_sensu_cli_available {
which sensu-cli >/dev/null
return $?
}
function silence_for {
if is_sensu_cli_available; then
if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then
echo "Silence already detected for ${fqdn}. Not replacing the existing silence"
else
echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..."
sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated."
fi
else
echo "sensu-cli is unavailable"
fi
}
case "$1" in
start)
echo "Sensu silence does nothing on start"
exit 0
;;
stop)
# Refuse to run on any runlevel except 6.
echo "Sensu silenced by reboot " | logger -s -t sensu-silence
silence_for 1800 2>&1 | logger -s -t sensu-silence
;;
esac
rc6.d
77. function is_sensu_cli_available {
which sensu-cli >/dev/null
return $?
}
function silence_for {
if is_sensu_cli_available; then
if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then
echo "Silence already detected for ${fqdn}. Not replacing the existing silence"
else
echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..."
sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated."
fi
else
echo "sensu-cli is unavailable"
fi
}
case "$1" in
start)
echo "Sensu silence does nothing on start"
exit 0
;;
stop)
# Refuse to run on any runlevel except 6.
echo "Sensu silenced by reboot " | logger -s -t sensu-silence
silence_for 1800 2>&1 | logger -s -t sensu-silence
;;
esac
rc6.d
78. function is_sensu_cli_available {
which sensu-cli >/dev/null
return $?
}
function silence_for {
if is_sensu_cli_available; then
if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then
echo "Silence already detected for ${fqdn}. Not replacing the existing silence"
else
echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..."
sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated."
fi
else
echo "sensu-cli is unavailable"
fi
}
case "$1" in
start)
echo "Sensu silence does nothing on start"
exit 0
;;
stop)
# Refuse to run on any runlevel except 6.
echo "Sensu silenced by reboot " | logger -s -t sensu-silence
silence_for 1800 2>&1 | logger -s -t sensu-silence
;;
esac
rc6.d
79. function is_sensu_cli_available {
which sensu-cli >/dev/null
return $?
}
function silence_for {
if is_sensu_cli_available; then
if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then
echo "Silence already detected for ${fqdn}. Not replacing the existing silence"
else
echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..."
sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated."
fi
else
echo "sensu-cli is unavailable"
fi
}
case "$1" in
start)
echo "Sensu silence does nothing on start"
exit 0
;;
stop)
# Refuse to run on any runlevel except 6.
echo "Sensu silenced by reboot " | logger -s -t sensu-silence
silence_for 1800 2>&1 | logger -s -t sensu-silence
;;
esac
rc6.d
80. function is_sensu_cli_available {
which sensu-cli >/dev/null
return $?
}
function silence_for {
if is_sensu_cli_available; then
if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then
echo "Silence already detected for ${fqdn}. Not replacing the existing silence"
else
echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..."
sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated."
fi
else
echo "sensu-cli is unavailable"
fi
}
case "$1" in
start)
echo "Sensu silence does nothing on start"
exit 0
;;
stop)
# Refuse to run on any runlevel except 6.
echo "Sensu silenced by reboot " | logger -s -t sensu-silence
silence_for 1800 2>&1 | logger -s -t sensu-silence
;;
esac
rc6.d
82. Cluster Checks
• Assert some % of machines are healthy
• Use to reduce alert noise
• If a cluster becomes unavailable, you want someone to be paged
83. Cluster Checks
• Assert some % of machines are healthy
• Use to reduce alert noise
• If a cluster becomes unavailable, you want someone to be paged
• If one machine becomes unavailable, it’s not a problem - open a
JIRA ticket to get it fixed in core hours