SlideShare ist ein Scribd-Unternehmen logo
1 von 84
Downloaden Sie, um offline zu lesen
Paul O’Connor
paul.oconnor@yelp.com
Superb Supervision of Short-lived
Servers with Sensu
Yelp’s Mission
Connecting people with great
local businesses.
Short-Lived Servers
Short-Lived Servers
• Servers in auto-scaling groups
Short-Lived Servers
• Servers in auto-scaling groups
• Short batch servers existing while the batch runs
Short-Lived Servers
• Servers in auto-scaling groups
• Short batch servers existing while the batch runs
• Ensuring latest image build
Why Sensu?
Why Sensu?
• Designed to be pluggable / extensible
• Arbitrary check metadata
Why Sensu?
• Designed to be pluggable / extensible
• Arbitrary check metadata
• Simple model
• Components do exactly one thing
Why Sensu?
• Designed to be pluggable / extensible
• Arbitrary check metadata
• Simple model
• Components do exactly one thing
• Ruby
• Not afraid to extend (or fork!)
https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-much/
“Sensu has so many
moving parts that I
wouldn’t be able to sleep
at night unless I set up a
Nagios instance to make
sure they were all running.”
How to use Sensu
How to use Sensu
• Don’t use all of this!
How to use Sensu
• Don’t use all of this!
• Standalone checks only
• Default in Puppet module
Sensu Data Flow
• Sensu client runs checks locally on each machine
• Results are published to RabbitMQ Cluster
Sensu Data Flow
• Sensu client runs checks locally on each machine
• Results are published to RabbitMQ Cluster
• Sensu Server in H/A Cluster
• Processes check results from RabbitMQ
• Invokes appropriate handlers
• Writes state to Redis
Sensu Data Flow
• Sensu client runs checks locally on each machine
• Results are published to RabbitMQ Cluster
• Sensu Server in H/A Cluster
• Processes check results from RabbitMQ
• Invokes appropriate handlers
• Writes state to Redis
• Redis + Redis Sentinel
• 2+ instances in each cluster
• Read by the Sensu API
Sensu Data Flow
• Sensu client runs checks locally on each machine
• Results are published to RabbitMQ Cluster
• Sensu Server in H/A Cluster
• Processes check results from RabbitMQ
• Invokes appropriate handlers
• Writes state to Redis
• Redis + Redis Sentinel
• 2+ instances in each cluster
• Read by the Sensu API
• Every layer is behind HAProxy
Mutually Assured Monitoring
• Multiple independent Sensu clusters per data centre/environment
• 2+ RabbitMQ Servers
• 2+ Redis Servers
• 2+ Sensu Server/API Servers
Mutually Assured Monitoring
• Multiple independent Sensu clusters per data centre/environment
• 2+ RabbitMQ Servers
• 2+ Redis Servers
• 2+ Sensu Server/API Servers
• Each cluster monitors each other
• /etc/sensu/conf.d/checks/$check_name.json
Machine Readable Config
• /etc/sensu/conf.d/checks/$check_name.json
• One check per file
Machine Readable Config
Machine Readable Config
• /etc/sensu/conf.d/checks/$check_name.json
• One check per file
• Extensible with arbitrary metadata
Machine Readable Config
• /etc/sensu/conf.d/checks/$check_name.json
• One check per file
• Extensible with arbitrary metadata
• Hash merge
Machine Readable Config
• /etc/sensu/conf.d/checks/$check_name.json
• One check per file
• Extensible with arbitrary metadata
• Hash merge
• Never edit by hand!
Let Puppet Do The Work
• Puppet is already working in the environment
Let Puppet Do The Work
• Puppet is already working in the environment
• It knows everything about every node in the environment
Let Puppet Do The Work
• Puppet is already working in the environment
• It knows everything about every node in the environment
• Puppet is human readable
monitoring_check { ‘check_disk_slash’:
page => true,
check_every => ‘5m’,
alert_after => ‘30m’,
realert_every => 10,
runbook => ‘http://wiki/check_disk/slash',
command => ‘/usr/lib/nagios/plugins/check_disk -c 10% -
K 10 -p /‘,
}
monitoring_check
monitoring_check { ‘check_disk_slash’:
page => true,
check_every => ‘5m’,
alert_after => ‘30m’,
realert_every => 10,
runbook => ‘http://wiki/check_disk/slash',
command => ‘/usr/lib/nagios/plugins/check_disk -c 10% -
K 10 -p /‘,
}
monitoring_check
monitoring_check { ‘check_disk_slash’:
page => true,
check_every => ‘5m’,
alert_after => ‘30m’,
realert_every => 10,
runbook => ‘http://wiki/check_disk/slash',
command => ‘/usr/lib/nagios/plugins/check_disk -c 10% -
K 10 -p /‘,
}
monitoring_check
monitoring_check { ‘check_disk_slash’:
page => true,
check_every => ‘5m’,
alert_after => ‘30m’,
realert_every => 10,
runbook => ‘http://wiki/check_disk/slash',
command => ‘/usr/lib/nagios/plugins/check_disk -c 10% -
K 10 -p /‘,
}
monitoring_check
monitoring_check { ‘check_disk_slash’:
page => true,
check_every => ‘5m’,
alert_after => ‘30m’,
realert_every => 10,
runbook => ‘http://wiki/check_disk/slash',
command => ‘/usr/lib/nagios/plugins/check_disk -c 10% -
K 10 -p /‘,
}
monitoring_check
sensu::check
• monitoring_check wraps this
sensu::check
• monitoring_check wraps this
• Writes a JSON file for each check
sensu::check
• monitoring_check wraps this
• Writes a JSON file for each check
• Comment safe
{
"checks": {
"check_disk_slash": {
"standalone": true, "handlers": [ "default" ],
"command": "/usr/lib/nagios/plugins/check_disk -c 5% -K 5 -p /",
"dependencies": [ ],
"interval": 300,
"timeout": 300,
"alert_after": 300,
"realert_every": "10",
"runbook": "http://wiki/disk-slash",
"team": "operations",
"irc_channels": "operations-notifications",
"notification_email": "undef",
"ticket": false,
"project": false,
"page": true,
"tip": "Try: sudo du -h -x --max-depth=1 /",
"tags": [ ]
}
}
}
{
"checks": {
"check_disk_slash": {
"standalone": true, "handlers": [ "default" ],
"command": "/usr/lib/nagios/plugins/check_disk -c 5% -K 5 -p /",
"dependencies": [ ],
"interval": 300,
"timeout": 300,
"alert_after": 300,
"realert_every": "10",
"runbook": "http://wiki/disk-slash",
"team": "operations",
"irc_channels": "operations-notifications",
"notification_email": "undef",
"ticket": false,
"project": false,
"page": true,
"tip": "Try: sudo du -h -x --max-depth=1 /",
"tags": [ ]
}
}
}
{
"checks": {
"check_disk_slash": {
"standalone": true, "handlers": [ "default" ],
"command": "/usr/lib/nagios/plugins/check_disk -c 5% -K 5 -p /",
"dependencies": [ ],
"interval": 300,
"timeout": 300,
"alert_after": 300,
"realert_every": "10",
"runbook": "http://wiki/disk-slash",
"team": "operations",
"irc_channels": "operations-notifications",
"notification_email": "undef",
"ticket": false,
"project": false,
"page": true,
"tip": "Try: sudo du -h -x --max-depth=1 /",
"tags": [ ]
}
}
}
{
"checks": {
"check_disk_slash": {
"standalone": true, "handlers": [ "default" ],
"command": "/usr/lib/nagios/plugins/check_disk -c 5% -K 5 -p /",
"dependencies": [ ],
"interval": 300,
"timeout": 300,
"alert_after": 300,
"realert_every": "10",
"runbook": "http://wiki/disk-slash",
"team": "operations",
"irc_channels": "operations-notifications",
"notification_email": "undef",
"ticket": false,
"project": false,
"page": true,
"tip": "Try: sudo du -h -x --max-depth=1 /",
"tags": [ ]
}
}
}
{
"checks": {
"check_disk_slash": {
"standalone": true, "handlers": [ "default" ],
"command": "/usr/lib/nagios/plugins/check_disk -c 5% -K 5 -p /",
"dependencies": [ ],
"interval": 300,
"timeout": 300,
"alert_after": 300,
"realert_every": "10",
"runbook": "http://wiki/disk-slash",
"team": "operations",
"irc_channels": "operations-notifications",
"notification_email": "undef",
"ticket": false,
"project": false,
"page": true,
"tip": "Try: sudo du -h -x --max-depth=1 /",
"tags": [ ]
}
}
}
• Same as Nagios checks
• Simple text output
• Posix Exit Codes
Check Scripts
Check Scripts
• Same as Nagios checks
• Simple text output
• Posix Exit Codes
• Result sent to Sensu Server, along with check definition
• Includes all custom metadata
• Custom handlers process the extra data
• base
Handlers
• base
• JIRA
Handlers
Handlers
• base
• JIRA
• email
Handlers
• base
• JIRA
• email
• irc
Handlers
• base
• JIRA
• email
• irc
• pagerduty
How Do The Checks Get Executed?
• Each machine runs the client
How Do The Checks Get Executed?
• Each machine runs the client
• Client is managed entirely by Puppet
Situational Awareness
Single Source of Truth
• DNS is canonical source for sensu servers
# Use DNS to detect if this server is a sensu server
$local_sensu_server_ips_array = gethostbyname2array("sensu.local-$
{::habitat}.yelpcorp.com")
$array_intersection = intersection($::all_ipaddresses,
$local_sensu_server_ips_array)
# If one of our ipaddresses are in the dns entries, we must be a sensu server!
if size($array_intersection) > 0 {
$is_sensu_server = true
$local_sensu_server_array = gethostbyaddr2array($local_sensu_server_ips_array)
Single Source of Truth
• DNS is canonical source for sensu servers
• Configure things in one place
# Use DNS to detect if this server is a sensu server
$local_sensu_server_ips_array = gethostbyname2array("sensu.local-$
{::habitat}.yelpcorp.com")
$array_intersection = intersection($::all_ipaddresses,
$local_sensu_server_ips_array)
# If one of our ipaddresses are in the dns entries, we must be a sensu server!
if size($array_intersection) > 0 {
$is_sensu_server = true
$local_sensu_server_array = gethostbyaddr2array($local_sensu_server_ips_array)
Single Source of Truth
• DNS is canonical source for sensu servers
• Configure things in one place
# Use DNS to detect if this server is a sensu server
$local_sensu_server_ips_array = gethostbyname2array("sensu.local-$
{::habitat}.yelpcorp.com")
$array_intersection = intersection($::all_ipaddresses,
$local_sensu_server_ips_array)
# If one of our ipaddresses are in the dns entries, we must be a sensu server!
if size($array_intersection) > 0 {
$is_sensu_server = true
$local_sensu_server_array = gethostbyaddr2array($local_sensu_server_ips_array)
Single Source of Truth
• DNS is canonical source for sensu servers
• Configure things in one place
# Use DNS to detect if this server is a sensu server
$local_sensu_server_ips_array = gethostbyname2array("sensu.local-$
{::habitat}.yelpcorp.com")
$array_intersection = intersection($::all_ipaddresses,
$local_sensu_server_ips_array)
# If one of our ipaddresses are in the dns entries, we must be a sensu server!
if size($array_intersection) > 0 {
$is_sensu_server = true
$local_sensu_server_array = gethostbyaddr2array($local_sensu_server_ips_array)
Single Source of Truth
• DNS is canonical source for sensu servers
• Configure things in one place
Automatic Monitoring
• Cron Jobs - check if a job was completed successfully
if $staleness_threshold {
$actual_command = "/nail/sys/bin/success_wrapper '${reporting_name}' $
{command}"
cron::staleness_check { $reporting_name:
threshold => $staleness_threshold,
params => $staleness_check_params,
user => $user,
}
} else {
$actual_command = $command
}
Automatic Monitoring
• Cron Jobs - check if a job was completed successfully
• cron::d
if $staleness_threshold {
$actual_command = "/nail/sys/bin/success_wrapper '${reporting_name}' $
{command}"
cron::staleness_check { $reporting_name:
threshold => $staleness_threshold,
params => $staleness_check_params,
user => $user,
}
} else {
$actual_command = $command
}
Automatic Monitoring
• Cron Jobs - check if a job was completed successfully
• cron::d
if $staleness_threshold {
$actual_command = "/nail/sys/bin/success_wrapper '${reporting_name}' $
{command}"
cron::staleness_check { $reporting_name:
threshold => $staleness_threshold,
params => $staleness_check_params,
user => $user,
}
} else {
$actual_command = $command
}
Automatic Monitoring
• Cron Jobs - check if a job was completed successfully
• cron::d
define cron::staleness_check(
$threshold,
$params,
$user,
) {
$threshold_s = cron_human_time_to_seconds($threshold)
# Check whether we are fresh five times per threshold, not to exceed 1 hour
if $threshold_s / 5 > 3600 {
$check_every = 3600
} else {
$check_every = $threshold_s / 5
}
$check_title = "${name}_staleness"
$overrides = {
'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}",
'check_every' => $check_every,
'needs_sudo' => true,
'alert_after' => '2m',
}
$check_data = { "$check_title" =>
merge(
$params,
$overrides
)
}
create_resources('monitoring_check', $check_data)
file { "/nail/run/success_wrapper/${name}":
ensure => 'file',
owner => $user,
mode => '640',
} ->
Monitoring_check[$check_title]
}
define cron::staleness_check(
$threshold,
$params,
$user,
) {
$threshold_s = cron_human_time_to_seconds($threshold)
# Check whether we are fresh five times per threshold, not to exceed 1 hour
if $threshold_s / 5 > 3600 {
$check_every = 3600
} else {
$check_every = $threshold_s / 5
}
$check_title = "${name}_staleness"
$overrides = {
'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}",
'check_every' => $check_every,
'needs_sudo' => true,
'alert_after' => '2m',
}
$check_data = { "$check_title" =>
merge(
$params,
$overrides
)
}
create_resources('monitoring_check', $check_data)
file { "/nail/run/success_wrapper/${name}":
ensure => 'file',
owner => $user,
mode => '640',
} ->
Monitoring_check[$check_title]
}
define cron::staleness_check(
$threshold,
$params,
$user,
) {
$threshold_s = cron_human_time_to_seconds($threshold)
# Check whether we are fresh five times per threshold, not to exceed 1 hour
if $threshold_s / 5 > 3600 {
$check_every = 3600
} else {
$check_every = $threshold_s / 5
}
$check_title = "${name}_staleness"
$overrides = {
'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}",
'check_every' => $check_every,
'needs_sudo' => true,
'alert_after' => '2m',
}
$check_data = { "$check_title" =>
merge(
$params,
$overrides
)
}
create_resources('monitoring_check', $check_data)
file { "/nail/run/success_wrapper/${name}":
ensure => 'file',
owner => $user,
mode => '640',
} ->
Monitoring_check[$check_title]
}
define cron::staleness_check(
$threshold,
$params,
$user,
) {
$threshold_s = cron_human_time_to_seconds($threshold)
# Check whether we are fresh five times per threshold, not to exceed 1 hour
if $threshold_s / 5 > 3600 {
$check_every = 3600
} else {
$check_every = $threshold_s / 5
}
$check_title = "${name}_staleness"
$overrides = {
'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}",
'check_every' => $check_every,
'needs_sudo' => true,
'alert_after' => '2m',
}
$check_data = { "$check_title" =>
merge(
$params,
$overrides
)
}
create_resources('monitoring_check', $check_data)
file { "/nail/run/success_wrapper/${name}":
ensure => 'file',
owner => $user,
mode => '640',
} ->
Monitoring_check[$check_title]
}
define cron::staleness_check(
$threshold,
$params,
$user,
) {
$threshold_s = cron_human_time_to_seconds($threshold)
# Check whether we are fresh five times per threshold, not to exceed 1 hour
if $threshold_s / 5 > 3600 {
$check_every = 3600
} else {
$check_every = $threshold_s / 5
}
$check_title = "${name}_staleness"
$overrides = {
'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}",
'check_every' => $check_every,
'needs_sudo' => true,
'alert_after' => '2m',
}
$check_data = { "$check_title" =>
merge(
$params,
$overrides
)
}
create_resources('monitoring_check', $check_data)
file { "/nail/run/success_wrapper/${name}":
ensure => 'file',
owner => $user,
mode => '640',
} ->
Monitoring_check[$check_title]
}
define cron::staleness_check(
$threshold,
$params,
$user,
) {
$threshold_s = cron_human_time_to_seconds($threshold)
# Check whether we are fresh five times per threshold, not to exceed 1 hour
if $threshold_s / 5 > 3600 {
$check_every = 3600
} else {
$check_every = $threshold_s / 5
}
$check_title = "${name}_staleness"
$overrides = {
'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}",
'check_every' => $check_every,
'needs_sudo' => true,
'alert_after' => '2m',
}
$check_data = { "$check_title" =>
merge(
$params,
$overrides
)
}
create_resources('monitoring_check', $check_data)
file { "/nail/run/success_wrapper/${name}":
ensure => 'file',
owner => $user,
mode => '640',
} ->
Monitoring_check[$check_title]
}
Automatic Remediation
• Make the computer try something before paging
Automatic Remediation
• Make the computer try something before paging
• Try it repeatedly if necessary
monitoring_check { ‘check_syslogd’:
page => true,
check_every => ‘5m’,
alert_after => ‘10m’,
realert_every => 10,
runbook => ‘http://wiki/syslogd',
command => ‘/usr/lib/nagios/plugins/check_proc syslogd‘,
remediation_action => ‘/etc/init.d/syslogd start’,
remediation_retries => 1
}
Server Maintenance
• Don’t alert on call if someone is working on a server
function is_sensu_cli_available {
which sensu-cli >/dev/null
return $?
}
deregister () {
if is_sensu_cli_available ; then
# By this run level, sensu-client should be stopped, but we can stop anyway.
service sensu-client stop 2>/dev/null
sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn`
fi
}
case "$1" in
start)
echo "$0 does nothing on start."
stop)
# Refuse to run on any runlevel except 0.
if runlevel | grep -q 0; then
deregister
fi
;;
esac
rc0.d
function is_sensu_cli_available {
which sensu-cli >/dev/null
return $?
}
deregister () {
if is_sensu_cli_available ; then
# By this run level, sensu-client should be stopped, but we can stop anyway.
service sensu-client stop 2>/dev/null
sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn`
fi
}
case "$1" in
start)
echo "$0 does nothing on start."
stop)
# Refuse to run on any runlevel except 0.
if runlevel | grep -q 0; then
deregister
fi
;;
esac
rc0.d
function is_sensu_cli_available {
which sensu-cli >/dev/null
return $?
}
deregister () {
if is_sensu_cli_available ; then
# By this run level, sensu-client should be stopped, but we can stop anyway.
service sensu-client stop 2>/dev/null
sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn`
fi
}
case "$1" in
start)
echo "$0 does nothing on start."
stop)
# Refuse to run on any runlevel except 0.
if runlevel | grep -q 0; then
deregister
fi
;;
esac
rc0.d
function is_sensu_cli_available {
which sensu-cli >/dev/null
return $?
}
deregister () {
if is_sensu_cli_available ; then
# By this run level, sensu-client should be stopped, but we can stop anyway.
service sensu-client stop 2>/dev/null
sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn`
fi
}
case "$1" in
start)
echo "$0 does nothing on start."
stop)
# Refuse to run on any runlevel except 0.
if runlevel | grep -q 0; then
deregister
fi
;;
esac
rc0.d
function is_sensu_cli_available {
which sensu-cli >/dev/null
return $?
}
deregister () {
if is_sensu_cli_available ; then
# By this run level, sensu-client should be stopped, but we can stop anyway.
service sensu-client stop 2>/dev/null
sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn`
fi
}
case "$1" in
start)
echo "$0 does nothing on start."
stop)
# Refuse to run on any runlevel except 0.
if runlevel | grep -q 0; then
deregister
fi
;;
esac
rc0.d
function is_sensu_cli_available {
which sensu-cli >/dev/null
return $?
}
function silence_for {
if is_sensu_cli_available; then
if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then
echo "Silence already detected for ${fqdn}. Not replacing the existing silence"
else
echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..."
sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated."
fi
else
echo "sensu-cli is unavailable"
fi
}
case "$1" in
start)
echo "Sensu silence does nothing on start"
exit 0
;;
stop)
# Refuse to run on any runlevel except 6.
echo "Sensu silenced by reboot " | logger -s -t sensu-silence
silence_for 1800 2>&1 | logger -s -t sensu-silence
;;
esac
rc6.d
function is_sensu_cli_available {
which sensu-cli >/dev/null
return $?
}
function silence_for {
if is_sensu_cli_available; then
if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then
echo "Silence already detected for ${fqdn}. Not replacing the existing silence"
else
echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..."
sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated."
fi
else
echo "sensu-cli is unavailable"
fi
}
case "$1" in
start)
echo "Sensu silence does nothing on start"
exit 0
;;
stop)
# Refuse to run on any runlevel except 6.
echo "Sensu silenced by reboot " | logger -s -t sensu-silence
silence_for 1800 2>&1 | logger -s -t sensu-silence
;;
esac
rc6.d
function is_sensu_cli_available {
which sensu-cli >/dev/null
return $?
}
function silence_for {
if is_sensu_cli_available; then
if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then
echo "Silence already detected for ${fqdn}. Not replacing the existing silence"
else
echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..."
sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated."
fi
else
echo "sensu-cli is unavailable"
fi
}
case "$1" in
start)
echo "Sensu silence does nothing on start"
exit 0
;;
stop)
# Refuse to run on any runlevel except 6.
echo "Sensu silenced by reboot " | logger -s -t sensu-silence
silence_for 1800 2>&1 | logger -s -t sensu-silence
;;
esac
rc6.d
function is_sensu_cli_available {
which sensu-cli >/dev/null
return $?
}
function silence_for {
if is_sensu_cli_available; then
if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then
echo "Silence already detected for ${fqdn}. Not replacing the existing silence"
else
echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..."
sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated."
fi
else
echo "sensu-cli is unavailable"
fi
}
case "$1" in
start)
echo "Sensu silence does nothing on start"
exit 0
;;
stop)
# Refuse to run on any runlevel except 6.
echo "Sensu silenced by reboot " | logger -s -t sensu-silence
silence_for 1800 2>&1 | logger -s -t sensu-silence
;;
esac
rc6.d
function is_sensu_cli_available {
which sensu-cli >/dev/null
return $?
}
function silence_for {
if is_sensu_cli_available; then
if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then
echo "Silence already detected for ${fqdn}. Not replacing the existing silence"
else
echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..."
sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated."
fi
else
echo "sensu-cli is unavailable"
fi
}
case "$1" in
start)
echo "Sensu silence does nothing on start"
exit 0
;;
stop)
# Refuse to run on any runlevel except 6.
echo "Sensu silenced by reboot " | logger -s -t sensu-silence
silence_for 1800 2>&1 | logger -s -t sensu-silence
;;
esac
rc6.d
Cluster Checks
• Assert some % of machines are healthy
• Use to reduce alert noise
Cluster Checks
• Assert some % of machines are healthy
• Use to reduce alert noise
• If a cluster becomes unavailable, you want someone to be paged
Cluster Checks
• Assert some % of machines are healthy
• Use to reduce alert noise
• If a cluster becomes unavailable, you want someone to be paged
• If one machine becomes unavailable, it’s not a problem - open a
JIRA ticket to get it fixed in core hours
@YelpEngineering
fb.com/YelpEngineers
engineeringblog.yelp.com
github.com/yelp

Weitere ähnliche Inhalte

Was ist angesagt?

Testing Ansible Infrastructure With Serverspec
Testing Ansible Infrastructure With ServerspecTesting Ansible Infrastructure With Serverspec
Testing Ansible Infrastructure With ServerspecBenji Visser
 
Portland PUG April 2014: Beaker 101: Acceptance Test Everything
Portland PUG April 2014: Beaker 101: Acceptance Test EverythingPortland PUG April 2014: Beaker 101: Acceptance Test Everything
Portland PUG April 2014: Beaker 101: Acceptance Test EverythingPuppet
 
DevOps in a Regulated World - aka 'Ansible, AWS, and Jenkins'
DevOps in a Regulated World - aka 'Ansible, AWS, and Jenkins'DevOps in a Regulated World - aka 'Ansible, AWS, and Jenkins'
DevOps in a Regulated World - aka 'Ansible, AWS, and Jenkins'rmcleay
 
Continuous Testing with Molecule, Ansible, and GitHub Actions
Continuous Testing with Molecule, Ansible, and GitHub ActionsContinuous Testing with Molecule, Ansible, and GitHub Actions
Continuous Testing with Molecule, Ansible, and GitHub ActionsJeff Geerling
 
Vagrant and Chef on FOSSASIA 2014
Vagrant and Chef on FOSSASIA 2014Vagrant and Chef on FOSSASIA 2014
Vagrant and Chef on FOSSASIA 2014Michael Lihs
 
Go Faster with Ansible (PHP meetup)
Go Faster with Ansible (PHP meetup)Go Faster with Ansible (PHP meetup)
Go Faster with Ansible (PHP meetup)Richard Donkin
 
How Ansible Makes Automation Easy
How Ansible Makes Automation EasyHow Ansible Makes Automation Easy
How Ansible Makes Automation EasyPeter Sankauskas
 
OSDC2014: Testing Server Infrastructure with #serverspec
OSDC2014: Testing Server Infrastructure with #serverspecOSDC2014: Testing Server Infrastructure with #serverspec
OSDC2014: Testing Server Infrastructure with #serverspecAndreas Schmidt
 
The Puppet Master on the JVM - PuppetConf 2014
The Puppet Master on the JVM - PuppetConf 2014The Puppet Master on the JVM - PuppetConf 2014
The Puppet Master on the JVM - PuppetConf 2014Puppet
 
Automate your Development Environment with Vagrant & Chef
Automate your Development Environment with Vagrant & ChefAutomate your Development Environment with Vagrant & Chef
Automate your Development Environment with Vagrant & Chef Michael Lihs
 
Automated Deployment and Configuration Engines. Ansible
Automated Deployment and Configuration Engines. AnsibleAutomated Deployment and Configuration Engines. Ansible
Automated Deployment and Configuration Engines. AnsibleAlberto Molina Coballes
 
Ansible Best Practices - July 30
Ansible Best Practices - July 30Ansible Best Practices - July 30
Ansible Best Practices - July 30tylerturk
 
Testing Ansible with Jenkins and Docker
Testing Ansible with Jenkins and DockerTesting Ansible with Jenkins and Docker
Testing Ansible with Jenkins and DockerDennis Rowe
 
Ansible module development 101
Ansible module development 101Ansible module development 101
Ansible module development 101yfauser
 
Chasing AMI - Building Amazon machine images with Puppet, Packer and Jenkins
Chasing AMI - Building Amazon machine images with Puppet, Packer and JenkinsChasing AMI - Building Amazon machine images with Puppet, Packer and Jenkins
Chasing AMI - Building Amazon machine images with Puppet, Packer and JenkinsTomas Doran
 
Using Vagrant
Using VagrantUsing Vagrant
Using Vagrantandygale
 
Vagrant, Chef and TYPO3 - A Love Affair
Vagrant, Chef and TYPO3 - A Love AffairVagrant, Chef and TYPO3 - A Love Affair
Vagrant, Chef and TYPO3 - A Love AffairMichael Lihs
 
Infrastructure Automation with Chef
Infrastructure Automation with ChefInfrastructure Automation with Chef
Infrastructure Automation with ChefJonathan Weiss
 

Was ist angesagt? (20)

Testing Ansible Infrastructure With Serverspec
Testing Ansible Infrastructure With ServerspecTesting Ansible Infrastructure With Serverspec
Testing Ansible Infrastructure With Serverspec
 
Portland PUG April 2014: Beaker 101: Acceptance Test Everything
Portland PUG April 2014: Beaker 101: Acceptance Test EverythingPortland PUG April 2014: Beaker 101: Acceptance Test Everything
Portland PUG April 2014: Beaker 101: Acceptance Test Everything
 
DevOps in a Regulated World - aka 'Ansible, AWS, and Jenkins'
DevOps in a Regulated World - aka 'Ansible, AWS, and Jenkins'DevOps in a Regulated World - aka 'Ansible, AWS, and Jenkins'
DevOps in a Regulated World - aka 'Ansible, AWS, and Jenkins'
 
Continuous Testing with Molecule, Ansible, and GitHub Actions
Continuous Testing with Molecule, Ansible, and GitHub ActionsContinuous Testing with Molecule, Ansible, and GitHub Actions
Continuous Testing with Molecule, Ansible, and GitHub Actions
 
Vagrant and Chef on FOSSASIA 2014
Vagrant and Chef on FOSSASIA 2014Vagrant and Chef on FOSSASIA 2014
Vagrant and Chef on FOSSASIA 2014
 
Go Faster with Ansible (PHP meetup)
Go Faster with Ansible (PHP meetup)Go Faster with Ansible (PHP meetup)
Go Faster with Ansible (PHP meetup)
 
How Ansible Makes Automation Easy
How Ansible Makes Automation EasyHow Ansible Makes Automation Easy
How Ansible Makes Automation Easy
 
OSDC2014: Testing Server Infrastructure with #serverspec
OSDC2014: Testing Server Infrastructure with #serverspecOSDC2014: Testing Server Infrastructure with #serverspec
OSDC2014: Testing Server Infrastructure with #serverspec
 
The Puppet Master on the JVM - PuppetConf 2014
The Puppet Master on the JVM - PuppetConf 2014The Puppet Master on the JVM - PuppetConf 2014
The Puppet Master on the JVM - PuppetConf 2014
 
Automate your Development Environment with Vagrant & Chef
Automate your Development Environment with Vagrant & ChefAutomate your Development Environment with Vagrant & Chef
Automate your Development Environment with Vagrant & Chef
 
Automated Deployment and Configuration Engines. Ansible
Automated Deployment and Configuration Engines. AnsibleAutomated Deployment and Configuration Engines. Ansible
Automated Deployment and Configuration Engines. Ansible
 
Ansible Best Practices - July 30
Ansible Best Practices - July 30Ansible Best Practices - July 30
Ansible Best Practices - July 30
 
Docker
DockerDocker
Docker
 
Testing Ansible with Jenkins and Docker
Testing Ansible with Jenkins and DockerTesting Ansible with Jenkins and Docker
Testing Ansible with Jenkins and Docker
 
Ansible module development 101
Ansible module development 101Ansible module development 101
Ansible module development 101
 
Chasing AMI - Building Amazon machine images with Puppet, Packer and Jenkins
Chasing AMI - Building Amazon machine images with Puppet, Packer and JenkinsChasing AMI - Building Amazon machine images with Puppet, Packer and Jenkins
Chasing AMI - Building Amazon machine images with Puppet, Packer and Jenkins
 
Using Vagrant
Using VagrantUsing Vagrant
Using Vagrant
 
Ansible Case Studies
Ansible Case StudiesAnsible Case Studies
Ansible Case Studies
 
Vagrant, Chef and TYPO3 - A Love Affair
Vagrant, Chef and TYPO3 - A Love AffairVagrant, Chef and TYPO3 - A Love Affair
Vagrant, Chef and TYPO3 - A Love Affair
 
Infrastructure Automation with Chef
Infrastructure Automation with ChefInfrastructure Automation with Chef
Infrastructure Automation with Chef
 

Andere mochten auch

富豪的交通工具
富豪的交通工具富豪的交通工具
富豪的交通工具honan4108
 
Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with Sensu
Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with SensuSense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with Sensu
Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with SensuBethany Erskine
 
The Next Generation Datacenter
The Next Generation DatacenterThe Next Generation Datacenter
The Next Generation DatacenterRed Hat Events
 
FreeSWITCH Monitoring
FreeSWITCH MonitoringFreeSWITCH Monitoring
FreeSWITCH MonitoringMoises Silva
 
Consul: Service-oriented at Scale
Consul: Service-oriented at ScaleConsul: Service-oriented at Scale
Consul: Service-oriented at ScaleC4Media
 
新興高一語文欣賞課程進度計畫表
新興高一語文欣賞課程進度計畫表新興高一語文欣賞課程進度計畫表
新興高一語文欣賞課程進度計畫表Yamila Cheng
 
Lesson 12 transports 交通工具
Lesson 12 transports 交通工具Lesson 12 transports 交通工具
Lesson 12 transports 交通工具Yamila Cheng
 
LA ESCRITURA, EL LIBRO Y LAS BIBLIOTECAS
LA ESCRITURA, EL LIBRO Y LAS BIBLIOTECASLA ESCRITURA, EL LIBRO Y LAS BIBLIOTECAS
LA ESCRITURA, EL LIBRO Y LAS BIBLIOTECASjaimemurillogamboa
 
新興中文學校高二語文欣賞課程進度計畫表
新興中文學校高二語文欣賞課程進度計畫表新興中文學校高二語文欣賞課程進度計畫表
新興中文學校高二語文欣賞課程進度計畫表Yamila Cheng
 
Elfen Lied Tomo 1 Capitulo 1
Elfen Lied Tomo 1 Capitulo 1Elfen Lied Tomo 1 Capitulo 1
Elfen Lied Tomo 1 Capitulo 1aurigame
 
Clase 2%20入门训练(一)-教师班[1]
Clase 2%20入门训练(一)-教师班[1]Clase 2%20入门训练(一)-教师班[1]
Clase 2%20入门训练(一)-教师班[1]Jorge Israel
 
Curso de Mandarín
Curso de MandarínCurso de Mandarín
Curso de Mandarínitmparatodos
 
La historia de la comunicación
La historia de la comunicaciónLa historia de la comunicación
La historia de la comunicacióncristinafavi05
 
Curso chino principiante
Curso chino principianteCurso chino principiante
Curso chino principianteiLabora
 

Andere mochten auch (20)

富豪的交通工具
富豪的交通工具富豪的交通工具
富豪的交通工具
 
Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with Sensu
Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with SensuSense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with Sensu
Sense and Sensu-bility: Painless Metrics And Monitoring In The Cloud with Sensu
 
The Next Generation Datacenter
The Next Generation DatacenterThe Next Generation Datacenter
The Next Generation Datacenter
 
FreeSWITCH Monitoring
FreeSWITCH MonitoringFreeSWITCH Monitoring
FreeSWITCH Monitoring
 
Consul: Service-oriented at Scale
Consul: Service-oriented at ScaleConsul: Service-oriented at Scale
Consul: Service-oriented at Scale
 
新興高一語文欣賞課程進度計畫表
新興高一語文欣賞課程進度計畫表新興高一語文欣賞課程進度計畫表
新興高一語文欣賞課程進度計畫表
 
Lesson 12 transports 交通工具
Lesson 12 transports 交通工具Lesson 12 transports 交通工具
Lesson 12 transports 交通工具
 
LA ESCRITURA, EL LIBRO Y LAS BIBLIOTECAS
LA ESCRITURA, EL LIBRO Y LAS BIBLIOTECASLA ESCRITURA, EL LIBRO Y LAS BIBLIOTECAS
LA ESCRITURA, EL LIBRO Y LAS BIBLIOTECAS
 
La Musica "Ranchera"
La Musica "Ranchera"La Musica "Ranchera"
La Musica "Ranchera"
 
新興中文學校高二語文欣賞課程進度計畫表
新興中文學校高二語文欣賞課程進度計畫表新興中文學校高二語文欣賞課程進度計畫表
新興中文學校高二語文欣賞課程進度計畫表
 
Elfen Lied Tomo 1 Capitulo 1
Elfen Lied Tomo 1 Capitulo 1Elfen Lied Tomo 1 Capitulo 1
Elfen Lied Tomo 1 Capitulo 1
 
Trabajo pictográfico
Trabajo pictográficoTrabajo pictográfico
Trabajo pictográfico
 
Ideogramas chinos
Ideogramas chinosIdeogramas chinos
Ideogramas chinos
 
Vocabulario mandarin
Vocabulario mandarinVocabulario mandarin
Vocabulario mandarin
 
Clase 2%20入门训练(一)-教师班[1]
Clase 2%20入门训练(一)-教师班[1]Clase 2%20入门训练(一)-教师班[1]
Clase 2%20入门训练(一)-教师班[1]
 
Curso de Mandarín
Curso de MandarínCurso de Mandarín
Curso de Mandarín
 
Matematiques egipcies
Matematiques egipciesMatematiques egipcies
Matematiques egipcies
 
Historia de la escritura
Historia de la escrituraHistoria de la escritura
Historia de la escritura
 
La historia de la comunicación
La historia de la comunicaciónLa historia de la comunicación
La historia de la comunicación
 
Curso chino principiante
Curso chino principianteCurso chino principiante
Curso chino principiante
 

Ähnlich wie Superb Supervision of Short-lived Servers with Sensu

OSMC 2014 | Monitoring Love with Sensu by Jochen Lillich
OSMC 2014 | Monitoring Love with Sensu by Jochen LillichOSMC 2014 | Monitoring Love with Sensu by Jochen Lillich
OSMC 2014 | Monitoring Love with Sensu by Jochen LillichNETWAYS
 
OSMC 2014: MonitoringLove with Sensu | Jochen Lillich
OSMC 2014: MonitoringLove with Sensu | Jochen LillichOSMC 2014: MonitoringLove with Sensu | Jochen Lillich
OSMC 2014: MonitoringLove with Sensu | Jochen LillichNETWAYS
 
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!Puppet
 
Orchestrating Docker with Terraform and Consul by Mitchell Hashimoto
Orchestrating Docker with Terraform and Consul by Mitchell Hashimoto Orchestrating Docker with Terraform and Consul by Mitchell Hashimoto
Orchestrating Docker with Terraform and Consul by Mitchell Hashimoto Docker, Inc.
 
There and Back Again: How We Drank the Chef Kool-Aid, Sobered Up, and Learned...
There and Back Again: How We Drank the Chef Kool-Aid, Sobered Up, and Learned...There and Back Again: How We Drank the Chef Kool-Aid, Sobered Up, and Learned...
There and Back Again: How We Drank the Chef Kool-Aid, Sobered Up, and Learned...Chef
 
Spot Trading - A case study in continuous delivery for mission critical finan...
Spot Trading - A case study in continuous delivery for mission critical finan...Spot Trading - A case study in continuous delivery for mission critical finan...
Spot Trading - A case study in continuous delivery for mission critical finan...SaltStack
 
BSIDES-PR Keynote Hunting for Bad Guys
BSIDES-PR Keynote Hunting for Bad GuysBSIDES-PR Keynote Hunting for Bad Guys
BSIDES-PR Keynote Hunting for Bad GuysJoff Thyer
 
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...Amazon Web Services
 
Ansible: How to Get More Sleep and Require Less Coffee
Ansible: How to Get More Sleep and Require Less CoffeeAnsible: How to Get More Sleep and Require Less Coffee
Ansible: How to Get More Sleep and Require Less CoffeeSarah Z
 
Monitoring and tuning your chef server - chef conf talk
Monitoring and tuning your chef server - chef conf talk Monitoring and tuning your chef server - chef conf talk
Monitoring and tuning your chef server - chef conf talk Andrew DuFour
 
Meetup on Apache Zookeeper
Meetup on Apache ZookeeperMeetup on Apache Zookeeper
Meetup on Apache ZookeeperAnshul Patel
 
EC2 AMI Factory with Chef, Berkshelf, and Packer
EC2 AMI Factory with Chef, Berkshelf, and PackerEC2 AMI Factory with Chef, Berkshelf, and Packer
EC2 AMI Factory with Chef, Berkshelf, and PackerGeorge Miranda
 
London Hashicorp Meetup #22 - Congruent infrastructure @zopa by Ben Coughlan
London Hashicorp Meetup #22 - Congruent infrastructure @zopa by Ben CoughlanLondon Hashicorp Meetup #22 - Congruent infrastructure @zopa by Ben Coughlan
London Hashicorp Meetup #22 - Congruent infrastructure @zopa by Ben CoughlanBen Coughlan
 
Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Pavel Chunyayev
 
Automation with ansible
Automation with ansibleAutomation with ansible
Automation with ansibleKhizer Naeem
 
InSpec For DevOpsDays Amsterdam 2017
InSpec For DevOpsDays Amsterdam 2017InSpec For DevOpsDays Amsterdam 2017
InSpec For DevOpsDays Amsterdam 2017Mandi Walls
 
Chef for Openstack
Chef for OpenstackChef for Openstack
Chef for OpenstackMohit Sethi
 
AWS_Community_Day_2023-Chathra Serasinghe.pptx
AWS_Community_Day_2023-Chathra Serasinghe.pptxAWS_Community_Day_2023-Chathra Serasinghe.pptx
AWS_Community_Day_2023-Chathra Serasinghe.pptxChathraSerasinghe2
 
Sensu @ Yelp!: A Guided Tour
Sensu @ Yelp!: A Guided TourSensu @ Yelp!: A Guided Tour
Sensu @ Yelp!: A Guided TourKyle Anderson
 

Ähnlich wie Superb Supervision of Short-lived Servers with Sensu (20)

OSMC 2014 | Monitoring Love with Sensu by Jochen Lillich
OSMC 2014 | Monitoring Love with Sensu by Jochen LillichOSMC 2014 | Monitoring Love with Sensu by Jochen Lillich
OSMC 2014 | Monitoring Love with Sensu by Jochen Lillich
 
OSMC 2014: MonitoringLove with Sensu | Jochen Lillich
OSMC 2014: MonitoringLove with Sensu | Jochen LillichOSMC 2014: MonitoringLove with Sensu | Jochen Lillich
OSMC 2014: MonitoringLove with Sensu | Jochen Lillich
 
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
Puppet Camp NYC 2014: Build a Modern Infrastructure in 45 min!
 
Sensu Monitoring
Sensu MonitoringSensu Monitoring
Sensu Monitoring
 
Orchestrating Docker with Terraform and Consul by Mitchell Hashimoto
Orchestrating Docker with Terraform and Consul by Mitchell Hashimoto Orchestrating Docker with Terraform and Consul by Mitchell Hashimoto
Orchestrating Docker with Terraform and Consul by Mitchell Hashimoto
 
There and Back Again: How We Drank the Chef Kool-Aid, Sobered Up, and Learned...
There and Back Again: How We Drank the Chef Kool-Aid, Sobered Up, and Learned...There and Back Again: How We Drank the Chef Kool-Aid, Sobered Up, and Learned...
There and Back Again: How We Drank the Chef Kool-Aid, Sobered Up, and Learned...
 
Spot Trading - A case study in continuous delivery for mission critical finan...
Spot Trading - A case study in continuous delivery for mission critical finan...Spot Trading - A case study in continuous delivery for mission critical finan...
Spot Trading - A case study in continuous delivery for mission critical finan...
 
BSIDES-PR Keynote Hunting for Bad Guys
BSIDES-PR Keynote Hunting for Bad GuysBSIDES-PR Keynote Hunting for Bad Guys
BSIDES-PR Keynote Hunting for Bad Guys
 
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
(APP307) Leverage the Cloud with a Blue/Green Deployment Architecture | AWS r...
 
Ansible: How to Get More Sleep and Require Less Coffee
Ansible: How to Get More Sleep and Require Less CoffeeAnsible: How to Get More Sleep and Require Less Coffee
Ansible: How to Get More Sleep and Require Less Coffee
 
Monitoring and tuning your chef server - chef conf talk
Monitoring and tuning your chef server - chef conf talk Monitoring and tuning your chef server - chef conf talk
Monitoring and tuning your chef server - chef conf talk
 
Meetup on Apache Zookeeper
Meetup on Apache ZookeeperMeetup on Apache Zookeeper
Meetup on Apache Zookeeper
 
EC2 AMI Factory with Chef, Berkshelf, and Packer
EC2 AMI Factory with Chef, Berkshelf, and PackerEC2 AMI Factory with Chef, Berkshelf, and Packer
EC2 AMI Factory with Chef, Berkshelf, and Packer
 
London Hashicorp Meetup #22 - Congruent infrastructure @zopa by Ben Coughlan
London Hashicorp Meetup #22 - Congruent infrastructure @zopa by Ben CoughlanLondon Hashicorp Meetup #22 - Congruent infrastructure @zopa by Ben Coughlan
London Hashicorp Meetup #22 - Congruent infrastructure @zopa by Ben Coughlan
 
Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015Ansible benelux meetup - Amsterdam 27-5-2015
Ansible benelux meetup - Amsterdam 27-5-2015
 
Automation with ansible
Automation with ansibleAutomation with ansible
Automation with ansible
 
InSpec For DevOpsDays Amsterdam 2017
InSpec For DevOpsDays Amsterdam 2017InSpec For DevOpsDays Amsterdam 2017
InSpec For DevOpsDays Amsterdam 2017
 
Chef for Openstack
Chef for OpenstackChef for Openstack
Chef for Openstack
 
AWS_Community_Day_2023-Chathra Serasinghe.pptx
AWS_Community_Day_2023-Chathra Serasinghe.pptxAWS_Community_Day_2023-Chathra Serasinghe.pptx
AWS_Community_Day_2023-Chathra Serasinghe.pptx
 
Sensu @ Yelp!: A Guided Tour
Sensu @ Yelp!: A Guided TourSensu @ Yelp!: A Guided Tour
Sensu @ Yelp!: A Guided Tour
 

Kürzlich hochgeladen

办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxeditsforyah
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书rnrncn29
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa494f574xmv
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationMarko4394
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predieusebiomeyer
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleanscorenetworkseo
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxDyna Gilbert
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一Fs
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书rnrncn29
 

Kürzlich hochgeladen (20)

办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
Q4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptxQ4-1-Illustrating-Hypothesis-Testing.pptx
Q4-1-Illustrating-Hypothesis-Testing.pptx
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
『澳洲文凭』买拉筹伯大学毕业证书成绩单办理澳洲LTU文凭学位证书
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Film cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasaFilm cover research (1).pptxsdasdasdasdasdasa
Film cover research (1).pptxsdasdasdasdasdasa
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
NSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentationNSX-T and Service Interfaces presentation
NSX-T and Service Interfaces presentation
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
SCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is prediSCM Symposium PPT Format Customer loyalty is predi
SCM Symposium PPT Format Customer loyalty is predi
 
Elevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New OrleansElevate Your Business with Our IT Expertise in New Orleans
Elevate Your Business with Our IT Expertise in New Orleans
 
Top 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptxTop 10 Interactive Website Design Trends in 2024.pptx
Top 10 Interactive Website Design Trends in 2024.pptx
 
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
定制(AUT毕业证书)新西兰奥克兰理工大学毕业证成绩单原版一比一
 
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
『澳洲文凭』买詹姆士库克大学毕业证书成绩单办理澳洲JCU文凭学位证书
 

Superb Supervision of Short-lived Servers with Sensu

  • 2. Yelp’s Mission Connecting people with great local businesses.
  • 4. Short-Lived Servers • Servers in auto-scaling groups
  • 5. Short-Lived Servers • Servers in auto-scaling groups • Short batch servers existing while the batch runs
  • 6. Short-Lived Servers • Servers in auto-scaling groups • Short batch servers existing while the batch runs • Ensuring latest image build
  • 8. Why Sensu? • Designed to be pluggable / extensible • Arbitrary check metadata
  • 9. Why Sensu? • Designed to be pluggable / extensible • Arbitrary check metadata • Simple model • Components do exactly one thing
  • 10. Why Sensu? • Designed to be pluggable / extensible • Arbitrary check metadata • Simple model • Components do exactly one thing • Ruby • Not afraid to extend (or fork!)
  • 11.
  • 12. https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-much/ “Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.”
  • 13. How to use Sensu
  • 14. How to use Sensu • Don’t use all of this!
  • 15. How to use Sensu • Don’t use all of this! • Standalone checks only • Default in Puppet module
  • 16. Sensu Data Flow • Sensu client runs checks locally on each machine • Results are published to RabbitMQ Cluster
  • 17. Sensu Data Flow • Sensu client runs checks locally on each machine • Results are published to RabbitMQ Cluster • Sensu Server in H/A Cluster • Processes check results from RabbitMQ • Invokes appropriate handlers • Writes state to Redis
  • 18. Sensu Data Flow • Sensu client runs checks locally on each machine • Results are published to RabbitMQ Cluster • Sensu Server in H/A Cluster • Processes check results from RabbitMQ • Invokes appropriate handlers • Writes state to Redis • Redis + Redis Sentinel • 2+ instances in each cluster • Read by the Sensu API
  • 19. Sensu Data Flow • Sensu client runs checks locally on each machine • Results are published to RabbitMQ Cluster • Sensu Server in H/A Cluster • Processes check results from RabbitMQ • Invokes appropriate handlers • Writes state to Redis • Redis + Redis Sentinel • 2+ instances in each cluster • Read by the Sensu API • Every layer is behind HAProxy
  • 20. Mutually Assured Monitoring • Multiple independent Sensu clusters per data centre/environment • 2+ RabbitMQ Servers • 2+ Redis Servers • 2+ Sensu Server/API Servers
  • 21. Mutually Assured Monitoring • Multiple independent Sensu clusters per data centre/environment • 2+ RabbitMQ Servers • 2+ Redis Servers • 2+ Sensu Server/API Servers • Each cluster monitors each other
  • 23. • /etc/sensu/conf.d/checks/$check_name.json • One check per file Machine Readable Config
  • 24. Machine Readable Config • /etc/sensu/conf.d/checks/$check_name.json • One check per file • Extensible with arbitrary metadata
  • 25. Machine Readable Config • /etc/sensu/conf.d/checks/$check_name.json • One check per file • Extensible with arbitrary metadata • Hash merge
  • 26. Machine Readable Config • /etc/sensu/conf.d/checks/$check_name.json • One check per file • Extensible with arbitrary metadata • Hash merge • Never edit by hand!
  • 27. Let Puppet Do The Work • Puppet is already working in the environment
  • 28. Let Puppet Do The Work • Puppet is already working in the environment • It knows everything about every node in the environment
  • 29. Let Puppet Do The Work • Puppet is already working in the environment • It knows everything about every node in the environment • Puppet is human readable
  • 30. monitoring_check { ‘check_disk_slash’: page => true, check_every => ‘5m’, alert_after => ‘30m’, realert_every => 10, runbook => ‘http://wiki/check_disk/slash', command => ‘/usr/lib/nagios/plugins/check_disk -c 10% - K 10 -p /‘, } monitoring_check
  • 31. monitoring_check { ‘check_disk_slash’: page => true, check_every => ‘5m’, alert_after => ‘30m’, realert_every => 10, runbook => ‘http://wiki/check_disk/slash', command => ‘/usr/lib/nagios/plugins/check_disk -c 10% - K 10 -p /‘, } monitoring_check
  • 32. monitoring_check { ‘check_disk_slash’: page => true, check_every => ‘5m’, alert_after => ‘30m’, realert_every => 10, runbook => ‘http://wiki/check_disk/slash', command => ‘/usr/lib/nagios/plugins/check_disk -c 10% - K 10 -p /‘, } monitoring_check
  • 33. monitoring_check { ‘check_disk_slash’: page => true, check_every => ‘5m’, alert_after => ‘30m’, realert_every => 10, runbook => ‘http://wiki/check_disk/slash', command => ‘/usr/lib/nagios/plugins/check_disk -c 10% - K 10 -p /‘, } monitoring_check
  • 34. monitoring_check { ‘check_disk_slash’: page => true, check_every => ‘5m’, alert_after => ‘30m’, realert_every => 10, runbook => ‘http://wiki/check_disk/slash', command => ‘/usr/lib/nagios/plugins/check_disk -c 10% - K 10 -p /‘, } monitoring_check
  • 36. sensu::check • monitoring_check wraps this • Writes a JSON file for each check
  • 37. sensu::check • monitoring_check wraps this • Writes a JSON file for each check • Comment safe
  • 38. { "checks": { "check_disk_slash": { "standalone": true, "handlers": [ "default" ], "command": "/usr/lib/nagios/plugins/check_disk -c 5% -K 5 -p /", "dependencies": [ ], "interval": 300, "timeout": 300, "alert_after": 300, "realert_every": "10", "runbook": "http://wiki/disk-slash", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": false, "project": false, "page": true, "tip": "Try: sudo du -h -x --max-depth=1 /", "tags": [ ] } } }
  • 39. { "checks": { "check_disk_slash": { "standalone": true, "handlers": [ "default" ], "command": "/usr/lib/nagios/plugins/check_disk -c 5% -K 5 -p /", "dependencies": [ ], "interval": 300, "timeout": 300, "alert_after": 300, "realert_every": "10", "runbook": "http://wiki/disk-slash", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": false, "project": false, "page": true, "tip": "Try: sudo du -h -x --max-depth=1 /", "tags": [ ] } } }
  • 40. { "checks": { "check_disk_slash": { "standalone": true, "handlers": [ "default" ], "command": "/usr/lib/nagios/plugins/check_disk -c 5% -K 5 -p /", "dependencies": [ ], "interval": 300, "timeout": 300, "alert_after": 300, "realert_every": "10", "runbook": "http://wiki/disk-slash", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": false, "project": false, "page": true, "tip": "Try: sudo du -h -x --max-depth=1 /", "tags": [ ] } } }
  • 41. { "checks": { "check_disk_slash": { "standalone": true, "handlers": [ "default" ], "command": "/usr/lib/nagios/plugins/check_disk -c 5% -K 5 -p /", "dependencies": [ ], "interval": 300, "timeout": 300, "alert_after": 300, "realert_every": "10", "runbook": "http://wiki/disk-slash", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": false, "project": false, "page": true, "tip": "Try: sudo du -h -x --max-depth=1 /", "tags": [ ] } } }
  • 42. { "checks": { "check_disk_slash": { "standalone": true, "handlers": [ "default" ], "command": "/usr/lib/nagios/plugins/check_disk -c 5% -K 5 -p /", "dependencies": [ ], "interval": 300, "timeout": 300, "alert_after": 300, "realert_every": "10", "runbook": "http://wiki/disk-slash", "team": "operations", "irc_channels": "operations-notifications", "notification_email": "undef", "ticket": false, "project": false, "page": true, "tip": "Try: sudo du -h -x --max-depth=1 /", "tags": [ ] } } }
  • 43. • Same as Nagios checks • Simple text output • Posix Exit Codes Check Scripts
  • 44. Check Scripts • Same as Nagios checks • Simple text output • Posix Exit Codes • Result sent to Sensu Server, along with check definition • Includes all custom metadata • Custom handlers process the extra data
  • 49. Handlers • base • JIRA • email • irc • pagerduty
  • 50. How Do The Checks Get Executed? • Each machine runs the client
  • 51. How Do The Checks Get Executed? • Each machine runs the client • Client is managed entirely by Puppet
  • 53. Single Source of Truth • DNS is canonical source for sensu servers
  • 54. # Use DNS to detect if this server is a sensu server $local_sensu_server_ips_array = gethostbyname2array("sensu.local-$ {::habitat}.yelpcorp.com") $array_intersection = intersection($::all_ipaddresses, $local_sensu_server_ips_array) # If one of our ipaddresses are in the dns entries, we must be a sensu server! if size($array_intersection) > 0 { $is_sensu_server = true $local_sensu_server_array = gethostbyaddr2array($local_sensu_server_ips_array) Single Source of Truth • DNS is canonical source for sensu servers • Configure things in one place
  • 55. # Use DNS to detect if this server is a sensu server $local_sensu_server_ips_array = gethostbyname2array("sensu.local-$ {::habitat}.yelpcorp.com") $array_intersection = intersection($::all_ipaddresses, $local_sensu_server_ips_array) # If one of our ipaddresses are in the dns entries, we must be a sensu server! if size($array_intersection) > 0 { $is_sensu_server = true $local_sensu_server_array = gethostbyaddr2array($local_sensu_server_ips_array) Single Source of Truth • DNS is canonical source for sensu servers • Configure things in one place
  • 56. # Use DNS to detect if this server is a sensu server $local_sensu_server_ips_array = gethostbyname2array("sensu.local-$ {::habitat}.yelpcorp.com") $array_intersection = intersection($::all_ipaddresses, $local_sensu_server_ips_array) # If one of our ipaddresses are in the dns entries, we must be a sensu server! if size($array_intersection) > 0 { $is_sensu_server = true $local_sensu_server_array = gethostbyaddr2array($local_sensu_server_ips_array) Single Source of Truth • DNS is canonical source for sensu servers • Configure things in one place
  • 57. # Use DNS to detect if this server is a sensu server $local_sensu_server_ips_array = gethostbyname2array("sensu.local-$ {::habitat}.yelpcorp.com") $array_intersection = intersection($::all_ipaddresses, $local_sensu_server_ips_array) # If one of our ipaddresses are in the dns entries, we must be a sensu server! if size($array_intersection) > 0 { $is_sensu_server = true $local_sensu_server_array = gethostbyaddr2array($local_sensu_server_ips_array) Single Source of Truth • DNS is canonical source for sensu servers • Configure things in one place
  • 58. Automatic Monitoring • Cron Jobs - check if a job was completed successfully
  • 59. if $staleness_threshold { $actual_command = "/nail/sys/bin/success_wrapper '${reporting_name}' $ {command}" cron::staleness_check { $reporting_name: threshold => $staleness_threshold, params => $staleness_check_params, user => $user, } } else { $actual_command = $command } Automatic Monitoring • Cron Jobs - check if a job was completed successfully • cron::d
  • 60. if $staleness_threshold { $actual_command = "/nail/sys/bin/success_wrapper '${reporting_name}' $ {command}" cron::staleness_check { $reporting_name: threshold => $staleness_threshold, params => $staleness_check_params, user => $user, } } else { $actual_command = $command } Automatic Monitoring • Cron Jobs - check if a job was completed successfully • cron::d
  • 61. if $staleness_threshold { $actual_command = "/nail/sys/bin/success_wrapper '${reporting_name}' $ {command}" cron::staleness_check { $reporting_name: threshold => $staleness_threshold, params => $staleness_check_params, user => $user, } } else { $actual_command = $command } Automatic Monitoring • Cron Jobs - check if a job was completed successfully • cron::d
  • 62. define cron::staleness_check( $threshold, $params, $user, ) { $threshold_s = cron_human_time_to_seconds($threshold) # Check whether we are fresh five times per threshold, not to exceed 1 hour if $threshold_s / 5 > 3600 { $check_every = 3600 } else { $check_every = $threshold_s / 5 } $check_title = "${name}_staleness" $overrides = { 'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}", 'check_every' => $check_every, 'needs_sudo' => true, 'alert_after' => '2m', } $check_data = { "$check_title" => merge( $params, $overrides ) } create_resources('monitoring_check', $check_data) file { "/nail/run/success_wrapper/${name}": ensure => 'file', owner => $user, mode => '640', } -> Monitoring_check[$check_title] }
  • 63. define cron::staleness_check( $threshold, $params, $user, ) { $threshold_s = cron_human_time_to_seconds($threshold) # Check whether we are fresh five times per threshold, not to exceed 1 hour if $threshold_s / 5 > 3600 { $check_every = 3600 } else { $check_every = $threshold_s / 5 } $check_title = "${name}_staleness" $overrides = { 'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}", 'check_every' => $check_every, 'needs_sudo' => true, 'alert_after' => '2m', } $check_data = { "$check_title" => merge( $params, $overrides ) } create_resources('monitoring_check', $check_data) file { "/nail/run/success_wrapper/${name}": ensure => 'file', owner => $user, mode => '640', } -> Monitoring_check[$check_title] }
  • 64. define cron::staleness_check( $threshold, $params, $user, ) { $threshold_s = cron_human_time_to_seconds($threshold) # Check whether we are fresh five times per threshold, not to exceed 1 hour if $threshold_s / 5 > 3600 { $check_every = 3600 } else { $check_every = $threshold_s / 5 } $check_title = "${name}_staleness" $overrides = { 'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}", 'check_every' => $check_every, 'needs_sudo' => true, 'alert_after' => '2m', } $check_data = { "$check_title" => merge( $params, $overrides ) } create_resources('monitoring_check', $check_data) file { "/nail/run/success_wrapper/${name}": ensure => 'file', owner => $user, mode => '640', } -> Monitoring_check[$check_title] }
  • 65. define cron::staleness_check( $threshold, $params, $user, ) { $threshold_s = cron_human_time_to_seconds($threshold) # Check whether we are fresh five times per threshold, not to exceed 1 hour if $threshold_s / 5 > 3600 { $check_every = 3600 } else { $check_every = $threshold_s / 5 } $check_title = "${name}_staleness" $overrides = { 'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}", 'check_every' => $check_every, 'needs_sudo' => true, 'alert_after' => '2m', } $check_data = { "$check_title" => merge( $params, $overrides ) } create_resources('monitoring_check', $check_data) file { "/nail/run/success_wrapper/${name}": ensure => 'file', owner => $user, mode => '640', } -> Monitoring_check[$check_title] }
  • 66. define cron::staleness_check( $threshold, $params, $user, ) { $threshold_s = cron_human_time_to_seconds($threshold) # Check whether we are fresh five times per threshold, not to exceed 1 hour if $threshold_s / 5 > 3600 { $check_every = 3600 } else { $check_every = $threshold_s / 5 } $check_title = "${name}_staleness" $overrides = { 'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}", 'check_every' => $check_every, 'needs_sudo' => true, 'alert_after' => '2m', } $check_data = { "$check_title" => merge( $params, $overrides ) } create_resources('monitoring_check', $check_data) file { "/nail/run/success_wrapper/${name}": ensure => 'file', owner => $user, mode => '640', } -> Monitoring_check[$check_title] }
  • 67. define cron::staleness_check( $threshold, $params, $user, ) { $threshold_s = cron_human_time_to_seconds($threshold) # Check whether we are fresh five times per threshold, not to exceed 1 hour if $threshold_s / 5 > 3600 { $check_every = 3600 } else { $check_every = $threshold_s / 5 } $check_title = "${name}_staleness" $overrides = { 'command' => "/nail/sys/bin/cron_staleness_check ${name} ${threshold_s}", 'check_every' => $check_every, 'needs_sudo' => true, 'alert_after' => '2m', } $check_data = { "$check_title" => merge( $params, $overrides ) } create_resources('monitoring_check', $check_data) file { "/nail/run/success_wrapper/${name}": ensure => 'file', owner => $user, mode => '640', } -> Monitoring_check[$check_title] }
  • 68. Automatic Remediation • Make the computer try something before paging
  • 69. Automatic Remediation • Make the computer try something before paging • Try it repeatedly if necessary monitoring_check { ‘check_syslogd’: page => true, check_every => ‘5m’, alert_after => ‘10m’, realert_every => 10, runbook => ‘http://wiki/syslogd', command => ‘/usr/lib/nagios/plugins/check_proc syslogd‘, remediation_action => ‘/etc/init.d/syslogd start’, remediation_retries => 1 }
  • 70. Server Maintenance • Don’t alert on call if someone is working on a server
  • 71. function is_sensu_cli_available { which sensu-cli >/dev/null return $? } deregister () { if is_sensu_cli_available ; then # By this run level, sensu-client should be stopped, but we can stop anyway. service sensu-client stop 2>/dev/null sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn` fi } case "$1" in start) echo "$0 does nothing on start." stop) # Refuse to run on any runlevel except 0. if runlevel | grep -q 0; then deregister fi ;; esac rc0.d
  • 72. function is_sensu_cli_available { which sensu-cli >/dev/null return $? } deregister () { if is_sensu_cli_available ; then # By this run level, sensu-client should be stopped, but we can stop anyway. service sensu-client stop 2>/dev/null sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn` fi } case "$1" in start) echo "$0 does nothing on start." stop) # Refuse to run on any runlevel except 0. if runlevel | grep -q 0; then deregister fi ;; esac rc0.d
  • 73. function is_sensu_cli_available { which sensu-cli >/dev/null return $? } deregister () { if is_sensu_cli_available ; then # By this run level, sensu-client should be stopped, but we can stop anyway. service sensu-client stop 2>/dev/null sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn` fi } case "$1" in start) echo "$0 does nothing on start." stop) # Refuse to run on any runlevel except 0. if runlevel | grep -q 0; then deregister fi ;; esac rc0.d
  • 74. function is_sensu_cli_available { which sensu-cli >/dev/null return $? } deregister () { if is_sensu_cli_available ; then # By this run level, sensu-client should be stopped, but we can stop anyway. service sensu-client stop 2>/dev/null sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn` fi } case "$1" in start) echo "$0 does nothing on start." stop) # Refuse to run on any runlevel except 0. if runlevel | grep -q 0; then deregister fi ;; esac rc0.d
  • 75. function is_sensu_cli_available { which sensu-cli >/dev/null return $? } deregister () { if is_sensu_cli_available ; then # By this run level, sensu-client should be stopped, but we can stop anyway. service sensu-client stop 2>/dev/null sensu-cli client delete `/opt/puppet-omnibus/bin/facter fqdn` fi } case "$1" in start) echo "$0 does nothing on start." stop) # Refuse to run on any runlevel except 0. if runlevel | grep -q 0; then deregister fi ;; esac rc0.d
  • 76. function is_sensu_cli_available { which sensu-cli >/dev/null return $? } function silence_for { if is_sensu_cli_available; then if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then echo "Silence already detected for ${fqdn}. Not replacing the existing silence" else echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..." sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated." fi else echo "sensu-cli is unavailable" fi } case "$1" in start) echo "Sensu silence does nothing on start" exit 0 ;; stop) # Refuse to run on any runlevel except 6. echo "Sensu silenced by reboot " | logger -s -t sensu-silence silence_for 1800 2>&1 | logger -s -t sensu-silence ;; esac rc6.d
  • 77. function is_sensu_cli_available { which sensu-cli >/dev/null return $? } function silence_for { if is_sensu_cli_available; then if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then echo "Silence already detected for ${fqdn}. Not replacing the existing silence" else echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..." sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated." fi else echo "sensu-cli is unavailable" fi } case "$1" in start) echo "Sensu silence does nothing on start" exit 0 ;; stop) # Refuse to run on any runlevel except 6. echo "Sensu silenced by reboot " | logger -s -t sensu-silence silence_for 1800 2>&1 | logger -s -t sensu-silence ;; esac rc6.d
  • 78. function is_sensu_cli_available { which sensu-cli >/dev/null return $? } function silence_for { if is_sensu_cli_available; then if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then echo "Silence already detected for ${fqdn}. Not replacing the existing silence" else echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..." sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated." fi else echo "sensu-cli is unavailable" fi } case "$1" in start) echo "Sensu silence does nothing on start" exit 0 ;; stop) # Refuse to run on any runlevel except 6. echo "Sensu silenced by reboot " | logger -s -t sensu-silence silence_for 1800 2>&1 | logger -s -t sensu-silence ;; esac rc6.d
  • 79. function is_sensu_cli_available { which sensu-cli >/dev/null return $? } function silence_for { if is_sensu_cli_available; then if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then echo "Silence already detected for ${fqdn}. Not replacing the existing silence" else echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..." sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated." fi else echo "sensu-cli is unavailable" fi } case "$1" in start) echo "Sensu silence does nothing on start" exit 0 ;; stop) # Refuse to run on any runlevel except 6. echo "Sensu silenced by reboot " | logger -s -t sensu-silence silence_for 1800 2>&1 | logger -s -t sensu-silence ;; esac rc6.d
  • 80. function is_sensu_cli_available { which sensu-cli >/dev/null return $? } function silence_for { if is_sensu_cli_available; then if sensu-cli stash show "silence/${fqdn}" | grep -q "total items"; then echo "Silence already detected for ${fqdn}. Not replacing the existing silence" else echo "Automatically silencing this host, ${fqdn} for $1 seconds in sensu..." sensu-cli silence "${fqdn}" --expire "$1" --reason "Reboot initiated." fi else echo "sensu-cli is unavailable" fi } case "$1" in start) echo "Sensu silence does nothing on start" exit 0 ;; stop) # Refuse to run on any runlevel except 6. echo "Sensu silenced by reboot " | logger -s -t sensu-silence silence_for 1800 2>&1 | logger -s -t sensu-silence ;; esac rc6.d
  • 81. Cluster Checks • Assert some % of machines are healthy • Use to reduce alert noise
  • 82. Cluster Checks • Assert some % of machines are healthy • Use to reduce alert noise • If a cluster becomes unavailable, you want someone to be paged
  • 83. Cluster Checks • Assert some % of machines are healthy • Use to reduce alert noise • If a cluster becomes unavailable, you want someone to be paged • If one machine becomes unavailable, it’s not a problem - open a JIRA ticket to get it fixed in core hours