Dylan Ferreira from FuseMail will share how they use the Syslog Telegraf plugin to help them troubleshoot their systems faster and with more success. Dylan will go over how to set up Rsyslog and Telegraf to filter logs then configure Kapacitor to help you look for interesting things in your raw logs to trigger alerts to your team. He will then bring this all together in a dashboard for your teams to use.
4. Agenda.real
● Our TICK stack layout
● Configuring Rsyslog & Telegraf
● Doing more with Telegraf plugins
● Using Kapacitor to search &
sanitize log data
● Exploring this data with Grafana
5. How did we get here?
● Multi-dimensional TSDB.
● Very high performance.
● All data is aggregated on ingest.
● Can store raw metrics.
● Multi-dimensional TSDB.
● Complex aggregations (CQs).
● Multiple fields!
● Can store raw metrics.
● Flat structure with metadata stored
in the metric name (pre-v1).
● Fixed per-series aggregations
6. What is Telegraf?
“Telegraf is InfluxData's open source
plugin-driven server agent for collecting
and reporting metrics.”
● Inputs
● Processors
● Aggregators
● Outputs
● Stream Buffer
●● Metrics Router
12. The OOM Killer
/**
* oom_badness - heuristic function to determine which candidate task to kill
* @p: task struct of which task we should calculate
* @totalpages: total present RAM allowed for page allocation
* @memcg: task's memory controller, if constrained
* @nodemask: nodemask passed to page allocator for mempolicy ooms
*
* The heuristic for determining which task to kill is made to be as simple and
* predictable as possible. The goal is to return the highest value for the
* task consuming the most memory to avoid subsequent oom failures.
*/
13. Armouring Telegraf Against the OOM Killer
[Service]
OOMScoreAdjust=-600
/etc/systemd/system/telegraf.service.d/oom_score_adj.conf
value between -1000 and +1000/proc/$PID/oom_score_adj
/proc/$PID/oom__adj value between -17 and +15
15. Rsyslog Config
$ActionQueueType LinkedList
$ActionQueueFileName telegraf
$ActionResumeRetryCount -1
$ActionQueueSaveOnShutdown on
# forward over tcp with octet framing according to RFC 5425
*.* @@(o)localhost:6514;RSYSLOG_SyslogProtocol23Format
# all logs that contain the string "**WARNING**"
if ($msg contains '**WARNING**') then
@@(o)localhost:6514;RSYSLOG_SyslogProtocol23Format
17. Telegraf Output
● syslog
○ tags
■ severity (string)
■ facility (string)
■ hostname (string)
■ appname (string)
○ fields
■ version (integer)
■ severity_code (integer)
■ facility_code (integer)
■ timestamp (integer): the time recorded in the syslog message
■ procid (string)
■ msgid (string)
■ sdid (bool)
■ Structured Data (string)
○ timestamp: the time the messages was received
Original timestamp
Timestamp given by Telegraf
18. Timestamps & Ingest Latency
var data = stream
|from()
.database('syslog')
.retentionPolicy('autogen')
.measurement('syslog')
|groupBy('hostname')
|window()
.align()
.period(1m)
.every(1m)
|eval(lambda: unixNano("time") - "timestamp")
.as('timestamp_lag_ns')
var mean_latency = data
|mean('timestamp_lag_ns')
.as('val')
mean_latency
|log()
Typical Latency: under 1ms
26. var ipv4_address = /d{1,3}.d{1,3}.d{1,3}.d{1,3}/
|eval(lambda: regexReplace(ipv4_address, "message", '<ipv4-address>'))
.as('message')
.keep()
Rewriting your data with regexReplace
var email_address = /w+([-+.']w+)*@w+([-.]w+)*.w+([-.]w+)*/
|eval(lambda: regexReplace(email_address, "message", '<email-address>'))
.as('message')
.keep(
Email Addresses
IP Addresses
33. Memory cgroup out of memory: Kill process 3056 ( upstart-socket-) score 1057 or sacrifice childn
The kernel truncates process names down to 15 chars (16 chars -1 NUL)
and stores this in /proc/<PID>/comm
include/linux/sched.h
/* Task command name length: */
#define TASK_COMM_LEN 16
e.g. upstart-socket-bridge becomes upstart-socket-
The Kernel & Process Names