Presentation for the Splunk User Group in London (June 2018), highlighting the importance of time extraction and parsing within a Splunk deployment, answering questions such as
"Why is it relevant to have the correct timestamp in my Splunk events?", "How can I troubleshoot latency in my environment?" and "What can I do if my events are arriving late?"
1. Splunk User Group London
Daniel Hernandez
Dealing with delayed events
2. Agenda
1. Housekeeping rules & Introductions
2. Time extraction and parsing in Splunk
3. Monitoring delayed events
4. Impact on Splunk workflow
5. Potential risks deriving from delayed events
3. Housekeeping
Feel free to stand up
and grab a refill.
Splunk brought us
here, pizza keeps us
here.
Ask away! Youâre
not interrupting
unless youâre asking
where the exit is.
Join the community,
connect, share.
Reach out if youâd
like to contribute.
4. Introduction â Daniel Hernandez
âą Background in Networks and Security.
âą Splunk SCC1, working with ECS for about a year and a
half in Security.
âą Currently leveraging Splunk for a SIEM replacement
project in the banking sector.
6. âTime is what keeps everything from
happening at onceâ
1. Timezones: Time values from different
locations may differ â a lot.
2. Realtime/Batch processing: Sometimes
logs are collected in hourly/daily chunks.
3. Correlation Searches (Rules) and
forensic investigations rely on the
Extracted Time.
Log generation time. Extracted from the log itself. (Extracted Time)
Generated by an Indexer. Event indexing time. (Index Time)
_time
_indextime
7. First things first!
First of all >
1. Make sure the _time field is extracted correctly!
2. You donât want to use Splunk to report on
internal network metrics.
3. Time extraction should be transparent for dashboards, alerts, and
reports.
Second of all >
Check your clock skew to monitor any potential delays within the
Splunk infrastructure.
9. Monitoring clock skew >
Code openly available in GitHub.
Based in SimpleXML and tstats.
Uses moving average (delay) to display clock skew violations.
10. Symptoms
Events collected from
a forwarder or from a
log file are not yet
searchable on Splunk.
Even though the time
stamps of the events
are within the search
time range, a search
does not return the
events.
Later, a search over
the same time range
returns the events.
11.
12. Narrowing down the issue:
source=mysource
| eval delay_sec=_indextime-_time
| timechart min(delay_sec) avg(delay_sec) max(delay_sec) by host
source=mysource
| eval delay_sec=_indextime-_time
| timechart min(delay_sec) avg(delay_sec) max(delay_sec) by source
Determine the common denominator between them. For example, all of the
delayed events might be from the same log file or the same host or source type.
Also, compare the delay from your events with the delay from the internal
Splunk logs.
index=_internal source=*splunkd.log*
| eval delay_sec=_indextime-_time
| timechart min(delay_sec) avg(delay_sec) max(delay_sec) by host
13. Finding the root cause
If some sources are delayed but
not others, this indicates a
problem with the input.
Thruput limits
Network limits
Time zone issue
Windows event logs delay
If all the logs are delayed,
including the internal logs, then
the delay is a forwarding issue.
14. Data Pipeline
At a very high level:
âą Parsing Queue/Pipeline
Responsible for source typing,
break-lining, time stamping, event
boundaries, regex.
âą Indexing Queue/Pipeline
Event segmentation and indexing,
index building.
Splunk Admin 101:
Thereâs a lot that can go wrong.
Youâll find A LOT of creative ways
to trash your data pipeline in a
shared environment.
15. Avoid (trouble)shooting yourself in the foot >
Make sure your Management Console is set up appropriately!
You want to keep a close eye on your queues, and spot any potential
bottlenecks.
16. What will go wrong?
Congratulations, youâve found severe delays in your Splunk infrastructure.
What can you expect?
Inconsistent dashboards.
Inconsistent reports.
And different results each time
theyâre run.
You know that saved searches
canât be re-run.
So your funky real-time
correlation searches are going to
miss events.
17. What can you do?
âą Focus on _indextime when writing real-time correlation searches:
index=funky_index _index_earliest=-1h@h _index_latest=now()
| <my_funky_correlation_search>
âą The scheduled saved search will capture events as theyâre indexed.
âą Events will appear delayed but they wonât be missed by the alert.
âą Chase other teams to get it fixed ASAP! Team-work brings it home.