SlideShare a Scribd company logo
1 of 101
© 2019 SPLUNK INC.© 2019 SPLUNK INC.
Best practises for Data Collection
Richard Morgan – Principal Architect
Improving Event Collection via measurement
© 2019 SPLUNK INC.
During the course of this presentation, we may make forward-looking statements regarding future events or
the expected performance of the company. We caution you that such statements reflect our current
expectations and estimates based on factors currently known to us and that actual events or results could
differ materially. For important factors that may cause actual results to differ from those contained in our
forward-looking statements, please review our filings with the SEC.
The forward-looking statements made in this presentation are being made as of the time and date of its live
presentation. If reviewed after its live presentation, this presentation may not contain current or accurate
information. We do not assume any obligation to update any forward-looking statements we may make. In
addition, any information about our roadmap outlines our general product direction and is subject to change
at any time without notice. It is for informational purposes only and shall not be incorporated into any contract
or other commitment. Splunk undertakes no obligation either to develop the features or functionality
described or to include any such feature or functionality in a future release.
Splunk, Splunk>, Listen to Your Data, The Engine for Machine Data, Splunk Cloud, Splunk Light and SPL are trademarks and registered trademarks of Splunk Inc. in
the United States and other countries. All other brand names, product names, or trademarks belong to their respective owners. © 2019 Splunk Inc. All rights reserved.
Forward-Looking Statements
© 2019 SPLUNK INC.
Spreading that Splunk across EMEA since 2013
Self professed data junkie and SPL addict
© 2019 SPLUNK INC.
indexers
forwarders
Search heads
A Splunk
installation is much
like an iceberg, the
visible tip is the
indexers, search
heads, cluster
master etc.
But this deck is
about what lies
beneath - the vast
and sprawling data
collection hierarchy
© 2019 SPLUNK INC.
Why is event collection tuning important?
1. Data collection is the foundation of any Splunk instance
2. Event distribution underpins linear scaling is built
3. Events must be synchronized in time to corelate across hosts
4. Events must arrive in a timely fashion for alerts to be effective
© 2019 SPLUNK INC.
Use EVENT_BREAKER and / or forceTimeBasedAutoLB
Use indexer discovery whenever possible
Use lots of multiple pipelines on intermediate forwarders
Use autoLBVolume for variable data rate sources
Configure LINE_BREAKER and SHOULD_LINEMERGE=false
Explicitly configure date time format for each sourcetype
A summary of data collection best practices
© 2019 SPLUNK INC.
What is event
distribution?
© 2019 SPLUNK INC.
What is Good Event Distribution?
IndexerIndexerIndexerIndexer
Forwarder
25 MB 25 MB 25 MB
100 MB
Event distribution is how Splunk spreads its incoming data across multiple indexers
25 MB
© 2019 SPLUNK INC.
Why is Good Event Distribution important?
IndexerIndexerIndexerIndexer
Search head
Event distribution is critical for the even distribution of search (computation) workload
10 sec10 sec10 sec10 sec
11 sec
Execution of a search for
a fix time range
© 2019 SPLUNK INC.
What is ‘bad’ Event Distribution?
Bad event distribution is when the spread of events is uneven across the indexers
70 MB 10 MB
100 MB
10 MB 10 MB
IndexerIndexerIndexerIndexer
Forwarder
© 2019 SPLUNK INC.
Bad Event Distribution affects search
Search time becomes unbalanced, searches take longer to complete and reducing
throughput
90 sec
91 sec
10 sec10 sec10 sec
IndexerIndexerIndexerIndexer
Search head
Execution of a search for
a fix time range
© 2019 SPLUNK INC.
What is event
delay?
© 2019 SPLUNK INC.
What is good event delay?
IUF Indexer
Search
head
UFApp
Log file readwrite s2s s2s search
It should take between 3-5 seconds from event
generation to that event being searchable
3-5 seconds
© 2019 SPLUNK INC.
How do you calculate event delay?
| eval event_delay=_indextime - _time
The timestamp of the
event is _time
The time of indexing is
_indextime
The time of search is
now()
IUF Indexer
Search
head
UFApp
Log file readwrite s2s s2s search
© 2019 SPLUNK INC.
How time affects
event distribution
© 2019 SPLUNK INC.
autoLBFrequency (outputs.conf)
autoLBFrequency = <integer>
* The amount of time, in seconds, that a forwarder sends data to an indexer
before redirecting outputs to another indexer in the pool.
* Use this setting when you are using automatic load balancing of outputs
from universal forwarders (UFs).
* Every 'autoLBFrequency' seconds, a new indexer is selected randomly from
the
list of indexers provided in the server setting of the target group
stanza.
* Default: 30
30 seconds of 1 MB/s is 30 MB for each connection!
© 2019 SPLUNK INC.
Event distribution improves over time
Switching at 30 seconds
5 mins = 10 connections
1 hour = 120 connections
1 day = 2880 connections
How long before all indexers have data?
10 indexers = 5 mins
50 indexers = 25 mins
100 indexers = 50 mins
Larger clusters take longer to get “good” event distribution
© 2019 SPLUNK INC.
Data is distributed across the indexers over time
Each indexer is allocated 30s of data via Randomized Round Robin
© 2019 SPLUNK INC.
earliest=now latest=-90s
Indexer 4 has no data to process
© 2019 SPLUNK INC.
earliest=now latest=-180s
Indexer 4 has 2x the data to process
© 2019 SPLUNK INC.
It’s not Round Robin (RR), its Randomized RR
Round robin:
Round 1 order: 1,2,3,4
Round 2 order: 1,2,3,4
Round 3 order: 1,2,3,4
Round 4 order: 1,2,3,4
Randomized Round Robin:
Round 1 order: 1,4,3,2
Round 2 order: 4,1,2,3
Round 3 order: 2,4,3,1
Round 4 order: 1,3,2,4
Splunk’s randomized round robin algorithm quickens event distribution
© 2019 SPLUNK INC.
Event distribution improves over time
time
Standarddeviation
Initially we start connecting to
just one indexer and the
standard deviation is high
As time goes to infinity
standard deviation
should trend to zero
Most searches execute
over the last 15mins
Standarddeviationacrossindexers
As time passes the standard
deviation improves
© 2019 SPLUNK INC.
Types of data
flows
© 2019 SPLUNK INC.
Types of data flow coming into Splunk
Periodic data, typically a scripted
input, very spikey. Think AWS
cloudwatch, think event delay.
Constant data rates, nice and
smooth, think metrics.
Variable data rates, typically
driven by usage, think web logs
One shot, much like periodic data
It can be difficult to optimize for every type of data flow
© 2019 SPLUNK INC.
Time based LB does not work well on its own
timebasedAutoLB is the default and is set to 30s
© 2019 SPLUNK INC.
autoLBVolume (outputs.conf)
autoLBVolume = <integer>
* The volume of data, in bytes, to send to an indexer before a new indexer
is randomly selected from the list of indexers provided in the server
setting of the target group stanza.
* This setting is closely related to the 'autoLBFrequency' setting.
The forwarder first uses 'autoLBVolume' to determine if it needs to
switch to another
indexer. If the 'autoLBVolume' is not reached,
but the 'autoLBFrequency' is, the forwarder switches to another indexer
as the forwarding target.
* A non-zero value means that volume-based forwarding is active.
* 0 means the volume-based forwarding is not active.
* Default: 0
Switching too fast can result in lower throughput
© 2019 SPLUNK INC.
autoLBVolume vs timeBasedAutoLB
With variable data rates timeBasedAutoLB will keep switching
rate constant and vary the volume of data
For variable data rates autoLBVolume will keep data volumes
constant and vary switching rate
For constant data rates they are the same.
Use both options together for best affect
© 2019 SPLUNK INC.
autoLBVolume + time based is much better
autoLBVolume is better for variable and bursty data flows
© 2019 SPLUNK INC.
How to measure
event distribution
© 2019 SPLUNK INC.
Call the REST API to get RT ingestion
| rest /services/server/introspection/indexer
| eventstats stdev(average_KBps) avg(average_KBps)
© 2019 SPLUNK INC.
Ingestion over time
index=_internal Metrics TERM(group=thruput) TERM(name=thruput)
sourcetype=splunkd
[| dbinspect index=*
| stats values(splunk_server) as indexer
| eval host_count=mvcount(indexer),
search="host IN (".mvjoin(mvfilter(indexer!=""), ", ").")"]
| eval host_pipeline=host."-".ingest_pipe
| timechart minspan=30sec limit=0 per_second(kb) by host_pipeline
Too much data being
ingested by a single
indexer
© 2019 SPLUNK INC.
index=_internal Metrics TERM(group=thruput) TERM(name=thruput)
sourcetype=splunkd
[| dbinspect index=*
| stats values(splunk_server) as indexer
| eval host_count=mvcount(indexer),
search="host IN (".mvjoin(mvfilter(indexer!=""), ", ").")"]
| eval host_pipeline=host."-".ingest_pipe
| timechart minspan=30sec limit=0 per_second(kb) by host_pipeline
Indexer thruput search explained
Use TERM to speed up
search execution
Use metrics to get pipeline
statistics
Sub-search returns indexer list
Measures each pipeline individually
© 2019 SPLUNK INC.
Count the events per indexer via job inspector
For any search, open the job inspector and
see how many events each indexer
returned.
This should be approximately the same for
each indexer.
In a multi site or multi cluster environment
we expect that groups of indexers are likely
to have different numbers of events.
Within a single site the number of events
should be the about same for each indexer.
© 2019 SPLUNK INC.
Count the events per indexer and index via search
The event distribution
score per index
© 2019 SPLUNK INC.
Calculate event distribution per index for -5mins
| tstats count where
index=*
splunk_server=*
host=* earliest=-5min latest=now
by splunk_server index
| stats
sum(count) as total_events
stdev(count) as stdev
avg(count) as average
by index
| eval normalized_stdev=stdev/average
Event distribution can be measured per index and per host
Narrow indexes
Narrow to site or cluster
Calculate score
Calculate per index,
or host
© 2019 SPLUNK INC.
http://bit.ly/2WxRXvI
Shameless self promotion
Use my event distribution measurement dashboard to assess your stack
© 2019 SPLUNK INC.
How to configure the event distribution dashboard
3. Specify set of indexers to measure,
scope per site or cluster
1. Specify indexes to
measure
4. Click link to run
Exponentially
growing steps
2WxRXvI
2. Select time power
function
© 2019 SPLUNK INC.
How to read the first panel
Variation of events across the indexers
Data received an indexer in -1 sec
Logscaleofeventsreceived
stdev improves as time increases
Each series is a time
range, exponentially
growing.
Each column is an indexer
Data received an indexer in -54 mins
2WxRXvI
© 2019 SPLUNK INC.
How to read the second panel
Events scanned in each step
Each series is an index
Thenumberofeventsreceived
(logscale)
After ~8mins indexer received
100k events into index
Each time series plotted on x axis
2WxRXvI
© 2019 SPLUNK INC.
How to read the third panel
How many indexers received data in time range
All 9 indexers received
data into every index
within one second
2WxRXvI
© 2019 SPLUNK INC.
How to read the final panel
How event distribution is improving over time
3% variation after 5 mins
10% variation after 5 mins
2WxRXvI
> 1% variation after 54 mins
© 2019 SPLUNK INC.
Three hours before all
indexers have data, need to
accelerate to improve entropy
Randomization is not
happening fast enough
Incoming data rates for indexers very
different for the different indexes
2WxRXvI
© 2019 SPLUNK INC.
Randomization is slow for most
indexes and plateaus for some
Some indexers are not getting
data, likely misconfigured
forwarders
2WxRXvI
© 2019 SPLUNK INC.
Near perfect event distribution using intermediate forwarders
One index is not
improving fast enough
2WxRXvI
© 2019 SPLUNK INC.
Measuring
event delay
© 2019 SPLUNK INC.
Use TSTATS to compute event delay at scale
© 2019 SPLUNK INC.
| tstats max(_indextime) as indexed_time count where index=*
latest=+1day earliest=-1day _index_latest=-1sec _index_earliest=-2sec
by index host splunk_server _time span=1s
| eval _time=round(_time), delay=indexed_time-_time,
delay_str=tostring(delay,"duration")
| eventstats max(delay) as max_delay
max(_time) as max_time
count as eps
by host index
| where max_delay = delay
| eval max_time=_time
| sort - delay
Delay per index and host
Use tstats because indextime as an indexed field!
© 2019 SPLUNK INC.
| tstats max(_indextime) as indexed_time count where index=*
latest=+1day earliest=-1day _index_latest=-1sec _index_earliest=-2sec
by index host splunk_server _time span=1s
| eval _time=round(_time), delay=indexed_time-_time,
delay_str=tostring(delay,"duration")
| eventstats max(delay) as max_delay
max(_time) as max_time
count as eps
by host index
| where max_delay = delay
| eval max_time=_time
| sort - delay
Delay per index and host
Get events received in the
last second, irrespective of
the delay
max(_indextime)
is latest event
Split by _time to a
resolution of a
second
Drop subseconds
Compute the delta
© 2019 SPLUNK INC.
http://bit.ly/2R9i4Yx
http://bit.ly/2I8qtIS
More shameless self promotion
I have a dashboards that use this method to measure delay
© 2019 SPLUNK INC.
How to use the event delay dashboard
Drills down to per host delay
Select indexes to measure
Keep time range
short, but extend
to capture periodic
data
translates into
_indextime, not
_time
2R9i4Yx
2I8qtIS
© 2019 SPLUNK INC.
How to read the main panels
By default the results are all for the last second
The average
delay per index
The number of
events received
per index
The maximum
delay per index
The number of
hosts per index
How events are distributed across the cluster
2R9i4Yx
2I8qtIS
© 2019 SPLUNK INC.
Select an index by clicking on a series
Click on any chart to select an index
Click on a host to drill down
Select period for drill down
Sorted by descending delay
2R9i4Yx
2I8qtIS
© 2019 SPLUNK INC.
Drill down to host to understand reasons for delay
The number of
events
generated at
time, per index
The number of
events
generated at
time, by delay
The number of
events received
at time, per
index
The number of
events received
at time, by delay
2R9i4Yx
2I8qtIS
© 2019 SPLUNK INC.
The causes of
event delay
© 2019 SPLUNK INC.
Congestion
network latency, IO contention, pipeline congestion, CPU saturation, rate limiting
Clock skew
Time zones wrong, clock drift, parsing issue
Timeliness
Histortical load, polling APIs, scripted inputs, component restarts
© 2019 SPLUNK INC.
Congestion: Rate limiting
Rate limiting slows transmission and true event delay occurs
Default maxkbps=256
IUF Indexer
Search
head
UFApp
Log file readwrite s2s s2s search
© 2019 SPLUNK INC.
Congestion: Network
Network saturation acts like rate limiting and causes true event delay
Network congestion
IUF Indexer
Search
head
UFApp
Log file readwrite s2s s2s search
© 2019 SPLUNK INC.
Congestion: Indexer
Excessive ingestion rates, FS IO problems, inefficient regex, inefficient
line breaking cause all cause true event delay
Pipeline
congestion
IUF Indexer
Search
head
UFApp
Log file readwrite s2s s2s search
© 2019 SPLUNK INC.
Timeliness: Scripted inputs
Increase polling frequency to reduce event delay
Polling very x mins means delays
of up to x mins for each
IUF Indexer
Search
head
UFApp Periodic API read s2s s2s search
© 2019 SPLUNK INC.
Timeliness: Offline components
Restarting forwarders causes brief event delay
Forwarder must be
running or events queue
IUF Indexer
Search
head
UFApp
Log file readwrite s2s s2s search
© 2019 SPLUNK INC.
Timeliness: Historical data
Loading historical data creates fake event delay
Backfilling data
from the past
IUF Indexer
Search
head
UFApp
Log file readwrite s2s s2s search
© 2019 SPLUNK INC.
Clock Skew: Time zones
When time zones aren’t configured correctly event delay measurement is
shifted into the past or future
Time zone A Time zone B
IUF Indexer
Search
head
UFApp
Log file readwrite s2s s2s search
© 2019 SPLUNK INC.
Clock Skew: Drift
Use NTP to align all clocks across your estate to maximize the usefulness of
Splunk
8:00 pm
7:59 pm
When events appear to arrive slightly from the future or
past, this makes time-based coloration hard
IUF Indexer
Search
head
UFApp
Log file readwrite s2s s2s search
© 2019 SPLUNK INC.
Clock Skew: Date time parsing problems
Automatic source typing assumes American date format when ambiguous
Always explicitly set the date time
exactly per sourcetype
IUF Indexer
Search
head
UFApp
Log file readwrite s2s s2s search
© 2019 SPLUNK INC.
Reasons for poor
event distribution
© 2019 SPLUNK INC.
Sticky forwarders
Super giant forwarders
Badly configured intermediate forwarders
Indexer abandonment
Indexer starvation
Network connectivity problems
Single target
Forwarder bugs
TCP back off
maxKBps
Channel saturation
HEC ingestion
© 2019 SPLUNK INC.
Sticky
Forwarders
© 2019 SPLUNK INC.
The UF uses “natural breaks” to chunk up logs
10 events 50kb
20 events 80kb
10k events
10MB
10 events 50kb
My logging
application
Logging Semantics:
1. Open file
2. Append to file
3. Close file
We know we have a
natural break each time
the file is closed.
Target 1
Time
Target 2
Target 4
Target 3
Natural break
Natural break
Natural break
EOF
When an
application bursts
the forwarder is
forced to “stick” to
the target until the
application
generates a natural
break
© 2019 SPLUNK INC.
The problem is exasperated with IUFs
UF 1
Indexer
1IUF 1
HWF 1
IUF 2
UF 3
Indexer
2
An intermediate universal
forwarder (IUF) works with
unparsed streams and
can only switch away the
incoming stream contains a
break.
Connections can last for
hours and this causes bad
event distribution
UF 2
IUFs cannot afford to be sticky and must switch on time
stuck
© 2019 SPLUNK INC.
The problem doesn’t exist with an intermediate HWF
UF 1
Indexer
1HWF 1
HWF 1
HWF 2
UF 3
Indexer
2
The HWF parses data
and forwarder events,
not streams.
HWF can receive
connections from
sticky forwarders
causing throughput
problems
UF 2
Heavy forwarders are generally considered a bad practise
© 2019 SPLUNK INC.
forceTimebasedAutoLB (outputs.conf)
forceTimebasedAutoLB = <boolean>
* Forces existing data streams to switch to a newly elected indexer every
auto load balancing cycle.
* On universal forwarders, use the 'EVENT_BREAKER_ENABLE' and
'EVENT_BREAKER' settings in props.conf rather than
'forceTimebasedAutoLB'
for improved load balancing, line breaking, and distribution of events.
* Default: false
Forcing a UF to switch can create broken events and generate parsing errors
© 2019 SPLUNK INC.
How forceTimeBasedAutoLB works
Splunk to Splunk protocol (s2s) uses datagrams of 64KB
64KB 64KB 64KB 64KB 64KB
Log
sources
current indexer
next indexer
64KB
First event Remaining events
30sec
Provided that an events doesn’t succeed a s2s datagram the algorithm works perfectly
We need to switch, but
there is no natural break!
Send packet to both assuming
that the packet contains
multiple events and a line
break will be found
Event boundary
Reads 1st
event
ignores 1st
event
© 2019 SPLUNK INC.
Applying forceTimeBasedAutoLB to IUF
UF 1
Indexer
1IUF 1
HWF 1
IUF 2
UF 3
Indexer
2
The Intermediate
Universal Forwarder
will force switching
without a natural
break.
The indexers no
longer get sticky
sessions!
UF 2
The intermediate forwarders still recieve sticky sessions
© 2019 SPLUNK INC.
EVENT_BREAKER (outputs.conf)
# Use the following settings to handle better load balancing from UF.
# Please note the EVENT_BREAKER properties are applicable for Splunk Universal
# Forwarder instances only.
EVENT_BREAKER_ENABLE = [true|false]
* When set to true, Splunk software will split incoming data with a
light-weight chunked line breaking processor so that data is distributed
fairly evenly amongst multiple indexers. Use this setting on the UF to
indicate that data should be split on event boundaries across indexers
especially for large files.
* Defaults to false
# Use the following to define event boundaries for multi-line events
# For single-line events, the default settings should suffice
EVENT_BREAKER = <regular expression>
* When set, Splunk software will use the setting to define an event boundary at
the end of the first matching group instance.
EVENT_BREAKER is configured with a regex per sourcetype
© 2019 SPLUNK INC.
EVENT_BREAKER is complicated to maintain on IUF
UF1 – 5 source types
UF1 – 10 source types
UF1 – 20 source types
UF1 – 15 source types
UF1 – 10 source types
UF1 – 5 source types
IUF1
65 source
types
Configure each
sourcetype with
EVENT_BREAKER so
the forwarder doesn’t
get stuck the IUF.
Trying to maintain
EVENT_BREAKER on an
IUF can be impractical due
to aggregation of streams
and source types.
If the endpoints all
implement
EVENT_BREAKER, it
won’t be triggered on the
IUF
forceTimeBasedAutoLB is universal algorithm, EVENT_BREAKER is not
© 2019 SPLUNK INC.
The final solution to sticky forwarders
UF 1
Indexer
1IUF 1
HWF 1
IUF 2
UF 3
Indexer
2
Removal of sticky
forwarders will lower
event delay, improve
event distribution, and
improve search
execution times
UF 2
Configured correctly intermediate forwarders are great for
improving event distribution.
Configure EVENT_BREAKER per
sourcetype, use autoLBVolume
Increase autoLBFrequency,
enable forceTimeBasedAutoLB,
use autoLBVolume
© 2019 SPLUNK INC.
Find all sticky forwarders by host name
index=_internal sourcetype=splunkd
TERM(eventType=connect_done) OR
TERM(eventType=connect_close)
| transaction
startswith=eventType=connect_done
endswith=eventType=connect_close
sourceHost sourcePort host
| stats stdev(duration) median(duration) avg(duration) max(duration)
by sourceHost
| sort - max(duration)
Intermediate LB will invalidate results
© 2019 SPLUNK INC.
Find all sticky forwarders by hostname
index=_internal sourcetype=splunkd
(TERM(eventType=connect_done) OR
TERM(eventType=connect_close) OR
TERM(group=tcpin_connections))
| transaction
startswith=eventType=connect_done
endswith=eventType=connect_close
sourceHost sourcePort host
| stats stdev(duration) median(duration) avg(duration) max(duration)
by hostname
| sort - max(duration)
Error prone as it requires that hostname = host
© 2019 SPLUNK INC.
Super Giant Forwarders
a.k.a.
“laser beams of death”
© 2019 SPLUNK INC.
Not all Forwarders are born equal
AWS
Unix
Hosts
Network
Syslog +
UF
HWF
Data
bases
Windows
host
5 KB/s
Windows
host
Windows
host
Super giant forwarder make others look like rounding errors
© 2019 SPLUNK INC.
Understanding forwarder weight distribution
Fwd1
Fwd2
Fwd3
Fwd4
Fwd5
Reading from left to right
• 0% forwarders = 0% data
• 20% forwarders = 66% data
• 40% forwarders = 73% data
• 60% forwarders = 90% data
• 80% forwarders = 93% data
• 100% forwarders 100% data
We can plot these pairs as an
chart to normalize and compare
stacks.
10 MB/s
3 MB/s
1 MB/s
500 KB/s 500 KB/s
Total ingestion is 15 MB/s
© 2019 SPLUNK INC.
Plotting normalized weight distribution
© 2019 SPLUNK INC.
Examples of forward weight distribution
Data imbalance issues can be found on very large
stack and must be addressed
10% 100%
© 2019 SPLUNK INC.
Forwarder weight distribution search
index=_internal Metrics sourcetype=splunkd TERM(group=tcpin_connections) earliest=-4hr latest=now
[| dbinspect index=_*
| stats values(splunk_server) as indexer
| eval search="host IN (".mvjoin(mvfilter(indexer!=""), ", ").")"]
| stats sum(kb) as throughput
by hostname
| sort - throughput
| eventstats sum(throughput) as total_throughput
dc(hostname) as all_forwarders
| streamstats sum(throughput) as accumlated_throughput count
by all_forwarders
| eval coverage=accumlated_throughput/total_throughput,
progress_through_forwarders=count/all_forwarders
| bin progress_through_forwarders bins=100
| stats max(coverage) as coverage
by progress_through_forwarders all_forwarders
| fields all_forwarders stack coverage progress_through_forwarders
© 2019 SPLUNK INC.
Find super giant forwarders and reconfigure
Super giant forwarders need careful configuration
1.Configure EVENT_BREAKER and / or forceTimeBasedAutoLB
2.Configure multiple pipelines (validate that they are being used)
3.Configure autoLBVolume and / or increase switching speed
(keeping an eye on throughput)
4.Use INGEST_EVAL and random() to shard output data flows
© 2019 SPLUNK INC.
Indexer
abandonment
© 2019 SPLUNK INC.
Firewalls can block connections
Indexer A Indexer B Indexer C
Forwarder
Forwarders generate errors when they cannot connect
A common cause for starvation is forwarders only
able to connect to a subset of indexers due to
network problems.
Normally a firewall or LB is blocking the
connections
This is very common when the indexer cluster is
increased in size.
replication
Network
says ”no”
Outputs:
A+B+C
© 2019 SPLUNK INC.
Find forwarders suffering network problems
index=_internal earliest=-24hrs latest=now sourcetype=splunkd
TERM(statusee=TcpOutputProcessor) TERM(eventType=*)
| stats count
count(eval(eventType="connect_try")) as try
count(eval(eventType="connect_fail")) as fail
count(eval(eventType="connect_done")) as done
by destHost destIp
| eval bad_output=if(try=failed,"yes","no")
Search for forwarders logs to find those fail to connect to a target
© 2019 SPLUNK INC.
Incomplete output lists create no errors
Indexer A Indexer B Indexer C
Forwarder
We must search for the absence of connections to find this
problem
Forwarders can only send data to targets that
are in their list.
This is very common when the indexer cluster
is increased in size.
Encourage customers to use indexer discovery
on forwarders so this never happens.
replication
Outputs:
B+C
© 2019 SPLUNK INC.
Do all indexers have the same forwarders?
index=_internal earliest=-24hrs latest=now sourcetype=splunkd TERM(eventType=connect_done)
TERM(group=tcpin_connections)
[| dbinspect index=*
| stats values(splunk_server) as indexer
| eval host_count=mvcount(indexer),
search="host IN (".mvjoin(mvfilter(indexer!=""), ", ").")"]
| stats count
by host sourceHost sourceIp
| stats
dc(host) as indexer_target_count
values(host) as indexers_connected_to
by sourceHost sourceIp
| eventstats max(indexer_target_count) as total_pool
| eval missing_indexer_count=total_pool-indexer_target_count
| where missing_indexer_count != 0
Search for forwarders logs to find those fail to connect to a target
This search assumes that all
indexers have the same
forwarders. With multiple
clusters and sites, this might
not be true
© 2019 SPLUNK INC.
Site aware forwarder connections
index=_internal earliest=-24hrs latest=now sourcetype=splunkd TERM(eventType=connect_done) TERM(group=tcpin_connections)
[| dbinspect index=*
| stats values(splunk_server) as indexer
| eval host_count=mvcount(indexer),
search="host IN (".mvjoin(mvfilter(indexer!=""), ", ").")"]
| eval remote_forwarder=sourceHost." & ".if(sourceIp!=sourceHost,"(".sourceIp.")","")
| stats count as total_connections
by host remote_forwarder
| join host
[ search index=_internal earliest=-1hrs latest=now sourcetype=splunkd CMMaster status=success site*
| rex field=message max_match=64 "(?<site_pair>sited+,"[^"]+)"
| eval cluster_master=host
| fields + site_pair cluster_master
| fields - _*
| dedup site_pair
| mvexpand site_pair
| dedup site_pair
| rex field=site_pair "^(?<site_id>sited+),"(?<host>.*)"
| eventstats count as site_size by site_id cluster_master
| table site_id cluster_master host site_size]
| eval unique_site=site_id." & ".cluster_master
| chart
values(site_size) as site_size
values(host) as indexers_connected_to
by remote_forwarder unique_site
| foreach site*
[| eval "length_<<FIELD>>"=mvcount('<<FIELD>>')]
What sites forwarders connect to which site and
what is the coverage for that site?
Sub-search returns
indexer list
Sub-search
computes
indexer to site
mapping
© 2019 SPLUNK INC.
Indexer
Starvation
© 2019 SPLUNK INC.
What is indexer starvation?
indexer indexer indexer
Forwarder
Indexers must receive data to participate in search
When an indexer or an indexer pipeline
has periods when it is starved of data.
It will continue to receive replicated data
and be assigned primaries by the CM to
search replicated cold buckets.
It will not search hot buckets without site
affinity.
replication
© 2019 SPLUNK INC.
Smoke test: count connections to indexers
index=_internal earliest=-1hrs latest=now sourcetype=splunkd
TERM(eventType=connect_done) TERM(group=tcpin_connections)
[| dbinspect index=*
| stats values(splunk_server) as indexer
| eval host_count=mvcount(indexer),
search="host IN (".mvjoin(mvfilter(indexer!=""), ", ").")"]
| timechart limit=0 count minspan=31sec by host
This quick smoke test shows if
the indexers in the cluster have
obviously varying numbers of
incoming connections.
When you see banding it either a
smoking gun, or different
forwarder groups per site.
Indexer starvation is guaranteed if
there are not enough incoming
connections.
Fix by increase connections by increasing switching frequency
and the number of pipelines on forwarders
Nice and healthy
© 2019 SPLUNK INC.
Funnel effect reduces connections
indexer indexer indexer
Forwarder
When an intermediate forwarder aggregates
multiple streams, it creates a super giant
forwarder.
You need to add more pipelines as each
pipeline creates a new TCP output to the
indexers.
Note that time division multiplexing doesn’t
mean 1 CPU = 1 pipeline, unless it is running at
maximum rate of about 20 MB/s
Apply configuration with care.
FWDFWD FWD FWD FWDFWD
© 2019 SPLUNK INC.
How to instrument and tune IUFs
indexer indexer indexer
Forwarder
FWDFWD FWD FWD FWDFWD
Monitor out going connections to
understand cluster coverage
Monitor CPU, load average,
pipeline utilization and pipeline
blocking
Find and reconfigure sticky
forwarders to use
EVENT_BREAKER
Monitor active channels to avoid
s2s protocol churn
Monitor incoming connections
to ensure even distribution of
workload over pipelines
Use autoLBVolume
Enable forceTimeBasedAutoLB
Lengthen timeBasedAutoLBFrequency
Add up to 16 pipelines but don’t
breach ~30% CPU
Monitor event delay to make
sure it doesn’t increase.
A well tuned intermediate forwarder can achieve 5%
event distribution across 100 indexers within 5mins
© 2019 SPLUNK INC.
Forwarder bugs
When an forwarder forgets about an indexer
© 2019 SPLUNK INC.
Some forwarders were just born bad
6.6.0 - 6.6.7
7.0.0 – 7.0.2
They just give up on targets after a timeout.
Rolling restarts cause them to abandon targets
You can work around the problem with static IP addresses
Remove where possible
© 2019 SPLUNK INC.
Perquisite
Sub-searches
How to write searches that write searches
© 2019 SPLUNK INC.
Consider the following search
It returns a single field “search” that is a simple search string
© 2019 SPLUNK INC.
Sub searches that return “search” are executed

More Related Content

What's hot

Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 
Splunk Search Optimization
Splunk Search OptimizationSplunk Search Optimization
Splunk Search OptimizationSplunk
 
Splunk workshop-Machine Data 101
Splunk workshop-Machine Data 101Splunk workshop-Machine Data 101
Splunk workshop-Machine Data 101Splunk
 
Splunk Data Onboarding Overview - Splunk Data Collection Architecture
Splunk Data Onboarding Overview - Splunk Data Collection ArchitectureSplunk Data Onboarding Overview - Splunk Data Collection Architecture
Splunk Data Onboarding Overview - Splunk Data Collection ArchitectureSplunk
 
SplunkLive! Data Models 101
SplunkLive! Data Models 101SplunkLive! Data Models 101
SplunkLive! Data Models 101Splunk
 
SplunkLive 2011 Beginners Session
SplunkLive 2011 Beginners SessionSplunkLive 2011 Beginners Session
SplunkLive 2011 Beginners SessionSplunk
 
Splunk Overview
Splunk OverviewSplunk Overview
Splunk OverviewSplunk
 
Splunk Cloud and Splunk Enterprise 7.2
Splunk Cloud and Splunk Enterprise 7.2 Splunk Cloud and Splunk Enterprise 7.2
Splunk Cloud and Splunk Enterprise 7.2 Splunk
 
SplunkLive! Splunk for Security
SplunkLive! Splunk for SecuritySplunkLive! Splunk for Security
SplunkLive! Splunk for SecuritySplunk
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
dlux - Splunk Technical Overview
dlux - Splunk Technical Overviewdlux - Splunk Technical Overview
dlux - Splunk Technical OverviewDavid Lutz
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersJean-Paul Azar
 
Splunk Tutorial for Beginners - What is Splunk | Edureka
Splunk Tutorial for Beginners - What is Splunk | EdurekaSplunk Tutorial for Beginners - What is Splunk | Edureka
Splunk Tutorial for Beginners - What is Splunk | EdurekaEdureka!
 
Getting started with Splunk
Getting started with SplunkGetting started with Splunk
Getting started with SplunkSplunk
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?confluent
 
Splunk 6.4 Administration.pdf
Splunk 6.4 Administration.pdfSplunk 6.4 Administration.pdf
Splunk 6.4 Administration.pdfnitinscribd
 

What's hot (20)

Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 
Splunk Search Optimization
Splunk Search OptimizationSplunk Search Optimization
Splunk Search Optimization
 
Splunk workshop-Machine Data 101
Splunk workshop-Machine Data 101Splunk workshop-Machine Data 101
Splunk workshop-Machine Data 101
 
Splunk Data Onboarding Overview - Splunk Data Collection Architecture
Splunk Data Onboarding Overview - Splunk Data Collection ArchitectureSplunk Data Onboarding Overview - Splunk Data Collection Architecture
Splunk Data Onboarding Overview - Splunk Data Collection Architecture
 
SplunkLive! Data Models 101
SplunkLive! Data Models 101SplunkLive! Data Models 101
SplunkLive! Data Models 101
 
SplunkLive 2011 Beginners Session
SplunkLive 2011 Beginners SessionSplunkLive 2011 Beginners Session
SplunkLive 2011 Beginners Session
 
Splunk Overview
Splunk OverviewSplunk Overview
Splunk Overview
 
Splunk Cloud and Splunk Enterprise 7.2
Splunk Cloud and Splunk Enterprise 7.2 Splunk Cloud and Splunk Enterprise 7.2
Splunk Cloud and Splunk Enterprise 7.2
 
SplunkLive! Splunk for Security
SplunkLive! Splunk for SecuritySplunkLive! Splunk for Security
SplunkLive! Splunk for Security
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Splunk-Presentation
Splunk-Presentation Splunk-Presentation
Splunk-Presentation
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
dlux - Splunk Technical Overview
dlux - Splunk Technical Overviewdlux - Splunk Technical Overview
dlux - Splunk Technical Overview
 
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...
 
Kafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer ConsumersKafka Intro With Simple Java Producer Consumers
Kafka Intro With Simple Java Producer Consumers
 
Splunk Tutorial for Beginners - What is Splunk | Edureka
Splunk Tutorial for Beginners - What is Splunk | EdurekaSplunk Tutorial for Beginners - What is Splunk | Edureka
Splunk Tutorial for Beginners - What is Splunk | Edureka
 
Getting started with Splunk
Getting started with SplunkGetting started with Splunk
Getting started with Splunk
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Splunk 6.4 Administration.pdf
Splunk 6.4 Administration.pdfSplunk 6.4 Administration.pdf
Splunk 6.4 Administration.pdf
 

Similar to Best Practices for Splunk Deployments

What's New with the Latest Splunk Platform Release
What's New with the Latest Splunk Platform ReleaseWhat's New with the Latest Splunk Platform Release
What's New with the Latest Splunk Platform ReleaseSplunk
 
Splunk und Multi-Cloud
Splunk und Multi-CloudSplunk und Multi-Cloud
Splunk und Multi-CloudSplunk
 
Alle Neuigkeiten im letzten Plattform Release
Alle Neuigkeiten im letzten Plattform ReleaseAlle Neuigkeiten im letzten Plattform Release
Alle Neuigkeiten im letzten Plattform ReleaseSplunk
 
Splunk Cloud and Splunk Enterprise 7.2
Splunk Cloud and Splunk Enterprise 7.2 Splunk Cloud and Splunk Enterprise 7.2
Splunk Cloud and Splunk Enterprise 7.2 Splunk
 
Splunk Cloud and Splunk Enterprise 7.2
Splunk Cloud and Splunk Enterprise 7.2Splunk Cloud and Splunk Enterprise 7.2
Splunk Cloud and Splunk Enterprise 7.2Splunk
 
Splunk Data Stream Processor (DSP)
Splunk Data Stream Processor (DSP)Splunk Data Stream Processor (DSP)
Splunk Data Stream Processor (DSP)Anthony Reinke
 
Clear the Mist from your Clouds with Splunk
Clear the Mist from your Clouds with SplunkClear the Mist from your Clouds with Splunk
Clear the Mist from your Clouds with SplunkSplunk
 
Splunk Discovery Day Milwaukee 9-14-17
Splunk Discovery Day Milwaukee 9-14-17Splunk Discovery Day Milwaukee 9-14-17
Splunk Discovery Day Milwaukee 9-14-17Splunk
 
Splunk and Multicloud
Splunk and Multicloud Splunk and Multicloud
Splunk and Multicloud Splunk
 
Splunk and Multicloud
Splunk and MulticloudSplunk and Multicloud
Splunk and MulticloudSplunk
 
Exploring Frameworks of Splunk Enterprise Security
Exploring Frameworks of Splunk Enterprise SecurityExploring Frameworks of Splunk Enterprise Security
Exploring Frameworks of Splunk Enterprise SecuritySplunk
 
Exploring Frameworks of Splunk Enterprise Security
Exploring Frameworks of Splunk Enterprise Security Exploring Frameworks of Splunk Enterprise Security
Exploring Frameworks of Splunk Enterprise Security Splunk
 
Splunk Enterprise Security
Splunk Enterprise SecuritySplunk Enterprise Security
Splunk Enterprise SecuritySplunk
 
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunk
 
Machine Data 101: Turning Data Into Insight
Machine Data 101: Turning Data Into InsightMachine Data 101: Turning Data Into Insight
Machine Data 101: Turning Data Into InsightSplunk
 
Splunk Enterprise 6.3 - Splunk Tech Day
Splunk Enterprise 6.3 - Splunk Tech DaySplunk Enterprise 6.3 - Splunk Tech Day
Splunk Enterprise 6.3 - Splunk Tech DayZivaro Inc
 
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunk
 
Getting Started with Splunk Enterprises
Getting Started with Splunk EnterprisesGetting Started with Splunk Enterprises
Getting Started with Splunk EnterprisesSplunk
 
Splunk Forum Frankfurt - 15th Nov 2017 - .conf2017 Update
Splunk Forum Frankfurt - 15th Nov 2017 - .conf2017 UpdateSplunk Forum Frankfurt - 15th Nov 2017 - .conf2017 Update
Splunk Forum Frankfurt - 15th Nov 2017 - .conf2017 UpdateSplunk
 

Similar to Best Practices for Splunk Deployments (20)

What's New with the Latest Splunk Platform Release
What's New with the Latest Splunk Platform ReleaseWhat's New with the Latest Splunk Platform Release
What's New with the Latest Splunk Platform Release
 
Splunk und Multi-Cloud
Splunk und Multi-CloudSplunk und Multi-Cloud
Splunk und Multi-Cloud
 
Alle Neuigkeiten im letzten Plattform Release
Alle Neuigkeiten im letzten Plattform ReleaseAlle Neuigkeiten im letzten Plattform Release
Alle Neuigkeiten im letzten Plattform Release
 
Splunk Cloud and Splunk Enterprise 7.2
Splunk Cloud and Splunk Enterprise 7.2 Splunk Cloud and Splunk Enterprise 7.2
Splunk Cloud and Splunk Enterprise 7.2
 
Splunk Cloud and Splunk Enterprise 7.2
Splunk Cloud and Splunk Enterprise 7.2Splunk Cloud and Splunk Enterprise 7.2
Splunk Cloud and Splunk Enterprise 7.2
 
Splunk Data Stream Processor (DSP)
Splunk Data Stream Processor (DSP)Splunk Data Stream Processor (DSP)
Splunk Data Stream Processor (DSP)
 
Clear the Mist from your Clouds with Splunk
Clear the Mist from your Clouds with SplunkClear the Mist from your Clouds with Splunk
Clear the Mist from your Clouds with Splunk
 
Splunk Discovery Day Milwaukee 9-14-17
Splunk Discovery Day Milwaukee 9-14-17Splunk Discovery Day Milwaukee 9-14-17
Splunk Discovery Day Milwaukee 9-14-17
 
Splunk and Multicloud
Splunk and Multicloud Splunk and Multicloud
Splunk and Multicloud
 
Splunk and Multicloud
Splunk and MulticloudSplunk and Multicloud
Splunk and Multicloud
 
Exploring Frameworks of Splunk Enterprise Security
Exploring Frameworks of Splunk Enterprise SecurityExploring Frameworks of Splunk Enterprise Security
Exploring Frameworks of Splunk Enterprise Security
 
Exploring Frameworks of Splunk Enterprise Security
Exploring Frameworks of Splunk Enterprise Security Exploring Frameworks of Splunk Enterprise Security
Exploring Frameworks of Splunk Enterprise Security
 
Splunk Enterprise Security
Splunk Enterprise SecuritySplunk Enterprise Security
Splunk Enterprise Security
 
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
 
FNC2751.pdf
FNC2751.pdfFNC2751.pdf
FNC2751.pdf
 
Machine Data 101: Turning Data Into Insight
Machine Data 101: Turning Data Into InsightMachine Data 101: Turning Data Into Insight
Machine Data 101: Turning Data Into Insight
 
Splunk Enterprise 6.3 - Splunk Tech Day
Splunk Enterprise 6.3 - Splunk Tech DaySplunk Enterprise 6.3 - Splunk Tech Day
Splunk Enterprise 6.3 - Splunk Tech Day
 
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding Overview
 
Getting Started with Splunk Enterprises
Getting Started with Splunk EnterprisesGetting Started with Splunk Enterprises
Getting Started with Splunk Enterprises
 
Splunk Forum Frankfurt - 15th Nov 2017 - .conf2017 Update
Splunk Forum Frankfurt - 15th Nov 2017 - .conf2017 UpdateSplunk Forum Frankfurt - 15th Nov 2017 - .conf2017 Update
Splunk Forum Frankfurt - 15th Nov 2017 - .conf2017 Update
 

More from Splunk

.conf Go 2023 - Data analysis as a routine
.conf Go 2023 - Data analysis as a routine.conf Go 2023 - Data analysis as a routine
.conf Go 2023 - Data analysis as a routineSplunk
 
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTVSplunk
 
.conf Go 2023 - Navegando la normativa SOX (Telefónica)
.conf Go 2023 - Navegando la normativa SOX (Telefónica).conf Go 2023 - Navegando la normativa SOX (Telefónica)
.conf Go 2023 - Navegando la normativa SOX (Telefónica)Splunk
 
.conf Go 2023 - Raiffeisen Bank International
.conf Go 2023 - Raiffeisen Bank International.conf Go 2023 - Raiffeisen Bank International
.conf Go 2023 - Raiffeisen Bank InternationalSplunk
 
.conf Go 2023 - På liv og død Om sikkerhetsarbeid i Norsk helsenett
.conf Go 2023 - På liv og død Om sikkerhetsarbeid i Norsk helsenett .conf Go 2023 - På liv og død Om sikkerhetsarbeid i Norsk helsenett
.conf Go 2023 - På liv og død Om sikkerhetsarbeid i Norsk helsenett Splunk
 
.conf Go 2023 - Many roads lead to Rome - this was our journey (Julius Bär)
.conf Go 2023 - Many roads lead to Rome - this was our journey (Julius Bär).conf Go 2023 - Many roads lead to Rome - this was our journey (Julius Bär)
.conf Go 2023 - Many roads lead to Rome - this was our journey (Julius Bär)Splunk
 
.conf Go 2023 - Das passende Rezept für die digitale (Security) Revolution zu...
.conf Go 2023 - Das passende Rezept für die digitale (Security) Revolution zu....conf Go 2023 - Das passende Rezept für die digitale (Security) Revolution zu...
.conf Go 2023 - Das passende Rezept für die digitale (Security) Revolution zu...Splunk
 
.conf go 2023 - Cyber Resilienz – Herausforderungen und Ansatz für Energiever...
.conf go 2023 - Cyber Resilienz – Herausforderungen und Ansatz für Energiever....conf go 2023 - Cyber Resilienz – Herausforderungen und Ansatz für Energiever...
.conf go 2023 - Cyber Resilienz – Herausforderungen und Ansatz für Energiever...Splunk
 
.conf go 2023 - De NOC a CSIRT (Cellnex)
.conf go 2023 - De NOC a CSIRT (Cellnex).conf go 2023 - De NOC a CSIRT (Cellnex)
.conf go 2023 - De NOC a CSIRT (Cellnex)Splunk
 
conf go 2023 - El camino hacia la ciberseguridad (ABANCA)
conf go 2023 - El camino hacia la ciberseguridad (ABANCA)conf go 2023 - El camino hacia la ciberseguridad (ABANCA)
conf go 2023 - El camino hacia la ciberseguridad (ABANCA)Splunk
 
Splunk - BMW connects business and IT with data driven operations SRE and O11y
Splunk - BMW connects business and IT with data driven operations SRE and O11ySplunk - BMW connects business and IT with data driven operations SRE and O11y
Splunk - BMW connects business and IT with data driven operations SRE and O11ySplunk
 
Splunk x Freenet - .conf Go Köln
Splunk x Freenet - .conf Go KölnSplunk x Freenet - .conf Go Köln
Splunk x Freenet - .conf Go KölnSplunk
 
Splunk Security Session - .conf Go Köln
Splunk Security Session - .conf Go KölnSplunk Security Session - .conf Go Köln
Splunk Security Session - .conf Go KölnSplunk
 
Data foundations building success, at city scale – Imperial College London
 Data foundations building success, at city scale – Imperial College London Data foundations building success, at city scale – Imperial College London
Data foundations building success, at city scale – Imperial College LondonSplunk
 
Splunk: How Vodafone established Operational Analytics in a Hybrid Environmen...
Splunk: How Vodafone established Operational Analytics in a Hybrid Environmen...Splunk: How Vodafone established Operational Analytics in a Hybrid Environmen...
Splunk: How Vodafone established Operational Analytics in a Hybrid Environmen...Splunk
 
SOC, Amore Mio! | Security Webinar
SOC, Amore Mio! | Security WebinarSOC, Amore Mio! | Security Webinar
SOC, Amore Mio! | Security WebinarSplunk
 
.conf Go 2022 - Observability Session
.conf Go 2022 - Observability Session.conf Go 2022 - Observability Session
.conf Go 2022 - Observability SessionSplunk
 
.conf Go Zurich 2022 - Keynote
.conf Go Zurich 2022 - Keynote.conf Go Zurich 2022 - Keynote
.conf Go Zurich 2022 - KeynoteSplunk
 
.conf Go Zurich 2022 - Platform Session
.conf Go Zurich 2022 - Platform Session.conf Go Zurich 2022 - Platform Session
.conf Go Zurich 2022 - Platform SessionSplunk
 
.conf Go Zurich 2022 - Security Session
.conf Go Zurich 2022 - Security Session.conf Go Zurich 2022 - Security Session
.conf Go Zurich 2022 - Security SessionSplunk
 

More from Splunk (20)

.conf Go 2023 - Data analysis as a routine
.conf Go 2023 - Data analysis as a routine.conf Go 2023 - Data analysis as a routine
.conf Go 2023 - Data analysis as a routine
 
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
.conf Go 2023 - How KPN drives Customer Satisfaction on IPTV
 
.conf Go 2023 - Navegando la normativa SOX (Telefónica)
.conf Go 2023 - Navegando la normativa SOX (Telefónica).conf Go 2023 - Navegando la normativa SOX (Telefónica)
.conf Go 2023 - Navegando la normativa SOX (Telefónica)
 
.conf Go 2023 - Raiffeisen Bank International
.conf Go 2023 - Raiffeisen Bank International.conf Go 2023 - Raiffeisen Bank International
.conf Go 2023 - Raiffeisen Bank International
 
.conf Go 2023 - På liv og død Om sikkerhetsarbeid i Norsk helsenett
.conf Go 2023 - På liv og død Om sikkerhetsarbeid i Norsk helsenett .conf Go 2023 - På liv og død Om sikkerhetsarbeid i Norsk helsenett
.conf Go 2023 - På liv og død Om sikkerhetsarbeid i Norsk helsenett
 
.conf Go 2023 - Many roads lead to Rome - this was our journey (Julius Bär)
.conf Go 2023 - Many roads lead to Rome - this was our journey (Julius Bär).conf Go 2023 - Many roads lead to Rome - this was our journey (Julius Bär)
.conf Go 2023 - Many roads lead to Rome - this was our journey (Julius Bär)
 
.conf Go 2023 - Das passende Rezept für die digitale (Security) Revolution zu...
.conf Go 2023 - Das passende Rezept für die digitale (Security) Revolution zu....conf Go 2023 - Das passende Rezept für die digitale (Security) Revolution zu...
.conf Go 2023 - Das passende Rezept für die digitale (Security) Revolution zu...
 
.conf go 2023 - Cyber Resilienz – Herausforderungen und Ansatz für Energiever...
.conf go 2023 - Cyber Resilienz – Herausforderungen und Ansatz für Energiever....conf go 2023 - Cyber Resilienz – Herausforderungen und Ansatz für Energiever...
.conf go 2023 - Cyber Resilienz – Herausforderungen und Ansatz für Energiever...
 
.conf go 2023 - De NOC a CSIRT (Cellnex)
.conf go 2023 - De NOC a CSIRT (Cellnex).conf go 2023 - De NOC a CSIRT (Cellnex)
.conf go 2023 - De NOC a CSIRT (Cellnex)
 
conf go 2023 - El camino hacia la ciberseguridad (ABANCA)
conf go 2023 - El camino hacia la ciberseguridad (ABANCA)conf go 2023 - El camino hacia la ciberseguridad (ABANCA)
conf go 2023 - El camino hacia la ciberseguridad (ABANCA)
 
Splunk - BMW connects business and IT with data driven operations SRE and O11y
Splunk - BMW connects business and IT with data driven operations SRE and O11ySplunk - BMW connects business and IT with data driven operations SRE and O11y
Splunk - BMW connects business and IT with data driven operations SRE and O11y
 
Splunk x Freenet - .conf Go Köln
Splunk x Freenet - .conf Go KölnSplunk x Freenet - .conf Go Köln
Splunk x Freenet - .conf Go Köln
 
Splunk Security Session - .conf Go Köln
Splunk Security Session - .conf Go KölnSplunk Security Session - .conf Go Köln
Splunk Security Session - .conf Go Köln
 
Data foundations building success, at city scale – Imperial College London
 Data foundations building success, at city scale – Imperial College London Data foundations building success, at city scale – Imperial College London
Data foundations building success, at city scale – Imperial College London
 
Splunk: How Vodafone established Operational Analytics in a Hybrid Environmen...
Splunk: How Vodafone established Operational Analytics in a Hybrid Environmen...Splunk: How Vodafone established Operational Analytics in a Hybrid Environmen...
Splunk: How Vodafone established Operational Analytics in a Hybrid Environmen...
 
SOC, Amore Mio! | Security Webinar
SOC, Amore Mio! | Security WebinarSOC, Amore Mio! | Security Webinar
SOC, Amore Mio! | Security Webinar
 
.conf Go 2022 - Observability Session
.conf Go 2022 - Observability Session.conf Go 2022 - Observability Session
.conf Go 2022 - Observability Session
 
.conf Go Zurich 2022 - Keynote
.conf Go Zurich 2022 - Keynote.conf Go Zurich 2022 - Keynote
.conf Go Zurich 2022 - Keynote
 
.conf Go Zurich 2022 - Platform Session
.conf Go Zurich 2022 - Platform Session.conf Go Zurich 2022 - Platform Session
.conf Go Zurich 2022 - Platform Session
 
.conf Go Zurich 2022 - Security Session
.conf Go Zurich 2022 - Security Session.conf Go Zurich 2022 - Security Session
.conf Go Zurich 2022 - Security Session
 

Recently uploaded

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 

Recently uploaded (20)

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 

Best Practices for Splunk Deployments

  • 1. © 2019 SPLUNK INC.© 2019 SPLUNK INC. Best practises for Data Collection Richard Morgan – Principal Architect Improving Event Collection via measurement
  • 2. © 2019 SPLUNK INC. During the course of this presentation, we may make forward-looking statements regarding future events or the expected performance of the company. We caution you that such statements reflect our current expectations and estimates based on factors currently known to us and that actual events or results could differ materially. For important factors that may cause actual results to differ from those contained in our forward-looking statements, please review our filings with the SEC. The forward-looking statements made in this presentation are being made as of the time and date of its live presentation. If reviewed after its live presentation, this presentation may not contain current or accurate information. We do not assume any obligation to update any forward-looking statements we may make. In addition, any information about our roadmap outlines our general product direction and is subject to change at any time without notice. It is for informational purposes only and shall not be incorporated into any contract or other commitment. Splunk undertakes no obligation either to develop the features or functionality described or to include any such feature or functionality in a future release. Splunk, Splunk>, Listen to Your Data, The Engine for Machine Data, Splunk Cloud, Splunk Light and SPL are trademarks and registered trademarks of Splunk Inc. in the United States and other countries. All other brand names, product names, or trademarks belong to their respective owners. © 2019 Splunk Inc. All rights reserved. Forward-Looking Statements
  • 3. © 2019 SPLUNK INC. Spreading that Splunk across EMEA since 2013 Self professed data junkie and SPL addict
  • 4. © 2019 SPLUNK INC. indexers forwarders Search heads A Splunk installation is much like an iceberg, the visible tip is the indexers, search heads, cluster master etc. But this deck is about what lies beneath - the vast and sprawling data collection hierarchy
  • 5. © 2019 SPLUNK INC. Why is event collection tuning important? 1. Data collection is the foundation of any Splunk instance 2. Event distribution underpins linear scaling is built 3. Events must be synchronized in time to corelate across hosts 4. Events must arrive in a timely fashion for alerts to be effective
  • 6. © 2019 SPLUNK INC. Use EVENT_BREAKER and / or forceTimeBasedAutoLB Use indexer discovery whenever possible Use lots of multiple pipelines on intermediate forwarders Use autoLBVolume for variable data rate sources Configure LINE_BREAKER and SHOULD_LINEMERGE=false Explicitly configure date time format for each sourcetype A summary of data collection best practices
  • 7. © 2019 SPLUNK INC. What is event distribution?
  • 8. © 2019 SPLUNK INC. What is Good Event Distribution? IndexerIndexerIndexerIndexer Forwarder 25 MB 25 MB 25 MB 100 MB Event distribution is how Splunk spreads its incoming data across multiple indexers 25 MB
  • 9. © 2019 SPLUNK INC. Why is Good Event Distribution important? IndexerIndexerIndexerIndexer Search head Event distribution is critical for the even distribution of search (computation) workload 10 sec10 sec10 sec10 sec 11 sec Execution of a search for a fix time range
  • 10. © 2019 SPLUNK INC. What is ‘bad’ Event Distribution? Bad event distribution is when the spread of events is uneven across the indexers 70 MB 10 MB 100 MB 10 MB 10 MB IndexerIndexerIndexerIndexer Forwarder
  • 11. © 2019 SPLUNK INC. Bad Event Distribution affects search Search time becomes unbalanced, searches take longer to complete and reducing throughput 90 sec 91 sec 10 sec10 sec10 sec IndexerIndexerIndexerIndexer Search head Execution of a search for a fix time range
  • 12. © 2019 SPLUNK INC. What is event delay?
  • 13. © 2019 SPLUNK INC. What is good event delay? IUF Indexer Search head UFApp Log file readwrite s2s s2s search It should take between 3-5 seconds from event generation to that event being searchable 3-5 seconds
  • 14. © 2019 SPLUNK INC. How do you calculate event delay? | eval event_delay=_indextime - _time The timestamp of the event is _time The time of indexing is _indextime The time of search is now() IUF Indexer Search head UFApp Log file readwrite s2s s2s search
  • 15. © 2019 SPLUNK INC. How time affects event distribution
  • 16. © 2019 SPLUNK INC. autoLBFrequency (outputs.conf) autoLBFrequency = <integer> * The amount of time, in seconds, that a forwarder sends data to an indexer before redirecting outputs to another indexer in the pool. * Use this setting when you are using automatic load balancing of outputs from universal forwarders (UFs). * Every 'autoLBFrequency' seconds, a new indexer is selected randomly from the list of indexers provided in the server setting of the target group stanza. * Default: 30 30 seconds of 1 MB/s is 30 MB for each connection!
  • 17. © 2019 SPLUNK INC. Event distribution improves over time Switching at 30 seconds 5 mins = 10 connections 1 hour = 120 connections 1 day = 2880 connections How long before all indexers have data? 10 indexers = 5 mins 50 indexers = 25 mins 100 indexers = 50 mins Larger clusters take longer to get “good” event distribution
  • 18. © 2019 SPLUNK INC. Data is distributed across the indexers over time Each indexer is allocated 30s of data via Randomized Round Robin
  • 19. © 2019 SPLUNK INC. earliest=now latest=-90s Indexer 4 has no data to process
  • 20. © 2019 SPLUNK INC. earliest=now latest=-180s Indexer 4 has 2x the data to process
  • 21. © 2019 SPLUNK INC. It’s not Round Robin (RR), its Randomized RR Round robin: Round 1 order: 1,2,3,4 Round 2 order: 1,2,3,4 Round 3 order: 1,2,3,4 Round 4 order: 1,2,3,4 Randomized Round Robin: Round 1 order: 1,4,3,2 Round 2 order: 4,1,2,3 Round 3 order: 2,4,3,1 Round 4 order: 1,3,2,4 Splunk’s randomized round robin algorithm quickens event distribution
  • 22. © 2019 SPLUNK INC. Event distribution improves over time time Standarddeviation Initially we start connecting to just one indexer and the standard deviation is high As time goes to infinity standard deviation should trend to zero Most searches execute over the last 15mins Standarddeviationacrossindexers As time passes the standard deviation improves
  • 23. © 2019 SPLUNK INC. Types of data flows
  • 24. © 2019 SPLUNK INC. Types of data flow coming into Splunk Periodic data, typically a scripted input, very spikey. Think AWS cloudwatch, think event delay. Constant data rates, nice and smooth, think metrics. Variable data rates, typically driven by usage, think web logs One shot, much like periodic data It can be difficult to optimize for every type of data flow
  • 25. © 2019 SPLUNK INC. Time based LB does not work well on its own timebasedAutoLB is the default and is set to 30s
  • 26. © 2019 SPLUNK INC. autoLBVolume (outputs.conf) autoLBVolume = <integer> * The volume of data, in bytes, to send to an indexer before a new indexer is randomly selected from the list of indexers provided in the server setting of the target group stanza. * This setting is closely related to the 'autoLBFrequency' setting. The forwarder first uses 'autoLBVolume' to determine if it needs to switch to another indexer. If the 'autoLBVolume' is not reached, but the 'autoLBFrequency' is, the forwarder switches to another indexer as the forwarding target. * A non-zero value means that volume-based forwarding is active. * 0 means the volume-based forwarding is not active. * Default: 0 Switching too fast can result in lower throughput
  • 27. © 2019 SPLUNK INC. autoLBVolume vs timeBasedAutoLB With variable data rates timeBasedAutoLB will keep switching rate constant and vary the volume of data For variable data rates autoLBVolume will keep data volumes constant and vary switching rate For constant data rates they are the same. Use both options together for best affect
  • 28. © 2019 SPLUNK INC. autoLBVolume + time based is much better autoLBVolume is better for variable and bursty data flows
  • 29. © 2019 SPLUNK INC. How to measure event distribution
  • 30. © 2019 SPLUNK INC. Call the REST API to get RT ingestion | rest /services/server/introspection/indexer | eventstats stdev(average_KBps) avg(average_KBps)
  • 31. © 2019 SPLUNK INC. Ingestion over time index=_internal Metrics TERM(group=thruput) TERM(name=thruput) sourcetype=splunkd [| dbinspect index=* | stats values(splunk_server) as indexer | eval host_count=mvcount(indexer), search="host IN (".mvjoin(mvfilter(indexer!=""), ", ").")"] | eval host_pipeline=host."-".ingest_pipe | timechart minspan=30sec limit=0 per_second(kb) by host_pipeline Too much data being ingested by a single indexer
  • 32. © 2019 SPLUNK INC. index=_internal Metrics TERM(group=thruput) TERM(name=thruput) sourcetype=splunkd [| dbinspect index=* | stats values(splunk_server) as indexer | eval host_count=mvcount(indexer), search="host IN (".mvjoin(mvfilter(indexer!=""), ", ").")"] | eval host_pipeline=host."-".ingest_pipe | timechart minspan=30sec limit=0 per_second(kb) by host_pipeline Indexer thruput search explained Use TERM to speed up search execution Use metrics to get pipeline statistics Sub-search returns indexer list Measures each pipeline individually
  • 33. © 2019 SPLUNK INC. Count the events per indexer via job inspector For any search, open the job inspector and see how many events each indexer returned. This should be approximately the same for each indexer. In a multi site or multi cluster environment we expect that groups of indexers are likely to have different numbers of events. Within a single site the number of events should be the about same for each indexer.
  • 34. © 2019 SPLUNK INC. Count the events per indexer and index via search The event distribution score per index
  • 35. © 2019 SPLUNK INC. Calculate event distribution per index for -5mins | tstats count where index=* splunk_server=* host=* earliest=-5min latest=now by splunk_server index | stats sum(count) as total_events stdev(count) as stdev avg(count) as average by index | eval normalized_stdev=stdev/average Event distribution can be measured per index and per host Narrow indexes Narrow to site or cluster Calculate score Calculate per index, or host
  • 36. © 2019 SPLUNK INC. http://bit.ly/2WxRXvI Shameless self promotion Use my event distribution measurement dashboard to assess your stack
  • 37. © 2019 SPLUNK INC. How to configure the event distribution dashboard 3. Specify set of indexers to measure, scope per site or cluster 1. Specify indexes to measure 4. Click link to run Exponentially growing steps 2WxRXvI 2. Select time power function
  • 38. © 2019 SPLUNK INC. How to read the first panel Variation of events across the indexers Data received an indexer in -1 sec Logscaleofeventsreceived stdev improves as time increases Each series is a time range, exponentially growing. Each column is an indexer Data received an indexer in -54 mins 2WxRXvI
  • 39. © 2019 SPLUNK INC. How to read the second panel Events scanned in each step Each series is an index Thenumberofeventsreceived (logscale) After ~8mins indexer received 100k events into index Each time series plotted on x axis 2WxRXvI
  • 40. © 2019 SPLUNK INC. How to read the third panel How many indexers received data in time range All 9 indexers received data into every index within one second 2WxRXvI
  • 41. © 2019 SPLUNK INC. How to read the final panel How event distribution is improving over time 3% variation after 5 mins 10% variation after 5 mins 2WxRXvI > 1% variation after 54 mins
  • 42. © 2019 SPLUNK INC. Three hours before all indexers have data, need to accelerate to improve entropy Randomization is not happening fast enough Incoming data rates for indexers very different for the different indexes 2WxRXvI
  • 43. © 2019 SPLUNK INC. Randomization is slow for most indexes and plateaus for some Some indexers are not getting data, likely misconfigured forwarders 2WxRXvI
  • 44. © 2019 SPLUNK INC. Near perfect event distribution using intermediate forwarders One index is not improving fast enough 2WxRXvI
  • 45. © 2019 SPLUNK INC. Measuring event delay
  • 46. © 2019 SPLUNK INC. Use TSTATS to compute event delay at scale
  • 47. © 2019 SPLUNK INC. | tstats max(_indextime) as indexed_time count where index=* latest=+1day earliest=-1day _index_latest=-1sec _index_earliest=-2sec by index host splunk_server _time span=1s | eval _time=round(_time), delay=indexed_time-_time, delay_str=tostring(delay,"duration") | eventstats max(delay) as max_delay max(_time) as max_time count as eps by host index | where max_delay = delay | eval max_time=_time | sort - delay Delay per index and host Use tstats because indextime as an indexed field!
  • 48. © 2019 SPLUNK INC. | tstats max(_indextime) as indexed_time count where index=* latest=+1day earliest=-1day _index_latest=-1sec _index_earliest=-2sec by index host splunk_server _time span=1s | eval _time=round(_time), delay=indexed_time-_time, delay_str=tostring(delay,"duration") | eventstats max(delay) as max_delay max(_time) as max_time count as eps by host index | where max_delay = delay | eval max_time=_time | sort - delay Delay per index and host Get events received in the last second, irrespective of the delay max(_indextime) is latest event Split by _time to a resolution of a second Drop subseconds Compute the delta
  • 49. © 2019 SPLUNK INC. http://bit.ly/2R9i4Yx http://bit.ly/2I8qtIS More shameless self promotion I have a dashboards that use this method to measure delay
  • 50. © 2019 SPLUNK INC. How to use the event delay dashboard Drills down to per host delay Select indexes to measure Keep time range short, but extend to capture periodic data translates into _indextime, not _time 2R9i4Yx 2I8qtIS
  • 51. © 2019 SPLUNK INC. How to read the main panels By default the results are all for the last second The average delay per index The number of events received per index The maximum delay per index The number of hosts per index How events are distributed across the cluster 2R9i4Yx 2I8qtIS
  • 52. © 2019 SPLUNK INC. Select an index by clicking on a series Click on any chart to select an index Click on a host to drill down Select period for drill down Sorted by descending delay 2R9i4Yx 2I8qtIS
  • 53. © 2019 SPLUNK INC. Drill down to host to understand reasons for delay The number of events generated at time, per index The number of events generated at time, by delay The number of events received at time, per index The number of events received at time, by delay 2R9i4Yx 2I8qtIS
  • 54. © 2019 SPLUNK INC. The causes of event delay
  • 55. © 2019 SPLUNK INC. Congestion network latency, IO contention, pipeline congestion, CPU saturation, rate limiting Clock skew Time zones wrong, clock drift, parsing issue Timeliness Histortical load, polling APIs, scripted inputs, component restarts
  • 56. © 2019 SPLUNK INC. Congestion: Rate limiting Rate limiting slows transmission and true event delay occurs Default maxkbps=256 IUF Indexer Search head UFApp Log file readwrite s2s s2s search
  • 57. © 2019 SPLUNK INC. Congestion: Network Network saturation acts like rate limiting and causes true event delay Network congestion IUF Indexer Search head UFApp Log file readwrite s2s s2s search
  • 58. © 2019 SPLUNK INC. Congestion: Indexer Excessive ingestion rates, FS IO problems, inefficient regex, inefficient line breaking cause all cause true event delay Pipeline congestion IUF Indexer Search head UFApp Log file readwrite s2s s2s search
  • 59. © 2019 SPLUNK INC. Timeliness: Scripted inputs Increase polling frequency to reduce event delay Polling very x mins means delays of up to x mins for each IUF Indexer Search head UFApp Periodic API read s2s s2s search
  • 60. © 2019 SPLUNK INC. Timeliness: Offline components Restarting forwarders causes brief event delay Forwarder must be running or events queue IUF Indexer Search head UFApp Log file readwrite s2s s2s search
  • 61. © 2019 SPLUNK INC. Timeliness: Historical data Loading historical data creates fake event delay Backfilling data from the past IUF Indexer Search head UFApp Log file readwrite s2s s2s search
  • 62. © 2019 SPLUNK INC. Clock Skew: Time zones When time zones aren’t configured correctly event delay measurement is shifted into the past or future Time zone A Time zone B IUF Indexer Search head UFApp Log file readwrite s2s s2s search
  • 63. © 2019 SPLUNK INC. Clock Skew: Drift Use NTP to align all clocks across your estate to maximize the usefulness of Splunk 8:00 pm 7:59 pm When events appear to arrive slightly from the future or past, this makes time-based coloration hard IUF Indexer Search head UFApp Log file readwrite s2s s2s search
  • 64. © 2019 SPLUNK INC. Clock Skew: Date time parsing problems Automatic source typing assumes American date format when ambiguous Always explicitly set the date time exactly per sourcetype IUF Indexer Search head UFApp Log file readwrite s2s s2s search
  • 65. © 2019 SPLUNK INC. Reasons for poor event distribution
  • 66. © 2019 SPLUNK INC. Sticky forwarders Super giant forwarders Badly configured intermediate forwarders Indexer abandonment Indexer starvation Network connectivity problems Single target Forwarder bugs TCP back off maxKBps Channel saturation HEC ingestion
  • 67. © 2019 SPLUNK INC. Sticky Forwarders
  • 68. © 2019 SPLUNK INC. The UF uses “natural breaks” to chunk up logs 10 events 50kb 20 events 80kb 10k events 10MB 10 events 50kb My logging application Logging Semantics: 1. Open file 2. Append to file 3. Close file We know we have a natural break each time the file is closed. Target 1 Time Target 2 Target 4 Target 3 Natural break Natural break Natural break EOF When an application bursts the forwarder is forced to “stick” to the target until the application generates a natural break
  • 69. © 2019 SPLUNK INC. The problem is exasperated with IUFs UF 1 Indexer 1IUF 1 HWF 1 IUF 2 UF 3 Indexer 2 An intermediate universal forwarder (IUF) works with unparsed streams and can only switch away the incoming stream contains a break. Connections can last for hours and this causes bad event distribution UF 2 IUFs cannot afford to be sticky and must switch on time stuck
  • 70. © 2019 SPLUNK INC. The problem doesn’t exist with an intermediate HWF UF 1 Indexer 1HWF 1 HWF 1 HWF 2 UF 3 Indexer 2 The HWF parses data and forwarder events, not streams. HWF can receive connections from sticky forwarders causing throughput problems UF 2 Heavy forwarders are generally considered a bad practise
  • 71. © 2019 SPLUNK INC. forceTimebasedAutoLB (outputs.conf) forceTimebasedAutoLB = <boolean> * Forces existing data streams to switch to a newly elected indexer every auto load balancing cycle. * On universal forwarders, use the 'EVENT_BREAKER_ENABLE' and 'EVENT_BREAKER' settings in props.conf rather than 'forceTimebasedAutoLB' for improved load balancing, line breaking, and distribution of events. * Default: false Forcing a UF to switch can create broken events and generate parsing errors
  • 72. © 2019 SPLUNK INC. How forceTimeBasedAutoLB works Splunk to Splunk protocol (s2s) uses datagrams of 64KB 64KB 64KB 64KB 64KB 64KB Log sources current indexer next indexer 64KB First event Remaining events 30sec Provided that an events doesn’t succeed a s2s datagram the algorithm works perfectly We need to switch, but there is no natural break! Send packet to both assuming that the packet contains multiple events and a line break will be found Event boundary Reads 1st event ignores 1st event
  • 73. © 2019 SPLUNK INC. Applying forceTimeBasedAutoLB to IUF UF 1 Indexer 1IUF 1 HWF 1 IUF 2 UF 3 Indexer 2 The Intermediate Universal Forwarder will force switching without a natural break. The indexers no longer get sticky sessions! UF 2 The intermediate forwarders still recieve sticky sessions
  • 74. © 2019 SPLUNK INC. EVENT_BREAKER (outputs.conf) # Use the following settings to handle better load balancing from UF. # Please note the EVENT_BREAKER properties are applicable for Splunk Universal # Forwarder instances only. EVENT_BREAKER_ENABLE = [true|false] * When set to true, Splunk software will split incoming data with a light-weight chunked line breaking processor so that data is distributed fairly evenly amongst multiple indexers. Use this setting on the UF to indicate that data should be split on event boundaries across indexers especially for large files. * Defaults to false # Use the following to define event boundaries for multi-line events # For single-line events, the default settings should suffice EVENT_BREAKER = <regular expression> * When set, Splunk software will use the setting to define an event boundary at the end of the first matching group instance. EVENT_BREAKER is configured with a regex per sourcetype
  • 75. © 2019 SPLUNK INC. EVENT_BREAKER is complicated to maintain on IUF UF1 – 5 source types UF1 – 10 source types UF1 – 20 source types UF1 – 15 source types UF1 – 10 source types UF1 – 5 source types IUF1 65 source types Configure each sourcetype with EVENT_BREAKER so the forwarder doesn’t get stuck the IUF. Trying to maintain EVENT_BREAKER on an IUF can be impractical due to aggregation of streams and source types. If the endpoints all implement EVENT_BREAKER, it won’t be triggered on the IUF forceTimeBasedAutoLB is universal algorithm, EVENT_BREAKER is not
  • 76. © 2019 SPLUNK INC. The final solution to sticky forwarders UF 1 Indexer 1IUF 1 HWF 1 IUF 2 UF 3 Indexer 2 Removal of sticky forwarders will lower event delay, improve event distribution, and improve search execution times UF 2 Configured correctly intermediate forwarders are great for improving event distribution. Configure EVENT_BREAKER per sourcetype, use autoLBVolume Increase autoLBFrequency, enable forceTimeBasedAutoLB, use autoLBVolume
  • 77. © 2019 SPLUNK INC. Find all sticky forwarders by host name index=_internal sourcetype=splunkd TERM(eventType=connect_done) OR TERM(eventType=connect_close) | transaction startswith=eventType=connect_done endswith=eventType=connect_close sourceHost sourcePort host | stats stdev(duration) median(duration) avg(duration) max(duration) by sourceHost | sort - max(duration) Intermediate LB will invalidate results
  • 78. © 2019 SPLUNK INC. Find all sticky forwarders by hostname index=_internal sourcetype=splunkd (TERM(eventType=connect_done) OR TERM(eventType=connect_close) OR TERM(group=tcpin_connections)) | transaction startswith=eventType=connect_done endswith=eventType=connect_close sourceHost sourcePort host | stats stdev(duration) median(duration) avg(duration) max(duration) by hostname | sort - max(duration) Error prone as it requires that hostname = host
  • 79. © 2019 SPLUNK INC. Super Giant Forwarders a.k.a. “laser beams of death”
  • 80. © 2019 SPLUNK INC. Not all Forwarders are born equal AWS Unix Hosts Network Syslog + UF HWF Data bases Windows host 5 KB/s Windows host Windows host Super giant forwarder make others look like rounding errors
  • 81. © 2019 SPLUNK INC. Understanding forwarder weight distribution Fwd1 Fwd2 Fwd3 Fwd4 Fwd5 Reading from left to right • 0% forwarders = 0% data • 20% forwarders = 66% data • 40% forwarders = 73% data • 60% forwarders = 90% data • 80% forwarders = 93% data • 100% forwarders 100% data We can plot these pairs as an chart to normalize and compare stacks. 10 MB/s 3 MB/s 1 MB/s 500 KB/s 500 KB/s Total ingestion is 15 MB/s
  • 82. © 2019 SPLUNK INC. Plotting normalized weight distribution
  • 83. © 2019 SPLUNK INC. Examples of forward weight distribution Data imbalance issues can be found on very large stack and must be addressed 10% 100%
  • 84. © 2019 SPLUNK INC. Forwarder weight distribution search index=_internal Metrics sourcetype=splunkd TERM(group=tcpin_connections) earliest=-4hr latest=now [| dbinspect index=_* | stats values(splunk_server) as indexer | eval search="host IN (".mvjoin(mvfilter(indexer!=""), ", ").")"] | stats sum(kb) as throughput by hostname | sort - throughput | eventstats sum(throughput) as total_throughput dc(hostname) as all_forwarders | streamstats sum(throughput) as accumlated_throughput count by all_forwarders | eval coverage=accumlated_throughput/total_throughput, progress_through_forwarders=count/all_forwarders | bin progress_through_forwarders bins=100 | stats max(coverage) as coverage by progress_through_forwarders all_forwarders | fields all_forwarders stack coverage progress_through_forwarders
  • 85. © 2019 SPLUNK INC. Find super giant forwarders and reconfigure Super giant forwarders need careful configuration 1.Configure EVENT_BREAKER and / or forceTimeBasedAutoLB 2.Configure multiple pipelines (validate that they are being used) 3.Configure autoLBVolume and / or increase switching speed (keeping an eye on throughput) 4.Use INGEST_EVAL and random() to shard output data flows
  • 86. © 2019 SPLUNK INC. Indexer abandonment
  • 87. © 2019 SPLUNK INC. Firewalls can block connections Indexer A Indexer B Indexer C Forwarder Forwarders generate errors when they cannot connect A common cause for starvation is forwarders only able to connect to a subset of indexers due to network problems. Normally a firewall or LB is blocking the connections This is very common when the indexer cluster is increased in size. replication Network says ”no” Outputs: A+B+C
  • 88. © 2019 SPLUNK INC. Find forwarders suffering network problems index=_internal earliest=-24hrs latest=now sourcetype=splunkd TERM(statusee=TcpOutputProcessor) TERM(eventType=*) | stats count count(eval(eventType="connect_try")) as try count(eval(eventType="connect_fail")) as fail count(eval(eventType="connect_done")) as done by destHost destIp | eval bad_output=if(try=failed,"yes","no") Search for forwarders logs to find those fail to connect to a target
  • 89. © 2019 SPLUNK INC. Incomplete output lists create no errors Indexer A Indexer B Indexer C Forwarder We must search for the absence of connections to find this problem Forwarders can only send data to targets that are in their list. This is very common when the indexer cluster is increased in size. Encourage customers to use indexer discovery on forwarders so this never happens. replication Outputs: B+C
  • 90. © 2019 SPLUNK INC. Do all indexers have the same forwarders? index=_internal earliest=-24hrs latest=now sourcetype=splunkd TERM(eventType=connect_done) TERM(group=tcpin_connections) [| dbinspect index=* | stats values(splunk_server) as indexer | eval host_count=mvcount(indexer), search="host IN (".mvjoin(mvfilter(indexer!=""), ", ").")"] | stats count by host sourceHost sourceIp | stats dc(host) as indexer_target_count values(host) as indexers_connected_to by sourceHost sourceIp | eventstats max(indexer_target_count) as total_pool | eval missing_indexer_count=total_pool-indexer_target_count | where missing_indexer_count != 0 Search for forwarders logs to find those fail to connect to a target This search assumes that all indexers have the same forwarders. With multiple clusters and sites, this might not be true
  • 91. © 2019 SPLUNK INC. Site aware forwarder connections index=_internal earliest=-24hrs latest=now sourcetype=splunkd TERM(eventType=connect_done) TERM(group=tcpin_connections) [| dbinspect index=* | stats values(splunk_server) as indexer | eval host_count=mvcount(indexer), search="host IN (".mvjoin(mvfilter(indexer!=""), ", ").")"] | eval remote_forwarder=sourceHost." & ".if(sourceIp!=sourceHost,"(".sourceIp.")","") | stats count as total_connections by host remote_forwarder | join host [ search index=_internal earliest=-1hrs latest=now sourcetype=splunkd CMMaster status=success site* | rex field=message max_match=64 "(?<site_pair>sited+,"[^"]+)" | eval cluster_master=host | fields + site_pair cluster_master | fields - _* | dedup site_pair | mvexpand site_pair | dedup site_pair | rex field=site_pair "^(?<site_id>sited+),"(?<host>.*)" | eventstats count as site_size by site_id cluster_master | table site_id cluster_master host site_size] | eval unique_site=site_id." & ".cluster_master | chart values(site_size) as site_size values(host) as indexers_connected_to by remote_forwarder unique_site | foreach site* [| eval "length_<<FIELD>>"=mvcount('<<FIELD>>')] What sites forwarders connect to which site and what is the coverage for that site? Sub-search returns indexer list Sub-search computes indexer to site mapping
  • 92. © 2019 SPLUNK INC. Indexer Starvation
  • 93. © 2019 SPLUNK INC. What is indexer starvation? indexer indexer indexer Forwarder Indexers must receive data to participate in search When an indexer or an indexer pipeline has periods when it is starved of data. It will continue to receive replicated data and be assigned primaries by the CM to search replicated cold buckets. It will not search hot buckets without site affinity. replication
  • 94. © 2019 SPLUNK INC. Smoke test: count connections to indexers index=_internal earliest=-1hrs latest=now sourcetype=splunkd TERM(eventType=connect_done) TERM(group=tcpin_connections) [| dbinspect index=* | stats values(splunk_server) as indexer | eval host_count=mvcount(indexer), search="host IN (".mvjoin(mvfilter(indexer!=""), ", ").")"] | timechart limit=0 count minspan=31sec by host This quick smoke test shows if the indexers in the cluster have obviously varying numbers of incoming connections. When you see banding it either a smoking gun, or different forwarder groups per site. Indexer starvation is guaranteed if there are not enough incoming connections. Fix by increase connections by increasing switching frequency and the number of pipelines on forwarders Nice and healthy
  • 95. © 2019 SPLUNK INC. Funnel effect reduces connections indexer indexer indexer Forwarder When an intermediate forwarder aggregates multiple streams, it creates a super giant forwarder. You need to add more pipelines as each pipeline creates a new TCP output to the indexers. Note that time division multiplexing doesn’t mean 1 CPU = 1 pipeline, unless it is running at maximum rate of about 20 MB/s Apply configuration with care. FWDFWD FWD FWD FWDFWD
  • 96. © 2019 SPLUNK INC. How to instrument and tune IUFs indexer indexer indexer Forwarder FWDFWD FWD FWD FWDFWD Monitor out going connections to understand cluster coverage Monitor CPU, load average, pipeline utilization and pipeline blocking Find and reconfigure sticky forwarders to use EVENT_BREAKER Monitor active channels to avoid s2s protocol churn Monitor incoming connections to ensure even distribution of workload over pipelines Use autoLBVolume Enable forceTimeBasedAutoLB Lengthen timeBasedAutoLBFrequency Add up to 16 pipelines but don’t breach ~30% CPU Monitor event delay to make sure it doesn’t increase. A well tuned intermediate forwarder can achieve 5% event distribution across 100 indexers within 5mins
  • 97. © 2019 SPLUNK INC. Forwarder bugs When an forwarder forgets about an indexer
  • 98. © 2019 SPLUNK INC. Some forwarders were just born bad 6.6.0 - 6.6.7 7.0.0 – 7.0.2 They just give up on targets after a timeout. Rolling restarts cause them to abandon targets You can work around the problem with static IP addresses Remove where possible
  • 99. © 2019 SPLUNK INC. Perquisite Sub-searches How to write searches that write searches
  • 100. © 2019 SPLUNK INC. Consider the following search It returns a single field “search” that is a simple search string
  • 101. © 2019 SPLUNK INC. Sub searches that return “search” are executed