Presented at SplunkLive! Frankfurt 2018:
Splunk Data Collection Architecture
Apps and Technology Add-ons
Demos / Examples
Best Practices
Resources and Q&A
5. Basic Architecture Refresh
How Splunk works at a high level
distributed search
auto-load balanced indexing
change tickets
web access logs
windows event logs / perfmon linux logs vmware logs, configs and metrics firewall data
app sever logs jmx and jvm metrics database logs and metrics product pricing
Search Head - Splunk’s UI
Indexer – Data Store/Processing
Forwarder - Collect & Send
Agentless
6. What can Splunk Ingest?
Agent-Less and Forwarder Approach for Flexibility and Optimization
syslog
TCP/UDP
Event Logs, Active Directory, OS Stats
Unix, Linux and Windows hosts
Universal Forwarder
syslog hosts
and network devices
Local File Monitoring
Universal Forwarder
Aggretation
host Windows
Aggregated/API Data Sources
Pre-filtering, API subscriptions
Heavy Forwarder
Mainframes*nix
Wire Data
Splunk Stream
Universal Forwarder or
HTTP Event Collector
DevOps, IoT,
Containers
HTTP Event Collector
(Agentless)
shell
API
perf
7. Collects Data From Remote Sources
• Splunk Universal Forwarders collect data from a local data source and sends it to
one or more Splunk indexers.
Scalable
• Thousands of universal forwarders can be installed with little impact on network
and host performance.
Broad Platform Support
• Available for installation on diverse computing platforms and architectures. Small
computing/disk/memory footprint.
Splunk Universal Forwarder
The Splunk Universal Forwarder is a Separate Download
8. Also Collects Data From Remote Sources...
• ...but is typically used for data aggregation for passage through firewalls, data
routing and/or filtering, scripted/modular inputs, or for HEC endpoints (more on this
in a bit).
Often run as a “data collection node” for API/scripted data access
• A heavy forwarder is typically run as a “data collection node” for technologies
requiring access via API, and not for collection of data from the node itself
Platform Support limited to that of Splunk Enterprise
• Being standalone, Heavy Forwarders are typically run on Linux VMs...
Splunk Heavy Forwarder
Configured via the regular Splunk Enterprise download
9. Large-Scale Data Collection Directly from Applications
• Provides a simple, load-balancer-friendly, secure way (token-based JSON or RAW
API) to send data at scale from applications directly to Splunk
Agentless
• Data at scale can be sent directly to indexer tier, bypassing forwarder layer
Broad Development Platform Support
• Logging drivers available for many platforms (docker, AWS Lambda, etc.) and
simple HTTP endpoint compatible with all development environments
Splunk HTTP Event Collector (HEC)
The Newest Way to Collect Data at Scale
11. App??? Add-on
▶ Your first choice when onboarding
new data
• Clean and ready to go out-of-the-box
▶ App is a complete solution
• Typically uses one or more TAs
▶ Add-on
• Abstracts collection methodology (log file, API,
scripted input, HEC)
• Typically includes relevant field extractions
(schema-on-the-fly)
• Includes relevant config files (props/transforms)
and ancillary scripts binaries
15. ▶ Using the Data Previewer
• Upload a File (You did this in the Getting Started Hands-on Session!)
▶ Installing and using Apps and Add-ons
▶ Continuous Local File Monitoring (Universal Forwarder)
• Monitor a directory and multiple files in real-time
• Most common architecture for syslog-based sourcetypes
What You Will See
17. Components of a Splunk Success Program
Architecture
&
Infrastructure
Operations
& Supporting
Tools
Staffing
Data
On-
Boarding
User
On-Boarding
Inform
18. ▶ Architect
• Design and optimize Splunk architecture for large-scale/distributed
deployments.
▶ System Administrator
• Implement and maintain Splunk infrastructure and configuration
▶ Search Expert
▶ App Developer
▶ Knowledge Manager
• Perform data interpretation, classification and enrichment
• Work with System Administrator to properly onboard data
Typical Splunk Staffing RolesArch &
Infra
Ops &
Tools
Staffing
Data
On-
Boarding
User
On-
Boarding
Inform
19. ▶ Define on-boarding process for
new data sources / apps
▶ Repeatable, documented
process
▶ Provide customer interview
forum or survey
▶ Integrate with service workflow
Data Onboarding TasksArch &
Infra
Ops &
Tools
StaffingData
On-
Boarding
User
On-
Boarding
Inform
New Data Source Request
Provide a data sample
Describe the data’s structure
timestamp | timezone single-/multi-line
sourcetype interesting fields
Describe initial uses for the data
searches | alerts | reports | dashboards
How to collect the data?
UF | syslog | API
How long to retain the data?
Who should have access?
Apply Common information Model
Are there TA’s available?
Validate
20. Ladies and Gentlemen, We’ll be Boarding Soon!
Six Things to Get Right at Index Time
Source
Event
Boundary /
LineBreaking
Host
Index
Sourcetype
Date
Timestamp
21. ▶ Gather info (New Data Source Request):
• Where does this data originate/reside? How will Splunk collect it?
• Which users/groups will need access to this data? Access controls?
• Determine the indexing volume and data retention requirements
• Will this data need to drive existing dashboards (ES, PCI, etc.)?
• Who is the Owner/SME for this data?
▶ Map it out:
• Get a "big enough" sample of the event data
• Identify and map out fields (ensure CIM compliance)
• Assign sourcetype and TA names according to CIM conventions
Pre-Board Essentials
22. ▶ Identify the specific sourcetype(s) - onboard each separately
• Important – syslog is not a sourcetype!
• More on this later
▶ Check for pre-existing app/add-on on splunk.com – don't
reinvent the wheel!
▶ Start with a “Test” index, Verify index-time settings correct
(previous slide)
• Try the Data Previewer first
• tweak props/transforms “by hand” only if absolutely necessary
Pre-Board Essentials (cont.)
23. ▶ Find and fix index-time problems BEFORE
polluting your index
▶ A try-it-before-you-fry-it interface for figuring out
• Event breaking
• Timestamp recognition
• Timezone assignment
▶ Provides most necessary props.conf parameter settings
Your Friend, the Data Previewer
24. If you have to get into the weeds...
Always set these six parameters in props.conf
# SL17
[SL17]
TIME_PREFIX = ^
TIME_FORMAT = %Y-%m-%d %H:%M:%S
MAX_TIMESTAMP_LOOKAHEAD = 19
SHOULD_LINEMERGE = False
LINE_BREAKER = ([nr]+)d{4}-d{2}-d{2}sd{2}:d{2}:d{2}
TRUNCATE = 10000
25. ▶ The Common Information Model (CIM) defines relationships in
the underlying data, while leaving the raw machine data intact
▶ A naming convention for fields, eventtypes & tags
▶ More advanced reporting and correlation requires that the data
be normalized, categorized and parsed
▶ CIM-compliant data sources can drive CIM-based dashboards
(ES, PCI, others)
What Is the CIM and Why Should I Care?
26. ▶ Syslog is a protocol – not a sourcetype
▶ Syslog typically carries multiple sourcetypes
▶ Best to pre-filter syslog traffic using syslog-ng or rsyslog
• Do not send syslog data directly to Splunk over a network port (514)
▶ Use a UF or HEC to transport data to Splunk (next slide)
• Ensures proper load balancing and data distribution
• Secure and efficient
• Insulates against Splunk component failures
▶ See https://www.splunk.com/blog/2017/03/30/syslog-ng-and-hec-scalable-
aggregated-data-collection-in-splunk.html for more info on this topic
A special note on Syslog
29. ▶ https://splunkbase.splunk.com/app/2962/
▶ For creating REST API, Scripted or Modular Inputs through a GUI
▶ Helps your Add-ons get Certified
▶ Can also use on sample data to build out configs as well
Check Out the New Add-on Builder!
30. ▶ Videos!
• http://www.splunk.com/view/education-videos/SP-CAAAGB6
▶ Getting Data In – Splunk Docs
• http://docs.splunk.com/Documentation/Splunk/latest/Data/WhatSplunkcanmonitor
▶ Date and time format variables
• http://docs.splunk.com/Documentation/Splunk/latest/SearchReference/Commontimeformatvariables
▶ Getting Data In – Dev Manual (very thorough!)
• http://dev.splunk.com/view/dev-guide/SP-CAAAE3A
▶ HTTP Event Collector
• http://docs.splunk.com/Documentation/Splunk/latest/Data/UsetheHTTPEventCollector
▶ .conf Sessions
• https://conf.splunk.com/session/2015/conf2015_Aduca_Splunk_Delpoying_OnboardingDataIntoSplunk.pdf
▶ GOOGLE!
Where to Go to Learn More
31. ORLANDO FLORIDA
Walt Disney World Swan and Dolphin Hotels
.conf18:
Monday, October 1 – Thursday, October 4
Splunk University:
Saturday, September 29 – Monday, October 1
Today’s goal is to talk about Data Onboarding or “Getting Data Into Splunk” from a ”New to Splunk” perspective. More specifically we’ll talk about the following and then do a little bit of demo.
1. Splunk Platform – a refresher
You’ve seen the Splunk Overview, but I want to quickly go through a few overview slides and relate why data onboarding is important to them
2. What can Splunk Eat
Then we’ll identify not only the data sources that Splunk can collect, but the methods of collection as well
3. Apps and Add-ons
Next we’ll discuss how Apps and Add-ons from the ecosystem play a role
4. Data Onboarding Examples/Demos
We’ll get into a few demos
5. Data Onboarding Best Practices and Next Steps
And finally we’ll get into some common best practices and what to do from here!
1. Explain the different components at a high level
2. The forwarder is one of the many ways to collect data in Splunk – we will discuss setting up and using a forwarder in more detail later in the presentation
1. Spend some time talking about each collection method
2. Today we will concentrate on and demo the ones highlighted in blue
Universal Forwarders provide reliable, secure data collection from remote sources and forward that data into Splunk software for indexing and consolidation. They can scale to tens of thousands of remote systems, collecting terabytes of data.
Heavy forwarders allow for the aggregation, filtering and routing of data, as well as serving as a “data collection node” for applications such as DB Connect and other API-driven data sources. They are typically *not* used for local data collection.
HTTP Event Collector (HEC pronounced H-E-C) is a new, robust, token-based JSON/raw API for sending events to Splunk from anywhere without requiring a forwarder. It is designed for performance and scale. Using a load balancer in front, it can be deployed to handle millions of events per second. It is highly available and it is secure. It is easy to configure, easy to use, and best of all it works out of the box.* A few other cool tidbits, it supports gzip compression, batching, HTTP keep-alive and HTTP/HTTPs.
Splunk apps and add-ons: what & why?
Splunk apps allow developers to extend data ingestion and processing capabilities of Splunk Enterprise for your specific needs. Apps facilitate more efficient completion of domain-specific tasks by the end user.
High-level perspective
A Splunk app is a prebuilt collection of additional capabilities packaged for a specific technology, or use cases, which allows a more effective usage of Splunk Enterprise. You can use Splunk apps to gain the specific insights you need from your machine data.
Depending on the type and complexity of those use cases, and also whether the developer wants certain app parts to be configured or distributed separately (potentially by a third party), an app may rely on various add-ons.
An add-on is a technical component that can be re-used across a number of different use cases and packaged with one or more Splunk apps. Add-ons may contain one or more knowledge objects, which encapsulate a specific functionality focused on a single concern and its configuration. Using an add-on should help to reduce the technical risk and cost of building an app.
Additionally we have the community!
The community provides thousands of apps and add-ons that can help you onboard and ingest thousands of different data types and new content is added everyday!
Let’s look at how we would use an Add-on from Splunkbase to get data in.
Use an example that you are comfortable with and showcases using an add-on to get data in and mapped properly.
< If you have another data source or want to improvise a little here feel free – otherwise you can use the following demo flow below >
< Support files can be found here: LINK >
1. Install an instance of Splunk on your laptop.
2. Create an inputs.conf that monitors a directory that will contain the PANW logs files, using the PANW sourcetype from the TA. Leave the directory empty for now.
3. Show the data preview wizard with the apache data. Show how Splunk understands (and assigns an appropriate sourcetype) to the data. Show proper field extractions when ingest is complete.
4. Use the wizard to upload one of the 5 PANW data files. Show how the sourcetype is *not* automatically set, and that there are no relevant choices in the sourcetype picker in the Wizard. Set the sourcetype to some arbitrary value. Show that there are no relevant field extractions after ingest.
5. Now, install the PANW app. Make sure to RESTART.
6. Use the wizard to import the next PANW data file. Now show that there *is* a relevant sourcetype in the picker. Select it.
7. Show how fields are extracted properly in the data. *HOWEVER* -- note that the original sourcetype is automatically changed by the TA, and you will get no results when jumping from the wizard into the search window. Instead, show the 5 or 6 new sourcetypes that get generated as a result of the TA doing its thing.
8. Lastly, deposit the 3rd PANW data file into the monitor directory set up earlier. Show the data in search, correctly sourcetyped.
9. Move the file to a “backup” filename in the monitored directory. Show how Splunk does *not* reingest the data.
10. Add the 4th PANW data sample. Show how the UF handles this.
In this next section we are only looking at the tip of the iceberg. Data Onboarding can quickly become an advanced topic so the point of this next session is to introduce you to some of the most important/key points to get you started. After that you’ll need to do some research and learn the specifics yourself.
These are the components that make up a successful Splunk program – both large and small. In a very large deployment, individual people (or more) can be dedicated to each of these components.
Appropriate staffing will ensure these components are properly addressed. The person responsible for data onboarding from an architectural perspective is the Knowledge Manager
It is important to have a defined, documented, and repeatable process for data onboarding.
Explain Index Time
Spend some time saying why these are so important for Splunk. Mention there will be references and resources at the end of the presentation to help dive deeper into these topics.
It is important to not only get the technical details right, but also the data stewardship issues: Who owns the data, who can see it, and how long to keep it?
It is important to “do the homework” prior to onboarding, not only to get the index-time parameters correct (previous slide) but also to ensure the resultant data in Splunk will be of value to the widest variety of people and use cases
Make sure to show this in the demo, this slide is just a follow up reminder
These are min number of parameters, that should be set when creating a new data source. Again, I like said when I flashed up the Splunk Apps site.. Find something similar to your source and re-work it. But make sure it includes these parameters.
Normalizes data from different sources – Host and hostname discussion
Syslog represents almost 50% of a typical Splunk installation’s data. And yet syslog itself is simply the protocol over which a number of devices’ log data flows. Be sure to *not* use syslog as the sourcetype, but rather that of the originating data. Use appropriate syslog tools to pre-filter data. They’re good at it, they’re free, they’re well-documented, and they integrate well with Splunk.
HEC is the newest, and most scalable, way to collect syslog-based data.
In addition to live, .conf, docs, answers, meetups etc etc