Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Â
Hunk - Unlocking The Power of Big Data Breakout Session
1. Copyright Š 2015 Splunk Inc.
Splunk / Hunk Big Data
Analytics
Raanan Dagan
Sr. SE, Hadoop DE
2. Mainframe
Data
VMware
Platform for Machine Data
Exchange PCI Security
DB Connect MobileForwarders
Syslog,
TCP,
Other
Sensors,
Control
Systems
600+ Ecosystem of Apps
Stream
SPLUNK TODAY
3. 3
Distributed File System
(semi-structured)
Key/Value, Columnar or
Other (semi-structured)
Relational Database
(highly structured)
MapReduce
Cassandra
Accumulo
MongoDB
Splunk - Big Data Technologies
SQL &
MapReduce
NoSQL
Temporal, Unstructured
Heterogeneous
Hadoop
RDBMS HDFS Storage +
MapReduce
Real-Time Indexing
3
Oracle
MySQL
IBM DB2
Teradata
5. 5
Splunk and Hadoop
5
Hunk:
â Main use case = Analyze Hadoop Data using Hadoop Processing
Splunk Hadoop Connect:
â Main use case = Real-time export data from Splunk to Hadoop
Hunk Archive
â Main use case = Archive Splunk indexers to Hadoop
Splunk Monitor Hadoop:
â Main use case = Monitor Hadoop
7. 7
Hunk â Unique
7
1. Run Natively in Hadoop:
â Use Hadoop MapReduce
2. Mixed Mode:
â Allows for data Preview
3. Auto deploy SplunkD to DataNodes:
â On the fly Indexing
4. Access Control:
â Allows for many users / many Hadoop directories / support Kerberos
5. Schema On the Fly
9. 9
Run Natively in Hadoop
External resource
(e.g. hadoop.prod)
MapReduce
jobs
Tasks
/ working
directory
Index on data nodes
Hunk
search head >
1
5
3
4
2
NameNode
JobTracker
(YARN)
DataNode /
TaskTracker
DataNode /
TaskTracker
DataNode /
TaskTracker
HDFS
9
Hadoop
MR Jobs
10. 10
Mixed-mode Search
10
Time
Hadoop MR /
Splunk Index
Splunk Stream
Switch over
time
preview
preview
⢠Data Preview
⢠Allows users to search interactively by pausing and
refining queries
11. 11
Indexing On the fly - Hunk Data Processing
11
HDFS
Results
Final search
results
ERP
Search process
Remote results Remote results
Search head
MapReduce
Search process
TaskTracker
raw
preprocessed
Remote results
Remote results
12. 12
12
Role-based Security for Shared Clusters
Pass-through
Authentication
⢠Provide role-based security
for Hadoop clusters
⢠Access Hadoop resources
under security and
compliance
⢠Integrates with Kerberos
for Hadoop security
Business
Analyst
Marketing
Analyst
Sys
Admin
Business
Analyst
Queue:
Biz Analytics
Marketing
Analyst
Queue:
Marketing
Sys
Admin2
Queue:
Prod
13. 13
We added these in Hunk 6.*
13
1. Report Acceleration: Get results in seconds
2. Hive Schema: Expose User Created Schema, Parquet, Sequence,
ORC, RC
3. Data Exploration: UI to navigate Hadoop
4. Hunk on EMR (Amazon): Hunk by the Hour
5. Search Head Clustering: Unlimited number of end-users
6. Archive Splunk Indexers to HDFS: Search through years of data
15. 15
Archiving Splunk Enterprise to Hunk-HDFS
15
⢠Archive buckets to Hadoop (HDFS) instead of freezing buckets or throwing data away
⢠Store old data up to 1/10 cheaper in Hadoop cheap batch storage instead of SANs
⢠Optimize Splunk Enterprise search head performance for real-time monitoring,
alerting and dashboarding with short-term historical context
⢠Hunk search, analyze and visualize months or years of historical data in Hadoop
⢠Run federated queries and dashboards across Splunk Enterprise and Hunk
Hadoop Clusters
WARM
COLD
FROZEN
17. 17
New Search
i ndex=" j obsummar y_l ogs_al l _r ed" cl ust er =" di l i t hi um* " | eval t ot al _sl ot _seconds=( m apSl ot Seconds + r educeSl ot Sec
onds) | eval gb_hour s=( ( t ot al _sl ot _seconds * 0. 5) / 3600) | eval gb_hour s=r ound( gb_h our s) | t i mechar t span=6h sum
( gb_hour s) as gb_hour s by queue
Last 7 days
â 1,175,726 events (5/20/ 14 8:00:00.000 PM to 5/ 27/14 8:26:26.000 PM)
200,000
400,000
600,000
_time â
OTH
ER
â
apg_dai
lyhigh_
p3 â
apg_dail
ymedium
_p5 â
apg_hou
rlyhigh_
p1 â
apg_ho
urlylow_
p4 â
apg_hourl
ymedium
_p2 â
apg
_p7
â
curveb
all_larg
e â
curveb
all_me
d â
sling
shot
â
sling
stone
â
Visualization
_time
Wed May 21
2014
Thu May 22 Fri May 23 Sat May 24 Sun May 25 Mon May 26
Yahoo - Visualizing Hadoop
1
⢠600PB of Data
⢠Very large clusters used by many
groups across the enterprise
⢠35,000 individual Datanodes
⢠Hadoop is provided as a Self
Service
18. 18
Vantrix Mobile media optimization
1
144 Hadoop Nodes,
69 TB SSD Storage
Analytics Application
10 Million subscribers generate:
⢠80GB of raw session log data / day
⢠26 Million video data session records
Hunk Query
⢠20 sec â search through 27M events
⢠Returning 4.7M events
Hunk as indexer - Automatically indexed and counted field value occurrences
Hunk as Self Service - Proved invaluable for identifying and exploring use cases
Hunk business value â Help identify when subscribers abandon video
20. 20
Hunk - Connect to NoSQL & SQL Databases
⢠Build custom streaming resource
libraries
⢠Search and analyze data from
other data stores in Hunk
⢠In partnership with leading
NoSQL vendors
⢠Use in conjunction with DB
Connect for relational database
lookups
22. 22
Mongo Specific Integration Highlights
22
index=mongodb foo=xyz | timechart avg(bar) by baz
Predicate Pushdown Projections
Filtering terms are processed on the MongoDB
side, so only results where the field foo matches
xyz are returned
We only return back fields which are mentioned
in the particular search, in this case _time, bar
and baz
23. 23
Splunk DB Connect
Enrich search results with additional
business context
Easily import data into Splunk for
deeper analysis
Integrate multiple DBs concurrently
Simple set-up, non-evasive and secure
Reliable, scalable, real-time
integration between Splunk and
traditional relational databases
Microsoft SQL
server
JDBC
Database
lookup
Database
query
Connection
pooling
Other
databases
Oracle
database
Java Bridge Server
23
24. The 6th Annual Splunk Worldwide Usersâ Conference
September 21-24, 2015 ďź The MGM Grand Hotel, Las Vegas
Did you like this session on Splunk for Big Data? You should check out
these sessions at .conf2015?
⢠Splunk Hunk â Performance, Best Practices, and Troubleshooting
⢠Archive Splunk Data and Access Using Hadoop Tools
⢠Hunk and Elastic Map Reduce (Amazon EMR)
⢠Real World Big Data Architecture (Splunk, Hunk, DB Connect)
⢠Splunk Distributed Processing with Spark
Register at: conf.splunk.com
25. The 6th Annual Splunk Worldwide Usersâ Conference
September 21-24, 2015 ďź The MGM Grand Hotel, Las Vegas
⢠50+ Customer Speakers
⢠50+ Splunk Speakers
⢠35+ Apps in Splunk Apps Showcase
⢠65 Technology Partners
⢠4,000+ IT & Business Professionals
⢠2 Keynote Sessions
⢠3 days of technical content (150+ Sessions)
⢠3 days of Splunk University
â Get Splunk Certified
â Get CPE credits for CISSP, CAP, SSCP, etc.
â Save thousands on Splunk education!
25
Register at: conf.splunk.com
27. 27
We Want to Hear your Feedback!
After the Breakout Sessions conclude
Text Splunk to 878787
And be entered for a chance to win a $100 AMEX gift card!
Since then, Splunk has invested significantly to expand from a search tool to a mission-critical platform. The platform includes hundreds of data types and can scale to massive volumes
Today, itâs more than Splunk Enterprise, weâve added Splunk Cloud, Hunk, Splunk MINT for mobile intelligence; and have more than 600 Apps.
Machine data is more than logs! Itâs wire data, mainframe data, mobile device data, sensor data, metrics
Your use cases have evolved well beyond troubleshooting so weâre investing in solutions that leverage the power of Splunk Enterprise to provide you with packaged views into your data for faster, deeper insights.
Our most well-known solution is Splunk Enterprise Security and if you arenât using it yet, we encourage you to find out why itâs turning the traditional SIEM market upside down.
How has big data evolved over time. For a long time, âbig dataâ was was simply a large database.
The database industry â in order to handle large data â moved to smaller databases, but many of them. Horizontal partitioning (Also known as Sharding) is a database design principle whereby rows of a database table are held separately (For example, A -> D in one database E -> H in a second database, etc ..)
Hadoop was introduced by Google and was adapted as the de-facto big data system. Hadoop is an open source project from Apache that has evolved rapidly into a major technology movement. It has emerged as a popular way to handle massive amounts of data, including structured and complex unstructured data. Its popularity is due in part to its ability to store and process large amounts of data effectively across clusters of commodity hardware. Apache Hadoop is not actually a single product but instead a collection of several components. For the most part, Hadoop is a batch oriented system.
** Teradata Aster Data & SQL on Hadoop are SQL interface systems that can talk to Hadoop
** Cassandra & HBase are NoSQL databases that can process data using a Key / Value in real-time.
Splunk = Temporal, Unstructured, Heterogeneous, real-time analytics platform.
Quick to set-up, scales to multiple concurrent databases
Enrich machine data with structured data from relational databases
Execute database queries directly from the Splunk user interface
Browse and navigate database schemas and tables
Combine machine data with structured data from relational databases
Quick to set-up, scales to multiple concurrent databases
Enrich machine data with structured data from relational databases
Execute database queries directly from the Splunk user interface
Browse and navigate database schemas and tables
Combine machine data with structured data from relational databases
Search execution:
The Hunk Search head takes the list of content of directories in the virtual index. The search head filters directories & files based on the search & time range (partition pruning)
The NameNode and JobTracker (MapReduce Resource Manager in YARN) read data from MapReduce framework and feed it to search process. The process computes File Splits, constructs and submits the MapReduce jobs.
Hunk streams a few File Splits from HDFS and processes them in the Search Head to provider quick previews. The search head consumes and merges the MapReduce results (provide incremental previews) while the MapReduce jobs kick off.
The data nodes run a copy of splunkd to process the the jobs and write them to a working directory in HDFS.
Final results are stored in the Hunk search head.
Hunk utilizes the Splunk Search Processing Language, the industry-leading method to enable interactive data exploration across large, diverse data sets. There is no requirement to "understand" data up front. For customers of Splunk Enterprise, reuse your Search Processing Language knowledge and skill set for data stored in Hadoop. Any commands whose output depends on the event input order would yield different results â this is because Splunk guarantees events to be delivered in descending time order. Hunk doesnât. This is the reason why transaction and localize do not work.
We can see the results from the intermediate Hadoop Map jobs getting steamed into the Splunk UI even before all the Map jobs are finished, and once all the Hadoop Maps are done processing the results, Splunk displays the full results.
In essence, Splunk acts as the Hadoop Reduce phase and there is no need to use Hadoop for that phase.
Hunk starts the streaming and reporting modes concurrently. Streaming results show until the reporting results come in. Allows users to search interactively by pausing and refining queries.
This is a major, unique advantage of Hunk compared to alternative approaches such as Hive or SQL on Hadoop which require fixed schema in an effort to speed up searches, while Hunk retains the combination of schema on the fly with results preview.
Quick to set-up, scales to multiple concurrent databases
Enrich machine data with structured data from relational databases
Execute database queries directly from the Splunk user interface
Browse and navigate database schemas and tables
Combine machine data with structured data from relational databases
In this new feature, planned for release in the next Hunk release (version 6.2.1), archive buckets to Hadoop (the Hadoop Distributed File System, or HDFS) instead of freezing buckets or throwing data away. This significantly lowers the total cost of ownership (TCO) for Splunk Enterprise installations while giving security analysts, risk managers and marketers access to months or years of historical data integral for their job success.
Store old data up to 1/10 cheaper in Hadoop cheap batch storage instead of SANs
Optimize Splunk Enterprise search head performance for real-time monitoring, alerting and dashboarding with short-term historical context
Hunk search, analyze and visualize months or years of historical data in Hadoop
Run federated queries and dashboards across Splunk Enterprise and Hunk
Indexing
Search execution:
The Hunk Search head receives a search from the end user and splits it into multiple queries against multiple indexes
Each query spawns a new search process. Each search is processed depending on whether itâs a native Splunk distributed search or whether it uses an External Results Provider. MongoDB and Hadoop are implemented via External Results Provider
The MongoDBProvider receives JSON config via stdin, translates and executes the Hunk query against MongoDB, and returns results via stdout
Hunk receives the results from multiple provides, and runs reduction to merge it into a single set of results
Splunk DB Connect delivers reliable, scalable, real-time integration between Splunk Enterprise and traditional relational databases. With Splunk DB Connect, structured data from relational databases can be easily integrated into Splunk Enterprise, driving deeper levels of operational intelligence and richer business analytics across the organization.
Organizations can drive more meaningful insights for IT operations, security and business users. For example, IT operations teams can track performance, outage and usage by department, location and business entities. Security professionals can correlate machine data with critical assets and watch-lists for: incident investigations, real-time correlations and advanced threat detection using the award-winning Splunk Enterprise. Business users can analyze service levels and user experience by customer in real-time to make more informed decisions.
And finally, I would like to encourage all of you to attend our user conference in September.
Â
The energy level and passion that our customers bring to this event is simply electrifying.
Â
Combined with inspirational keynotes and 150+ breakout session across all areas of operational intelligence,
Â
It is simply the best forum to bring our Splunk community together, to learn about new and advanced Splunk offerings, and most of all to learn from one another.