3. Agenda
1. What is Hunk?
2. Powerful Developer Platform
3. Preparation
4. Connect Hunk to HDFS and MapReduce
5. Create Virtual Indexes
6. MapReduce as the Orchestration Framework
7. Search Data in Hadoop
8. Flexible, Iterative Workflow for Business Users
3
4. Explore, Analyze, Visualize Data in Hadoop
No fixed schema to search unstructured data
Preview results while MapReduce jobs start
Easier app development than in raw Hadoop
4
Unlock business value of data in Hadoop
Fast to learn instead of scarce skills
Integrated – explore, analyze and visualize
6. Connect to HDFS and MapReduce
6
Connect to Apache HDFS and MapReduce
or your choice of Hadoop distribution
Hadoop Cluster 1
7. Extract to
in-memory store
Unmet Needs for Hadoop Analytics
8
• Scarce skill sets to hire
• Need to know MapReduce
• Wait for slow jobs to finish
• No results preview
• No built-in visualization
• No granular authentication
• Slow time to value
• Pre-defined fixed schema
• Need knowledge of data
• Miss data that “doesn’t fit”
• No results preview
• No built-in visualization
• Scarce skill sets to hire
• Slow time to value
• Data too big to move
• Limited drill down to raw data
• No results preview
• Another data mart
• Expensive hardware
“Do it yourself”
Hadoop / Pig
Problems
OPTION 1
Hive or SQL on
Hadoop
Problems
OPTION 2
Problems
OPTION 3
8. Hadoop in Real life
Using HunkMap Reduce Job for Hadoop
il.public class WordCount extends Configured implements Tool {
public static class Map extends MapReduceBase implements Mapper<LongWritable,
Text, Text, IntWritable> {static enum Counters { INPUT_WORDS }
• private Text word = new Text();
• private boolean caseSensitive = true;
• private Set<String> patternsToSkip = new HashSet<String>();
• private long numRecords = 0;
• private String inputFile;
• public void configure(JobConf job) {
• caseSensitive = job.getBoolean("wordcount.case.sensitive", true);
• inputFile = job.get("map.input.file");
• if (job.getBoolean("wordcount.skip.patterns", false)) {
• Path[] patternsFiles = new Path[0];
• try {
• patternsFiles = DistributedCache.getLocalCacheFiles(job);
• } catch (IOException ioe) {
• System.err.println("Caught exception while getting cached files: " +
StringUtils.stringifyException(ioe));
• }
• for (Path patternsFile : patternsFiles) {
• parseSkipFile(patternsFile);
• }
• }
• Index=Hadoop
• |wc usestopwords=f
• |stats sum(count) by word
9. Integrated Analytics Platform for Hadoop Data
10
10
Full-featured,
Integrated
Product
Insights for
Everyone
Works with
What You
Have Today
Explore Visualize Dashboards Share
Hadoop
(MapReduce
& HDFS)
Analyze
10. What Hunk Does Not Do
Hunk does not replace your Hadoop distribution
Hunk does not replace or require Splunk Enterprise
Interactive but not real time
No data ingest management (that’s Flume or Sqoop)
No Hadoop operations management
11
1.
2.
3.
4.
5.
11. Product Portfolio
12
Real-time
indexing
Real-time
search
Splunk Apps
Vibrant and passionate developer community
IT
Ops.
Security &
Compliance
Web
Intelli-
gence
App Dev
&
App
Mgmt.
Business
Analytics
Splunk Hadoop Connect
DB Connect
Ad hoc analytics of
historical data in Hadoop
Developers building big data apps on top of Hadoop
3600
Customer
View
Complete
Security
Analytics
Product and
Service
Analytics
12. Powerful Developer Platform with Familiar Tools
13
JavaScript Java Python PHP C# Ruby
API
Add New
UI components
Integrate into
Existing Systems
With Known
Languages
and Frameworks
14. MapReduce as the Orchestration Framework
15
1. Copy splunkd
binary
HDFS.tgz
TaskTracker 1 TaskTracker 2
.tgz
2. Copy
3. Expand in specified location on each TaskTracker
TaskTracker 3
.tgz
4. Receive binary in
subsequent searches
Hunk
Search Head >
15. Data Processing Pipeline
17 17
Raw data
(HDFS)
Custom
processing
Indexing
pipeline
Search
pipeline
You can plug in
data preprocessors
e.g. Apache Avro or
format readers
MapReduce/Java
stdin
Event breaking
Timestamping
Event typing
Lookups
Tagging
Search processors
splunkd/C++
16. Hunk applies schema for all fields – including transactions – at search time
Hunk Applies Schema on the Fly
18
• Structure applied at
search time
• No brittle schema to
work around
• Automatically find
patterns and trends
17. Mixed-mode Search
ReportingStreaming
• Transfers first several blocks from
HDFS to the Hunk Search Head
for immediate processing
• Pushes computation to the
DataNodes and TaskTrackers for
the complete search
20
• Hunk starts the streaming and reporting modes concurrently
• Streaming results show until the reporting results come in
• Allows users to search interactively by pausing and refining queries
18. Flexible, Iterative Workflow for Business Users
22
Explore
Analyze
Model
Pivot
Visualize
Share
Interactive Analytics
• Preview results
• Normalization as it’s
needed
• Faster implementation
and flexibility
• Easy search language +
data models & pivot
• Multiple views into the
same data
This session is designed for audiences who have seen an introduction to Hunk and would like a more comprehensive understanding of how Hunk works. I’ll cover each of these eight topics.
Hunk is a new product for organizations deploying Hadoop and is priced and packaged separately from Splunk Enterprise. A Splunk Enterprise license is not required to run Hunk. Hunk is the integrated analytics platform for data in Hadoop. Supports business use cases to unlock value of data stored in Hadoop– Data analytics to launch and optimize products and services – Synthesis of data from all customer touch points– Comprehensive security analytics for modern threats– Easier app development than in raw Hadoop, with tools and frameworks that developers already know Easy to use for any business or IT user– Versus scarce skills to manually write MapReduce jobs or define Hive data schemasFully integrated analytics product– Explore, analyze, visualize, create dashboards, create data models, pivot, and share No fixed schema to search raw and unstructured dataPreview results while MapReduce jobs startEasier app development than in raw Hadoop
Hunk is essentially the Splunk Enterprise technology stacksitting on top of Hadoop, with some limitations (no real time, and several functions in the Splunk processing language that do not apply to virtual indexes). Hunk is a high performance, scalable software server written in C/C++ and Python. It indexes and searches logs and other big data stored in the Hadoop Distributed File System, called HDFS, or MapR’s proprietary variant of HDFS. Hunk works with machine data generated by any application, server or device. The Splunk Developer API is accessible via REST, SOAP or the command line. After downloading Hunk, installing Hunk on your choice of 64-bit Linux operating system, and starting Hunk, you'll find two Hunk Server processes running on your host:splunkd and splunkweb. splunkweb is a Python-based application server providing the Splunk Web user interface. It allows users to search and navigate machine data virtually indexed by Hunk servers and to manage your Hunk deployment through the browser interface. splunkd is a distributed C/C++ server that creates a virtual index from machine data and handles search requests. An ODBC driver (in beta as of September 2013) will provide integration with 3rd party data visualization software.
Connect Hunk to your Hadoop cluster as an external results provider. The external results provider is a search-time helper process responsible for: accessing the external system (Hadoop); translating or interpreting the search request; and pushing as much of the computation as possible to the external system. Connect to the Hadoop Distributed File System (HDFS) and MapReduce from Apache downloads or from your choice of Hadoop distribution, including the option for Cloudera, Hortonworks, MapR or Pivotal. Hunk only requires basic Hadoop: HDFS and MapReduce. You can continue to use additional projects and subprojects with your Hadoop cluster but what’s required by Hunk is just MapReduce and HDFS (or MapR’s proprietary variant of HDFS).
Connect Hunk to multiple Hadoop clusters.
There are significant challenges with theseapproaches to ask and answer questions of data in Hadoop. Not shown is a less common option, spreadsheet-like interfaces, that raise their own problems: these are batch job builders, with no interactive engine, and use “spreadsheet like” interfaces, not Microsoft Excel or Apple Numbers.
Hunk (Splunk Analytics for Hadoop) is a full-featured, integrated product offering – that delivers interactive data exploration, analysis and visualization for Hadoop. Full-featured, integrated product: Delivers interactive data exploration, analysis and visualization for HadoopInsights for everyone: Empowers broader user groups to derive actionable insights from raw data in HadoopWorks with what you have today: Works with leading Hadoopdistributions to maximize enterprise technology investments
Hunk does not replace your Hadoop distributionHunk coexists with your Apache HDFS & MapReduce downloads or your Hortonworks, Cloudera, or MapR distributionHunk does not replace or require Splunk Enterprise– Hunk is a separate product designed for new use cases involving data in HadoopIterative search but not real time or needle in the haystack searches– That’s Splunk EnterpriseNo data ingest management– That’s using tools from Apache Hadoop or from your Hadoop distribution vendor, or Hadoop connectors by enterprise software or business intelligence vendors Notes: Needle in a haystack – one in a million searches.
Splunk Enterpriseis a standalone solution and the industry-leading platform for machine data with all of Splunk’s core use cases. For customers who are storing historical data in Hadoop, we offer Hunk to run analytics on data stored natively in Hadoop. Hunk targets new use cases, including:– Data analytics for new product and service launches – Synthesis of data from all customer touch points– Comprehensive security analytics for modern threats– Easier big data app development than in raw Hadoop Furthermore, you can use Splunk Enterprise Hadoop Connect to send data between Splunk Enterprise and Hadoop. Many accounts may decide to buy both Splunk Enterprise for real-time monitoring and real-time search together with Hadoop for exploratory analytics of historical data stored in Hadoop. With this combination, you can run searches across native indexes in Splunk Enterprise and Hunk virtual indexes for data in Hadoop.
A rich developer platform and tool chain that includes a robust API and software developer kits in Java, JavaScript, Python, PHP, C# and Ruby to enable developer teams to rapidly build powerful big data applications. DEV.SPLUNK.COM activity highlights a strong developer community.
What you’ll need to get started. Data in Hadoop to analyze Hadoop client libraries From your Hadoop distribution vendor or from http://archive.apache.org/dist/hadoop/core/Hadoop access rightsHunk requires permission to read from HDFS and run MapReduce jobsJava 1.6+HDFS scratch spaceThe amount depends on the size of the interim results. Between 10 and 20 Gigs is common. DataNode local temp disk spaceAt most 5 Gigs per DataNode
On the first search, MapReduce auto-populates the Splunk binaries. The orchestration process begins when Hunk copies the Hunk binary .tgz file to HDFS. Hunk supports both the MapReduce JobTracker and the YARN MapReduce Resource Manager.Each TaskTracker (called ApplicationContainer in YARN) fetches the binary.The binary files expand in the specified location on each TaskTracker; the default location is configurable. TaskTrackers not involved in the 1st search will receive the Hunk binary in a subsequent search that involves those TaskTrackers. This process is one example of why Hunk needs some scratch space in HDFS and in the local file system (TaskTrackers / DataNodes). Background on Hadoop: Typically a Hadoop cluster has a single master and multiple worker nodes. The master node (also referred to as NameNode) coordinates the reads and writes to worker nodes (also referred to as DataNodes). HDFS reliability is achieved by replicating the data across multiple machines. By default the replication value is 3 and chunk size is 64MB.The JobTracker dispatches tasks to worker nodes (TaskTracker) in the cluster. Priority is given to nodes that host the data upon which said task will operate on. If the task cannot be run on that node, next priority is given to neighboring nodes (in order to minimize network traffic). Upon job completion, each worker node writes own results locally and the HDFS ensures replication across the cluster.HDFS = NameNode + DataNodes MapReduce Engine = JobTracker + TaskTracker
Search execution: The Hunk Search head takes the list of content of directories in the virtual index. The search head filtersdirectories & files based on the search & time range(partition pruning)The NameNode and JobTracker (MapReduce Resource Manager in YARN) read data from MapReduce framework and feed it to search process. The process computes File Splits, constructs and submits the MapReduce jobs.Hunk streams a few File Splits from HDFS and processes them in the Search Head to provider quick previews. The search head consumes and merges the MapReduce results (provide incremental previews) while the MapReduce jobs kick off. The data nodes run a copy of splunkd to process the the jobs and write them to a working directory in HDFS. Final results are stored in the Hunk search head. Hunk utilizes the Splunk Search Processing Language, the industry-leading method to enable interactive data exploration across large, diverse data sets. There is no requirement to "understand" data up front. For customers of Splunk Enterprise, reuse your Search Processing Language knowledge and skill set for data stored in Hadoop. Any commands whose output depends on the event input order would yield different results – this is because Splunk guarantees events to be delivered in descending time order. Hunk doesn’t. This is the reason why transaction and localize do not work.We can see the results from the intermediate Hadoop Map jobs getting steamed into the Splunk UI even before all the Map jobs are finished, and once all the Hadoop Maps are done processing the results, Splunk displays the full results. In essence, Splunk acts as the Hadoop Reduce phase and there is no need to use Hadoop for that phase.
Before data is processed by Hunk you can plug in your own data preprocessor. The preprocessors have to be written in Java and can transform the data in some way before Hunk gets a chance to. Data preprocessors can vary in complexity from simple translators (say Avro to JSON) to as complex as doing image/video/document processing.Hunk translates Avro to JSON. These translations happen on the fly and are not persisted.
Hunk applies structure at search timeDesigned for data exploration across large datasets – preview data & iterate quicklyNo requirement to understand the data upfrontNo limit to the number of results returned by Hadoop or the number of searchesNo brittle schema to maintain or update Find patterns and trends across disparate data sets in a “grab bag” Hadoop clusterUse the Search Processing Language or create data models and pivot Unlike Splunk Enterprise, Hunk applies schema for all fields – including transactions and localizations – at search time.
MapReduce considerations: Stats/chart/timechart/top/etc. commands work well in a distributed environmentThey MapReduce wellTime and order commands don’t work well in a distributed environmentThey don’t MapReduce wellFor large summary indexes, consider a dedicated "summarizer" instance with plenty of CPU to execute search jobs Summary jobs won't interfere with user searchesAggregates and stores the results away from indexers Report acceleration is not supported by Hunk 6.0 but may be supported in a future release.
Hunk starts the streaming and reporting modes concurrently. Streaming results show until the reporting results come in.Allows users to search interactively by pausing and refining queries.This is a major, unique advantage of Hunk compared to alternative approaches such as Hive or SQL on Hadoop which require fixed schema in an effort to speed up searches, while Hunk retains the combination of schema on the fly with results preview.
Pause or stop Jobs in progress and revise queries interactively. We’re mindful of the resources we use in Hadoop. Pause in Hunk:This pauses in the Search Head. Hadoop jobs keep running until the TCP header runs out. If you abandon a search for more than 30 seconds it will kill the search.
There’s no one path to explore data. Preview results and refine your queries. Hunk applies normalization as it’s needed for faster implementation and flexibility. Hunk supports the easy-to-use Splunk search processing language along with data models and pivot to provide multiple views into the same data. Find insights following a flexible, iterative workflow. I’ll touch on each of the components of the data workflow. There is no one set way to explore data. Go back and forth across components at the speed of thought. Explore and search data from one placePowerful Search Processing Language (SPL)Designed for data exploration across large datasetsPreview data, iterate quicklyNo fixed schemaNo requirement to “understand” data upfrontEasy to use interactive analytics Deep analysisPattern detectionFind anomaliesOver 100 statistical commandsModel: make unstructured data more valuable Describes how underlying machine data is represented and accessedDefines hierarchical relationships Enables single authoritative view of underlying raw dataPivot: powerful analytics anyone can useDrag and drop interface Easily build complex queries and reports Click to visualize chart typesReports dynamically updateVisualize: interactive reporting and visualization of dataInteractive reports viewRapidly build advanced graphs and chartsGenerate visualizations on-the-fly Drill down to raw data in HadoopOBDC connector to 3rd-party data visualization softwareShare:Build, personalize and share custom dashboards and PDFsCombine multiple charts, views, reports and external dataSet role and group access security for web dashboardsView and edit on any desktop, tablet or mobile deviceAnd do all of this from one integration platform for data in Hadoop.