Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Hunk - Unlocking the Power of Big Data

Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Anzeige
Wird geladen in …3
×

Hier ansehen

1 von 22 Anzeige

Hunk - Unlocking the Power of Big Data

Explore, Analyze and Visualize Data in Hadoop and NoSQL. Make massive quantities of machine data accessible, usable and valuable for the people who need it, at the speed they need it. Use Hunk to turn underutilized data into valuable insights in minutes, not weeks or months.

Explore, Analyze and Visualize Data in Hadoop and NoSQL. Make massive quantities of machine data accessible, usable and valuable for the people who need it, at the speed they need it. Use Hunk to turn underutilized data into valuable insights in minutes, not weeks or months.

Anzeige
Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (20)

Ähnlich wie Hunk - Unlocking the Power of Big Data (20)

Anzeige

Weitere von Splunk (20)

Aktuellste (20)

Anzeige

Hunk - Unlocking the Power of Big Data

  1. 1. Copyright © 2015 Splunk Inc. Hunk – Unlocking the Power of Big Data
  2. 2. Splunk Disruptive Approach to Unstructured Data Structured RDBMS SQL Search Schema at Write Schema at Read 1980-2010 2010+ ETL Universal Indexing Unstructured Volume | Velocity | Variety
  3. 3. Mainframe Data VMware Platform for Machine Data Exchange PCI Security DB Connect MobileForwarders Syslog, TCP, Other Sensors, Control Systems 600+ Ecosystem of Apps Stream SPLUNK TODAY
  4. 4. Copyright © 2015 Splunk Inc. Splunk – Big Data Engine
  5. 5. 5 Distributed File System (semi-structured) Key/Value, Columnar or Other (semi-structured) Relational Database (highly structured) MapReduce Cassandra Accumulo MongoDB Splunk - Big Data Technologies SQL & MapReduce NoSQL Temporal, Unstructured Heterogeneous Hadoop RDBMS HDFS Storage + MapReduce Real-Time Indexing 5 Oracle MySQL IBM DB2 Teradata
  6. 6. Massive Linear Scalability to Tens of TBs/Day Send data from 1000s of servers using combination of Splunk Forwarders, syslog, WMI, message queues, or other remote protocols Auto load-balanced forwarding to as many Splunk Indexers as you need to index terabytes/day Offload search load to Splunk Search Heads 6 Automatic load balancing linearly scales indexing Distributed search and MapReduce linearly scales search and reporting
  7. 7. 7 Splunk Real-Time Analytics Data ParsingQueue Parsing Pipeline • Source, event typing • Character set normalization • Line breaking • Timestamp identification • Regex transforms Indexing Pipeline Real-time Buffer Raw data Index Files Real-time Search Process Monitor Input IndexQueue TCP/UDP Input Scripted Input Splunk Index 7
  8. 8. 8 Search Head Clustering Ability to group search heads into a cluster in order to provide Highly Available and Scalable search services = Thousands of Users 8 MISSION CRITICAL ENTERPRISE
  9. 9. 9 Splunk Index Replication – High Availability 9 2 Master asks the redundant peer to act as primary 3 Peers copies the search files / index files / raw data 2 3 1 Master auto-detects that a peer is down 1 • Default is 3X Replication
  10. 10. Copyright © 2015 Splunk Inc. Hunk – Hadoop
  11. 11. 11 Splunk and Hadoop 1 Hunk: – Main use case = Analyze Hadoop Data using Hadoop Processing Splunk Hadoop Connect: – Main use case = Real-time export data from Splunk to Hadoop Hunk Archive – Main use case = Archive Splunk indexers to Hadoop Splunk HadoopOps: – Main use case = Monitor Hadoop
  12. 12. 1 Integrated Analytics Platform Full-featured, Integrated Product Insights for Everyone Works with What You Have Today Explore Visualize Dashboard s ShareAnalyze Hadoop Clusters NoSQL and Other Data Stores Hadoop Client Libraries Streaming Resource Libraries for Diverse Data Stores
  13. 13. 13 Hunk – Unique 1 1. Run Natively in Hadoop: – Use Hadoop MapReduce 2. Mixed Mode: – Allows for data Preview 3. Auto deploy SplunkD to DataNodes: – On the fly Indexing 4. Access Control: – Allows for many users / many Hadoop directories / support Kerberos 5. Schema On the Fly
  14. 14. 14 Run Natively in Hadoop External resource (e.g. hadoop.prod) MapReduce jobs Tasks / working directory Index on data nodes Hunk search head > 1 5 3 4 2 NameNode JobTracker (YARN) DataNode / TaskTracker DataNode / TaskTracker DataNode / TaskTracker HDFS 14 Hadoop MR Jobs
  15. 15. 15 Mixed-mode Search 15 Time Hadoop MR / Splunk Index Splunk Stream Switch over time preview preview • Data Preview • Allows users to search interactively by pausing and refining queries
  16. 16. 16 Indexing On the fly - Hunk Data Processing 16 HDFS Results Final search results ERP Search process Remote results Remote results Search head MapReduce Search process TaskTracker raw preprocessed Remote results Remote results
  17. 17. 17 1 Role-based Security for Shared Clusters Pass-through Authentication • Provide role-based security for Hadoop clusters • Access Hadoop resources under security and compliance • Integrates with Kerberos for Hadoop security Business Analyst Marketing Analyst Sys Admin Business Analyst Queue: Biz Analytics Marketing Analyst Queue: Marketing Sys Admin2 Queue: Prod
  18. 18. 18 Managed Archiving Splunk Enterprise to Hunk-HDFS 1 • Archive buckets to Hadoop (HDFS) instead of freezing buckets or throwing data away • Store old data up to 1/10 cheaper in Hadoop cheap batch storage instead of SANs • Optimize Splunk Enterprise search head performance for real-time monitoring, alerting and dashboarding with short-term historical context • Hunk search, analyze and visualize months or years of historical data in Hadoop • Run federated queries and dashboards across Splunk Enterprise and Hunk Hadoop Clusters WARM COLD FROZEN
  19. 19. 19 Hunk Enables Hadoop as Self Service 1
  20. 20. 20 New Search i ndex=" j obsummar y_l ogs_al l _r ed" cl ust er =" di l i t hi um* " | eval t ot al _sl ot _seconds=( m apSl ot Seconds + r educeSl ot Sec onds) | eval gb_hour s=( ( t ot al _sl ot _seconds * 0. 5) / 3600) | eval gb_hour s=r ound( gb_h our s) | t i mechar t span=6h sum ( gb_hour s) as gb_hour s by queue Last 7 days ✓ 1,175,726 events (5/20/ 14 8:00:00.000 PM to 5/ 27/14 8:26:26.000 PM) 200,000 400,000 600,000 _time ↕ OTH ER ↕ apg_dai lyhigh_ p3 ↕ apg_dail ymedium _p5 ↕ apg_hou rlyhigh_ p1 ↕ apg_ho urlylow_ p4 ↕ apg_hourl ymedium _p2 ↕ apg _p7 ↕ curveb all_larg e ↕ curveb all_me d ↕ sling shot ↕ sling stone ↕ Visualization _time Wed May 21 2014 Thu May 22 Fri May 23 Sat May 24 Sun May 25 Mon May 26 Yahoo - Visualizing Hadoop 2 • 600PB of Data • Very large clusters used by many groups across the enterprise • 35,000 individual Datanodes • Hadoop is provided as a Self Service
  21. 21. 21 Vantrix Mobile media optimization 2 144 Hadoop Nodes, 69 TB SSD Storage Analytics Application 10 Million subscribers generate: • 80GB of raw session log data / day • 26 Million video data session records Hunk Query • 20 sec – search through 27M events • Returning 4.7M events Hunk as indexer - Automatically indexed and counted field value occurrences Hunk as Self Service - Proved invaluable for identifying and exploring use cases Hunk business value – Help identify when subscribers abandon video
  22. 22. Thank You

Hinweis der Redaktion

  • But listening to your machine data isn’t as easy as it sounds.

    Machine Data is different:
    It is voluminous unstructured time series data with no predefined schema
    It is generated by all IT systems– from servers and applications, to RFIDs and wire data.
    It is non-standard data and characterized by many unpredictable and changing formats
     
    Because of this, machine data cannot be managed using traditional approaches.
    Traditional approaches require you to transform your data and force fit it into a brittle schema – They aren’t designed to handle the inconsistent machine data formats
    Traditional approaches are designed with specific use cases and queries in mind – they limit the problems that you can solve
    Traditional approaches rely on siloed tools that are designed for structured data approaches and legacy computing environments – They are inherently limited in their ability to scale
     
    To listen to your machine data, you need a solution with no limits:
    No limits on the formats of data
    No limits on where you can collect the data from
    No limits on the questions that you can ask and the use cases you can solve.
    And no limits on scale.
     
    You need a solution that can keep up with Machine Data.
     
  • Since then, Splunk has invested significantly to expand from a search tool to a mission-critical platform. The platform includes hundreds of data types and can scale to massive volumes
    Today, it’s more than Splunk Enterprise, we’ve added Splunk Cloud, Hunk, Splunk MINT for mobile intelligence; and have more than 600 Apps.

    Machine data is more than logs! It’s wire data, mainframe data, mobile device data, sensor data, metrics

    Your use cases have evolved well beyond troubleshooting so we’re investing in solutions that leverage the power of Splunk Enterprise to provide you with packaged views into your data for faster, deeper insights.

    Our most well-known solution is Splunk Enterprise Security and if you aren’t using it yet, we encourage you to find out why it’s turning the traditional SIEM market upside down.
  • How has big data evolved over time. For a long time, ‘big data’ was was simply a large database.

    The database industry – in order to handle large data – moved to smaller databases, but many of them. Horizontal partitioning (Also known as Sharding) is a database design principle whereby rows of a database table are held separately (For example, A -> D in one database E -> H in a second database, etc ..)

    Hadoop was introduced by Google and was adapted as the de-facto big data system. Hadoop is an open source project from Apache that has evolved rapidly into a major technology movement. It has emerged as a popular way to handle massive amounts of data, including structured and complex unstructured data. Its popularity is due in part to its ability to store and process large amounts of data effectively across clusters of commodity hardware. Apache Hadoop is not actually a single product but instead a collection of several components. For the most part, Hadoop is a batch oriented system.
    ** Teradata Aster Data & SQL on Hadoop are SQL interface systems that can talk to Hadoop
    ** Cassandra & HBase are NoSQL databases that can process data using a Key / Value in real-time.

    Splunk = Temporal, Unstructured, Heterogeneous, real-time analytics platform.
  • Splunk allows you divide up the work of search and indexing across as many servers as you need to achieve the performance and scale you require. Using work dividing techniques such as MapReduce, Splunk can take a single search and query as many indexers as you need to complete the job, allowing you to use inexpensive commodity hardware in massively parallel clusters.

    For example, if you had 1 million events to search, one Indexer can easily complete that search. But it will take a little time – let’s say 30 seconds. However, if the same million events was spread across 10 indexers, the same search would complete in 3 seconds. How fast and how large you want your searches is yours to control by adding indexers as desired.
  • For the most part, you can use monitor to add nearly all your data sources from files and directories. However, you might want to use upload to add one-time inputs, such as an archive of historical data. You can enable Splunk to accept an input on any TCP or UDP port. Splunk consumes any data sent on these ports. Use this method for syslog (default port is UDP 514), or set up netcat and bind to a port. TCP is the protocol underlying Splunk's data distribution and is the recommended method for sending data from any remote machine to your Splunk server. Splunk can index remote data from syslog-ng or any other application that transmits via TCP. However, there are times when you want to use scripts to feed data to Splunk for indexing, or prepare data from a non-standard source so Splunk can properly parse events and extract fields. You can use shell scripts, python scripts, Windows batch files, PowerShell, or any other utility that can format and stream the data that you want Splunk to index. You can stream the data to Splunk or write the data from a script to a file.

    All data that comes into Splunk enters through the parsing pipeline as large chunks. During parsing, Splunk breaks these chunks into events which it hands off to the indexing pipeline, where final processing occurs. During both parsing and indexing, Splunk acts on the data, transforming it in various ways. Most of these processes are configurable, so you have the ability to adapt them to your needs.

    To kick off a real-time search in Splunk Web, use the time range menu to select a preset Real-time time range window, such as 30 seconds or 1 minute. You can also specify a sliding time range window to apply to your real-time search. This defines a real-time buffer.

    The Splunk Index is the repository for Splunk Enterprise data. Splunk Enterprise transforms incoming data into events, which it stores in indexes.


  •  
    Faster Recovery II -
     
    If you look at the screen – 2 indexers on the left with green cylinders – searchable copies of the data, 2 indexers on the right – only raw data
    What happens when a peer goes down, master waits for hb timeout and marks the peer down.
    <Cl>
    Reassigns primaries to another peer. Then tries to enforce the replication policy, makes copies of the raw data and search files
    In 5.0, search files are generated on each peer from the raw data, In 6.0, the search files are copied over from a peer that already has them instead of regenerating.
    <Cl>
    These statistics are from our internal tests...
    Another point to note is generating search files from the rawdata is cpu intensive as compared to copying search files.
     
  • Quick to set-up, scales to multiple concurrent databases
    Enrich machine data with structured data from relational databases
    Execute database queries directly from the Splunk user interface
    Browse and navigate database schemas and tables
    Combine machine data with structured data from relational databases
  • Quick to set-up, scales to multiple concurrent databases
    Enrich machine data with structured data from relational databases
    Execute database queries directly from the Splunk user interface
    Browse and navigate database schemas and tables
    Combine machine data with structured data from relational databases
  • Search execution:
    The Hunk Search head takes the list of content of directories in the virtual index. The search head filters directories & files based on the search & time range (partition pruning)
    The NameNode and JobTracker (MapReduce Resource Manager in YARN) read data from MapReduce framework and feed it to search process. The process computes File Splits, constructs and submits the MapReduce jobs.
    Hunk streams a few File Splits from HDFS and processes them in the Search Head to provider quick previews. The search head consumes and merges the MapReduce results (provide incremental previews) while the MapReduce jobs kick off.
    The data nodes run a copy of splunkd to process the the jobs and write them to a working directory in HDFS.
    Final results are stored in the Hunk search head.

    Hunk utilizes the Splunk Search Processing Language, the industry-leading method to enable interactive data exploration across large, diverse data sets. There is no requirement to "understand" data up front. For customers of Splunk Enterprise, reuse your Search Processing Language knowledge and skill set for data stored in Hadoop. Any commands whose output depends on the event input order would yield different results – this is because Splunk guarantees events to be delivered in descending time order. Hunk doesn’t. This is the reason why transaction and localize do not work.

    We can see the results from the intermediate Hadoop Map jobs getting steamed into the Splunk UI even before all the Map jobs are finished, and once all the Hadoop Maps are done processing the results, Splunk displays the full results.

    In essence, Splunk acts as the Hadoop Reduce phase and there is no need to use Hadoop for that phase.

  • Hunk starts the streaming and reporting modes concurrently. Streaming results show until the reporting results come in. Allows users to search interactively by pausing and refining queries.

    This is a major, unique advantage of Hunk compared to alternative approaches such as Hive or SQL on Hadoop which require fixed schema in an effort to speed up searches, while Hunk retains the combination of schema on the fly with results preview.
  • In this new feature, planned for release in the next Hunk release (version 6.2.1), archive buckets to Hadoop (the Hadoop Distributed File System, or HDFS) instead of freezing buckets or throwing data away. This significantly lowers the total cost of ownership (TCO) for Splunk Enterprise installations while giving security analysts, risk managers and marketers access to months or years of historical data integral for their job success.
    Store old data up to 1/10 cheaper in Hadoop cheap batch storage instead of SANs
    Optimize Splunk Enterprise search head performance for real-time monitoring, alerting and dashboarding with short-term historical context
    Hunk search, analyze and visualize months or years of historical data in Hadoop
    Run federated queries and dashboards across Splunk Enterprise and Hunk
  • Indexing

×