SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
1
Adding Search to the
Hadoop Ecosystem
Gregory Chanan (gchanan AT cloudera.com)
SF HUG August 2013
Agenda
• Big Data and Search – setting the stage
• Cloudera Search Architecture
• Component deep dive
• Security
• Conclusion
Why Search?
• Hadoop for everyone
• Typical case:
• Ingest data to storage engine (HDFS, HBase, etc)
• Process data (MapReduce, Hive, Impala)
• Experts know MapReduce
• Savvy people know SQL
• Everyone knows Search!
Why Search?
An Integrated Part of
the Hadoop System
One pool of data
One security framework
One set of system resources
One management interface
Benefits of Search
• Improved Big Data ROI
• An interactive experience without technical knowledge
• Single data set for multiple computing frameworks
• Faster time to insight
• Exploratory analysis, esp. unstructured data
• Broad range of indexing options to accommodate needs
• Cost efficiency
• Single scalable platform; no incremental investment
• No need for separate systems, storage
What is Cloudera Search?
• Full-text, interactive search with faceted navigation
• Batch, near real-time, and on-demand indexing
• Apache Solr integrated with CDH
• Established, mature search with vibrant community
• In production environments for years
• Open Source
• 100% Apache, 100% Solr
• Standard Solr APIs
• In public beta (version 0.9.3)
Cloudera Search Components
• HDFS/MR/Lucene/Solr/SolrCloud
• Indexing
• Near Real Time (NRT) indexing
• Batch
• ETL – Cloudera Morphlines
• Querying
Apache Hadoop
• Apache HDFS
• Distributed file system
• High reliability
• High throughput
• Apache MapReduce
• Parallel, distributed programming model
• Allows processing of large datasets
• Fault tolerant
Apache Lucene
• Full text search
• Indexing
• Query
• Traditional inverted index
• Batch and Incremental indexing
• We are using version 4.3 in current release
Apache Solr
• Search service built using Lucene
• Ships with Lucene (same TLP at Apache)
• Provides XML/HTTP/JSON/Python/Ruby/… APIs
• Indexing
• Query
• Administrative interface
• Also rich web admin GUI via HTTP
Apache SolrCloud
• Provides distributed Search capability
• Part of Solr (not a separate library/codebase)
• Shards – provide scalability
• partition index for size
• replicate for query performance
• Uses ZooKeeper for coordination
• No split-brain issues
• Simplifies operations
Distributed Search on Hadoop
Flume
Hue UI
Custom
UI
Custom
App
Solr
Solr
Solr
SolrCloud
query
query
query
index
Hadoop Cluster
MR
HDFS
index
HBase
index
ZK
Indexing
• Near Real Time (NRT)
• Flume
• HBase Indexer
• Batch (MR)
Indexing
• Near Real Time (NRT)
• Flume
• HBase Indexer
• Batch (MR)
Near Real Time Indexing with Flume
Log File
Solr and Flume
• Data ingest at scale
• Flexible extraction and
mapping
• Indexing at data ingest
HDFS
Flume
Agent
Indexer
Other
Log File
Flume
Agent
Indexer
15
Apache Flume - MorphlineSolrSink
• A Flume Source…
• Receives/gathers events
• A Flume Channel…
• Carries the event – MemoryChannel or reliable FileChannel
• A Flume Sink…
• Sends the events on to the next location
• Flume MorphlineSolrSink
• Integrates Cloudera Morphlines library
• ETL, more on that in a bit
• Does batching
• Results sent to Solr for indexing
Indexing
• Near Real Time (NRT)
• Flume
• HBase Indexer
• Batch (MR)
Near Real Time Indexing of Apache HBase
HDFS
HBase
interactiveload
HBase
Indexer(s)
Trigger Solr server
Solr server
Solr server
Solr server
Solr server
Search
+ =
planet-sized tabular data
immediate access & updates
fast & flexible information
discovery
BIG DATA DATAMANAGEMENT
Lily HBase Indexer
• Collaboration between NGData & Cloudera
• NGData are creators of the Lily data management platform
• Lily HBase Indexer
• Service which acts as a HBase replication listener
• HBase replication features, such as filtering, supported
• Replication updates trigger indexing of updates (rows)
• Integrates Cloudera Morphlines library for ETL of rows
• AL2 licensed on github https://github.com/ngdata
Indexing
• Near Real Time (NRT)
• Flume
• HBase Indexer
• Batch (MR)
Scalable Batch Indexing
Index
shard
Files
Index
shard
Indexer
Files
Solr
server
Indexer
Solr
server
21
HDFS
Solr and MapReduce
• Flexible, scalable batch
indexing
• Start serving new indices
with no downtime
• On-demand indexing, cost-
efficient re-indexing
MapReduce Indexer
MapReduce Job with two parts
1) Scan HDFS for files to be indexed
• Much like Unix “find” – see HADOOP-8989
• Output is NLineInputFormat’ed file
2) Mapper/Reducer indexing step
• Mapper extracts content via Cloudera Morphlines
• Reducer indexes documents via embedded Solr server
• Originally based on SOLR-1301
• Many modifications to enable linear scalability
MapReduce Indexer “golive”
• Cloudera created this to bridge the gap between NRT
(low latency, expensive) and Batch (high latency,
cheap at scale) indexing
• Results of MR indexing operation are immediately
merged into a live SolrCloud serving cluster
• No downtime for users
• No NRT expense
• Linear scale out to the size of your MR cluster
Cloudera Morphlines
• Open Source framework for simple ETL
• Ships as part Cloudera Developer Kit (CDK)
• It’s a Java library
• AL2 licensed on github https://github.com/cloudera/cdk
• Simplify ETL
• Built-in commands and library support (Avro format, Hadoop
SequenceFiles, grok for syslog messages)
• Configuration over coding
• Standardize ETL
Cloudera Morphlines Architecture
Solr
Solr
Solr
SolrCloud
Logs, tweets, social
media, html,
images, pdf, text….
Anything you want
to index
Flume, MR Indexer, HBase indexer, etc...
Or your application!
Morphline Library
Morphlines can be embedded in any application…
Extraction and Mapping
• Modeled after Unix
pipelines
• Simple and flexible data
transformation
• Reusable across multiple
index workloads
• Over time, extend and re-
use across platform
workloads
syslog Flume
Agent
Solr sink
Command: readLine
Command: grok
Command: loadSolr
Solr
Event
Record
Record
Record
Document
MorphlineLibrary
Morphline Example – syslog with grok
morphlines : [
{
id : morphline1
importCommands : ["com.cloudera.**", "org.apache.solr.**"]
commands : [
{ readLine {} }
{
grok {
dictionaryFiles : [/tmp/grok-dictionaries]
expressions : {
message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp}
%{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?:
%{GREEDYDATA:syslog_message}"""
}
}
}
{ loadSolr {} }
]
}
]
Example Input
<164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22
Output Record
syslog_pri:164
syslog_timestamp:Feb 4 10:46:14
syslog_hostname:syslog
syslog_program:sshd
syslog_pid:607
syslog_message:listening on 0.0.0.0 port 22.
Current Command Library
• Integrate with and load into Apache Solr
• Flexible log file analysis
• Single-line record, multi-line records, CSV files
• Regex based pattern matching and extraction
• Integration with Avro
• Integration with Apache Hadoop Sequence Files
• Integration with SolrCell and all Apache Tika parsers
• Auto-detection of MIME types from binary data using
Apache Tika
Current Command Library (cont)
• Scripting support for dynamic java code
• Operations on fields for assignment and comparison
• Operations on fields with list and set semantics
• if-then-else conditionals
• A small rules engine (tryRules)
• String and timestamp conversions
• slf4j logging
• Yammer metrics and counters
• Decompression and unpacking of arbitrarily nested
container file formats
• Etc…
Querying
• Built-in solr web UI
• Write your own
• Hue
Simple, Customizable Search Interface
Hue
• Simple UI
• Navigated, faceted drill
down
• Customizable display
• Full text search,
standard Solr API and
query language
Security
• Upstream Solr doesn’t really deal with security
• Goal: use kerberos, like other CDH components
• Current release: Support for kerberos authentication
• Actively working on Index-level authorization
• Future: more granular authorization
Conclusion
• Cloudera Search now in public beta
• Free Download
• Extensive documentation
• Send your questions and feedback to search-
user@cloudera.org
• Take the Search online training
• Cloudera Manager Standard (i.e. the free version)
• Simple management of Search
• Free Download
• QuickStart VM also available!

Weitere ähnliche Inhalte

Was ist angesagt?

Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemCloudera, Inc.
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and SparkLucidworks
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingDatabricks
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineDataWorks Summit
 
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...Michael Stack
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on MesosPaco Nathan
 
Lambda Architecture in Practice
Lambda Architecture in PracticeLambda Architecture in Practice
Lambda Architecture in PracticeNavneet kumar
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesAnand Narayanan
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Spark Summit
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Apex
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrRahul Jain
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark PresentationStephen Borg
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Lucidworks
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Rahul Jain
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit
 
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...Lucidworks
 

Was ist angesagt? (20)

Adding Search to the Hadoop Ecosystem
Adding Search to the Hadoop EcosystemAdding Search to the Hadoop Ecosystem
Adding Search to the Hadoop Ecosystem
 
Data Engineering with Solr and Spark
Data Engineering with Solr and SparkData Engineering with Solr and Spark
Data Engineering with Solr and Spark
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
 
Bullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query EngineBullet: A Real Time Data Query Engine
Bullet: A Real Time Data Query Engine
 
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...
HBaseConEast2016: How yarn timeline service v.2 unlocks 360 degree platform i...
 
#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos#MesosCon 2014: Spark on Mesos
#MesosCon 2014: Spark on Mesos
 
Lambda Architecture in Practice
Lambda Architecture in PracticeLambda Architecture in Practice
Lambda Architecture in Practice
 
Spark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika TechnologiesSpark Internals Training | Apache Spark | Spark | Anika Technologies
Spark Internals Training | Apache Spark | Spark | Anika Technologies
 
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)
 
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache ApexApache Big Data EU 2016: Building Streaming Applications with Apache Apex
Apache Big Data EU 2016: Building Streaming Applications with Apache Apex
 
Building a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache SolrBuilding a Large Scale SEO/SEM Application with Apache Solr
Building a Large Scale SEO/SEM Application with Apache Solr
 
Tech Spark Presentation
Tech Spark PresentationTech Spark Presentation
Tech Spark Presentation
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
 
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
Building a Large Scale SEO/SEM Application with Apache Solr: Presented by Rah...
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...Searching The Enterprise Data Lake With Solr  - Watch Us Do It!: Presented by...
Searching The Enterprise Data Lake With Solr - Watch Us Do It!: Presented by...
 
Presto - SQL on anything
Presto  - SQL on anythingPresto  - SQL on anything
Presto - SQL on anything
 

Andere mochten auch

Extracting twitter data using apache flume
Extracting twitter data using apache flumeExtracting twitter data using apache flume
Extracting twitter data using apache flumeBharat Khanna
 
Milap Thaker - Biology Powerpoint: Harvard University - DNA Damage Checkpoints
Milap Thaker - Biology Powerpoint: Harvard University -  DNA Damage CheckpointsMilap Thaker - Biology Powerpoint: Harvard University -  DNA Damage Checkpoints
Milap Thaker - Biology Powerpoint: Harvard University - DNA Damage Checkpointsmilapthaker
 
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Steve Hoffman
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudDataWorks Summit
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutesdwmclary
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Data Con LA
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingRapheephan Thongkham-Uan
 
Hadoop cluster setup by using cloudera manager
Hadoop cluster setup by using cloudera managerHadoop cluster setup by using cloudera manager
Hadoop cluster setup by using cloudera managerCo-graph Inc.
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLCloudera, Inc.
 
Analyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveAnalyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveIMC Institute
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in HadoopAnalyticsWeek
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...Cloudera, Inc.
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesIsheeta Sanghi
 

Andere mochten auch (18)

'Flume' Essay
'Flume' Essay 'Flume' Essay
'Flume' Essay
 
Extracting twitter data using apache flume
Extracting twitter data using apache flumeExtracting twitter data using apache flume
Extracting twitter data using apache flume
 
'Flume' Case Study
'Flume' Case Study'Flume' Case Study
'Flume' Case Study
 
Apache flume
Apache flumeApache flume
Apache flume
 
Milap Thaker - Biology Powerpoint: Harvard University - DNA Damage Checkpoints
Milap Thaker - Biology Powerpoint: Harvard University -  DNA Damage CheckpointsMilap Thaker - Biology Powerpoint: Harvard University -  DNA Damage Checkpoints
Milap Thaker - Biology Powerpoint: Harvard University - DNA Damage Checkpoints
 
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
Chicago Hadoop User Group (CHUG) Presentation on Apache Flume - April 9, 2014
 
Large scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloudLarge scale near real-time log indexing with Flume and SolrCloud
Large scale near real-time log indexing with Flume and SolrCloud
 
Flume in 10minutes
Flume in 10minutesFlume in 10minutes
Flume in 10minutes
 
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
Spark Streaming& Kafka-The Future of Stream Processing by Hari Shreedharan of...
 
Apache Flume and its use case in Manufacturing
Apache Flume and its use case in ManufacturingApache Flume and its use case in Manufacturing
Apache Flume and its use case in Manufacturing
 
Hadoop cluster setup by using cloudera manager
Hadoop cluster setup by using cloudera managerHadoop cluster setup by using cloudera manager
Hadoop cluster setup by using cloudera manager
 
Using Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETLUsing Morphlines for On-the-Fly ETL
Using Morphlines for On-the-Fly ETL
 
Analyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and HiveAnalyse Tweets using Flume, Hadoop and Hive
Analyse Tweets using Flume, Hadoop and Hive
 
Advanced Analytics in Hadoop
Advanced Analytics in HadoopAdvanced Analytics in Hadoop
Advanced Analytics in Hadoop
 
Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
HBaseCon 2013: Streaming Data into Apache HBase using Apache Flume: Experienc...
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 
Apache Flume
Apache FlumeApache Flume
Apache Flume
 

Ähnlich wie Search onhadoopsfhug081413

Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoopgregchanan
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1Sperasoft
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalogmarkgrover
 
Webinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big DataWebinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big DataLucidworks
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Cask Data
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerhdhappy001
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageSATOSHI TAGOMORI
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadooplucenerevolution
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceSATOSHI TAGOMORI
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackDataWorks Summit/Hadoop Summit
 

Ähnlich wie Search onhadoopsfhug081413 (20)

Solr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for HadoopSolr + Hadoop: Interactive Search for Hadoop
Solr + Hadoop: Interactive Search for Hadoop
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
SolrCloud on Hadoop
SolrCloud on HadoopSolrCloud on Hadoop
SolrCloud on Hadoop
 
Apache Hadoop 1.1
Apache Hadoop 1.1Apache Hadoop 1.1
Apache Hadoop 1.1
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
Introduction to Hive and HCatalog
Introduction to Hive and HCatalogIntroduction to Hive and HCatalog
Introduction to Hive and HCatalog
 
Webinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big DataWebinar: Solr & Fusion for Big Data
Webinar: Solr & Fusion for Big Data
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
An intro to Azure Data Lake
An intro to Azure Data LakeAn intro to Azure Data Lake
An intro to Azure Data Lake
 
Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?Webinar: What's new in CDAP 3.5?
Webinar: What's new in CDAP 3.5?
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Gunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stingerGunther hagleitner:apache hive & stinger
Gunther hagleitner:apache hive & stinger
 
Cloudera search
Cloudera searchCloudera search
Cloudera search
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Data Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby UsageData Analytics Service Company and Its Ruby Usage
Data Analytics Service Company and Its Ruby Usage
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
The First Class Integration of Solr with Hadoop
The First Class Integration of Solr with HadoopThe First Class Integration of Solr with Hadoop
The First Class Integration of Solr with Hadoop
 
Overview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data ServiceOverview of data analytics service: Treasure Data Service
Overview of data analytics service: Treasure Data Service
 
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014
 
Real time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stackReal time fraud detection at 1+M scale on hadoop stack
Real time fraud detection at 1+M scale on hadoop stack
 

Kürzlich hochgeladen

Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 

Kürzlich hochgeladen (20)

Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
201610817 - edge part1
201610817 - edge part1201610817 - edge part1
201610817 - edge part1
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 

Search onhadoopsfhug081413

  • 1. 1 Adding Search to the Hadoop Ecosystem Gregory Chanan (gchanan AT cloudera.com) SF HUG August 2013
  • 2. Agenda • Big Data and Search – setting the stage • Cloudera Search Architecture • Component deep dive • Security • Conclusion
  • 3. Why Search? • Hadoop for everyone • Typical case: • Ingest data to storage engine (HDFS, HBase, etc) • Process data (MapReduce, Hive, Impala) • Experts know MapReduce • Savvy people know SQL • Everyone knows Search!
  • 4. Why Search? An Integrated Part of the Hadoop System One pool of data One security framework One set of system resources One management interface
  • 5. Benefits of Search • Improved Big Data ROI • An interactive experience without technical knowledge • Single data set for multiple computing frameworks • Faster time to insight • Exploratory analysis, esp. unstructured data • Broad range of indexing options to accommodate needs • Cost efficiency • Single scalable platform; no incremental investment • No need for separate systems, storage
  • 6. What is Cloudera Search? • Full-text, interactive search with faceted navigation • Batch, near real-time, and on-demand indexing • Apache Solr integrated with CDH • Established, mature search with vibrant community • In production environments for years • Open Source • 100% Apache, 100% Solr • Standard Solr APIs • In public beta (version 0.9.3)
  • 7. Cloudera Search Components • HDFS/MR/Lucene/Solr/SolrCloud • Indexing • Near Real Time (NRT) indexing • Batch • ETL – Cloudera Morphlines • Querying
  • 8. Apache Hadoop • Apache HDFS • Distributed file system • High reliability • High throughput • Apache MapReduce • Parallel, distributed programming model • Allows processing of large datasets • Fault tolerant
  • 9. Apache Lucene • Full text search • Indexing • Query • Traditional inverted index • Batch and Incremental indexing • We are using version 4.3 in current release
  • 10. Apache Solr • Search service built using Lucene • Ships with Lucene (same TLP at Apache) • Provides XML/HTTP/JSON/Python/Ruby/… APIs • Indexing • Query • Administrative interface • Also rich web admin GUI via HTTP
  • 11. Apache SolrCloud • Provides distributed Search capability • Part of Solr (not a separate library/codebase) • Shards – provide scalability • partition index for size • replicate for query performance • Uses ZooKeeper for coordination • No split-brain issues • Simplifies operations
  • 12. Distributed Search on Hadoop Flume Hue UI Custom UI Custom App Solr Solr Solr SolrCloud query query query index Hadoop Cluster MR HDFS index HBase index ZK
  • 13. Indexing • Near Real Time (NRT) • Flume • HBase Indexer • Batch (MR)
  • 14. Indexing • Near Real Time (NRT) • Flume • HBase Indexer • Batch (MR)
  • 15. Near Real Time Indexing with Flume Log File Solr and Flume • Data ingest at scale • Flexible extraction and mapping • Indexing at data ingest HDFS Flume Agent Indexer Other Log File Flume Agent Indexer 15
  • 16. Apache Flume - MorphlineSolrSink • A Flume Source… • Receives/gathers events • A Flume Channel… • Carries the event – MemoryChannel or reliable FileChannel • A Flume Sink… • Sends the events on to the next location • Flume MorphlineSolrSink • Integrates Cloudera Morphlines library • ETL, more on that in a bit • Does batching • Results sent to Solr for indexing
  • 17. Indexing • Near Real Time (NRT) • Flume • HBase Indexer • Batch (MR)
  • 18. Near Real Time Indexing of Apache HBase HDFS HBase interactiveload HBase Indexer(s) Trigger Solr server Solr server Solr server Solr server Solr server Search + = planet-sized tabular data immediate access & updates fast & flexible information discovery BIG DATA DATAMANAGEMENT
  • 19. Lily HBase Indexer • Collaboration between NGData & Cloudera • NGData are creators of the Lily data management platform • Lily HBase Indexer • Service which acts as a HBase replication listener • HBase replication features, such as filtering, supported • Replication updates trigger indexing of updates (rows) • Integrates Cloudera Morphlines library for ETL of rows • AL2 licensed on github https://github.com/ngdata
  • 20. Indexing • Near Real Time (NRT) • Flume • HBase Indexer • Batch (MR)
  • 21. Scalable Batch Indexing Index shard Files Index shard Indexer Files Solr server Indexer Solr server 21 HDFS Solr and MapReduce • Flexible, scalable batch indexing • Start serving new indices with no downtime • On-demand indexing, cost- efficient re-indexing
  • 22. MapReduce Indexer MapReduce Job with two parts 1) Scan HDFS for files to be indexed • Much like Unix “find” – see HADOOP-8989 • Output is NLineInputFormat’ed file 2) Mapper/Reducer indexing step • Mapper extracts content via Cloudera Morphlines • Reducer indexes documents via embedded Solr server • Originally based on SOLR-1301 • Many modifications to enable linear scalability
  • 23. MapReduce Indexer “golive” • Cloudera created this to bridge the gap between NRT (low latency, expensive) and Batch (high latency, cheap at scale) indexing • Results of MR indexing operation are immediately merged into a live SolrCloud serving cluster • No downtime for users • No NRT expense • Linear scale out to the size of your MR cluster
  • 24. Cloudera Morphlines • Open Source framework for simple ETL • Ships as part Cloudera Developer Kit (CDK) • It’s a Java library • AL2 licensed on github https://github.com/cloudera/cdk • Simplify ETL • Built-in commands and library support (Avro format, Hadoop SequenceFiles, grok for syslog messages) • Configuration over coding • Standardize ETL
  • 25. Cloudera Morphlines Architecture Solr Solr Solr SolrCloud Logs, tweets, social media, html, images, pdf, text…. Anything you want to index Flume, MR Indexer, HBase indexer, etc... Or your application! Morphline Library Morphlines can be embedded in any application…
  • 26. Extraction and Mapping • Modeled after Unix pipelines • Simple and flexible data transformation • Reusable across multiple index workloads • Over time, extend and re- use across platform workloads syslog Flume Agent Solr sink Command: readLine Command: grok Command: loadSolr Solr Event Record Record Record Document MorphlineLibrary
  • 27. Morphline Example – syslog with grok morphlines : [ { id : morphline1 importCommands : ["com.cloudera.**", "org.apache.solr.**"] commands : [ { readLine {} } { grok { dictionaryFiles : [/tmp/grok-dictionaries] expressions : { message : """<%{POSINT:syslog_pri}>%{SYSLOGTIMESTAMP:syslog_timestamp} %{SYSLOGHOST:syslog_hostname} %{DATA:syslog_program}(?:[%{POSINT:syslog_pid}])?: %{GREEDYDATA:syslog_message}""" } } } { loadSolr {} } ] } ] Example Input <164>Feb 4 10:46:14 syslog sshd[607]: listening on 0.0.0.0 port 22 Output Record syslog_pri:164 syslog_timestamp:Feb 4 10:46:14 syslog_hostname:syslog syslog_program:sshd syslog_pid:607 syslog_message:listening on 0.0.0.0 port 22.
  • 28. Current Command Library • Integrate with and load into Apache Solr • Flexible log file analysis • Single-line record, multi-line records, CSV files • Regex based pattern matching and extraction • Integration with Avro • Integration with Apache Hadoop Sequence Files • Integration with SolrCell and all Apache Tika parsers • Auto-detection of MIME types from binary data using Apache Tika
  • 29. Current Command Library (cont) • Scripting support for dynamic java code • Operations on fields for assignment and comparison • Operations on fields with list and set semantics • if-then-else conditionals • A small rules engine (tryRules) • String and timestamp conversions • slf4j logging • Yammer metrics and counters • Decompression and unpacking of arbitrarily nested container file formats • Etc…
  • 30. Querying • Built-in solr web UI • Write your own • Hue
  • 31. Simple, Customizable Search Interface Hue • Simple UI • Navigated, faceted drill down • Customizable display • Full text search, standard Solr API and query language
  • 32. Security • Upstream Solr doesn’t really deal with security • Goal: use kerberos, like other CDH components • Current release: Support for kerberos authentication • Actively working on Index-level authorization • Future: more granular authorization
  • 33. Conclusion • Cloudera Search now in public beta • Free Download • Extensive documentation • Send your questions and feedback to search- user@cloudera.org • Take the Search online training • Cloudera Manager Standard (i.e. the free version) • Simple management of Search • Free Download • QuickStart VM also available!