SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Building Data Pipelines for Solr with
Apache NiFi
Bryan Bende – Member of Technical Staff
Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Outline
• Introduction to Apache NiFi
• Solr Indexing & Update Handlers
• NiFi/Solr Integration
• Use Cases
Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
About Me
• Member of Technical Staff at Hortonworks
• Apache NiFi Committer & PMC Member since June 2015
• Solr/Lucene user for several years
• Developed Solr integration for Apache NiFi 0.1.0 release
• Twitter: @bbende / Blog: bryanbende.com
Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Introduction
Installing Solr and getting started - easy (extract, bin/solr start)
Defining a schema and configuring Solr - easy
Getting all of your incoming data into Solr - not as easy
A lot of time spent…
• Cleaning and parsing data
• Writing custom code/scripts
• Building approaches for monitoring and debugging
• Deploying updates to code/scripts for small changes
Need something to make this easier…
Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Introduction to Apache NiFi
Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache NiFi
• Powerful and reliable system to process and
distribute data
• Directed graphs of data routing and transformation
• Web-based User Interface for creating, monitoring,
& controlling data flows
• Highly configurable - modify data flow at runtime,
dynamically prioritize data
• Data Provenance tracks data through entire
system
• Easily extensible through development of custom
components
[1] https://nifi.apache.org/
Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
NiFi - Terminology
FlowFile
• Unit of data moving through the system
• Content + Attributes (key/value pairs)
Processor
• Performs the work, can access FlowFiles
Connection
• Links between processors
• Queues that can be dynamically prioritized
Process Group
• Set of processors and their connections
• Receive data via input ports, send data via output ports
Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
NiFi - User Interface
• Drag and drop processors to build a flow
• Start, stop, and configure components in real time
• View errors and corresponding error messages
• View statistics and health of data flow
• Create templates of common processor & connections
Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
NiFi - Provenance
• Tracks data at each point as it flows
through the system
• Records, indexes, and makes
events available for display
• Handles fan-in/fan-out, i.e. merging
and splitting data
• View attributes and content at given
points in time
Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
NiFi - Queue Prioritization
• Configure a prioritizer per
connection
• Determine what is important for your
data – time based, arrival order,
importance of a data set
• Funnel many connections down to a
single connection to prioritize across
data sets
• Develop your own prioritizer if
needed
Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
NiFi - Extensibility
Built from the ground up with extensions in mind
Service-loader pattern for…
• Processors
• Controller Services
• Reporting Tasks
• Prioritizers
Extensions packaged as NiFi Archives (NARs)
• Deploy NiFi lib directory and restart
• Provides ClassLoader isolation
• Same model as standard components
Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
NiFi - Architecture
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
OS/Host
JVM
NiFi Cluster Manager – Request Replicator
Web Server
Master
NiFi Cluster
Manager (NCM)
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile
Repository
Content
Repository
Provenance
Repository
Local Storage
Slaves
NiFi Nodes
Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Solr Indexing & Update Handlers
Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Solr – Indexing Data
Update Handlers
• XML, JSON, CSV
• https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers
Clients
• Java, PHP, Python, Ruby, Scala, Perl, and more
• https://wiki.apache.org/solr/IntegratingSolr
Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Solr Update Handlers - XML
Adding documents
<add>
<doc>
<field name=”foo”>bad</field>
</doc>
</add>
Deleting documents
<delete>
<id>1234567</id>
<query>foo:bar</query>
</delete>
Other Operations
<commit waitSearcher="false"/>
<commit waitSearcher="false"
expungeDeletes="true"/>
<optimize waitSearcher="false"/>
Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Solr Update Handlers - JSON
Solr-Style JSON…
Add Documents
[
{
"id": "1”,
"title": "Doc 1”
},
{
"id": "2”,
"title": "Doc 2”
}
]
Commands
{
"add": {
"doc": {
"id": "1”,
"title": {
"boost": 2.3,
"value": "Doc1”
}
}
}
}
Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Solr Update Handlers - JSON
Custom JSON
• Transform custom JSON based on Solr
schema
• Define paths to split JSON into multiple Solr
documents
• Field mappings from JSON field name to
Solr field name
Produces two Solr documents:
- John, Math, term1, 90
- John, Biology, term1, 86
split=/exams&
f=name:/name&
f=subject:/exams/subject&
f=test:/exams/test&
f=marks:/exams/marks
{
"name": "John",
"exams": [
{
"subject": "Math",
"test" : "term1",
"marks" : 90},
{
"subject": "Biology",
"test" : "term1",
"marks" : 86}
]
}
Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Solr Update Handlers - CSV
/update with Content-Type:application/csv
Important parameters:
• separator
• trim
• header
• fieldnames
• skip
• rowid
Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SolrJ Client
SolrDocument Update
SolrInputDocument doc =
new SolrInputDocument();
doc.addField("first", "bob");
doc.addField("last", "smith");
solrClient.add(doc);
ContentStream Update
ContentStreamUpdateRequest request =
new ContentStreamUpdateRequest(
"/update/json/docs");
request.setParam("json.command", "false");
request.setParam("split", "/exams");
request.getParams().add("f", "name:/name");
request.getParams().add("f",
"subject:/exams/subject");
request.getParams().add("f","test:/exams/test");
request.getParams().add("f","marks:/exams/marks");
request.addContentStream(new ContentStream...);
Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
NiFi/Solr Integration
Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
NiFi Solr Processors
• Support Solr Cloud and stand-alone Solr instances
• Leverage SolrJ – CloudSolrClient & HttpSolrClient
• Extract new documents based on a date/time field – GetSolr
• Stream FlowFile content to an update handler - PutSolrContentStream
Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
PutSolrContentStream
• Choose Solr Type - Cloud or
Standard
• Specify ZooKeeper hosts, or the
Solr URL
• Specify a collection if using Solr
Cloud
• Specify the Solr path for the
ContentStream
• Dynamic properties sent as
key/value pairs on the request
• Relationships for success, failure,
and connection_failure
Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
GetSolr
• Solr Type, Solr Location, and
Collection are the same as PutSolr
• Specify a query to run on each
execution of the processor
• Specify a sort clause and a date
field used to filter results
• Schedule processor to run on a
cron, or timer
• Retrieves documents with ‘Date
Field’ greater than time of last
execution
• Produces output in SolrJ XML
Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Use Cases
Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Use Cases – Index JSON
1. Pull in Tweets using Twitter API
2. Extract language and text into FlowFile
attributes
3. Get non-empty English tweets
${twitter.text:isEmpty():not():and(
${twitter.lang:equals("en")})}
4. Merge together JSON documents based on
quantity, or time
5. Use dynamic field mappings to select fields for
indexing:
Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Use Cases – Issue Commands
1. Generate a FlowFile on a cron, or timer, to
initiate an action
2. Replace the contents of the FlowFile with a
Solr command
<delete>
<query>
timestamp:[* TO NOW-1HOUR]
</query>
</delete>
3. Send the command to the appropriate
update handler
Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Use Cases – Multiple Collections
1. Set a FlowFile attribute
containing the name of a Solr
collection
2. Use expression language when
setting the Collection property on
the Solr processor:
${solr.collection}
Note:
• If merging documents, merge per
collection in this case
• Current bug preventing this scenario
from working:
https://issues.apache.org/jira/browse/NIFI-959
Page28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Use Cases – Log Aggregation
1. Listen for log events over UDP on a
given port
• Set ‘Flow File Per Datagram’ to true
2. Send JSON log events
• Syslog UDP forwarding
• Logback/log4j UDP appenders
3. Merge JSON events together based on
size, or time
4. Stream JSON update to Solr
http://bryanbende.com/development/2015/05/17/c
ollecting-logs-with-apache-nifi/
Page29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Use Cases – Index Avro
1. Receive an Avro datafile with binary
encoding
2. Convert Avro to JSON using built in
ConvertAvroToJSON processor
3. Stream JSON documents to Solr
Page30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Use Cases – Index a Relational Database
1. GenerateFlowFile acts a timer to trigger
ExecuteSQL
(Future plans to not require in an incoming FlowFile
to ExecuteSQL NIFI-932)
2. ExecuteSQL performs a SQL query and
streams the results as an Avro datafile
Use expression language to construct a dynamic
date range:
${now():toNumber():minus(60000)
:format(‘YYYY-MM-DD’}
3. Convert Avro to JSON using built in
ConvertAvroToJSON processor
4. Stream JSON update to Solr
Page31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Use Case – Extraction in a Cluster
1. Schedule GetSolr to run
on Primary Node
2. Send results to a Remote
Process Group pointing
back to self
3. Data gets redistributed to
“Solr XML Docs” Input
Ports across cluster
4. Perform further
processing on each node
Page32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Future Work
Unofficial ideas…
PutSolrDocument
• Parse FlowFile InputStream into one or more SolrDocuments
• Allow developers to provide “FlowFile to SolrDocument” converter
PutSolrAttributes
• Create a SolrDocument from FlowFile attributes
• Processor properties specify attributes to include/exclude
Distribute & Execute Solr Commands
• DistributeSolrCommand learns about Solr shards and produces commands per shard
• ExecuteSolrCommand performs action based on the incoming command
Page33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Summary
Resources
• Apache NiFi Mailing Lists
– https://nifi.apache.org/mailing_lists.html
• Apache NiFi Documentation
– https://nifi.apache.org/docs.html
• Getting started developing extensions
– https://cwiki.apache.org/confluence/display/NIFI/Maven+Projects+for+Extensions
– https://nifi.apache.org/developer-guide.html
Contact Info:
• Email: bbende@hortonworks.com
• Twitter: @bbende
Page34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Sources
[1] https://nifi.apache.org/
[2] https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers
[3] https://wiki.apache.org/solr/IntegratingSolr
[4] http://lucidworks.com/blog/indexing-custom-json-data/
Page35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFi
DataWorks Summit
 
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Timothy Spann
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
DataWorks Summit
 

Was ist angesagt? (20)

NiFi Developer Guide
NiFi Developer GuideNiFi Developer Guide
NiFi Developer Guide
 
File Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and ParquetFile Format Benchmark - Avro, JSON, ORC and Parquet
File Format Benchmark - Avro, JSON, ORC and Parquet
 
Introduction to data flow management using apache nifi
Introduction to data flow management using apache nifiIntroduction to data flow management using apache nifi
Introduction to data flow management using apache nifi
 
Integrating NiFi and Flink
Integrating NiFi and FlinkIntegrating NiFi and Flink
Integrating NiFi and Flink
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Drone Data Flowing Through Apache NiFi
Drone Data Flowing Through Apache NiFiDrone Data Flowing Through Apache NiFi
Drone Data Flowing Through Apache NiFi
 
Introduction to Apache NiFi dws19 DWS - DC 2019
Introduction to Apache NiFi   dws19 DWS - DC 2019Introduction to Apache NiFi   dws19 DWS - DC 2019
Introduction to Apache NiFi dws19 DWS - DC 2019
 
Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFi
 
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
Hortonworks Data in Motion Webinar Series Part 7 Apache Kafka Nifi Better Tog...
 
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
Designing Event-Driven Applications with Apache NiFi, Apache Flink, Apache Sp...
 
Flink Streaming
Flink StreamingFlink Streaming
Flink Streaming
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink
 
Apache NiFi Record Processing
Apache NiFi Record ProcessingApache NiFi Record Processing
Apache NiFi Record Processing
 
Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
 
Integrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data LakesIntegrating Apache Spark and NiFi for Data Lakes
Integrating Apache Spark and NiFi for Data Lakes
 
Apache Nifi Crash Course
Apache Nifi Crash CourseApache Nifi Crash Course
Apache Nifi Crash Course
 
Alfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en españolAlfresco y SOLR, presentación en español
Alfresco y SOLR, presentación en español
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 
Hadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox GatewayHadoop REST API Security with Apache Knox Gateway
Hadoop REST API Security with Apache Knox Gateway
 
Apache Flink and what it is used for
Apache Flink and what it is used forApache Flink and what it is used for
Apache Flink and what it is used for
 

Andere mochten auch

Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
Tommaso Teofili
 

Andere mochten auch (20)

Real-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFiReal-Time Data Flows with Apache NiFi
Real-Time Data Flows with Apache NiFi
 
Apache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup SlidesApache NiFi- MiNiFi meetup Slides
Apache NiFi- MiNiFi meetup Slides
 
NJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep DiveNJ Hadoop Meetup - Apache NiFi Deep Dive
NJ Hadoop Meetup - Apache NiFi Deep Dive
 
Apache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop EcosystemApache NiFi in the Hadoop Ecosystem
Apache NiFi in the Hadoop Ecosystem
 
The Avant-garde of Apache NiFi
The Avant-garde of Apache NiFiThe Avant-garde of Apache NiFi
The Avant-garde of Apache NiFi
 
IOT, Streaming Analytics and Machine Learning
IOT, Streaming Analytics and Machine Learning IOT, Streaming Analytics and Machine Learning
IOT, Streaming Analytics and Machine Learning
 
Integrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applicationsIntegrate Solr with real-time stream processing applications
Integrate Solr with real-time stream processing applications
 
Tuning Solr & Pipeline for Logs
Tuning Solr & Pipeline for LogsTuning Solr & Pipeline for Logs
Tuning Solr & Pipeline for Logs
 
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache FlinkIntegrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
 
The Elephant in the Clouds
The Elephant in the CloudsThe Elephant in the Clouds
The Elephant in the Clouds
 
A Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache FlinkA Comparative Performance Evaluation of Apache Flink
A Comparative Performance Evaluation of Apache Flink
 
Monitoring as code
Monitoring as codeMonitoring as code
Monitoring as code
 
Building a Smarter Home with Apache NiFi and Spark
Building a Smarter Home with Apache NiFi and SparkBuilding a Smarter Home with Apache NiFi and Spark
Building a Smarter Home with Apache NiFi and Spark
 
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)Integrating Spark and Solr-(Timothy Potter, Lucidworks)
Integrating Spark and Solr-(Timothy Potter, Lucidworks)
 
From Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFiFrom Zero to Data Flow in Hours with Apache NiFi
From Zero to Data Flow in Hours with Apache NiFi
 
How to choose the right Integration Framework - Apache Camel (JBoss, Talend),...
How to choose the right Integration Framework - Apache Camel (JBoss, Talend),...How to choose the right Integration Framework - Apache Camel (JBoss, Talend),...
How to choose the right Integration Framework - Apache Camel (JBoss, Talend),...
 
Apache Solr crash course
Apache Solr crash courseApache Solr crash course
Apache Solr crash course
 
Dataflow with Apache NiFi - Crash Course - HS16SJ
Dataflow with Apache NiFi - Crash Course - HS16SJDataflow with Apache NiFi - Crash Course - HS16SJ
Dataflow with Apache NiFi - Crash Course - HS16SJ
 
Stream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache FlinkStream Analytics with SQL on Apache Flink
Stream Analytics with SQL on Apache Flink
 
Joe Witt presentation on Apache NiFi
Joe Witt presentation on Apache NiFiJoe Witt presentation on Apache NiFi
Joe Witt presentation on Apache NiFi
 

Ähnlich wie Building Data Pipelines for Solr with Apache NiFi

Ähnlich wie Building Data Pipelines for Solr with Apache NiFi (20)

Integrating Apache NiFi and Apache Apex
Integrating Apache NiFi and Apache Apex Integrating Apache NiFi and Apache Apex
Integrating Apache NiFi and Apache Apex
 
Integrating NiFi and Apex
Integrating NiFi and ApexIntegrating NiFi and Apex
Integrating NiFi and Apex
 
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache FalconDriving Enterprise Data Governance for Big Data Systems through Apache Falcon
Driving Enterprise Data Governance for Big Data Systems through Apache Falcon
 
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015 Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
Data Governance in Apache Falcon - Hadoop Summit Brussels 2015
 
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache FlinkIntegrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
 
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache FlinkIntegrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
 
Integrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache FlinkIntegrating Apache NiFi and Apache Flink
Integrating Apache NiFi and Apache Flink
 
Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search Hortonworks Technical Workshop - HDP Search
Hortonworks Technical Workshop - HDP Search
 
WebSocket in Enterprise Applications 2015
WebSocket in Enterprise Applications 2015WebSocket in Enterprise Applications 2015
WebSocket in Enterprise Applications 2015
 
State of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & CommunityState of the Apache NiFi Ecosystem & Community
State of the Apache NiFi Ecosystem & Community
 
You Can't Search Without Data
You Can't Search Without DataYou Can't Search Without Data
You Can't Search Without Data
 
[253] apache ni fi
[253] apache ni fi[253] apache ni fi
[253] apache ni fi
 
Apache NiFi Crash Course Intro
Apache NiFi Crash Course IntroApache NiFi Crash Course Intro
Apache NiFi Crash Course Intro
 
Apache NiFi - Flow Based Programming Meetup
Apache NiFi - Flow Based Programming MeetupApache NiFi - Flow Based Programming Meetup
Apache NiFi - Flow Based Programming Meetup
 
BigData Techcon - Beyond Messaging with Apache NiFi
BigData Techcon - Beyond Messaging with Apache NiFiBigData Techcon - Beyond Messaging with Apache NiFi
BigData Techcon - Beyond Messaging with Apache NiFi
 
HDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical WorkshopHDF: Hortonworks DataFlow: Technical Workshop
HDF: Hortonworks DataFlow: Technical Workshop
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Mission to NARs with Apache NiFi
Mission to NARs with Apache NiFiMission to NARs with Apache NiFi
Mission to NARs with Apache NiFi
 
Falcon Meetup
Falcon Meetup Falcon Meetup
Falcon Meetup
 
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
Big Data Day LA 2016/ Big Data Track - Building scalable enterprise data flow...
 

Mehr von Bryan Bende

Mehr von Bryan Bende (6)

Apache NiFi SDLC Improvements
Apache NiFi SDLC ImprovementsApache NiFi SDLC Improvements
Apache NiFi SDLC Improvements
 
Apache NiFi Meetup - Introduction to NiFi Registry
Apache NiFi Meetup - Introduction to NiFi RegistryApache NiFi Meetup - Introduction to NiFi Registry
Apache NiFi Meetup - Introduction to NiFi Registry
 
Devnexus 2018 - Let Your Data Flow with Apache NiFi
Devnexus 2018 - Let Your Data Flow with Apache NiFiDevnexus 2018 - Let Your Data Flow with Apache NiFi
Devnexus 2018 - Let Your Data Flow with Apache NiFi
 
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFiTaking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
 
Document Similarity with Cloud Computing
Document Similarity with Cloud ComputingDocument Similarity with Cloud Computing
Document Similarity with Cloud Computing
 
Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014Real-Time Inverted Search NYC ASLUG Oct 2014
Real-Time Inverted Search NYC ASLUG Oct 2014
 

Kürzlich hochgeladen

AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
mohitmore19
 

Kürzlich hochgeladen (20)

%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park %in kempton park+277-882-255-28 abortion pills for sale in kempton park
%in kempton park+277-882-255-28 abortion pills for sale in kempton park
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban%in Durban+277-882-255-28 abortion pills for sale in Durban
%in Durban+277-882-255-28 abortion pills for sale in Durban
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
%in Lydenburg+277-882-255-28 abortion pills for sale in Lydenburg
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Microsoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdfMicrosoft AI Transformation Partner Playbook.pdf
Microsoft AI Transformation Partner Playbook.pdf
 
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdfThe Ultimate Test Automation Guide_ Best Practices and Tips.pdf
The Ultimate Test Automation Guide_ Best Practices and Tips.pdf
 
Generic or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisionsGeneric or specific? Making sensible software design decisions
Generic or specific? Making sensible software design decisions
 
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Vancouver Psychic Readings, Attraction spells,Br...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
SHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions PresentationSHRMPro HRMS Software Solutions Presentation
SHRMPro HRMS Software Solutions Presentation
 
Exploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdfExploring the Best Video Editing App.pdf
Exploring the Best Video Editing App.pdf
 
Define the academic and professional writing..pdf
Define the academic and professional writing..pdfDefine the academic and professional writing..pdf
Define the academic and professional writing..pdf
 
10 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 202410 Trends Likely to Shape Enterprise Technology in 2024
10 Trends Likely to Shape Enterprise Technology in 2024
 

Building Data Pipelines for Solr with Apache NiFi

  • 1. Page1 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Building Data Pipelines for Solr with Apache NiFi Bryan Bende – Member of Technical Staff
  • 2. Page2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Outline • Introduction to Apache NiFi • Solr Indexing & Update Handlers • NiFi/Solr Integration • Use Cases
  • 3. Page3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved About Me • Member of Technical Staff at Hortonworks • Apache NiFi Committer & PMC Member since June 2015 • Solr/Lucene user for several years • Developed Solr integration for Apache NiFi 0.1.0 release • Twitter: @bbende / Blog: bryanbende.com
  • 4. Page4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Introduction Installing Solr and getting started - easy (extract, bin/solr start) Defining a schema and configuring Solr - easy Getting all of your incoming data into Solr - not as easy A lot of time spent… • Cleaning and parsing data • Writing custom code/scripts • Building approaches for monitoring and debugging • Deploying updates to code/scripts for small changes Need something to make this easier…
  • 5. Page5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Introduction to Apache NiFi
  • 6. Page6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Apache NiFi • Powerful and reliable system to process and distribute data • Directed graphs of data routing and transformation • Web-based User Interface for creating, monitoring, & controlling data flows • Highly configurable - modify data flow at runtime, dynamically prioritize data • Data Provenance tracks data through entire system • Easily extensible through development of custom components [1] https://nifi.apache.org/
  • 7. Page7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved NiFi - Terminology FlowFile • Unit of data moving through the system • Content + Attributes (key/value pairs) Processor • Performs the work, can access FlowFiles Connection • Links between processors • Queues that can be dynamically prioritized Process Group • Set of processors and their connections • Receive data via input ports, send data via output ports
  • 8. Page8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved NiFi - User Interface • Drag and drop processors to build a flow • Start, stop, and configure components in real time • View errors and corresponding error messages • View statistics and health of data flow • Create templates of common processor & connections
  • 9. Page9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved NiFi - Provenance • Tracks data at each point as it flows through the system • Records, indexes, and makes events available for display • Handles fan-in/fan-out, i.e. merging and splitting data • View attributes and content at given points in time
  • 10. Page10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved NiFi - Queue Prioritization • Configure a prioritizer per connection • Determine what is important for your data – time based, arrival order, importance of a data set • Funnel many connections down to a single connection to prioritize across data sets • Develop your own prioritizer if needed
  • 11. Page11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved NiFi - Extensibility Built from the ground up with extensions in mind Service-loader pattern for… • Processors • Controller Services • Reporting Tasks • Prioritizers Extensions packaged as NiFi Archives (NARs) • Deploy NiFi lib directory and restart • Provides ClassLoader isolation • Same model as standard components
  • 12. Page12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved NiFi - Architecture OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage OS/Host JVM NiFi Cluster Manager – Request Replicator Web Server Master NiFi Cluster Manager (NCM) OS/Host JVM Flow Controller Web Server Processor 1 Extension N FlowFile Repository Content Repository Provenance Repository Local Storage Slaves NiFi Nodes
  • 13. Page13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Solr Indexing & Update Handlers
  • 14. Page14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Solr – Indexing Data Update Handlers • XML, JSON, CSV • https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers Clients • Java, PHP, Python, Ruby, Scala, Perl, and more • https://wiki.apache.org/solr/IntegratingSolr
  • 15. Page15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Solr Update Handlers - XML Adding documents <add> <doc> <field name=”foo”>bad</field> </doc> </add> Deleting documents <delete> <id>1234567</id> <query>foo:bar</query> </delete> Other Operations <commit waitSearcher="false"/> <commit waitSearcher="false" expungeDeletes="true"/> <optimize waitSearcher="false"/>
  • 16. Page16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Solr Update Handlers - JSON Solr-Style JSON… Add Documents [ { "id": "1”, "title": "Doc 1” }, { "id": "2”, "title": "Doc 2” } ] Commands { "add": { "doc": { "id": "1”, "title": { "boost": 2.3, "value": "Doc1” } } } }
  • 17. Page17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Solr Update Handlers - JSON Custom JSON • Transform custom JSON based on Solr schema • Define paths to split JSON into multiple Solr documents • Field mappings from JSON field name to Solr field name Produces two Solr documents: - John, Math, term1, 90 - John, Biology, term1, 86 split=/exams& f=name:/name& f=subject:/exams/subject& f=test:/exams/test& f=marks:/exams/marks { "name": "John", "exams": [ { "subject": "Math", "test" : "term1", "marks" : 90}, { "subject": "Biology", "test" : "term1", "marks" : 86} ] }
  • 18. Page18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Solr Update Handlers - CSV /update with Content-Type:application/csv Important parameters: • separator • trim • header • fieldnames • skip • rowid
  • 19. Page19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved SolrJ Client SolrDocument Update SolrInputDocument doc = new SolrInputDocument(); doc.addField("first", "bob"); doc.addField("last", "smith"); solrClient.add(doc); ContentStream Update ContentStreamUpdateRequest request = new ContentStreamUpdateRequest( "/update/json/docs"); request.setParam("json.command", "false"); request.setParam("split", "/exams"); request.getParams().add("f", "name:/name"); request.getParams().add("f", "subject:/exams/subject"); request.getParams().add("f","test:/exams/test"); request.getParams().add("f","marks:/exams/marks"); request.addContentStream(new ContentStream...);
  • 20. Page20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved NiFi/Solr Integration
  • 21. Page21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved NiFi Solr Processors • Support Solr Cloud and stand-alone Solr instances • Leverage SolrJ – CloudSolrClient & HttpSolrClient • Extract new documents based on a date/time field – GetSolr • Stream FlowFile content to an update handler - PutSolrContentStream
  • 22. Page22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved PutSolrContentStream • Choose Solr Type - Cloud or Standard • Specify ZooKeeper hosts, or the Solr URL • Specify a collection if using Solr Cloud • Specify the Solr path for the ContentStream • Dynamic properties sent as key/value pairs on the request • Relationships for success, failure, and connection_failure
  • 23. Page23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved GetSolr • Solr Type, Solr Location, and Collection are the same as PutSolr • Specify a query to run on each execution of the processor • Specify a sort clause and a date field used to filter results • Schedule processor to run on a cron, or timer • Retrieves documents with ‘Date Field’ greater than time of last execution • Produces output in SolrJ XML
  • 24. Page24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Use Cases
  • 25. Page25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Use Cases – Index JSON 1. Pull in Tweets using Twitter API 2. Extract language and text into FlowFile attributes 3. Get non-empty English tweets ${twitter.text:isEmpty():not():and( ${twitter.lang:equals("en")})} 4. Merge together JSON documents based on quantity, or time 5. Use dynamic field mappings to select fields for indexing:
  • 26. Page26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Use Cases – Issue Commands 1. Generate a FlowFile on a cron, or timer, to initiate an action 2. Replace the contents of the FlowFile with a Solr command <delete> <query> timestamp:[* TO NOW-1HOUR] </query> </delete> 3. Send the command to the appropriate update handler
  • 27. Page27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Use Cases – Multiple Collections 1. Set a FlowFile attribute containing the name of a Solr collection 2. Use expression language when setting the Collection property on the Solr processor: ${solr.collection} Note: • If merging documents, merge per collection in this case • Current bug preventing this scenario from working: https://issues.apache.org/jira/browse/NIFI-959
  • 28. Page28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Use Cases – Log Aggregation 1. Listen for log events over UDP on a given port • Set ‘Flow File Per Datagram’ to true 2. Send JSON log events • Syslog UDP forwarding • Logback/log4j UDP appenders 3. Merge JSON events together based on size, or time 4. Stream JSON update to Solr http://bryanbende.com/development/2015/05/17/c ollecting-logs-with-apache-nifi/
  • 29. Page29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Use Cases – Index Avro 1. Receive an Avro datafile with binary encoding 2. Convert Avro to JSON using built in ConvertAvroToJSON processor 3. Stream JSON documents to Solr
  • 30. Page30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Use Cases – Index a Relational Database 1. GenerateFlowFile acts a timer to trigger ExecuteSQL (Future plans to not require in an incoming FlowFile to ExecuteSQL NIFI-932) 2. ExecuteSQL performs a SQL query and streams the results as an Avro datafile Use expression language to construct a dynamic date range: ${now():toNumber():minus(60000) :format(‘YYYY-MM-DD’} 3. Convert Avro to JSON using built in ConvertAvroToJSON processor 4. Stream JSON update to Solr
  • 31. Page31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Use Case – Extraction in a Cluster 1. Schedule GetSolr to run on Primary Node 2. Send results to a Remote Process Group pointing back to self 3. Data gets redistributed to “Solr XML Docs” Input Ports across cluster 4. Perform further processing on each node
  • 32. Page32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Future Work Unofficial ideas… PutSolrDocument • Parse FlowFile InputStream into one or more SolrDocuments • Allow developers to provide “FlowFile to SolrDocument” converter PutSolrAttributes • Create a SolrDocument from FlowFile attributes • Processor properties specify attributes to include/exclude Distribute & Execute Solr Commands • DistributeSolrCommand learns about Solr shards and produces commands per shard • ExecuteSolrCommand performs action based on the incoming command
  • 33. Page33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Summary Resources • Apache NiFi Mailing Lists – https://nifi.apache.org/mailing_lists.html • Apache NiFi Documentation – https://nifi.apache.org/docs.html • Getting started developing extensions – https://cwiki.apache.org/confluence/display/NIFI/Maven+Projects+for+Extensions – https://nifi.apache.org/developer-guide.html Contact Info: • Email: bbende@hortonworks.com • Twitter: @bbende
  • 34. Page34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Sources [1] https://nifi.apache.org/ [2] https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Index+Handlers [3] https://wiki.apache.org/solr/IntegratingSolr [4] http://lucidworks.com/blog/indexing-custom-json-data/
  • 35. Page35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Thank you