A quick introduction to Apache NiFi and it's ecosystem. Also a hands on demo on using processors, examining provenance, ingesting REST Feeds, XML, Cameras, Files, Running TensorFlow, Running Apache MXNet, integrating with Spark and Kafka. Storing to HDFS, HBase, Phoenix, Hive and S3.
2. 2Š Cloudera, Inc. All rights reserved.
FLOW MANAGEMENT POWERED BY APACHE NIFI
⢠Ingestion: connectors to read/write
data from/to several data sources
⢠Transformation:
⢠Format conversion
⢠Compression/decompression,
Merge, Split, encryption, etc..
⢠Data enrichment
⢠Attribute, content, rules, etcâŚ
⢠Routing
⢠Priority, dynamic/static, based on
content or metadata, etcâŚ
⢠Parsing
3. 3Š Cloudera, Inc. All rights reserved.
APACHE NIFI HIGH LEVEL CAPABILITIES
⢠Web-based user interface
⢠Design, control, feedback & monitoring
⢠Highly conďŹgurable
⢠Loss tolerant vs guaranteed delivery
⢠Low latency vs high throughput
⢠Dynamic prioritization
⢠Flow can be modiďŹed at runtime
⢠Back pressure
⢠Data provenance
⢠Track dataďŹow from beginning to end
⢠Designed for extension
⢠Build your own processors
⢠Secure
⢠SSL, SSH, HTTPS, etc.
4. ⢠Guaranteed delivery
⢠Data buffering
- Backpressure
- Pressure release
⢠Prioritized queuing
⢠Flow speciďŹc QoS
- Latency vs. throughput
- Loss tolerance
⢠Data provenance
⢠Supports push and
pull models
⢠Hundreds of processors
⢠Visual command and
control
⢠Flow templates
⢠Pluggable/multi-role
security
⢠Designed for extension
⢠Clustering
⢠Version Control
Why Apache NiFi?
5. 5Š Cloudera, Inc. All rights reserved.
285+ PROCESSORS FOR DEEPER ECOSYSTEM INTEGRATION
Hash
Extract
Merge
Duplicate
Scan
GeoEnrich
Replace
ConvertSplit
Translate
Route Content
Route Context
Route Text
Control Rate
Distribute Load
Generate Table Fetch
Jolt Transform JSON
Prioritized Delivery
Encrypt
Tail
Evaluate
Execute
All Apache project logos are trademarks of the ASF and the respective
projects.
Fetch
HTTP
Syslog
Email
HTML
Image
HL7
FTP
UDP
XML
SFTP
AMQP
WebSocket
6. 6Š Cloudera, Inc. All rights reserved.
Apache NiFi 1.9 Features
Key New Features
⢠Apache Kafka 2.0 support
⢠Apache Hive 3.1.0 support
⢠Connection load balancing
⢠MQTT Performance improvements
Updated to 1.9.0
7. FLOW FILES ARE LIKE HTTP DATA
HTTP Data FlowFile
HTTP/1.1 200 OK
Date: Sun, 10 Oct 2010 23:26:07 GMT
Server: Apache/2.2.8 (CentOS) OpenSSL/0.9.8g
Last-Modified: Sun, 26 Sep 2010 22:04:35 GMT
ETag: "45b6-834-49130cc1182c0"
Accept-Ranges: bytes
Content-Length: 13
Connection: close
Content-Type: text/html
Hello world!
Standard FlowFile Attributes
Key: 'entryDateâ Value: 'Fri Jun 17 17:15:04 EDT
2016'
Key: 'lineageStartDateâ Value: 'Fri Jun 17 17:15:04 EDT
2016'
Key: 'fileSizeâ Value: '23609'
FlowFile Attribute Map Content
Key: 'filenameâ Value: '15650246997242'
Key: 'pathâValue: './â
Binary Content *
Header
Content
8. SQL BASED ROUTING WITH NiFiâs QueryRecord Processor
⢠QueryRecord Processor-
Executes a SQL statement
against records and writes the
results to the ďŹow ďŹle content.
⢠CSVReader: Looking up schema
from SR, it will converts CSV
Records into ProcessRecords
⢠SQ execution via Apache Calcite:
execute conďŹgured SQL against
the ProcessRecords for routing
⢠CSVRecordSetWriter: Converts
the result of the query from
Process records into CSV for the
for the ďŹow ďŹle content
Why should you care?
Do routing(routing geo and speed streams) using standard SQL as opposed to complex regular
expressions.
16. NiFi Positioning
Apache
NiFi / MiNiFi
ETL
(Informatica, etc.)
Enterprise
Service Bus
(Fuse, Mule, etc.)
Messaging
Bus
(Kafka, MQ, etc.)
Processing
Framework
(Storm, Spark,
etc.)
17. Apache NiFi / Processing Frameworks
NiFi
Simple event processing
⢠Primarily feed data into processing
frameworks, can process data, with a focus
on simple event processing
⢠Operate on a single piece of data, or in
correlation with an enrichment dataset
(enrichment, parsing, splitting, and
transformations)
⢠Can scale out, but scale up better to take
full advantage of hardware resources, run
concurrent processing tasks/threads
(processing terabytes of data per day on a
single node)
â Not another distributed processing
framework, but to feed data into those
Processing Frameworks (Flink, Kafka
Streams, Storm, Spark, etc.)
Complex and distributed processing
⢠Complex processing from multiple streams
(JOIN operations)
⢠Analyzing data across time windows (rolling
window aggregation, standard deviation, etc.)
⢠Scale out to thousands of nodes if needed
â Not designed to collect data or manage data
ďŹow
18. Apache NiFi / Messaging Bus Services
NiFi
Provide dataďŹow solution
⢠Centralized management, from edge to core
⢠Great traceability, event level data provenance
starting when data is born
⢠Interactive command and control â real time
operational visibility
⢠DataďŹow management, including prioritization,
back pressure, and edge intelligence
⢠Visual representation of global dataďŹow
â Not a messaging bus, ďŹow maintenance
needed when you have frequent consumer side
updates
Messaging Bus (Kafka, JMS, etc.)
Provide messaging bus service
⢠Low latency
⢠Great data durability
⢠Decentralized management (producers &
consumers)
⢠Low broker maintenance for dynamic consumer
side updates
â Not designed to solve dataďŹow problems
(prioritization, edge intelligence, etc.)
â Traceability limited to in/out of topics, no lineage
â Lack of global view of components/connectivities
19. Apache NiFi / Integration, or Ingestion, Frameworks
NiFi
End user facing dataďŹow management
tool
⢠Out of the box solution for dataďŹow
management
⢠Interactive command and control in the core,
design and deploy on the edge
⢠Flexible failure handling at each point of the
ďŹow
⢠Visual representation of global dataďŹow and
connectivities
⢠Native cross data center communication
⢠Data provenance for traceability
â Not a library to be embedded in other
applications
Integration framework (Spring Integration,
Camel, etc), ingestion framework (Flume,
etc)
Developer facing integration tool with a
focus on data ingestion
⢠A set of tools to orchestrate workďŹow
⢠A ďŹxed design and deploy pattern
⢠Leverage messaging bus across disconnected
networks
â Developer facing, custom coding needed to
optimize
â Pre-built failure handling, lack of ďŹexibility
â No holistic view of global dataďŹow
â No built-in data traceability
20. Apache NiFi / ETL Tools
NiFi
NOT schema dependent
⢠DataďŹow management for both structured and
unstructured data, powered by separation of
metadata and payload
⢠Schema is not required, but you can have
schema
⢠Minimum modeling effort, just enough to
manage dataďŹows
⢠Do the plumbing job, maximize developersâ
brainpower for creative work
â Not designed to do heavy lifting transformation
work for DB tables (JOIN datasets, etc.). You
can create custom processors to do that, but
long way to go to catch up with existing ETL
tools from user experience perspective (GUI for
data wrangling, cleansing, etc.)
ETL (Informatica, etc.)
Schema dependent
⢠Tailored for Databases/WH
⢠ETL operations based on schema/data
modeling
⢠Highly eďŹcient, optimized performance
â Must pre-prepare your data, time consuming to
build data modeling, and maintain schemas
â Not geared towards handling unstructured data,
PDF, Audio, Video, etc.
â Not designed to solve dataďŹow problems
21. NiFi and Kafka Are Complementary
NiFi
Provide dataďŹow solution
⢠Centralized management, from edge to core
⢠Great traceability, event level data provenance
starting when data is born
⢠Interactive command and control â real time
operational visibility
⢠DataďŹow management, including prioritization,
back pressure, and edge intelligence
⢠Visual representation of global dataďŹow
Kafka
Provide durable stream store
⢠Low latency
⢠Distributed data durability
⢠Decentralized management of producers &
consumers
+
â Requires adding/removing processors
according to consumer-side updates
â Not optimized to manage dataďŹows
(prioritization, enrichment, protocols, formats,
event level authorizations, objects with
various sizes, etc.)