2. Topics we will cover
DataFlow and problems.
What is Apache NiFi – History, key features, core components
Architecture To start with NiFi (Single server setup)
Architecture To scale with NiFi (NiFi cluster setup)
Fundamentals of NiFi Web UI
Building a NiFi DataFlow Processor
Live demo
Testing
Deployment and automation
What next?
Q&A
3. DataFlow
The term “DataFlow” can be used in variety of contexts.
In our context it is the flow of information between systems.
It is crucial to have a robust platform to create, manage and automate the
flow of enterprise data.
There are many tools for data gathering and data flow, but more often
than not we lack an integrated platform for that.
Probably an ideal situation would be have a seamless integration ,..
4. What enterprises look for
To be able to get data from any source
… To the systems that performs Analytics
… And to those for user availability
5. Common DataFlow challenges
System failure
Difference between data production and consumption
Change in dynamic data priority
Protocols and format changes; new systems, new protocols
Need of bidirectional data flow
Transparency and control
Security and privacy
6. Brief history of Apache NiFi
Developed at NSA (National Security Agency, USA) for over 8 years.
Onyara engineers, for NSA, have developed a project called “Niagara
Files” which later went on to become NiFi.
Trough NSA Technology transfer program it was made available as an open
source Apache project “Apache NiFi” in the year 2014.
Hortonworks has a partnership with Onyara on their “Hortonworks DataFlow
powered by Apache NiFi”
7. What is Apache NiFi
Holistically Apache NiFi is an integrated platform to collect, conduct and
curate real-time data (data in motion).
Provides an end to end DataFlow management from any source* to any
destination*.
Provides data logistics – real-time operational visibility and control of
DataFlow.
Supports powerful and scalable directed graphs of data routing and data
transformation.
All these in a reliable and secure manner.
*complete list of source and destination on official documentation
8. Key features
Guaranteed data delivery – “at least once” semantics
Data buffering and Back pressure
Data prioritization in queue
Flow specific setting for “latency vs. throughput”
Data provenance
Visual control
Flow templates
Recovery/ Recording through content repository
Clustering to scale-out
Security
Classloader Isolation
9. Core components of NiFi
NiFi at it’s core follow the concept of Flow Based programming.
Core components of NiFi are
FlowFile – the unit of information packet
FlowFile Processor – the processing engine; black box.
Connection – the relation between Processors and bounded buffer.
Flow Controller – the scheduler in real world.
Process Group – the compact function or subnet
18. Building a DataFlow Processor
Setting up mandatory and optional ‘PROPERTIES’
19. Building a DataFlow Processor
Auto alert mechanism
If there is an error it will not allow to start the processor
20. Building a DataFlow Processor
If everything is se, we are ready to initiate/ start the process
21. Demo 1
In this demo, we will go through a NiFi DataFlow that deals with the
following steps
Connect to Kafka and consume from a topic.
Store consumed data in a local storage (optional).
Anonymize IP address.
Merge content before writing to HDFS (small file issues).
Finally store Kafka data onto HDFS
Look into error handling.
Look into use of expression language.
22.
23. Demo 2
In this demo, we will go through a NiFi DataFlow that deals with the
following steps
Collect/ fetch data files from a local location.
Update/ add attributes.
Parse JSON strings to DB Insert statements.
Connect to PostgreSQL and Insert.
Error handling.
24.
25. Unit testing components
For component testing nifi-mock module can be used with JUnit.
The TestRunner interface allows us to test Processors and Controller Services.
We need to instantiate and get a new TestRunner (org.apache.nifi.util)
Add Controller Services and configure
Set property of Processors setProperty(PropertyDescriptor, String)
Enqueue FlowFiles by using the enqueue methods of the TestRunner class.
Processor can be started by triggering run() method of TestRunner.
Validate output – using the TestRunners assertAllFlowFilesTransferred and
assertTransferCount methods.
More details can be found here – https://nifi.apache.org/docs/nifi-
docs/html/developer-guide.html#testing
26. Add Maven dependency
Call static newTestRunner method of the TestRunners class
Call addControllerService method to add controller
Set properties by setProperty(ControllerService, PropertyDescriptor, String)
Enable services by enableControllerService(ControllerService)
Set processor property setProperty(PropertyDescriptor, String)
Override enqueue method for byte[], InputStream, or Path.
run(int); This will call methods with @OnScheduled annotation, Processor’s
onTrigger method, and then run the @OnUnscheduled and finally @OnStopped
methods.
Validate result by assertAllFlowFilesTransferred and assertTransferCount methods.
Access FlowFiles by calling getFlowFilesForRelationship() method
27. Error handling
Following can occur
Unexpected data format
Network connection, disk failure
Bug in processor
ProcessException and all others (like null pointer)
ProcessException – Rollback and penalize the FlowFiles
All others – Rollback, penalize the FlowFiles and Yield the Processor
28. Testing automation, Deployment
NiFi provides ‘ReST’ API for all components and entire documentation can
be found here https://nifi.apache.org/docs/nifi-docs/rest-api/index.html
Apache NiFi Community is working to improve on this area
We can setup the deployment in following way
Create an application i.e. entire DataFlow in your local machine and test.
Create a process group around that (optional though)
Create a template. (Can be done from Web UI/ ReST API call)
Download the template. (Can be done from Web UI/ ReST API call)
Use ReST API call to import the template in new environment.
Use ReST API call to Update Processors (Properties, Schedule, and Settings etc.)
Use ReST API call to Instantiate a template
29. Deployment
There can be one more option to do it.
Copying the whole flow (flow.xml.gz) from one environment to another
Need to copy the entire canvas.
Need to take care of sensitive properties encryption.
30. What is next
We are planning to work on the testing, deployment side and update it.
Please read more on NiFi development here –
https://nifi.apache.org/docs/nifi-docs/html/developer-guide.html
And for user guide – https://nifi.apache.org/docs/nifi-docs/html/user-
guide.html
We have carried out POCs on some of our real use cases; please find them
here
Link HDFS data ingestion using Apache
Link How to setup Apache NiFi
Link Expression Language Guide
Any questions and/ or suggestions please come by or write