Я розповім про досвід будування системи для роботи з великими даними на базі відкритої технологіі Apache Nifi та Kubernetes на прикладі аналізу ресурсів новин з використанням NLP аналізом.
DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient instrument to make custom workflow
1. Data pipelines: building
an efficient instrument
to create a custom
workflow
Speaker: Daniel Yavorovych
DevOpsFest2020
2. Daniel Yavorovych
CTO & Co-Founder at Dysnix
10+ years of * nix-systems administration;
5+ years of DevOps, SRE;
7+ years in the development of the cloud solution
architectures and HL / HA infrastructures;
7+ years in the development of highly-powerful servers
(Python / Golang).
3. Real-Time Data Pipelines
Processing
When is it needed?
Why is this a problem?
Real-time processing is needed for continuously data arriving -
for example from Twitter, media news, Email, etc.
Most solutions for working with Data Pipelines imply working
in Batch mode. There are only a few alternatives which will be
discussed further.
4. Data Pipeline Solutions
Google Cloud Dataflow
Batch and Stream modes!
Fully integrated with AutoML, Pub/Sub and other
GCP components
Vender-lock-in
It is not expensive, but it costs more than
self-hosted solutions
5. Data Pipeline Solutions
Apache Airflow
Open Source & No vendor lock-in
User Interface for visualizing Data Pipelines and
Processing
Support of various executors (Apache Spark,
Celery, Kubernetes)
No Stream Mode
6. Data Pipeline Solutions
Luigi
Open Source & No vendor lock-in
Not very scalable: you need to split tasks in
projects for parallel execution
User Interface
Hard to use: Dag tasks cannot be viewed before
execution, logs view is difficult
7. Data Pipeline Solutions
argoproj/argo-events
Open source & No vendor lock-in
No User Interface
Real-time mode
Kubernetes-native solution
20+ event sources
Argo workflow support:
- container-native
- workflow engine
New and poor
8. Data Pipeline Solutions
Apache NiFi
Open Source & No vendor lock-in
Difficult integration with Kubernetes
Real-time mode
Flexible & User-friendly Interface for
viewing Data Pipeline and Processing
Highly scalable
Lots of native Processors available
9. We choose NiFi because:
the number of native processors
available
NiFi provides many ready-made Processors -
from Twitter API and Slack to TCP and HTTP
servers, S3, GCS, Google PUB / SUB (there
are about 300 of them)
10. We choose NiFi because:
Custom Scripts
Have you ever lacked Processors? Write your
own Processor in one of the convenient
languages: Clojure, ECMAScript, Groovy, Lua,
Python, Ruby.
Will it work faster?
I rewrote some Processors in Python just to
substitute several NiFi Processors and it
began working even faster...
11. WechooseNiFibecause:
possibilitytochangedataflows
&queuesinreal-time
You can stop the Processor or a group of
Processors at any time to make some changes
and start working again.
At the same time, all other Processors that do
not depend on the shutdown will continue
working. This allows you to stop those
Processors that have errors or if just some
changes are required.
All messages will be added to the NiFi queue
12. We choose NiFi because:
NiFi Registry
NiFi Registry is a central location for the
storage and management of shared resources
across one or more instances of NiFi and/or
MiNiFi.
This allows you not only to switch between
each of NiFi Processors and Processors
Groups but also to create a version of your
work (similar to GIT), always be able to roll
back to one of the previous versions.
13. WechooseNifibecause:
Templates
NiFi templates allow you to export all your
data flow to an XML file as a backup with a few
keystrokes or hand it off to another
developer. It can also be used as a base for
presets (we'll talk about this later)
14. We choose Nifi because:
External Auth & Users/Groups
NiFi has flexible support for sharing
permissions for Users / Groups with different
Permissions.
Permissions can be set both for operations
(viewing / editing Flow, and specific objects
(Processors / Processors groups).
NiFi also supports external authentication
(there is even support for the OpenID
protocol). For example, we integrated
Keycloak to store user data in one place.
LDAP
Kerberos
18. NiFi Scalability: Multiple
Clusters
In any case, if you lack 10 nodes because
you are limited with the network bandwidth
then you can build several NiFi clusters and
connect them through Remote Processor
Groups.
19. NiFi & Kubernetes
Existing solutions:
https://medium.com/swlh/operationalising-nifi-on-kubernetes-1a8e0ae16a6c
https://hub.helm.sh/charts/cetic/nifi
https://community.cloudera.com/t5/Community-Articles/Deploy-NiFi-On-Kubernetes/ta-p/269758
The last Helm Chart was the most relevant and we took it as a basis
21. Tips & Tricks
Use Kafka or any Message Bus. If there are any failures in NiFi, safety must
be in any concern.
Although NiFi has a visual editor and a bunch of Processors they must be
built by a technically competent engineer, otherwise, data flow can be
destabilized.
For unpredictable inputs, use Rate Limit Processor.
Use NiFi Registry - it will always allow you to roll back!
Don’t try to use only Native NiFi Processors: sometimes it's too complicated
and easier to write a couple of lines in Python.
Don’t gloss over the mistakes! Working in NiFi you can deal with errors the
same way as with regular data and send them to Slack or use for your
purposes.
23. Conclusion
NiFi proved to be not only good for rapid prototyping of Data Pipeline Flow
but also a good basis for scalable and loaded ELT systems
Of all free self-hosted implementations that support NiFi, it is the most
modern and actively developing
Configuration of a NiFi cluster in Kubernetes did not seem like a trivial task
but after some difficulties faced this ready-to-use solution meets all the
requirements
NiFi is flexible - it does not block everything on itself and using it properly
you can achieve very good results with the support of really big but similar
projects
24. Dysnix Open Source
github.com/dysnix
Helm charts
Cryptocurrency nodes docker images
Prometheus exporters
Grafana dashboards
Terraform for Blockchain-ETL (project for Google Cloud Platform)
25. Daniel Yavorovych
CTO & Co-Founder at Dysnix
daniel@dysnix.com
https://www.linkedin.com/in/daniel-yavorovych/
Questions?