SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Data pipelines: building 

an efficient instrument 

to create a custom
workflow
Speaker: Daniel Yavorovych
DevOpsFest2020
Daniel Yavorovych
CTO & Co-Founder at Dysnix
10+ years of * nix-systems administration;

5+ years of DevOps, SRE;

7+ years in the development of the cloud solution
architectures and HL / HA infrastructures;

7+ years in the development of highly-powerful servers
(Python / Golang).
Real-Time Data Pipelines
Processing

When is it needed?
Why is this a problem?
Real-time processing is needed for continuously data arriving -
for example from Twitter, media news, Email, etc.
Most solutions for working with Data Pipelines imply working
in Batch mode. There are only a few alternatives which will be
discussed further.
Data Pipeline Solutions
Google Cloud Dataflow
Batch and Stream modes!
Fully integrated with AutoML, Pub/Sub and other
GCP components
Vender-lock-in
It is not expensive, but it costs more than
self-hosted solutions
Data Pipeline Solutions
Apache Airflow
Open Source & No vendor lock-in
User Interface for visualizing Data Pipelines and
Processing
Support of various executors (Apache Spark,
Celery, Kubernetes)
No Stream Mode
Data Pipeline Solutions
Luigi
Open Source & No vendor lock-in
Not very scalable: you need to split tasks in
projects for parallel execution
User Interface
Hard to use: Dag tasks cannot be viewed before
execution, logs view is difficult
Data Pipeline Solutions
argoproj/argo-events
Open source & No vendor lock-in
No User Interface
Real-time mode
Kubernetes-native solution
20+ event sources
Argo workflow support: 

- container-native 

- workflow engine
New and poor
Data Pipeline Solutions
Apache NiFi
Open Source & No vendor lock-in
Difficult integration with Kubernetes
Real-time mode
Flexible & User-friendly Interface for
viewing Data Pipeline and Processing
Highly scalable
Lots of native Processors available
We choose NiFi because: 

the number of native processors
available
NiFi provides many ready-made Processors -
from Twitter API and Slack to TCP and HTTP
servers, S3, GCS, Google PUB / SUB (there
are about 300 of them)
We choose NiFi because: 

Custom Scripts
Have you ever lacked Processors? Write your
own Processor in one of the convenient
languages: Clojure, ECMAScript, Groovy, Lua,
Python, Ruby.


Will it work faster?


I rewrote some Processors in Python just to
substitute several NiFi Processors and it
began working even faster...
WechooseNiFibecause:
possibilitytochangedataflows
&queuesinreal-time

You can stop the Processor or a group of
Processors at any time to make some changes
and start working again.


At the same time, all other Processors that do
not depend on the shutdown will continue
working. This allows you to stop those
Processors that have errors or if just some
changes are required.


All messages will be added to the NiFi queue
We choose NiFi because: 

NiFi Registry


NiFi Registry is a central location for the
storage and management of shared resources
across one or more instances of NiFi and/or
MiNiFi.


This allows you not only to switch between
each of NiFi Processors and Processors
Groups but also to create a version of your
work (similar to GIT), always be able to roll
back to one of the previous versions.
WechooseNifibecause:
Templates

NiFi templates allow you to export all your
data flow to an XML file as a backup with a few
keystrokes or hand it off to another
developer. It can also be used as a base for
presets (we'll talk about this later)
We choose Nifi because:
External Auth & Users/Groups
NiFi has flexible support for sharing
permissions for Users / Groups with different
Permissions.

Permissions can be set both for operations
(viewing / editing Flow, and specific objects
(Processors / Processors groups).


NiFi also supports external authentication
(there is even support for the OpenID
protocol). For example, we integrated
Keycloak to store user data in one place.
LDAP
Kerberos
NiFi Arch
NiFi Arch: Cluster Mode
NiFi Scalability
bit.ly/nifi-limits

Source:

Horizontal scaling
There’s no limit of nodes in a single
cluster (only node hardware limits
and limits of network performance)
It’s easy to join a new node to the
running cluster
NiFi Scalability: Multiple
Clusters
In any case, if you lack 10 nodes because
you are limited with the network bandwidth
then you can build several NiFi clusters and
connect them through Remote Processor
Groups.
NiFi & Kubernetes
Existing solutions:
https://medium.com/swlh/operationalising-nifi-on-kubernetes-1a8e0ae16a6c
https://hub.helm.sh/charts/cetic/nifi
https://community.cloudera.com/t5/Community-Articles/Deploy-NiFi-On-Kubernetes/ta-p/269758
The last Helm Chart was the most relevant and we took it as a basis
Helm chart
12375
Grafana Dashboard ID:
Nifi registry
Grafana dashboard & prometheus
metrics
Predefined Nifi Flow
Tips & Tricks
Use Kafka or any Message Bus. If there are any failures in NiFi, safety must
be in any concern.
Although NiFi has a visual editor and a bunch of Processors they must be
built by a technically competent engineer, otherwise, data flow can be
destabilized.
For unpredictable inputs, use Rate Limit Processor.
Use NiFi Registry - it will always allow you to roll back!
Don’t try to use only Native NiFi Processors: sometimes it's too complicated
and easier to write a couple of lines in Python.
Don’t gloss over the mistakes! Working in NiFi you can deal with errors the
same way as with regular data and send them to Slack or use for your
purposes.
Production Architecture
Example
Conclusion
NiFi proved to be not only good for rapid prototyping of Data Pipeline Flow
but also a good basis for scalable and loaded ELT systems
Of all free self-hosted implementations that support NiFi, it is the most
modern and actively developing
Configuration of a NiFi cluster in Kubernetes did not seem like a trivial task
but after some difficulties faced this ready-to-use solution meets all the
requirements
NiFi is flexible - it does not block everything on itself and using it properly
you can achieve very good results with the support of really big but similar
projects
Dysnix Open Source
github.com/dysnix

Helm charts
Cryptocurrency nodes docker images
Prometheus exporters
Grafana dashboards
Terraform for Blockchain-ETL (project for Google Cloud Platform)
Daniel Yavorovych
CTO & Co-Founder at Dysnix
daniel@dysnix.com
https://www.linkedin.com/in/daniel-yavorovych/
Questions?

Weitere ähnliche Inhalte

Was ist angesagt?

ApacheCon 2021 Apache Deep Learning 302
ApacheCon 2021   Apache Deep Learning 302ApacheCon 2021   Apache Deep Learning 302
ApacheCon 2021 Apache Deep Learning 302
Timothy Spann
 
Cloud streaming presentation
Cloud streaming presentationCloud streaming presentation
Cloud streaming presentation
edmandt
 

Was ist angesagt? (20)

Bitsy graph database
Bitsy graph databaseBitsy graph database
Bitsy graph database
 
ApacheCon 2021 Apache Deep Learning 302
ApacheCon 2021   Apache Deep Learning 302ApacheCon 2021   Apache Deep Learning 302
ApacheCon 2021 Apache Deep Learning 302
 
Cracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworksCracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworks
 
Python web conference 2022 apache pulsar development 101 with python (f li-...
Python web conference 2022   apache pulsar development 101 with python (f li-...Python web conference 2022   apache pulsar development 101 with python (f li-...
Python web conference 2022 apache pulsar development 101 with python (f li-...
 
Deploying OpenNebula in an HPC environment
Deploying OpenNebula in an HPC environmentDeploying OpenNebula in an HPC environment
Deploying OpenNebula in an HPC environment
 
ApacheCon 2021: Cracking the nut with Apache Pulsar (FLiP)
ApacheCon 2021:  Cracking the nut with Apache Pulsar (FLiP)ApacheCon 2021:  Cracking the nut with Apache Pulsar (FLiP)
ApacheCon 2021: Cracking the nut with Apache Pulsar (FLiP)
 
Api world apache nifi 101
Api world   apache nifi 101Api world   apache nifi 101
Api world apache nifi 101
 
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...Devfest uk & ireland  using apache nifi with apache pulsar for fast data on-r...
Devfest uk & ireland using apache nifi with apache pulsar for fast data on-r...
 
Codeless pipelines with pulsar and flink
Codeless pipelines with pulsar and flinkCodeless pipelines with pulsar and flink
Codeless pipelines with pulsar and flink
 
Zephyr: Creating a Best-of-Breed, Secure RTOS for IoT
Zephyr: Creating a Best-of-Breed, Secure RTOS for IoTZephyr: Creating a Best-of-Breed, Secure RTOS for IoT
Zephyr: Creating a Best-of-Breed, Secure RTOS for IoT
 
Using the FLiPN Stack for Edge AI (Flink, NiFi, Pulsar) - Pulsar Summit Asia ...
Using the FLiPN Stack for Edge AI (Flink, NiFi, Pulsar) - Pulsar Summit Asia ...Using the FLiPN Stack for Edge AI (Flink, NiFi, Pulsar) - Pulsar Summit Asia ...
Using the FLiPN Stack for Edge AI (Flink, NiFi, Pulsar) - Pulsar Summit Asia ...
 
Boolan machine learning summit
Boolan machine learning summitBoolan machine learning summit
Boolan machine learning summit
 
Real-time Streaming Pipelines with FLaNK
Real-time Streaming Pipelines with FLaNKReal-time Streaming Pipelines with FLaNK
Real-time Streaming Pipelines with FLaNK
 
Cracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworksCracking the nut, solving edge ai with apache tools and frameworks
Cracking the nut, solving edge ai with apache tools and frameworks
 
Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020Learning the basics of Apache NiFi for iot OSS Europe 2020
Learning the basics of Apache NiFi for iot OSS Europe 2020
 
DBCC 2021 - FLiP Stack for Cloud Data Lakes
DBCC 2021 - FLiP Stack for Cloud Data LakesDBCC 2021 - FLiP Stack for Cloud Data Lakes
DBCC 2021 - FLiP Stack for Cloud Data Lakes
 
Cloud streaming presentation
Cloud streaming presentationCloud streaming presentation
Cloud streaming presentation
 
Apache NiFi User Guide
Apache NiFi User GuideApache NiFi User Guide
Apache NiFi User Guide
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Microsoft Office 2010 by Mr. EJ Lopez
Microsoft Office 2010 by Mr. EJ LopezMicrosoft Office 2010 by Mr. EJ Lopez
Microsoft Office 2010 by Mr. EJ Lopez
 

Ähnlich wie DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient instrument to make custom workflow

AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101
Timothy Spann
 
ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300
Timothy Spann
 
Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFi
DataWorks Summit
 

Ähnlich wie DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient instrument to make custom workflow (20)

AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101
 
Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin  Monitoring&Logging - Stanislav Kolenkin
Monitoring&Logging - Stanislav Kolenkin
 
ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300ApacheCon 2021 - Apache NiFi Deep Dive 300
ApacheCon 2021 - Apache NiFi Deep Dive 300
 
Introduction to data flow management using apache nifi
Introduction to data flow management using apache nifiIntroduction to data flow management using apache nifi
Introduction to data flow management using apache nifi
 
PyData Boston 2013
PyData Boston 2013PyData Boston 2013
PyData Boston 2013
 
Codemotion Rome 2015. GlusterFS
Codemotion Rome 2015. GlusterFSCodemotion Rome 2015. GlusterFS
Codemotion Rome 2015. GlusterFS
 
Gluster FS a filesistem for Big Data | Roberto Franchini - Codemotion Rome 2015
Gluster FS  a filesistem for Big Data | Roberto Franchini - Codemotion Rome 2015Gluster FS  a filesistem for Big Data | Roberto Franchini - Codemotion Rome 2015
Gluster FS a filesistem for Big Data | Roberto Franchini - Codemotion Rome 2015
 
Automate your data flows with Apache NIFI
Automate your data flows with Apache NIFIAutomate your data flows with Apache NIFI
Automate your data flows with Apache NIFI
 
SRE NL MeetUp - eBPF.pdf
SRE NL MeetUp - eBPF.pdfSRE NL MeetUp - eBPF.pdf
SRE NL MeetUp - eBPF.pdf
 
Introduction to Filecoin
Introduction to Filecoin   Introduction to Filecoin
Introduction to Filecoin
 
Celi @Codemotion 2014 - Roberto Franchini GlusterFS
Celi @Codemotion 2014 - Roberto Franchini GlusterFSCeli @Codemotion 2014 - Roberto Franchini GlusterFS
Celi @Codemotion 2014 - Roberto Franchini GlusterFS
 
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
Introduction to Alluxio 2.0 Preview | Simplifying data access for cloud workl...
 
ApacheCon 2021: Apache NiFi 101- introduction and best practices
ApacheCon 2021:   Apache NiFi 101- introduction and best practicesApacheCon 2021:   Apache NiFi 101- introduction and best practices
ApacheCon 2021: Apache NiFi 101- introduction and best practices
 
Top 10 dev ops tools (1)
Top 10 dev ops tools (1)Top 10 dev ops tools (1)
Top 10 dev ops tools (1)
 
Architecture of a Next-Generation Parallel File System
Architecture of a Next-Generation Parallel File System	Architecture of a Next-Generation Parallel File System
Architecture of a Next-Generation Parallel File System
 
Future of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep DiveFuture of Data New Jersey - HDF 3.0 Deep Dive
Future of Data New Jersey - HDF 3.0 Deep Dive
 
Timothy Spann [StreamNative] | Using FLaNK with InfluxDB for EdgeAI IoT at Sc...
Timothy Spann [StreamNative] | Using FLaNK with InfluxDB for EdgeAI IoT at Sc...Timothy Spann [StreamNative] | Using FLaNK with InfluxDB for EdgeAI IoT at Sc...
Timothy Spann [StreamNative] | Using FLaNK with InfluxDB for EdgeAI IoT at Sc...
 
Using FLiP with influxdb for EdgeAI IoT at Scale
Using FLiP with influxdb for EdgeAI IoT at ScaleUsing FLiP with influxdb for EdgeAI IoT at Scale
Using FLiP with influxdb for EdgeAI IoT at Scale
 
Dataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFiDataflow Management From Edge to Core with Apache NiFi
Dataflow Management From Edge to Core with Apache NiFi
 
Modern VoIP in modern infrastructures
Modern VoIP in modern infrastructuresModern VoIP in modern infrastructures
Modern VoIP in modern infrastructures
 

Mehr von DevOps_Fest

DevOps Fest 2020. Барух Садогурский и Леонид Игольник. Устраиваем DevOps без ...
DevOps Fest 2020. Барух Садогурский и Леонид Игольник. Устраиваем DevOps без ...DevOps Fest 2020. Барух Садогурский и Леонид Игольник. Устраиваем DevOps без ...
DevOps Fest 2020. Барух Садогурский и Леонид Игольник. Устраиваем DevOps без ...
DevOps_Fest
 
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
DevOps_Fest
 
DevOps Fest 2020. Андрій Шабалін. Distributed Tracing for microservices with ...
DevOps Fest 2020. Андрій Шабалін. Distributed Tracing for microservices with ...DevOps Fest 2020. Андрій Шабалін. Distributed Tracing for microservices with ...
DevOps Fest 2020. Андрій Шабалін. Distributed Tracing for microservices with ...
DevOps_Fest
 

Mehr von DevOps_Fest (20)

DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
DevOps Fest 2020. Сергій Калінець. Building Data Streaming Platform with Apac...
 
DevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CD
DevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CDDevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CD
DevOps Fest 2020. Kohsuke Kawaguchi. GitOps, Jenkins X & the Future of CI/CD
 
DevOps Fest 2020. Барух Садогурский и Леонид Игольник. Устраиваем DevOps без ...
DevOps Fest 2020. Барух Садогурский и Леонид Игольник. Устраиваем DevOps без ...DevOps Fest 2020. Барух Садогурский и Леонид Игольник. Устраиваем DevOps без ...
DevOps Fest 2020. Барух Садогурский и Леонид Игольник. Устраиваем DevOps без ...
 
DevOps Fest 2020. James Spiteri. Advanced Security Operations with Elastic Se...
DevOps Fest 2020. James Spiteri. Advanced Security Operations with Elastic Se...DevOps Fest 2020. James Spiteri. Advanced Security Operations with Elastic Se...
DevOps Fest 2020. James Spiteri. Advanced Security Operations with Elastic Se...
 
DevOps Fest 2020. Pavlo Repalo. Edge Computing: Appliance and Challanges
DevOps Fest 2020. Pavlo Repalo. Edge Computing: Appliance and ChallangesDevOps Fest 2020. Pavlo Repalo. Edge Computing: Appliance and Challanges
DevOps Fest 2020. Pavlo Repalo. Edge Computing: Appliance and Challanges
 
DevOps Fest 2020. Максим Безуглый. DevOps - как архитектура в процессе. Две к...
DevOps Fest 2020. Максим Безуглый. DevOps - как архитектура в процессе. Две к...DevOps Fest 2020. Максим Безуглый. DevOps - как архитектура в процессе. Две к...
DevOps Fest 2020. Максим Безуглый. DevOps - как архитектура в процессе. Две к...
 
DevOps Fest 2020. Павел Жданов та Никора Никита. Построение процесса CI\CD дл...
DevOps Fest 2020. Павел Жданов та Никора Никита. Построение процесса CI\CD дл...DevOps Fest 2020. Павел Жданов та Никора Никита. Построение процесса CI\CD дл...
DevOps Fest 2020. Павел Жданов та Никора Никита. Построение процесса CI\CD дл...
 
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
DevOps Fest 2020. Станислав Коленкин. How to connect non-connectible: tips, t...
 
DevOps Fest 2020. Андрій Шабалін. Distributed Tracing for microservices with ...
DevOps Fest 2020. Андрій Шабалін. Distributed Tracing for microservices with ...DevOps Fest 2020. Андрій Шабалін. Distributed Tracing for microservices with ...
DevOps Fest 2020. Андрій Шабалін. Distributed Tracing for microservices with ...
 
DevOps Fest 2020. Дмитрий Кудрявцев. Реализация GitOps на Kubernetes. ArgoCD
DevOps Fest 2020. Дмитрий Кудрявцев. Реализация GitOps на Kubernetes. ArgoCDDevOps Fest 2020. Дмитрий Кудрявцев. Реализация GitOps на Kubernetes. ArgoCD
DevOps Fest 2020. Дмитрий Кудрявцев. Реализация GitOps на Kubernetes. ArgoCD
 
DevOps Fest 2020. Роман Орлов. Инфраструктура тестирования в Kubernetes
DevOps Fest 2020. Роман Орлов. Инфраструктура тестирования в KubernetesDevOps Fest 2020. Роман Орлов. Инфраструктура тестирования в Kubernetes
DevOps Fest 2020. Роман Орлов. Инфраструктура тестирования в Kubernetes
 
DevOps Fest 2020. Андрей Шишенко. CI/CD for AWS Lambdas with Serverless frame...
DevOps Fest 2020. Андрей Шишенко. CI/CD for AWS Lambdas with Serverless frame...DevOps Fest 2020. Андрей Шишенко. CI/CD for AWS Lambdas with Serverless frame...
DevOps Fest 2020. Андрей Шишенко. CI/CD for AWS Lambdas with Serverless frame...
 
DevOps Fest 2020. Александр Глущенко. Modern Enterprise Network Architecture ...
DevOps Fest 2020. Александр Глущенко. Modern Enterprise Network Architecture ...DevOps Fest 2020. Александр Глущенко. Modern Enterprise Network Architecture ...
DevOps Fest 2020. Александр Глущенко. Modern Enterprise Network Architecture ...
 
DevOps Fest 2020. Виталий Складчиков. Сквозь монолитный enterprise к микросер...
DevOps Fest 2020. Виталий Складчиков. Сквозь монолитный enterprise к микросер...DevOps Fest 2020. Виталий Складчиков. Сквозь монолитный enterprise к микросер...
DevOps Fest 2020. Виталий Складчиков. Сквозь монолитный enterprise к микросер...
 
DevOps Fest 2020. Денис Медведенко. Управление сложными многокомпонентными ин...
DevOps Fest 2020. Денис Медведенко. Управление сложными многокомпонентными ин...DevOps Fest 2020. Денис Медведенко. Управление сложными многокомпонентными ин...
DevOps Fest 2020. Денис Медведенко. Управление сложными многокомпонентными ин...
 
DevOps Fest 2020. Павел Галушко. Что делать devops'у если у вас захотели mach...
DevOps Fest 2020. Павел Галушко. Что делать devops'у если у вас захотели mach...DevOps Fest 2020. Павел Галушко. Что делать devops'у если у вас захотели mach...
DevOps Fest 2020. Павел Галушко. Что делать devops'у если у вас захотели mach...
 
DevOps Fest 2020. Сергей Абаничев. Modern CI\CD pipeline with Azure DevOps
DevOps Fest 2020. Сергей Абаничев. Modern CI\CD pipeline with Azure DevOpsDevOps Fest 2020. Сергей Абаничев. Modern CI\CD pipeline with Azure DevOps
DevOps Fest 2020. Сергей Абаничев. Modern CI\CD pipeline with Azure DevOps
 
DevOps Fest 2020. Philipp Krenn. Scale Your Auditing Events
DevOps Fest 2020. Philipp Krenn. Scale Your Auditing EventsDevOps Fest 2020. Philipp Krenn. Scale Your Auditing Events
DevOps Fest 2020. Philipp Krenn. Scale Your Auditing Events
 
DevOps Fest 2020. Володимир Мельник. TuchaKube - перша українська DevOps/Host...
DevOps Fest 2020. Володимир Мельник. TuchaKube - перша українська DevOps/Host...DevOps Fest 2020. Володимир Мельник. TuchaKube - перша українська DevOps/Host...
DevOps Fest 2020. Володимир Мельник. TuchaKube - перша українська DevOps/Host...
 
DevOps Fest 2020. Денис Васильев. Let's make it KUL! Kubernetes Ultra Light
DevOps Fest 2020. Денис Васильев. Let's make it KUL! Kubernetes Ultra LightDevOps Fest 2020. Денис Васильев. Let's make it KUL! Kubernetes Ultra Light
DevOps Fest 2020. Денис Васильев. Let's make it KUL! Kubernetes Ultra Light
 

Kürzlich hochgeladen

1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 

Kürzlich hochgeladen (20)

Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Hongkong ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Graduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - EnglishGraduate Outcomes Presentation Slides - English
Graduate Outcomes Presentation Slides - English
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 

DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient instrument to make custom workflow

  • 1. Data pipelines: building an efficient instrument to create a custom workflow Speaker: Daniel Yavorovych DevOpsFest2020
  • 2. Daniel Yavorovych CTO & Co-Founder at Dysnix 10+ years of * nix-systems administration; 5+ years of DevOps, SRE; 7+ years in the development of the cloud solution architectures and HL / HA infrastructures; 7+ years in the development of highly-powerful servers (Python / Golang).
  • 3. Real-Time Data Pipelines Processing When is it needed? Why is this a problem? Real-time processing is needed for continuously data arriving - for example from Twitter, media news, Email, etc. Most solutions for working with Data Pipelines imply working in Batch mode. There are only a few alternatives which will be discussed further.
  • 4. Data Pipeline Solutions Google Cloud Dataflow Batch and Stream modes! Fully integrated with AutoML, Pub/Sub and other GCP components Vender-lock-in It is not expensive, but it costs more than self-hosted solutions
  • 5. Data Pipeline Solutions Apache Airflow Open Source & No vendor lock-in User Interface for visualizing Data Pipelines and Processing Support of various executors (Apache Spark, Celery, Kubernetes) No Stream Mode
  • 6. Data Pipeline Solutions Luigi Open Source & No vendor lock-in Not very scalable: you need to split tasks in projects for parallel execution User Interface Hard to use: Dag tasks cannot be viewed before execution, logs view is difficult
  • 7. Data Pipeline Solutions argoproj/argo-events Open source & No vendor lock-in No User Interface Real-time mode Kubernetes-native solution 20+ event sources Argo workflow support: - container-native - workflow engine New and poor
  • 8. Data Pipeline Solutions Apache NiFi Open Source & No vendor lock-in Difficult integration with Kubernetes Real-time mode Flexible & User-friendly Interface for viewing Data Pipeline and Processing Highly scalable Lots of native Processors available
  • 9. We choose NiFi because: the number of native processors available NiFi provides many ready-made Processors - from Twitter API and Slack to TCP and HTTP servers, S3, GCS, Google PUB / SUB (there are about 300 of them)
  • 10. We choose NiFi because: Custom Scripts Have you ever lacked Processors? Write your own Processor in one of the convenient languages: Clojure, ECMAScript, Groovy, Lua, Python, Ruby. Will it work faster? I rewrote some Processors in Python just to substitute several NiFi Processors and it began working even faster...
  • 11. WechooseNiFibecause: possibilitytochangedataflows &queuesinreal-time You can stop the Processor or a group of Processors at any time to make some changes and start working again. At the same time, all other Processors that do not depend on the shutdown will continue working. This allows you to stop those Processors that have errors or if just some changes are required. All messages will be added to the NiFi queue
  • 12. We choose NiFi because: NiFi Registry NiFi Registry is a central location for the storage and management of shared resources across one or more instances of NiFi and/or MiNiFi. This allows you not only to switch between each of NiFi Processors and Processors Groups but also to create a version of your work (similar to GIT), always be able to roll back to one of the previous versions.
  • 13. WechooseNifibecause: Templates NiFi templates allow you to export all your data flow to an XML file as a backup with a few keystrokes or hand it off to another developer. It can also be used as a base for presets (we'll talk about this later)
  • 14. We choose Nifi because: External Auth & Users/Groups NiFi has flexible support for sharing permissions for Users / Groups with different Permissions. Permissions can be set both for operations (viewing / editing Flow, and specific objects (Processors / Processors groups). NiFi also supports external authentication (there is even support for the OpenID protocol). For example, we integrated Keycloak to store user data in one place. LDAP Kerberos
  • 17. NiFi Scalability bit.ly/nifi-limits Source: Horizontal scaling There’s no limit of nodes in a single cluster (only node hardware limits and limits of network performance) It’s easy to join a new node to the running cluster
  • 18. NiFi Scalability: Multiple Clusters In any case, if you lack 10 nodes because you are limited with the network bandwidth then you can build several NiFi clusters and connect them through Remote Processor Groups.
  • 19. NiFi & Kubernetes Existing solutions: https://medium.com/swlh/operationalising-nifi-on-kubernetes-1a8e0ae16a6c https://hub.helm.sh/charts/cetic/nifi https://community.cloudera.com/t5/Community-Articles/Deploy-NiFi-On-Kubernetes/ta-p/269758 The last Helm Chart was the most relevant and we took it as a basis
  • 20. Helm chart 12375 Grafana Dashboard ID: Nifi registry Grafana dashboard & prometheus metrics Predefined Nifi Flow
  • 21. Tips & Tricks Use Kafka or any Message Bus. If there are any failures in NiFi, safety must be in any concern. Although NiFi has a visual editor and a bunch of Processors they must be built by a technically competent engineer, otherwise, data flow can be destabilized. For unpredictable inputs, use Rate Limit Processor. Use NiFi Registry - it will always allow you to roll back! Don’t try to use only Native NiFi Processors: sometimes it's too complicated and easier to write a couple of lines in Python. Don’t gloss over the mistakes! Working in NiFi you can deal with errors the same way as with regular data and send them to Slack or use for your purposes.
  • 23. Conclusion NiFi proved to be not only good for rapid prototyping of Data Pipeline Flow but also a good basis for scalable and loaded ELT systems Of all free self-hosted implementations that support NiFi, it is the most modern and actively developing Configuration of a NiFi cluster in Kubernetes did not seem like a trivial task but after some difficulties faced this ready-to-use solution meets all the requirements NiFi is flexible - it does not block everything on itself and using it properly you can achieve very good results with the support of really big but similar projects
  • 24. Dysnix Open Source github.com/dysnix Helm charts Cryptocurrency nodes docker images Prometheus exporters Grafana dashboards Terraform for Blockchain-ETL (project for Google Cloud Platform)
  • 25. Daniel Yavorovych CTO & Co-Founder at Dysnix daniel@dysnix.com https://www.linkedin.com/in/daniel-yavorovych/ Questions?