SlideShare ist ein Scribd-Unternehmen logo
1 von 68
Downloaden Sie, um offline zu lesen
The art of data
engineering
Andrii Soldatenko
TVTime
The art of
Data engineering
@a_soldatenko
14 December 2019,
Remote :)
Andrii Soldatenko
• Senior Data Engineer TVTime
• Apache Airflow contributor
• Public Speaker at many conferences, committee
member of different PyCons
• blogger at https://asoldatenko.com
@a_soldatenko
@a_soldatenko
“…The Web was done
by amateurs” 🎉

- Alan Kay
@a_soldatenko
TL;DR
Data Scientist
Data Engineer
Data Engineering:
A quick definition
Twitter: 600 million tweets per day.
Facebook: 600 terabytes of
incoming data each day, from 1.6
billion active users.
Google: 3.5 billion search
queries per day.
Instagram: 52 million
new photos per day
What are data
engineers focused on?
How to Become a Data
Engineer? 👀
What skills do data
engineers need?
Possible architecture
for a data system
Cron + bash script 🔥
Simple data pipeline 🔥
TL;DR
0 * * * * my_bash_pipeline.sh
#!/bin/bash
psql -U user -d database_name -c 
"COPY country FROM '/tmp/
country_data';"
I’m sorry Cron, I’ve met
Airflow :)
Data Pipelins:
Open source ToolZZZ
https://docs.google.com/spreadsheets/d/
1wfKibn4yS7AshPzY8ch2Dq_VmQVKnGSXGj08A2u1YQI/edit#gid=0
OS workflow tools:
Airflow 💫
Airflow Concepts
Airflow Core Ideas:
DAGS
Airflow Core Ideas
Airflow Core Ideas:
Operators %
• BashOperator
• PythonOperator
• MySqlOperator
• S3FileTransformOperator … you get the idea
https://airflow.apache.org/concepts.html
Return to real world
ML Dependency hell:
AWS: the good, the bad and the ugly
ML Dependency hell:
https://www.instagram.com/natgeo/
Case #1 Amazon Batch:
AWS Batch and Airflow
AWS Batch and Airflow
AWS Batch and Airflow:
Example
FROM amazonlinux:latest
RUN yum -y groupinstall 'Development Tools'
...
WORKDIR /scratch
RUN git clone https://github.com/cjlin1/
libmf.git /tmp/libmf && cd /tmp/libmf && make
ENTRYPOINT ["/usr/local/bin/entrypoint.sh"]
LIBMF is a library for large-scale sparse matrix factorization. For the
optimization problem it solves and the overall framework, please refer to [3].
AWS Batch and Airflow
Submit job from Airflow
Return results from AWS Batch
AWS Batch and Airflow
22G volume mounted by
default , ~ 10Gb
https://forums.aws.amazon.com/thread.jspa?threadID=250705
• You need to create EC2 based on ECS optimized
AMIs
• Attach another volume separate from your root
volume, you will want to modify the /etc/fstab 
• Create image (AMI) Or use (ami-XXXXX with extra
20Gb space)
• Create AWS Batch compute environment based on
AMI
• useful link https://aws.amazon.com/blogs/compute/building-high-throughput-
genomic-batch-workflows-on-aws-batch-layer-part-3-of-4/)
Solution =)
Case #2 EMR Spark and
Airflow
EMR and Airflow
airflow/contrib/operators/emr_add_steps_operator.py
airflow/contrib/operators/emr_create_job_flow_operator.py
airflow/contrib/operators/emr_terminate_job_flow_operator.py
Debug EMR cluster:
As a final step
load data to warehouse service:
As a final step
load data to warehouse service:
COPY INTO table (colA, colB)
from (SELECT (colA, colB)
FROM @s3_stage_parquet)
ON_ERROR = CONTINUE
Case #3 Kubernetes Pod
Executor and Airflow
CI/CD pipelines
CI/CD pipelines
• unit tests/integration tests   🍸
• code quality checks   🔫
• automatic deployment after merge into master 🌀
• Pull Request code review flow 🍒
unit tests/integration tests🍸
@pytest.mark.parametrize('dag_file', DAG_FILES)
def test_import_dag_files(dag_file):
"""Import dag files and check for DAG."""
module_name, _ = os.path.splitext(dag_file)
module_path = os.path.join(DAG_PATH, dag_file)
mod_spec = importlib.util.spec_from_file_location(module_name, module_path)
module = importlib.util.module_from_spec(mod_spec)
mod_spec.loader.exec_module(module)
assert any(
isinstance(var, af_models.DAG)
for var in vars(module).values())
Data quality checkers
COPY DATA to NEW_TABLE
APPLY DATA QUALITY CHECKS IF ERROR -> SEND ALERTS EXIT(1)
BACKUP ORIGINAL TABLE
RENAME NEW_TABLE to ORIGINAL_TABLE
Data quality checkers
Learn more
Learn more
- Awesome Data Engineering (igorbarinov/awesome-data-engineering)
- Awesome Apache airflow (jghoman/awesome-apache-airflow)
Conclusion:
• Data engineering new trend with old roots 🌴
• Data intensive apps is fun 📈
• ETL code base is just yet another source code
🛠
• You must know better than others you data
storages ﷽
Thank You
Thank You
andrii.soldatenko@gmail.com
@a_soldatenko
Questions
? @a_soldatenko

Weitere ähnliche Inhalte

Was ist angesagt?

Map Analytics in Starcraft II
Map Analytics in Starcraft IIMap Analytics in Starcraft II
Map Analytics in Starcraft II
gy8
 

Was ist angesagt? (20)

Ansible party in the [Google] clouds
Ansible party in the [Google] cloudsAnsible party in the [Google] clouds
Ansible party in the [Google] clouds
 
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
TensorFlow London 14: Ben Hall 'Machine Learning Workloads with Kubernetes an...
 
APIfying the Web with import.io (at APIdays mediterranea)
APIfying the Web with import.io (at APIdays mediterranea)APIfying the Web with import.io (at APIdays mediterranea)
APIfying the Web with import.io (at APIdays mediterranea)
 
Scalable Event Tracking
Scalable Event TrackingScalable Event Tracking
Scalable Event Tracking
 
Presentation kyushu-2018
Presentation kyushu-2018Presentation kyushu-2018
Presentation kyushu-2018
 
An Introduction to Dashing and Smashing
An Introduction to Dashing and SmashingAn Introduction to Dashing and Smashing
An Introduction to Dashing and Smashing
 
Graphql Realtime Pagination
Graphql Realtime PaginationGraphql Realtime Pagination
Graphql Realtime Pagination
 
Terraform day 2
Terraform day 2Terraform day 2
Terraform day 2
 
How to make GAE adapt the Great Firewall
How to make GAE adapt the Great FirewallHow to make GAE adapt the Great Firewall
How to make GAE adapt the Great Firewall
 
Introduction to Cosmos db
Introduction to Cosmos dbIntroduction to Cosmos db
Introduction to Cosmos db
 
Hands on App Engine
Hands on App EngineHands on App Engine
Hands on App Engine
 
AWS Customer Presentation - Zynga
AWS Customer Presentation - ZyngaAWS Customer Presentation - Zynga
AWS Customer Presentation - Zynga
 
RightScale Minneapolis Lightning Talk
RightScale Minneapolis Lightning TalkRightScale Minneapolis Lightning Talk
RightScale Minneapolis Lightning Talk
 
Nyc kubernetes Meetup - Kubeflow Lightning talk
Nyc kubernetes Meetup - Kubeflow Lightning talkNyc kubernetes Meetup - Kubeflow Lightning talk
Nyc kubernetes Meetup - Kubeflow Lightning talk
 
Talent42 2017: Maisha Cannon 1-Pager Brewing Boolean
Talent42 2017: Maisha Cannon 1-Pager Brewing BooleanTalent42 2017: Maisha Cannon 1-Pager Brewing Boolean
Talent42 2017: Maisha Cannon 1-Pager Brewing Boolean
 
How to Automate API Discovery
How to Automate API DiscoveryHow to Automate API Discovery
How to Automate API Discovery
 
[Jfokus] Riding the Jet Streams
[Jfokus] Riding the Jet Streams[Jfokus] Riding the Jet Streams
[Jfokus] Riding the Jet Streams
 
Map Analytics in Starcraft II
Map Analytics in Starcraft IIMap Analytics in Starcraft II
Map Analytics in Starcraft II
 
RightScale CloudCamp Austin Slides
RightScale CloudCamp Austin SlidesRightScale CloudCamp Austin Slides
RightScale CloudCamp Austin Slides
 
Search Like A Boss
Search Like A BossSearch Like A Boss
Search Like A Boss
 

Ähnlich wie Andrii Soldatenko "The art of data engineering"

Cloud computing infrastructure
Cloud computing infrastructureCloud computing infrastructure
Cloud computing infrastructure
sinhhn
 

Ähnlich wie Andrii Soldatenko "The art of data engineering" (20)

"It’s not only Lambda! Economics behind Serverless" at JAX Conference in Mai ...
"It’s not only Lambda! Economics behind Serverless" at JAX Conference in Mai ..."It’s not only Lambda! Economics behind Serverless" at JAX Conference in Mai ...
"It’s not only Lambda! Economics behind Serverless" at JAX Conference in Mai ...
 
Loading Data into Redshift with Lab
Loading Data into Redshift with LabLoading Data into Redshift with Lab
Loading Data into Redshift with Lab
 
FaaS or not to FaaS AWS Community Day Hamburg 2019 Bannes Kazulkin
FaaS or not to FaaS  AWS Community Day Hamburg 2019 Bannes KazulkinFaaS or not to FaaS  AWS Community Day Hamburg 2019 Bannes Kazulkin
FaaS or not to FaaS AWS Community Day Hamburg 2019 Bannes Kazulkin
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Azure tales: a real world CQRS and ES Deep Dive - Andrea Saltarello
Azure tales: a real world CQRS and ES Deep Dive - Andrea SaltarelloAzure tales: a real world CQRS and ES Deep Dive - Andrea Saltarello
Azure tales: a real world CQRS and ES Deep Dive - Andrea Saltarello
 
Loading Data into Amazon Redshift
Loading Data into Amazon RedshiftLoading Data into Amazon Redshift
Loading Data into Amazon Redshift
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
Cloud Computing and HTML5, 2010
Cloud Computing and HTML5, 2010Cloud Computing and HTML5, 2010
Cloud Computing and HTML5, 2010
 
Loading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SFLoading Data into Redshift: Data Analytics Week SF
Loading Data into Redshift: Data Analytics Week SF
 
Castles in the Cloud: Developing with Google App Engine
Castles in the Cloud: Developing with Google App EngineCastles in the Cloud: Developing with Google App Engine
Castles in the Cloud: Developing with Google App Engine
 
Loading Data into Redshift
Loading Data into RedshiftLoading Data into Redshift
Loading Data into Redshift
 
DW on AWS
DW on AWSDW on AWS
DW on AWS
 
Cloud Computing ...changes everything
Cloud Computing ...changes everythingCloud Computing ...changes everything
Cloud Computing ...changes everything
 
"It’s not only Lambda! Economics behind Serverless" at Serverless Architectur...
"It’s not only Lambda! Economics behind Serverless" at Serverless Architectur..."It’s not only Lambda! Economics behind Serverless" at Serverless Architectur...
"It’s not only Lambda! Economics behind Serverless" at Serverless Architectur...
 
Loading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF LoftLoading Data into Redshift: Data Analytics Week at the SF Loft
Loading Data into Redshift: Data Analytics Week at the SF Loft
 
Cloud computing infrastructure
Cloud computing infrastructureCloud computing infrastructure
Cloud computing infrastructure
 
GDG Heraklion - Architecting for the Google Cloud Platform
GDG Heraklion - Architecting for the Google Cloud PlatformGDG Heraklion - Architecting for the Google Cloud Platform
GDG Heraklion - Architecting for the Google Cloud Platform
 
Where should I run my code? Serverless, Containers, Virtual Machines and more
Where should I run my code? Serverless, Containers, Virtual Machines and moreWhere should I run my code? Serverless, Containers, Virtual Machines and more
Where should I run my code? Serverless, Containers, Virtual Machines and more
 
Design Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 

Mehr von Fwdays

Mehr von Fwdays (20)

"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y..."How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
"How Preply reduced ML model development time from 1 month to 1 day",Yevhen Y...
 
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
"GenAI Apps: Our Journey from Ideas to Production Excellence",Danil Topchii
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets"What is a RAG system and how to build it",Dmytro Spodarets
"What is a RAG system and how to build it",Dmytro Spodarets
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi"Distributed graphs and microservices in Prom.ua",  Maksym Kindritskyi
"Distributed graphs and microservices in Prom.ua", Maksym Kindritskyi
 
"Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl..."Rethinking the existing data loading and processing process as an ETL exampl...
"Rethinking the existing data loading and processing process as an ETL exampl...
 
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T..."How Ukrainian IT specialist can go on vacation abroad without crossing the T...
"How Ukrainian IT specialist can go on vacation abroad without crossing the T...
 
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ..."The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
"The Strength of Being Vulnerable: the experience from CIA, Tesla and Uber", ...
 
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu..."[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
"[QUICK TALK] Radical candor: how to achieve results faster thanks to a cultu...
 
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care..."[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
"[QUICK TALK] PDP Plan, the only one door to raise your salary and boost care...
 
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"..."4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
"4 horsemen of the apocalypse of working relationships (+ antidotes to them)"...
 
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast..."Reconnecting with Purpose: Rediscovering Job Interest after Burnout",  Anast...
"Reconnecting with Purpose: Rediscovering Job Interest after Burnout", Anast...
 
"Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others..."Mentoring 101: How to effectively invest experience in the success of others...
"Mentoring 101: How to effectively invest experience in the success of others...
 
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova"Mission (im) possible: How to get an offer in 2024?",  Oleksandra Myronova
"Mission (im) possible: How to get an offer in 2024?", Oleksandra Myronova
 
"Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv..."Why have we learned how to package products, but not how to 'package ourselv...
"Why have we learned how to package products, but not how to 'package ourselv...
 
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin..."How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
"How to tame the dragon, or leadership with imposter syndrome", Oleksandr Zin...
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Andrii Soldatenko "The art of data engineering"