SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Downloaden Sie, um offline zu lesen
Data
Pipelines
Powered by Open Source
and Fun
Sayantika Banik (she/her)
Recipe to build data pipelines
Short demo (time check)
Intro to open source data
engineering tools
Takeaways
Overview
Agenda for today
Data Engineer @Quansight
Organiser @PyLadies Bangalore,
India
Project Incubator founding member
@NumFOCUS
D&I workgroup member @PSF
Open source contributor @Scipy
& NumPy
Crazy about abstract art, nature &
notes
sayantikabanik.com
@sayabanik
Internet in few seconds
source : everysecond.io
In the background 🤫
find the sources
keep pinging the source
is it big enough, to be called big data?
Volume, Velocity, Variety and Veracity
load in batch
stream it
orchestration
add partitioning
let's use this, this, this and more of that
clean the data
ETL
Data Engineering
lakes and warehouses
It's broken again!!!!
We forgot to add tests
metadata
hadoop
dask
Note: The colours are chosen randomly
with no meaning attached
missing data
spark
Building data pipelines in five
simple steps
Ingestion from
source
Quality control Business logic
1 2 3
Orchestrate
4
update, break &
continue
5
Tools that exists
Image credits: Link
Handy open source tools
Package, dependency
and environment
management for any
language—Python, R,
Ruby, Lua, Scala,
Java, JavaScript, C/
C++, Fortran, and
more.
https://docs.conda.io/
en/latest/
Intake is a
lightweight package
for finding,
investigating, loading
and disseminating
data.
https://github.com/int
ake/intake
GitHub Actions makes
it easy to automate
all your software
workflows, now with
world-class CI/CD.
Data orchestration
platform, comes with a
nice UI : Dagit
https://dagster.io/
Two main components
(covered as part of
this session)
- Ops are the core unit
of computation in
Dagster.
An individual op should
perform relatively
simple tasks, such as:
Deriving a dataset from
other datasets
- Jobs are the main
unit of execution and
monitoring in Dagster.
The core of a job is a
graph of ops connected
via data dependencies.
Launching Dagster Run
Repositiory
https://github.com/sayantikabanik/DataJourney
@sayabanik sayantikabanik.com
Have fun and keep
exploring

Weitere ähnliche Inhalte

Ähnlich wie stackconf 2022: Data pipelines powered by Open source and fun

Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at NetflixDistributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
sfbiganalytics
 

Ähnlich wie stackconf 2022: Data pipelines powered by Open source and fun (20)

Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
Pandas/Data Analysis at Baypiggies
Pandas/Data Analysis at BaypiggiesPandas/Data Analysis at Baypiggies
Pandas/Data Analysis at Baypiggies
 
Classification And Reconstruction Of Indian Pottery
Classification And Reconstruction Of Indian PotteryClassification And Reconstruction Of Indian Pottery
Classification And Reconstruction Of Indian Pottery
 
2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review2015 Data Science Summit @ dato Review
2015 Data Science Summit @ dato Review
 
Belgrade R - Intro to H2O and Deep Water
Belgrade R - Intro to H2O and Deep WaterBelgrade R - Intro to H2O and Deep Water
Belgrade R - Intro to H2O and Deep Water
 
H2O at BelgradeR Meetup
H2O at BelgradeR MeetupH2O at BelgradeR Meetup
H2O at BelgradeR Meetup
 
Intro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWSIntro to Machine Learning with H2O and AWS
Intro to Machine Learning with H2O and AWS
 
H2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to EveryoneH2O Deep Water - Making Deep Learning Accessible to Everyone
H2O Deep Water - Making Deep Learning Accessible to Everyone
 
capturing the impact of software AAS 2017
capturing the impact of software AAS 2017capturing the impact of software AAS 2017
capturing the impact of software AAS 2017
 
Distributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at NetflixDistributed Time Travel for Feature Generation at Netflix
Distributed Time Travel for Feature Generation at Netflix
 
The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016 The Rise of the DataOps - Dataiku - J On the Beach 2016
The Rise of the DataOps - Dataiku - J On the Beach 2016
 
December 2013 HUG: Hunk - Splunk over Hadoop
December 2013 HUG: Hunk - Splunk over HadoopDecember 2013 HUG: Hunk - Splunk over Hadoop
December 2013 HUG: Hunk - Splunk over Hadoop
 
Intro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize SeattleIntro to H2O Machine Learning in Python - Galvanize Seattle
Intro to H2O Machine Learning in Python - Galvanize Seattle
 
Building Tools for Neuroimaging
Building Tools for NeuroimagingBuilding Tools for Neuroimaging
Building Tools for Neuroimaging
 
Towards the Cytoscape Cyberinfrastructure
Towards the Cytoscape CyberinfrastructureTowards the Cytoscape Cyberinfrastructure
Towards the Cytoscape Cyberinfrastructure
 
Open Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache SparkOpen Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache Spark
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
 
SXSW2018 - Designing & Building for a Data Science Future
SXSW2018 - Designing & Building for a Data Science FutureSXSW2018 - Designing & Building for a Data Science Future
SXSW2018 - Designing & Building for a Data Science Future
 
Intro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LAIntro to H2O in Python - Data Science LA
Intro to H2O in Python - Data Science LA
 
(ATS6-APP01) Unleashing the Power of Your Data with Discoverant
(ATS6-APP01) Unleashing the Power of Your Data with Discoverant(ATS6-APP01) Unleashing the Power of Your Data with Discoverant
(ATS6-APP01) Unleashing the Power of Your Data with Discoverant
 

Kürzlich hochgeladen

“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
Muhammad Subhan
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 

Kürzlich hochgeladen (20)

“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
WebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM PerformanceWebAssembly is Key to Better LLM Performance
WebAssembly is Key to Better LLM Performance
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
Using IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & IrelandUsing IESVE for Room Loads Analysis - UK & Ireland
Using IESVE for Room Loads Analysis - UK & Ireland
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 

stackconf 2022: Data pipelines powered by Open source and fun