SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Downloaden Sie, um offline zu lesen
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Building a data processing pipeline in Python
Joe Cabrera
https://github.com/greedo
@greedoshotlast
jcabrera@eminorlabs.com
PyGotham, 2015
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Outline
1 The problem
2 Data ingestion
3 Data parsing
4 Data cleansing
5 Scaling out
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Poorly formatted data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Poorly formatted data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Poorly formatted data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Largely dispersed across the web
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
No standard data processing library
Pandas
Bubbles
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Data processing
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Requests and Futures
Requests makes it easy to send the required parameters
Concurrent Futures allows for the asynchronous execution
of download requests
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Parsers
Python tokenize
BeautifulSoup
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Why BeautifulSoup
More forgiving than standard XML or HTML libraries
Supports regex
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Celery job scheduling
Each download job is a task
Each parse job is a task
Each cleanse job is a task
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Re-insert cleansed data
Cleanup data after raw ingest
Separate stores for raw and clean data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Distributed task queue
Distribute data processing jobs to many machines
Distribute jobs on a given machine across many CPUs
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
SQL-Alchemy basic sharding API
Each databases each has a shard id
We query for data based on which shard contains the data
Joe Cabrera Building a data processing pipeline in Python
The problem
Data ingestion
Data parsing
Data cleansing
Scaling out
Questions
Thanks!
https://github.com/greedo
@greedoshotlast
jcabrera@eminorlabs.com
Joe Cabrera Building a data processing pipeline in Python

Weitere Àhnliche Inhalte

Was ist angesagt?

Python and BIG Data analytics | Python Fundamentals | Python Architecture
Python and BIG Data analytics | Python Fundamentals | Python ArchitecturePython and BIG Data analytics | Python Fundamentals | Python Architecture
Python and BIG Data analytics | Python Fundamentals | Python Architecture
Skillspeed
 
Logs & Visualizations at Twitter
Logs & Visualizations at TwitterLogs & Visualizations at Twitter
Logs & Visualizations at Twitter
Krist Wongsuphasawat
 

Was ist angesagt? (19)

The Lonesome LOD Cloud
The Lonesome LOD CloudThe Lonesome LOD Cloud
The Lonesome LOD Cloud
 
SQL: The one language to rule all your data
SQL: The one language to rule all your dataSQL: The one language to rule all your data
SQL: The one language to rule all your data
 
How to Build a Semantic Search System
How to Build a Semantic Search SystemHow to Build a Semantic Search System
How to Build a Semantic Search System
 
Fire-fighting java big data problems
Fire-fighting java big data problemsFire-fighting java big data problems
Fire-fighting java big data problems
 
How to Light a Beacon
How to Light a BeaconHow to Light a Beacon
How to Light a Beacon
 
BDACA1516s2 - Lecture5
BDACA1516s2 - Lecture5BDACA1516s2 - Lecture5
BDACA1516s2 - Lecture5
 
Building Search & Recommendation Engines
Building Search & Recommendation EnginesBuilding Search & Recommendation Engines
Building Search & Recommendation Engines
 
Aqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State UniversityAqua Browser Implementation at Oklahoma State University
Aqua Browser Implementation at Oklahoma State University
 
Linking media, data, and services
Linking media, data, and servicesLinking media, data, and services
Linking media, data, and services
 
LinkedGov extension for Google Refine
LinkedGov extension for Google RefineLinkedGov extension for Google Refine
LinkedGov extension for Google Refine
 
Python and BIG Data analytics | Python Fundamentals | Python Architecture
Python and BIG Data analytics | Python Fundamentals | Python ArchitecturePython and BIG Data analytics | Python Fundamentals | Python Architecture
Python and BIG Data analytics | Python Fundamentals | Python Architecture
 
Logs & Visualizations at Twitter
Logs & Visualizations at TwitterLogs & Visualizations at Twitter
Logs & Visualizations at Twitter
 
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...
Using Visualizations to Monitor Changes and Harvest Insights from a Global-sc...
 
Adventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at TwitterAdventure in Data: A tour of visualization projects at Twitter
Adventure in Data: A tour of visualization projects at Twitter
 
Using server logs to your advantage
Using server logs to your advantageUsing server logs to your advantage
Using server logs to your advantage
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
 
BDACA1516s2 - Lecture6
BDACA1516s2 - Lecture6BDACA1516s2 - Lecture6
BDACA1516s2 - Lecture6
 
Hadoop with Python
Hadoop with PythonHadoop with Python
Hadoop with Python
 
Democratizing Data at Airbnb
Democratizing Data at AirbnbDemocratizing Data at Airbnb
Democratizing Data at Airbnb
 

Andere mochten auch

How To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGARHow To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGAR
Alexander Falk
 

Andere mochten auch (9)

Pyxley: Easy Web Applications with Flask and React.js
Pyxley: Easy Web Applications with Flask and React.jsPyxley: Easy Web Applications with Flask and React.js
Pyxley: Easy Web Applications with Flask and React.js
 
A Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with LuigiA Beginner's Guide to Building Data Pipelines with Luigi
A Beginner's Guide to Building Data Pipelines with Luigi
 
Functional Programming with Ruby
Functional Programming with RubyFunctional Programming with Ruby
Functional Programming with Ruby
 
Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago Large scale data processing pipelines at trivago
Large scale data processing pipelines at trivago
 
How To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGARHow To Download and Process SEC XBRL Data Directly from EDGAR
How To Download and Process SEC XBRL Data Directly from EDGAR
 
Managing data workflows with Luigi
Managing data workflows with LuigiManaging data workflows with Luigi
Managing data workflows with Luigi
 
Bubbles – Virtual Data Objects
Bubbles – Virtual Data ObjectsBubbles – Virtual Data Objects
Bubbles – Virtual Data Objects
 
Big Data and Analytics on AWS
Big Data and Analytics on AWS Big Data and Analytics on AWS
Big Data and Analytics on AWS
 
Building a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe CrobakBuilding a Data Pipeline from Scratch - Joe Crobak
Building a Data Pipeline from Scratch - Joe Crobak
 

Ähnlich wie Building a data processing pipeline in Python

Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
HostedbyConfluent
 

Ähnlich wie Building a data processing pipeline in Python (20)

OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdfOSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
OSA Con 2022 - Scaling your Pandas Analytics with Modin - Doris Lee - Ponder.pdf
 
Database story by DevOps
Database story by DevOpsDatabase story by DevOps
Database story by DevOps
 
DevFest Taipei - Advanced Ticketing System.pdf
DevFest Taipei - Advanced Ticketing System.pdfDevFest Taipei - Advanced Ticketing System.pdf
DevFest Taipei - Advanced Ticketing System.pdf
 
Big Data made easy with a Spark
Big Data made easy with a SparkBig Data made easy with a Spark
Big Data made easy with a Spark
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
 
Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014
 
Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...Data infrastructure architecture for medium size organization: tips for colle...
Data infrastructure architecture for medium size organization: tips for colle...
 
Performance tuning
Performance tuningPerformance tuning
Performance tuning
 
Building, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for ProductionBuilding, Evaluating, and Optimizing your RAG App for Production
Building, Evaluating, and Optimizing your RAG App for Production
 
Integrating Hadoop in Your Existing DW and BI Environment
Integrating Hadoop in Your Existing DW and BI EnvironmentIntegrating Hadoop in Your Existing DW and BI Environment
Integrating Hadoop in Your Existing DW and BI Environment
 
Intro to Python Data Analysis in Wakari
Intro to Python Data Analysis in WakariIntro to Python Data Analysis in Wakari
Intro to Python Data Analysis in Wakari
 
Big data made easy with a Spark
Big data made easy with a SparkBig data made easy with a Spark
Big data made easy with a Spark
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
 
R meetup talk scaling data science with dgit
R meetup talk   scaling data science with dgitR meetup talk   scaling data science with dgit
R meetup talk scaling data science with dgit
 
The Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data ImplementationThe Great Lakes: How to Approach a Big Data Implementation
The Great Lakes: How to Approach a Big Data Implementation
 
Etl with apache impala by athemaster
Etl with apache impala by athemasterEtl with apache impala by athemaster
Etl with apache impala by athemaster
 
Connecting Your Data Analytics Pipeline
Connecting Your Data Analytics PipelineConnecting Your Data Analytics Pipeline
Connecting Your Data Analytics Pipeline
 
Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and Python
 
Introduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and PythonIntroduction to Machine Learning with H2O and Python
Introduction to Machine Learning with H2O and Python
 
Building Data Applications with Apache Druid
Building Data Applications with Apache DruidBuilding Data Applications with Apache Druid
Building Data Applications with Apache Druid
 

KĂŒrzlich hochgeladen

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Just Call Vip call girls Erode Escorts ☎9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎9352988975 Two shot with one girl (E...
gajnagarg
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls In Shivaji Nagar ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 đŸ„” Book Your One night Stand
amitlee9823
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night Stand
amitlee9823
 
Just Call Vip call girls roorkee Escorts ☎9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎9352988975 Two shot with one girl ...
gajnagarg
 
âž„đŸ” 7737669865 đŸ”â–» Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
âž„đŸ” 7737669865 đŸ”â–» Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...âž„đŸ” 7737669865 đŸ”â–» Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
âž„đŸ” 7737669865 đŸ”â–» Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
amitlee9823
 
Just Call Vip call girls Palakkad Escorts ☎9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎9352988975 Two shot with one girl...
gajnagarg
 
Call Girls In Attibele ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Attibele ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 đŸ„” Book Your One night Stand
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Bellandur ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 đŸ„” Book Your One night Stand
amitlee9823
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
only4webmaster01
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...
gajnagarg
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 JustđŸ“Č Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 JustđŸ“Č Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 JustđŸ“Č Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 JustđŸ“Č Call Ruhi Call Girl Phone No Amri...
karishmasinghjnh
 

KĂŒrzlich hochgeladen (20)

Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Just Call Vip call girls Erode Escorts ☎9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎9352988975 Two shot with one girl (E...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls In Shivaji Nagar ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 đŸ„” Book Your One night Stand
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
âž„đŸ” 7737669865 đŸ”â–» Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Call Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 đŸ„” Book Your One night Stand
 
Just Call Vip call girls roorkee Escorts ☎9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎9352988975 Two shot with one girl ...Just Call Vip call girls roorkee Escorts ☎9352988975 Two shot with one girl ...
Just Call Vip call girls roorkee Escorts ☎9352988975 Two shot with one girl ...
 
âž„đŸ” 7737669865 đŸ”â–» Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
âž„đŸ” 7737669865 đŸ”â–» Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...âž„đŸ” 7737669865 đŸ”â–» Sambalpur Call-girls in Women Seeking Men  🔝Sambalpur🔝   Esc...
âž„đŸ” 7737669865 đŸ”â–» Sambalpur Call-girls in Women Seeking Men 🔝Sambalpur🔝 Esc...
 
Just Call Vip call girls Palakkad Escorts ☎9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎9352988975 Two shot with one girl...Just Call Vip call girls Palakkad Escorts ☎9352988975 Two shot with one girl...
Just Call Vip call girls Palakkad Escorts ☎9352988975 Two shot with one girl...
 
Call Girls In Attibele ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Attibele ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 đŸ„” Book Your One night Stand
 
Call Girls In Bellandur ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 đŸ„” Book Your One night StandCall Girls In Bellandur ☎ 7737669865 đŸ„” Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 đŸ„” Book Your One night Stand
 
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 9155563397 👗 Top Class Call Girl Service B...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎9352988975 Two shot with one girl ...
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 JustđŸ“Č Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 JustđŸ“Č Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 JustđŸ“Č Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 JustđŸ“Č Call Ruhi Call Girl Phone No Amri...
 

Building a data processing pipeline in Python

  • 1. The problem Data ingestion Data parsing Data cleansing Scaling out Building a data processing pipeline in Python Joe Cabrera https://github.com/greedo @greedoshotlast jcabrera@eminorlabs.com PyGotham, 2015 Joe Cabrera Building a data processing pipeline in Python
  • 2. The problem Data ingestion Data parsing Data cleansing Scaling out Outline 1 The problem 2 Data ingestion 3 Data parsing 4 Data cleansing 5 Scaling out Joe Cabrera Building a data processing pipeline in Python
  • 3. The problem Data ingestion Data parsing Data cleansing Scaling out Poorly formatted data Joe Cabrera Building a data processing pipeline in Python
  • 4. The problem Data ingestion Data parsing Data cleansing Scaling out Poorly formatted data Joe Cabrera Building a data processing pipeline in Python
  • 5. The problem Data ingestion Data parsing Data cleansing Scaling out Poorly formatted data Joe Cabrera Building a data processing pipeline in Python
  • 6. The problem Data ingestion Data parsing Data cleansing Scaling out Largely dispersed across the web Joe Cabrera Building a data processing pipeline in Python
  • 7. The problem Data ingestion Data parsing Data cleansing Scaling out No standard data processing library Pandas Bubbles Joe Cabrera Building a data processing pipeline in Python
  • 8. The problem Data ingestion Data parsing Data cleansing Scaling out Data processing Joe Cabrera Building a data processing pipeline in Python
  • 9. The problem Data ingestion Data parsing Data cleansing Scaling out Requests and Futures Requests makes it easy to send the required parameters Concurrent Futures allows for the asynchronous execution of download requests Joe Cabrera Building a data processing pipeline in Python
  • 10. The problem Data ingestion Data parsing Data cleansing Scaling out Parsers Python tokenize BeautifulSoup Joe Cabrera Building a data processing pipeline in Python
  • 11. The problem Data ingestion Data parsing Data cleansing Scaling out Why BeautifulSoup More forgiving than standard XML or HTML libraries Supports regex Joe Cabrera Building a data processing pipeline in Python
  • 12. The problem Data ingestion Data parsing Data cleansing Scaling out Celery job scheduling Each download job is a task Each parse job is a task Each cleanse job is a task Joe Cabrera Building a data processing pipeline in Python
  • 13. The problem Data ingestion Data parsing Data cleansing Scaling out Re-insert cleansed data Cleanup data after raw ingest Separate stores for raw and clean data Joe Cabrera Building a data processing pipeline in Python
  • 14. The problem Data ingestion Data parsing Data cleansing Scaling out Distributed task queue Distribute data processing jobs to many machines Distribute jobs on a given machine across many CPUs Joe Cabrera Building a data processing pipeline in Python
  • 15. The problem Data ingestion Data parsing Data cleansing Scaling out SQL-Alchemy basic sharding API Each databases each has a shard id We query for data based on which shard contains the data Joe Cabrera Building a data processing pipeline in Python
  • 16. The problem Data ingestion Data parsing Data cleansing Scaling out Questions Thanks! https://github.com/greedo @greedoshotlast jcabrera@eminorlabs.com Joe Cabrera Building a data processing pipeline in Python