SlideShare ist ein Scribd-Unternehmen logo
1 von 41
GoDataDriven
PROUDLY PART OF THE XEBIA GROUP
@fzk
frisovanvollenhoven@godatadriven.com
Dirty Data
Friso van Vollenhoven
It’s a mess. It’s your problem.
'februari-22 2013'
A: Yes, sometimes as
often as 1 in every 10K
calls. Or about once a
week at 3K files / day.
þ
þ
TSV ==
thorn separated values?
þ == 0xFE
or -2, in Hive
CREATE TABLE browsers (
browser_id STRING,
browser STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-2';
• The format will change
• Faulty deliveries will occur
• Your parser will break
• Records will be mistakingly produced (over-logging)
• Other people test in production too (and you get the
data from it)
• Etc., etc.
•Simple deployment of ETL code
•Scheduling
•Scalable
•Independent jobs
•Fixable data store
•Incremental where possible
•Metrics
EXTRACT
TRANSFORM
LOAD
• No JVM startup overhead for Hadoop API usage
• Relatively concise syntax (Python)
• Mix Python standard library with any Java libs
• Flexible scheduling with dependencies
• Saves output
• E-mails on errors
• Scales to multiple nodes
• REST API
• Status monitor
• Integrates with version control
Deployment
git push jenkins master
Independent jobs
source (external)
staging (HDFS)
hive-staging (HDFS)
Hive
HDFS upload + move in place
MapReduce + HDFS move
Hive map external table + SELECT INTO
Out of order jobs
• At any point, you don’t really know what ‘made it’
to Hive
• Will happen anyway, because some days the data
delivery is going to be three hours late
• Or you get half in the morning and the other half
later in the day
• It really depends on what you do with the data
• This is where metrics + fixable data store help...
Fixable data store
• Using Hive partitions
• Jobs that move data from staging create partitions
• When new data / insight about the data arrives,
drop the partition and re-insert
• Be careful to reset any metrics in this case
• Basically: instead of trying to make everything
transactional, repair afterwards
• Use metrics to determine whether data is fit for
purpose
Metrics
Metrics service
• Job ran, so may units processed, took so much
time
• e.g. 10GB imported, took 1 hr
• e.g. 60M records transformed, took 10 minutes
• Dropped partition
• Inserted X records into partition
GoDataDriven
We’re hiring / Questions? / Thank you!
@fzk
frisovanvollenhoven@godatadriven.com
Friso van Vollenhoven

Weitere ähnliche Inhalte

Mehr von GoDataDriven

DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
GoDataDriven
 

Mehr von GoDataDriven (20)

Building a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlowBuilding a Scalable and reliable open source ML Platform with MLFlow
Building a Scalable and reliable open source ML Platform with MLFlow
 
Training Taster: Leading the way to become a data-driven organization
Training Taster: Leading the way to become a data-driven organizationTraining Taster: Leading the way to become a data-driven organization
Training Taster: Leading the way to become a data-driven organization
 
My Path From Data Engineer to Analytics Engineer
My Path From Data Engineer to Analytics EngineerMy Path From Data Engineer to Analytics Engineer
My Path From Data Engineer to Analytics Engineer
 
dbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchezdbt Python models - GoDataFest by Guillermo Sanchez
dbt Python models - GoDataFest by Guillermo Sanchez
 
Workshop on Google Cloud Data Platform
Workshop on Google Cloud Data PlatformWorkshop on Google Cloud Data Platform
Workshop on Google Cloud Data Platform
 
How to create a Devcontainer for your Python project
How to create a Devcontainer for your Python projectHow to create a Devcontainer for your Python project
How to create a Devcontainer for your Python project
 
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
Using Graph Neural Networks To Embrace The Dependency In Your Data by Usman Z...
 
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
Common Issues With Time Series by Vadim Nelidov - GoDataFest 2022
 
MLOps CodeBreakfast on AWS - GoDataFest 2022
MLOps CodeBreakfast on AWS - GoDataFest 2022MLOps CodeBreakfast on AWS - GoDataFest 2022
MLOps CodeBreakfast on AWS - GoDataFest 2022
 
MLOps CodeBreakfast on Azure - GoDataFest 2022
MLOps CodeBreakfast on Azure - GoDataFest 2022MLOps CodeBreakfast on Azure - GoDataFest 2022
MLOps CodeBreakfast on Azure - GoDataFest 2022
 
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
Tableau vs. Power BI by Juan Manuel Perafan - GoDataFest 2022
 
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
Deploying a Modern Data Stack by Lasse Benninga - GoDataFest 2022
 
AWS Well-Architected Webinar Security - Ben de Haan
AWS Well-Architected Webinar Security - Ben de HaanAWS Well-Architected Webinar Security - Ben de Haan
AWS Well-Architected Webinar Security - Ben de Haan
 
The 7 Habits of Effective Data Driven Companies
The 7 Habits of Effective Data Driven CompaniesThe 7 Habits of Effective Data Driven Companies
The 7 Habits of Effective Data Driven Companies
 
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
DevOps for Data Science on Azure - Marcel de Vries (Xpirit) and Niels Zeilema...
 
Artificial intelligence in actions: delivering a new experience to Formula 1 ...
Artificial intelligence in actions: delivering a new experience to Formula 1 ...Artificial intelligence in actions: delivering a new experience to Formula 1 ...
Artificial intelligence in actions: delivering a new experience to Formula 1 ...
 
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't HofSmart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
Smart application on Azure at Vattenfall - Rens Weijers & Peter van 't Hof
 
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
Democratizing AI/ML with GCP - Abishay Rao (Google) at GoDataFest 2019
 
The world runs on AI - Tony Krijnen (Microsoft) at GoDataFest 2019
The world runs on AI - Tony Krijnen (Microsoft) at GoDataFest 2019The world runs on AI - Tony Krijnen (Microsoft) at GoDataFest 2019
The world runs on AI - Tony Krijnen (Microsoft) at GoDataFest 2019
 
CI/CD with Azure DevOps and Azure Databricks
CI/CD with Azure DevOps and Azure DatabricksCI/CD with Azure DevOps and Azure Databricks
CI/CD with Azure DevOps and Azure Databricks
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 

Dirty data by friso van vollenhoven

  • 1. GoDataDriven PROUDLY PART OF THE XEBIA GROUP @fzk frisovanvollenhoven@godatadriven.com Dirty Data Friso van Vollenhoven It’s a mess. It’s your problem.
  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 8.
  • 9. A: Yes, sometimes as often as 1 in every 10K calls. Or about once a week at 3K files / day.
  • 10.
  • 11.
  • 12. þ
  • 13. þ
  • 16. or -2, in Hive CREATE TABLE browsers ( browser_id STRING, browser STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '-2';
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23. • The format will change • Faulty deliveries will occur • Your parser will break • Records will be mistakingly produced (over-logging) • Other people test in production too (and you get the data from it) • Etc., etc.
  • 24. •Simple deployment of ETL code •Scheduling •Scalable •Independent jobs •Fixable data store •Incremental where possible •Metrics
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31. • No JVM startup overhead for Hadoop API usage • Relatively concise syntax (Python) • Mix Python standard library with any Java libs
  • 32.
  • 33. • Flexible scheduling with dependencies • Saves output • E-mails on errors • Scales to multiple nodes • REST API • Status monitor • Integrates with version control
  • 34.
  • 36. Independent jobs source (external) staging (HDFS) hive-staging (HDFS) Hive HDFS upload + move in place MapReduce + HDFS move Hive map external table + SELECT INTO
  • 37. Out of order jobs • At any point, you don’t really know what ‘made it’ to Hive • Will happen anyway, because some days the data delivery is going to be three hours late • Or you get half in the morning and the other half later in the day • It really depends on what you do with the data • This is where metrics + fixable data store help...
  • 38. Fixable data store • Using Hive partitions • Jobs that move data from staging create partitions • When new data / insight about the data arrives, drop the partition and re-insert • Be careful to reset any metrics in this case • Basically: instead of trying to make everything transactional, repair afterwards • Use metrics to determine whether data is fit for purpose
  • 40. Metrics service • Job ran, so may units processed, took so much time • e.g. 10GB imported, took 1 hr • e.g. 60M records transformed, took 10 minutes • Dropped partition • Inserted X records into partition
  • 41. GoDataDriven We’re hiring / Questions? / Thank you! @fzk frisovanvollenhoven@godatadriven.com Friso van Vollenhoven