SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Build data quality rules and data
cleansing into your data pipelines
Principal Product Manager
Microsoft Data Integration
Data Quality for Data Warehouse Scenarios
Verify data types
and lengths
How should I
handle NULLs?
Domain value
constraints (ie. US
states)
Single source of
truth (master
data)
Late arriving
dimensions
Reference data
lookups
Data Quality for Data Science Scenarios
De-Duplicate
Data
Descriptive data
statistics (Length,
type, mean, median,
average, stddev …)
Frequency
distribution
How do I handle
missing values?
Enumerations
(turn codes into
categorical data)
Value
replacement
Replacing Values
• iif (length(title) == 0,toString(null()),title)
Splitting Data Based on Values
Data Profiling
• Summary statistics describing the shape, size,
content of your data
• Helps you to understand what is inside your data
• As a Data Engineer, you have the responsibility to
provide proper data for analytics and models
• For big data sets, use sampling / good enough
approach
• View data statistics at any step in your data
transformation
• Here is how to persist your data profile stats:
https://techcommunity.microsoft.com/t5/azure-
data-factory/how-to-save-your-data-profiler-
summary-stats-in-adf-data-flows/ba-p/1243251
Pattern Matching
With very wide datasets or datasets where you cannot
anticipate all column names, generate pattern-matching rules
to apply data quality rules en masse without the need to name
each column and describe each rule individually.
Enumerations /
Lookups
• Replace coded values with categorical
strings, or vice-versa
• Hard code rules inside Case() statement or
use a Join or Lookup method against a
reference file or table
• Models and analytics will sometimes require
codes or string literal values
Data De-Duplication
and Distinct Rows
• Use this pattern to eliminate common rows
from your data
• You pick a heuristic to use during duplicate
matching
• You can tag rows and/or remove duplicate
rows
• Use exact matching and/or fuzzy matching
• Available as pipeline template Dedupe
Pipeline
Fuzzy Lookups &
Joins
• Sometimes when performing
inline lookups, you don’t
have exact matches when
looking for references
• Fuzzy Lookups with Soundex
helps find matches based on
phonetic algorithms
• Very useful in data lake
scenarios where joins and
lookups are against data that
is not normalized or cleaned
Metadata
Validation Rules
• Transform data conditionally based on metadata traits
• Create and manage metadata quality rules
• Manipulate column properties of source data
• https://www.youtube.com/watch?v=E_UD3R-VpYE
Assertions
• Useful for building data quality pipelines
• Log failed assertions
• Fail fast
• Built-in expectations
• Expect true
• Expect unique
• Expect exists

Weitere ähnliche Inhalte

Was ist angesagt?

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
MDM for Customer data with Talend
MDM for Customer data with Talend MDM for Customer data with Talend
MDM for Customer data with Talend Jean-Michel Franco
 
Basic Introduction of Data Warehousing from Adiva Consulting
Basic Introduction of  Data Warehousing from Adiva ConsultingBasic Introduction of  Data Warehousing from Adiva Consulting
Basic Introduction of Data Warehousing from Adiva Consultingadivasoft
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data MeshLibbySchulze
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouseJames Serra
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture DesignKujambu Murugesan
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business EnablerSrinivasan Sankar
 
Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005Mark Kromer
 
Data warehouse presentaion
Data warehouse presentaionData warehouse presentaion
Data warehouse presentaionsridhark1981
 
Linking Data Governance to Business Goals
Linking Data Governance to Business GoalsLinking Data Governance to Business Goals
Linking Data Governance to Business GoalsPrecisely
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerDatabricks
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureDATAVERSITY
 
The Business Glossary, Data Dictionary, Data Catalog Trifecta
The Business Glossary, Data Dictionary, Data Catalog TrifectaThe Business Glossary, Data Dictionary, Data Catalog Trifecta
The Business Glossary, Data Dictionary, Data Catalog Trifectageorgefirican
 
Chapter 3: Data Governance
Chapter 3: Data Governance Chapter 3: Data Governance
Chapter 3: Data Governance Ahmed Alorage
 
Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mark Kromer
 
Data cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsData cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsMark Kromer
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks
 

Was ist angesagt? (20)

Data Vault Overview
Data Vault OverviewData Vault Overview
Data Vault Overview
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
MDM for Customer data with Talend
MDM for Customer data with Talend MDM for Customer data with Talend
MDM for Customer data with Talend
 
Basic Introduction of Data Warehousing from Adiva Consulting
Basic Introduction of  Data Warehousing from Adiva ConsultingBasic Introduction of  Data Warehousing from Adiva Consulting
Basic Introduction of Data Warehousing from Adiva Consulting
 
Time to Talk about Data Mesh
Time to Talk about Data MeshTime to Talk about Data Mesh
Time to Talk about Data Mesh
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Modern Data architecture Design
Modern Data architecture DesignModern Data architecture Design
Modern Data architecture Design
 
Data Catalog as a Business Enabler
Data Catalog as a Business EnablerData Catalog as a Business Enabler
Data Catalog as a Business Enabler
 
Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005Azure Data Factory Data Flows Training v005
Azure Data Factory Data Flows Training v005
 
Data warehouse presentaion
Data warehouse presentaionData warehouse presentaion
Data warehouse presentaion
 
Linking Data Governance to Business Goals
Linking Data Governance to Business GoalsLinking Data Governance to Business Goals
Linking Data Governance to Business Goals
 
Building Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics PrimerBuilding Lakehouses on Delta Lake with SQL Analytics Primer
Building Lakehouses on Delta Lake with SQL Analytics Primer
 
Enterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data ArchitectureEnterprise Architecture vs. Data Architecture
Enterprise Architecture vs. Data Architecture
 
The Business Glossary, Data Dictionary, Data Catalog Trifecta
The Business Glossary, Data Dictionary, Data Catalog TrifectaThe Business Glossary, Data Dictionary, Data Catalog Trifecta
The Business Glossary, Data Dictionary, Data Catalog Trifecta
 
Chapter 3: Data Governance
Chapter 3: Data Governance Chapter 3: Data Governance
Chapter 3: Data Governance
 
Data modelling 101
Data modelling 101Data modelling 101
Data modelling 101
 
Dimensional Modelling
Dimensional ModellingDimensional Modelling
Dimensional Modelling
 
Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22
 
Data cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flowsData cleansing and prep with synapse data flows
Data cleansing and prep with synapse data flows
 
Databricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With DataDatabricks: A Tool That Empowers You To Do More With Data
Databricks: A Tool That Empowers You To Do More With Data
 

Ă„hnlich wie Build data quality rules and data cleansing into your data pipelines

Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernAmin Chowdhury
 
Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...
Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...
Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...SaiM947604
 
Big Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATA
Big Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATABig Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATA
Big Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATAMatt Stubbs
 
Data warehouse 17 dimensional data model
Data warehouse 17 dimensional data modelData warehouse 17 dimensional data model
Data warehouse 17 dimensional data modelVaibhav Khanna
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyRTTS
 
Datawarehousing Terminology
Datawarehousing TerminologyDatawarehousing Terminology
Datawarehousing TerminologyDev EngineersSaathi
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Caserta
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architectureCosta Pissaris
 
Intro to Data Management
Intro to Data ManagementIntro to Data Management
Intro to Data ManagementChristopher Eaker
 
Data Profiling, Data Catalogs and Metadata Harmonisation
Data Profiling, Data Catalogs and Metadata HarmonisationData Profiling, Data Catalogs and Metadata Harmonisation
Data Profiling, Data Catalogs and Metadata HarmonisationAlan McSweeney
 
Data pre processing
Data pre processingData pre processing
Data pre processingkalavathisugan
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJordan Open Source Association
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best PracticesChristopher Eaker
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data miningDhilsath Fathima
 
Combining Human+Machine Intelligence to Successfully Integrate Biomedical Data
Combining Human+Machine Intelligence to Successfully Integrate Biomedical DataCombining Human+Machine Intelligence to Successfully Integrate Biomedical Data
Combining Human+Machine Intelligence to Successfully Integrate Biomedical Datarusselltamr
 
Using ca e rwin modeling to asure data 09162010
Using ca e rwin modeling to asure data 09162010Using ca e rwin modeling to asure data 09162010
Using ca e rwin modeling to asure data 09162010ERwin Modeling
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning Gopal Sakarkar
 
Data, Text and Web Mining
Data, Text and Web Mining Data, Text and Web Mining
Data, Text and Web Mining Jeremiah Fadugba
 

Ă„hnlich wie Build data quality rules and data cleansing into your data pipelines (20)

Data Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing ConcernData Quality: A Raising Data Warehousing Concern
Data Quality: A Raising Data Warehousing Concern
 
Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...
Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...
Apply Raw Data Set And Implement The Different Data Warngliing Functionalitie...
 
Big Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATA
Big Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATABig Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATA
Big Data LDN 2018: TIPS AND TRICKS TO WRANGLE BIG, DIRTY DATA
 
Data warehouse 17 dimensional data model
Data warehouse 17 dimensional data modelData warehouse 17 dimensional data model
Data warehouse 17 dimensional data model
 
Creating a Data validation and Testing Strategy
Creating a Data validation and Testing StrategyCreating a Data validation and Testing Strategy
Creating a Data validation and Testing Strategy
 
Datawarehousing Terminology
Datawarehousing TerminologyDatawarehousing Terminology
Datawarehousing Terminology
 
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
Big Data Warehousing Meetup: Dimensional Modeling Still Matters!!!
 
Building the enterprise data architecture
Building the enterprise data architectureBuilding the enterprise data architecture
Building the enterprise data architecture
 
Digital data
Digital dataDigital data
Digital data
 
Digital Types
Digital TypesDigital Types
Digital Types
 
Intro to Data Management
Intro to Data ManagementIntro to Data Management
Intro to Data Management
 
Data Profiling, Data Catalogs and Metadata Harmonisation
Data Profiling, Data Catalogs and Metadata HarmonisationData Profiling, Data Catalogs and Metadata Harmonisation
Data Profiling, Data Catalogs and Metadata Harmonisation
 
Data pre processing
Data pre processingData pre processing
Data pre processing
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
 
Data Management Best Practices
Data Management Best PracticesData Management Best Practices
Data Management Best Practices
 
Unit 3 part i Data mining
Unit 3 part i Data miningUnit 3 part i Data mining
Unit 3 part i Data mining
 
Combining Human+Machine Intelligence to Successfully Integrate Biomedical Data
Combining Human+Machine Intelligence to Successfully Integrate Biomedical DataCombining Human+Machine Intelligence to Successfully Integrate Biomedical Data
Combining Human+Machine Intelligence to Successfully Integrate Biomedical Data
 
Using ca e rwin modeling to asure data 09162010
Using ca e rwin modeling to asure data 09162010Using ca e rwin modeling to asure data 09162010
Using ca e rwin modeling to asure data 09162010
 
Data preprocessing using Machine Learning
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
 
Data, Text and Web Mining
Data, Text and Web Mining Data, Text and Web Mining
Data, Text and Web Mining
 

Mehr von Mark Kromer

Fabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptxFabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptxMark Kromer
 
Data cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsData cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsMark Kromer
 
Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mark Kromer
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mark Kromer
 
Data Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADFData Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADFMark Kromer
 
Azure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryAzure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryMark Kromer
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Mark Kromer
 
Data Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADFData Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADFMark Kromer
 
Data quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADFData quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADFMark Kromer
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryMark Kromer
 
ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300Mark Kromer
 
ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2Mark Kromer
 
ADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview MigrationADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview MigrationMark Kromer
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudMark Kromer
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudMark Kromer
 
Azure Data Factory Data Flow
Azure Data Factory Data FlowAzure Data Factory Data Flow
Azure Data Factory Data FlowMark Kromer
 
Azure Data Factory Data Flow Limited Preview for January 2019
Azure Data Factory Data Flow Limited Preview for January 2019Azure Data Factory Data Flow Limited Preview for January 2019
Azure Data Factory Data Flow Limited Preview for January 2019Mark Kromer
 
Microsoft Azure Data Factory Data Flow Scenarios
Microsoft Azure Data Factory Data Flow ScenariosMicrosoft Azure Data Factory Data Flow Scenarios
Microsoft Azure Data Factory Data Flow ScenariosMark Kromer
 
Azure Data Factory Data Flow Preview December 2019
Azure Data Factory Data Flow Preview December 2019Azure Data Factory Data Flow Preview December 2019
Azure Data Factory Data Flow Preview December 2019Mark Kromer
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekMark Kromer
 

Mehr von Mark Kromer (20)

Fabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptxFabric Data Factory Pipeline Copy Perf Tips.pptx
Fabric Data Factory Pipeline Copy Perf Tips.pptx
 
Data cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flowsData cleansing and data prep with synapse data flows
Data cleansing and data prep with synapse data flows
 
Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021Mapping Data Flows Training April 2021
Mapping Data Flows Training April 2021
 
Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021Mapping Data Flows Perf Tuning April 2021
Mapping Data Flows Perf Tuning April 2021
 
Data Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADFData Lake ETL in the Cloud with ADF
Data Lake ETL in the Cloud with ADF
 
Azure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power QueryAzure Data Factory Data Wrangling with Power Query
Azure Data Factory Data Wrangling with Power Query
 
Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101Azure Data Factory Data Flow Performance Tuning 101
Azure Data Factory Data Flow Performance Tuning 101
 
Data Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADFData Quality Patterns in the Cloud with ADF
Data Quality Patterns in the Cloud with ADF
 
Data quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADFData quality patterns in the cloud with ADF
Data quality patterns in the cloud with ADF
 
Data Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data FactoryData Quality Patterns in the Cloud with Azure Data Factory
Data Quality Patterns in the Cloud with Azure Data Factory
 
ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300ADF Mapping Data Flows Level 300
ADF Mapping Data Flows Level 300
 
ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2ADF Mapping Data Flows Training V2
ADF Mapping Data Flows Training V2
 
ADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview MigrationADF Mapping Data Flow Private Preview Migration
ADF Mapping Data Flow Private Preview Migration
 
Azure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the CloudAzure Data Factory ETL Patterns in the Cloud
Azure Data Factory ETL Patterns in the Cloud
 
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the CloudSQL Saturday Redmond 2019 ETL Patterns in the Cloud
SQL Saturday Redmond 2019 ETL Patterns in the Cloud
 
Azure Data Factory Data Flow
Azure Data Factory Data FlowAzure Data Factory Data Flow
Azure Data Factory Data Flow
 
Azure Data Factory Data Flow Limited Preview for January 2019
Azure Data Factory Data Flow Limited Preview for January 2019Azure Data Factory Data Flow Limited Preview for January 2019
Azure Data Factory Data Flow Limited Preview for January 2019
 
Microsoft Azure Data Factory Data Flow Scenarios
Microsoft Azure Data Factory Data Flow ScenariosMicrosoft Azure Data Factory Data Flow Scenarios
Microsoft Azure Data Factory Data Flow Scenarios
 
Azure Data Factory Data Flow Preview December 2019
Azure Data Factory Data Flow Preview December 2019Azure Data Factory Data Flow Preview December 2019
Azure Data Factory Data Flow Preview December 2019
 
Azure Data Factory for Azure Data Week
Azure Data Factory for Azure Data WeekAzure Data Factory for Azure Data Week
Azure Data Factory for Azure Data Week
 

KĂĽrzlich hochgeladen

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontologyjohnbeverley2021
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Christopher Logan Kennedy
 

KĂĽrzlich hochgeladen (20)

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 

Build data quality rules and data cleansing into your data pipelines

  • 1. Build data quality rules and data cleansing into your data pipelines Principal Product Manager Microsoft Data Integration
  • 2. Data Quality for Data Warehouse Scenarios Verify data types and lengths How should I handle NULLs? Domain value constraints (ie. US states) Single source of truth (master data) Late arriving dimensions Reference data lookups
  • 3. Data Quality for Data Science Scenarios De-Duplicate Data Descriptive data statistics (Length, type, mean, median, average, stddev …) Frequency distribution How do I handle missing values? Enumerations (turn codes into categorical data) Value replacement
  • 4. Replacing Values • iif (length(title) == 0,toString(null()),title)
  • 6. Data Profiling • Summary statistics describing the shape, size, content of your data • Helps you to understand what is inside your data • As a Data Engineer, you have the responsibility to provide proper data for analytics and models • For big data sets, use sampling / good enough approach • View data statistics at any step in your data transformation • Here is how to persist your data profile stats: https://techcommunity.microsoft.com/t5/azure- data-factory/how-to-save-your-data-profiler- summary-stats-in-adf-data-flows/ba-p/1243251
  • 7. Pattern Matching With very wide datasets or datasets where you cannot anticipate all column names, generate pattern-matching rules to apply data quality rules en masse without the need to name each column and describe each rule individually.
  • 8. Enumerations / Lookups • Replace coded values with categorical strings, or vice-versa • Hard code rules inside Case() statement or use a Join or Lookup method against a reference file or table • Models and analytics will sometimes require codes or string literal values
  • 9. Data De-Duplication and Distinct Rows • Use this pattern to eliminate common rows from your data • You pick a heuristic to use during duplicate matching • You can tag rows and/or remove duplicate rows • Use exact matching and/or fuzzy matching • Available as pipeline template Dedupe Pipeline
  • 10. Fuzzy Lookups & Joins • Sometimes when performing inline lookups, you don’t have exact matches when looking for references • Fuzzy Lookups with Soundex helps find matches based on phonetic algorithms • Very useful in data lake scenarios where joins and lookups are against data that is not normalized or cleaned
  • 11. Metadata Validation Rules • Transform data conditionally based on metadata traits • Create and manage metadata quality rules • Manipulate column properties of source data • https://www.youtube.com/watch?v=E_UD3R-VpYE
  • 12. Assertions • Useful for building data quality pipelines • Log failed assertions • Fail fast • Built-in expectations • Expect true • Expect unique • Expect exists