SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Deduplication Bouvet BigOne, 2011-04-13 Lars Marius Garshol, <larsga@bouvet.no> http://twitter.com/larsga
Getting started Baby steps
The problem The suppliers table Real-world data is very, very messy
The problem – take 2 Suppliers Customers Customers Customers Companies CRM Billing ERP Each of these has internal duplicates, plus duplicates across the tables. No easy fix.
But ... what about identifiers? No, there are no system IDs across these tables Yes, there are outside identifiers organization number for companies personal number for people But, these are problematic many records don't have them they are inconsistently formatted sometimes they are misspelled some parts of huge organizations have the same org number, but need to be treated as separate
First attempt at solution I wrote a simple Python script in ~2 hours It does the following: load all records normalize the data strip extra whitespace, lowercase, remove letters from org codes... use Bayesian inferencing for matching
Configuration
Matching This sums out to 0.93 probability
Problems The functions comparing values are still pretty primitive Performance is abysmal 90 minutes to process 14,500 records performance is O(n2) total number of records is ~2.5 million time to process all records: 1 year 10 months Now what?
An idea Well, we don't necessarily need to compare each record with all others if we have indexes we can look up the records which have matching values Use DBM for the indexes, for example Unfortunately, these only allow exact matching But, we can break up complex values into tokens, and index those Hang on, isn't this rather like a search engine? Bing! Let's try Lucene!
Lucene-based prototype I whip out Jython and try it New script first builds Lucene index Then searches all records against the index Time to process 14,500 records: 1 minute Now we're talking...
Reality sets in A splash of cold water to the face
Prior art It turns out people have been doing this before They call it entity resolution identity resolution merge/purge deduplication record linkage ... This makes Googling for information an absolute nightmare
Existing tools Several commercial tools they look big and expensive: we skip those Stian found some open source tools Oyster: slow, bad architecture, primitive matching SERF: slow, bad architecture I’ve later found more, but was not impressed So, it seems we still have to do it ourselves
Finds in the research literature General problem is well-understood "naïve Bayes" is naïve lots of interesting work on value comparisons performance problem 'solved' with "blocking" build a key from parts of the data sort records by key compare each record with m nearest neighbours performance goes from O(n2) to O(n m) parallel processing widely used Swoosh paper compare and merge should have ICAR1 properties optimal algorithms for general merge found run-time for 14,000 records ~1.5 hours... 1 Idempotence, commutativity, associativity, reflexivity
Good research papers Threat and Fraud Intelligence, Las Vegas Style, Jeff Jonas http://jeffjonas.typepad.com/IEEE.Identity.Resolution.pdf Real-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfo http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.3496&rep=rep1&type=pdf Swoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et al http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.5696&rep=rep1&type=pdf
DUplicate KillEr Duke
Java deduplication engine Work in progress so far spent only ~20 hours on it only command-line batch client built so far Based on Lucene 3.1 Open source (on Google Code) http://code.google.com/p/duke/ Blazingly fast 960,000 records in 11 minutes on this laptop
Architecture data in equivalences out SDshare client SDshare server RDF frontend Datastore API Duke engine Lucene H2 database
Architecture #2 data in link file out Command-line client More frontends: ,[object Object]
 SPARQL
 RDF file
 ...CSV frontend Datastore API Duke engine Lucene

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Big data-ppt
Big data-pptBig data-ppt
Big data-ppt
 
Data Analysis and Analytics.pdf
Data Analysis and Analytics.pdfData Analysis and Analytics.pdf
Data Analysis and Analytics.pdf
 
Graph based data models
Graph based data modelsGraph based data models
Graph based data models
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 
Fraud and Risk in Big Data
Fraud and Risk in Big DataFraud and Risk in Big Data
Fraud and Risk in Big Data
 
11. data management
11. data management11. data management
11. data management
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Introduction to databases
Introduction to databasesIntroduction to databases
Introduction to databases
 
Better decision making with proper business intelligence
Better decision making with proper business intelligenceBetter decision making with proper business intelligence
Better decision making with proper business intelligence
 
Big Data
Big DataBig Data
Big Data
 
Big Data Fundamentals
Big Data FundamentalsBig Data Fundamentals
Big Data Fundamentals
 
Big data analytics, research report
Big data analytics, research reportBig data analytics, research report
Big data analytics, research report
 
Big data-ppt-
Big data-ppt-Big data-ppt-
Big data-ppt-
 
Metadata ppt
Metadata pptMetadata ppt
Metadata ppt
 
Introducing Neo4j
Introducing Neo4jIntroducing Neo4j
Introducing Neo4j
 
Internet Of Things: Convergence
Internet Of Things: ConvergenceInternet Of Things: Convergence
Internet Of Things: Convergence
 
Big Data - Applications and Technologies Overview
Big Data - Applications and Technologies OverviewBig Data - Applications and Technologies Overview
Big Data - Applications and Technologies Overview
 
Data visualization
Data visualizationData visualization
Data visualization
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Data analytics vs. Data analysis
Data analytics vs. Data analysisData analytics vs. Data analysis
Data analytics vs. Data analysis
 

Ähnlich wie Deduplication

Metric Abuse: Frequently Misused Metrics in Oracle
Metric Abuse: Frequently Misused Metrics in OracleMetric Abuse: Frequently Misused Metrics in Oracle
Metric Abuse: Frequently Misused Metrics in OracleSteve Karam
 
How to fix bug or defects in software
How to fix bug or defects in software How to fix bug or defects in software
How to fix bug or defects in software Rajasekar Subramanian
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Yahoo Developer Network
 
Zen and the Art of ILS Migration--KUDOSCon 2011
Zen and the Art of ILS Migration--KUDOSCon 2011Zen and the Art of ILS Migration--KUDOSCon 2011
Zen and the Art of ILS Migration--KUDOSCon 2011D Ruth Bavousett
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiersLars Marius Garshol
 
The search engine index
The search engine indexThe search engine index
The search engine indexCJ Jenkins
 
Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of TryingShowing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of TryingDan Kaminsky
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Sparkelephantscale
 
Large Components in the Rearview Mirror
Large Components in the Rearview MirrorLarge Components in the Rearview Mirror
Large Components in the Rearview MirrorMichelle Brush
 
10 ways to accelerate software development by dave thomas at yow! nights hk
10 ways to accelerate software development by dave thomas at yow! nights hk10 ways to accelerate software development by dave thomas at yow! nights hk
10 ways to accelerate software development by dave thomas at yow! nights hkTze Chin Tang
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and boltsNBER
 
Puppet for Sys Admins
Puppet for Sys AdminsPuppet for Sys Admins
Puppet for Sys AdminsPuppet
 
Teradata Aster Discovery Platform
Teradata Aster Discovery PlatformTeradata Aster Discovery Platform
Teradata Aster Discovery PlatformScott Antony
 
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...DevOpsDays Riga
 
Data Management - Basic Concepts
Data Management - Basic ConceptsData Management - Basic Concepts
Data Management - Basic ConceptsSr Edith Bogue
 

Ähnlich wie Deduplication (20)

Metric Abuse: Frequently Misused Metrics in Oracle
Metric Abuse: Frequently Misused Metrics in OracleMetric Abuse: Frequently Misused Metrics in Oracle
Metric Abuse: Frequently Misused Metrics in Oracle
 
How to fix bug or defects in software
How to fix bug or defects in software How to fix bug or defects in software
How to fix bug or defects in software
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
 
Zen and the Art of ILS Migration--KUDOSCon 2011
Zen and the Art of ILS Migration--KUDOSCon 2011Zen and the Art of ILS Migration--KUDOSCon 2011
Zen and the Art of ILS Migration--KUDOSCon 2011
 
Oops Concepts
Oops ConceptsOops Concepts
Oops Concepts
 
Linking data without common identifiers
Linking data without common identifiersLinking data without common identifiers
Linking data without common identifiers
 
The search engine index
The search engine indexThe search engine index
The search engine index
 
Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of TryingShowing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying
 
Backpack Tools4 Sql Dev
Backpack Tools4 Sql DevBackpack Tools4 Sql Dev
Backpack Tools4 Sql Dev
 
Machine Learning with Spark
Machine Learning with SparkMachine Learning with Spark
Machine Learning with Spark
 
Smart Housekeeping Apps
Smart Housekeeping AppsSmart Housekeeping Apps
Smart Housekeeping Apps
 
Large Components in the Rearview Mirror
Large Components in the Rearview MirrorLarge Components in the Rearview Mirror
Large Components in the Rearview Mirror
 
10 ways to accelerate software development by dave thomas at yow! nights hk
10 ways to accelerate software development by dave thomas at yow! nights hk10 ways to accelerate software development by dave thomas at yow! nights hk
10 ways to accelerate software development by dave thomas at yow! nights hk
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and bolts
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
Puppet for Sys Admins
Puppet for Sys AdminsPuppet for Sys Admins
Puppet for Sys Admins
 
Teradata Aster Discovery Platform
Teradata Aster Discovery PlatformTeradata Aster Discovery Platform
Teradata Aster Discovery Platform
 
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...
 
Data Management - Basic Concepts
Data Management - Basic ConceptsData Management - Basic Concepts
Data Management - Basic Concepts
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 

Mehr von Lars Marius Garshol

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformationLars Marius Garshol
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at SchibstedLars Marius Garshol
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityLars Marius Garshol
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engineLars Marius Garshol
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLars Marius Garshol
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityLars Marius Garshol
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceLars Marius Garshol
 

Mehr von Lars Marius Garshol (20)

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
 
Kveik - what is it?
Kveik - what is it?Kveik - what is it?
Kveik - what is it?
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithms
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
History of writing
History of writingHistory of writing
History of writing
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativity
 
Norwegian farmhouse ale
Norwegian farmhouse aleNorwegian farmhouse ale
Norwegian farmhouse ale
 
Archive integration with RDF
Archive integration with RDFArchive integration with RDF
Archive integration with RDF
 
The Euro crisis in 10 minutes
The Euro crisis in 10 minutesThe Euro crisis in 10 minutes
The Euro crisis in 10 minutes
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
 
Bitcoin - digital gold
Bitcoin - digital goldBitcoin - digital gold
Bitcoin - digital gold
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Hops - the green gold
Hops - the green goldHops - the green gold
Hops - the green gold
 
Big data 101
Big data 101Big data 101
Big data 101
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
 
Approximate string comparators
Approximate string comparatorsApproximate string comparators
Approximate string comparators
 

Kürzlich hochgeladen

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 

Kürzlich hochgeladen (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

Deduplication

  • 1. Deduplication Bouvet BigOne, 2011-04-13 Lars Marius Garshol, <larsga@bouvet.no> http://twitter.com/larsga
  • 3. The problem The suppliers table Real-world data is very, very messy
  • 4. The problem – take 2 Suppliers Customers Customers Customers Companies CRM Billing ERP Each of these has internal duplicates, plus duplicates across the tables. No easy fix.
  • 5. But ... what about identifiers? No, there are no system IDs across these tables Yes, there are outside identifiers organization number for companies personal number for people But, these are problematic many records don't have them they are inconsistently formatted sometimes they are misspelled some parts of huge organizations have the same org number, but need to be treated as separate
  • 6. First attempt at solution I wrote a simple Python script in ~2 hours It does the following: load all records normalize the data strip extra whitespace, lowercase, remove letters from org codes... use Bayesian inferencing for matching
  • 8. Matching This sums out to 0.93 probability
  • 9. Problems The functions comparing values are still pretty primitive Performance is abysmal 90 minutes to process 14,500 records performance is O(n2) total number of records is ~2.5 million time to process all records: 1 year 10 months Now what?
  • 10. An idea Well, we don't necessarily need to compare each record with all others if we have indexes we can look up the records which have matching values Use DBM for the indexes, for example Unfortunately, these only allow exact matching But, we can break up complex values into tokens, and index those Hang on, isn't this rather like a search engine? Bing! Let's try Lucene!
  • 11. Lucene-based prototype I whip out Jython and try it New script first builds Lucene index Then searches all records against the index Time to process 14,500 records: 1 minute Now we're talking...
  • 12. Reality sets in A splash of cold water to the face
  • 13. Prior art It turns out people have been doing this before They call it entity resolution identity resolution merge/purge deduplication record linkage ... This makes Googling for information an absolute nightmare
  • 14. Existing tools Several commercial tools they look big and expensive: we skip those Stian found some open source tools Oyster: slow, bad architecture, primitive matching SERF: slow, bad architecture I’ve later found more, but was not impressed So, it seems we still have to do it ourselves
  • 15. Finds in the research literature General problem is well-understood "naïve Bayes" is naïve lots of interesting work on value comparisons performance problem 'solved' with "blocking" build a key from parts of the data sort records by key compare each record with m nearest neighbours performance goes from O(n2) to O(n m) parallel processing widely used Swoosh paper compare and merge should have ICAR1 properties optimal algorithms for general merge found run-time for 14,000 records ~1.5 hours... 1 Idempotence, commutativity, associativity, reflexivity
  • 16. Good research papers Threat and Fraud Intelligence, Las Vegas Style, Jeff Jonas http://jeffjonas.typepad.com/IEEE.Identity.Resolution.pdf Real-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfo http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.3496&rep=rep1&type=pdf Swoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et al http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.5696&rep=rep1&type=pdf
  • 18. Java deduplication engine Work in progress so far spent only ~20 hours on it only command-line batch client built so far Based on Lucene 3.1 Open source (on Google Code) http://code.google.com/p/duke/ Blazingly fast 960,000 records in 11 minutes on this laptop
  • 19. Architecture data in equivalences out SDshare client SDshare server RDF frontend Datastore API Duke engine Lucene H2 database
  • 20.
  • 23. ...CSV frontend Datastore API Duke engine Lucene
  • 24. Architecture #3 data in equivalences out REST interface X frontend Datastore API Duke engine Lucene H2 database
  • 25. Weaknesses Tied to naïve Bayes model research shows more sophisticated models perform better non-trivial to reconcile these with index lookup Value comparison sophistication limited Lucene does support Levenshtein queries (these are slow, though. will be fast in 4.x)