Deduplication

•Als PPTX, PDF herunterladen•

26 gefällt mir•11,013 views

Lars Marius Garshol

Describes the basic issues of detecting duplicates in messy data and a proposed open source Java engine for solving it.

Technologie

Deduplication Bouvet BigOne, 2011-04-13 Lars Marius Garshol, <larsga@bouvet.no> http://twitter.com/larsga

The problem The suppliers table Real-world data is very, very messy

The problem – take 2 Suppliers Customers Customers Customers Companies CRM Billing ERP Each of these has internal duplicates, plus duplicates across the tables. No easy fix.

But ... what about identifiers? No, there are no system IDs across these tables Yes, there are outside identifiers organization number for companies personal number for people But, these are problematic many records don't have them they are inconsistently formatted sometimes they are misspelled some parts of huge organizations have the same org number, but need to be treated as separate

First attempt at solution I wrote a simple Python script in ~2 hours It does the following: load all records normalize the data strip extra whitespace, lowercase, remove letters from org codes... use Bayesian inferencing for matching

Matching This sums out to 0.93 probability

Problems The functions comparing values are still pretty primitive Performance is abysmal 90 minutes to process 14,500 records performance is O(n2) total number of records is ~2.5 million time to process all records: 1 year 10 months Now what?

An idea Well, we don't necessarily need to compare each record with all others if we have indexes we can look up the records which have matching values Use DBM for the indexes, for example Unfortunately, these only allow exact matching But, we can break up complex values into tokens, and index those Hang on, isn't this rather like a search engine? Bing! Let's try Lucene!

Lucene-based prototype I whip out Jython and try it New script first builds Lucene index Then searches all records against the index Time to process 14,500 records: 1 minute Now we're talking...

Reality sets in A splash of cold water to the face

Prior art It turns out people have been doing this before They call it entity resolution identity resolution merge/purge deduplication record linkage ... This makes Googling for information an absolute nightmare

Existing tools Several commercial tools they look big and expensive: we skip those Stian found some open source tools Oyster: slow, bad architecture, primitive matching SERF: slow, bad architecture I’ve later found more, but was not impressed So, it seems we still have to do it ourselves

Finds in the research literature General problem is well-understood "naïve Bayes" is naïve lots of interesting work on value comparisons performance problem 'solved' with "blocking" build a key from parts of the data sort records by key compare each record with m nearest neighbours performance goes from O(n2) to O(n m) parallel processing widely used Swoosh paper compare and merge should have ICAR1 properties optimal algorithms for general merge found run-time for 14,000 records ~1.5 hours... 1 Idempotence, commutativity, associativity, reflexivity

Good research papers Threat and Fraud Intelligence, Las Vegas Style, Jeff Jonas http://jeffjonas.typepad.com/IEEE.Identity.Resolution.pdf Real-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfo http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.3496&rep=rep1&type=pdf Swoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et al http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.5696&rep=rep1&type=pdf

Java deduplication engine Work in progress so far spent only ~20 hours on it only command-line batch client built so far Based on Lucene 3.1 Open source (on Google Code) http://code.google.com/p/duke/ Blazingly fast 960,000 records in 11 minutes on this laptop

Architecture data in equivalences out SDshare client SDshare server RDF frontend Datastore API Duke engine Lucene H2 database

Architecture #2 data in link file out Command-line client More frontends: ,[object Object]

...CSV frontend Datastore API Duke engine Lucene

Weitere ähnliche Inhalte

Was ist angesagt?

Big data-pptNazir Ahmed

Data Analysis and Analytics.pdfrohitgautam105831

Graph based data modelsMoumie Soulemane

Data Analysis in PythonRichard Herrell

Fraud and Risk in Big DataUmma Khatuna Jannat

11. data managementAshok Kulkarni

Big data pptShweta Sahu

Introduction to databasesDr Timothy Osadiya CITP., FIfL., ACII

Better decision making with proper business intelligencemadhavlankapati

Big DataSubhavinolin Raja

Big Data Fundamentalsrjain51

Big data analytics, research reportJULIO GONZALEZ SANZ

Big data-ppt-Bhagya Patil

Metadata pptShashikant Kumar

Introducing Neo4jNeo4j

Internet Of Things: ConvergenceDaniel Kolodziej

Big Data - Applications and Technologies OverviewSivashankar Ganapathy

Data visualizationJan Willem Tulp

Big Data Analytics with HadoopPhilippe Julio

Data analytics vs. Data analysisDr. C.V. Suresh Babu

Was ist angesagt? (20)

Big data-ppt

Data Analysis and Analytics.pdf

Graph based data models

Data Analysis in Python

Fraud and Risk in Big Data

11. data management

Big data ppt

Introduction to databases

Better decision making with proper business intelligence

Big Data

Big Data Fundamentals

Big data analytics, research report

Big data-ppt-

Metadata ppt

Introducing Neo4j

Internet Of Things: Convergence

Big Data - Applications and Technologies Overview

Data visualization

Big Data Analytics with Hadoop

Data analytics vs. Data analysis

Ähnlich wie Deduplication

Metric Abuse: Frequently Misused Metrics in OracleSteve Karam

How to fix bug or defects in software Rajasekar Subramanian

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Yahoo Developer Network

Zen and the Art of ILS Migration--KUDOSCon 2011D Ruth Bavousett

Oops Conceptsguest1aac43

Linking data without common identifiersLars Marius Garshol

The search engine indexCJ Jenkins

Showing How Security Has (And Hasn't) Improved, After Ten Years Of TryingDan Kaminsky

Backpack Tools4 Sql DevGonçalo Chaves

Machine Learning with Sparkelephantscale

Smart Housekeeping AppsKinshuk Adhikary

Large Components in the Rearview MirrorMichelle Brush

10 ways to accelerate software development by dave thomas at yow! nights hkTze Chin Tang

Nuts and boltsNBER

2014 pycon-talkc.titus.brown

Puppet for Sys AdminsPuppet

Teradata Aster Discovery PlatformScott Antony

DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...DevOpsDays Riga

Data Management - Basic ConceptsSr Edith Bogue

OpenML data@SheffieldJoaquin Vanschoren

Ähnlich wie Deduplication (20)

Metric Abuse: Frequently Misused Metrics in Oracle

How to fix bug or defects in software

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010

Zen and the Art of ILS Migration--KUDOSCon 2011

Oops Concepts

Linking data without common identifiers

The search engine index

Showing How Security Has (And Hasn't) Improved, After Ten Years Of Trying

Backpack Tools4 Sql Dev

Machine Learning with Spark

Smart Housekeeping Apps

Large Components in the Rearview Mirror

10 ways to accelerate software development by dave thomas at yow! nights hk

Nuts and bolts

2014 pycon-talk

Puppet for Sys Admins

Teradata Aster Discovery Platform

DevOpsDaysRiga 2017 ignite: Mikhail Iljin - DevOps meets Data Science - how t...

Data Management - Basic Concepts

OpenML data@Sheffield

Mehr von Lars Marius Garshol

JSLT: JSON querying and transformationLars Marius Garshol

Data collection in AWS at SchibstedLars Marius Garshol

Kveik - what is it?Lars Marius Garshol

Nature-inspired algorithmsLars Marius Garshol

Collecting 600M events/dayLars Marius Garshol

History of writingLars Marius Garshol

NoSQL and Einstein's theory of relativityLars Marius Garshol

Norwegian farmhouse aleLars Marius Garshol

Archive integration with RDFLars Marius Garshol

The Euro crisis in 10 minutesLars Marius Garshol

Using the search engine as recommendation engineLars Marius Garshol

Linked Open Data for the Cultural SectorLars Marius Garshol

NoSQL databases, the CAP theorem, and the theory of relativityLars Marius Garshol

Bitcoin - digital goldLars Marius Garshol

Introduction to Big Data/Machine LearningLars Marius Garshol

Hops - the green goldLars Marius Garshol

Big data 101Lars Marius Garshol

Linked Open DataLars Marius Garshol

Hafslund SESAM - Semantic integration in practiceLars Marius Garshol

Approximate string comparatorsLars Marius Garshol

Mehr von Lars Marius Garshol (20)

JSLT: JSON querying and transformation

Data collection in AWS at Schibsted

Kveik - what is it?

Nature-inspired algorithms

Collecting 600M events/day

History of writing

NoSQL and Einstein's theory of relativity

Norwegian farmhouse ale

Archive integration with RDF

The Euro crisis in 10 minutes

Using the search engine as recommendation engine

Linked Open Data for the Cultural Sector

NoSQL databases, the CAP theorem, and the theory of relativity

Bitcoin - digital gold

Introduction to Big Data/Machine Learning

Hops - the green gold

Big data 101

Linked Open Data

Hafslund SESAM - Semantic integration in practice

Approximate string comparators

Kürzlich hochgeladen

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

Ransomware_Q4_2023. The report. [EN].pdfOverkill Security

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

MINDCTI Revenue Release Quarter One 2024MIND CTI

Manulife - Insurer Transformation Award 2024The Digital Insurer

ICT role in 21st century education and its challengesrafiqahmad00786416

Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services

Real Time Object Detection Using Open CVKhem

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

GenAI Risks & Security Meetup 01052024.pdflior mazor

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

Kürzlich hochgeladen (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

A Beginners Guide to Building a RAG App Using Open Source Milvus

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

Exploring the Future Potential of AI-Enabled Smartphone Processors

Ransomware_Q4_2023. The report. [EN].pdf

Boost Fertility New Invention Ups Success Rates.pdf

MINDCTI Revenue Release Quarter One 2024

Manulife - Insurer Transformation Award 2024

ICT role in 21st century education and its challenges

Strategies for Landing an Oracle DBA Job as a Fresher

Real Time Object Detection Using Open CV

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

GenAI Risks & Security Meetup 01052024.pdf

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe

Deduplication

1. Deduplication Bouvet BigOne, 2011-04-13 Lars Marius Garshol, <larsga@bouvet.no> http://twitter.com/larsga

2. Getting started Baby steps

3. The problem The suppliers table Real-world data is very, very messy

4. The problem – take 2 Suppliers Customers Customers Customers Companies CRM Billing ERP Each of these has internal duplicates, plus duplicates across the tables. No easy fix.

5. But ... what about identifiers? No, there are no system IDs across these tables Yes, there are outside identifiers organization number for companies personal number for people But, these are problematic many records don't have them they are inconsistently formatted sometimes they are misspelled some parts of huge organizations have the same org number, but need to be treated as separate

6. First attempt at solution I wrote a simple Python script in ~2 hours It does the following: load all records normalize the data strip extra whitespace, lowercase, remove letters from org codes... use Bayesian inferencing for matching

7. Configuration

8. Matching This sums out to 0.93 probability

9. Problems The functions comparing values are still pretty primitive Performance is abysmal 90 minutes to process 14,500 records performance is O(n2) total number of records is ~2.5 million time to process all records: 1 year 10 months Now what?

10. An idea Well, we don't necessarily need to compare each record with all others if we have indexes we can look up the records which have matching values Use DBM for the indexes, for example Unfortunately, these only allow exact matching But, we can break up complex values into tokens, and index those Hang on, isn't this rather like a search engine? Bing! Let's try Lucene!

11. Lucene-based prototype I whip out Jython and try it New script first builds Lucene index Then searches all records against the index Time to process 14,500 records: 1 minute Now we're talking...

12. Reality sets in A splash of cold water to the face

13. Prior art It turns out people have been doing this before They call it entity resolution identity resolution merge/purge deduplication record linkage ... This makes Googling for information an absolute nightmare

14. Existing tools Several commercial tools they look big and expensive: we skip those Stian found some open source tools Oyster: slow, bad architecture, primitive matching SERF: slow, bad architecture I’ve later found more, but was not impressed So, it seems we still have to do it ourselves

15. Finds in the research literature General problem is well-understood "naïve Bayes" is naïve lots of interesting work on value comparisons performance problem 'solved' with "blocking" build a key from parts of the data sort records by key compare each record with m nearest neighbours performance goes from O(n2) to O(n m) parallel processing widely used Swoosh paper compare and merge should have ICAR1 properties optimal algorithms for general merge found run-time for 14,000 records ~1.5 hours... 1 Idempotence, commutativity, associativity, reflexivity

16. Good research papers Threat and Fraud Intelligence, Las Vegas Style, Jeff Jonas http://jeffjonas.typepad.com/IEEE.Identity.Resolution.pdf Real-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfo http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.3496&rep=rep1&type=pdf Swoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et al http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.5696&rep=rep1&type=pdf

17. DUplicate KillEr Duke

18. Java deduplication engine Work in progress so far spent only ~20 hours on it only command-line batch client built so far Based on Lucene 3.1 Open source (on Google Code) http://code.google.com/p/duke/ Blazingly fast 960,000 records in 11 minutes on this laptop

19. Architecture data in equivalences out SDshare client SDshare server RDF frontend Datastore API Duke engine Lucene H2 database

20.

21. SPARQL

22. RDF file

23. ...CSV frontend Datastore API Duke engine Lucene

24. Architecture #3 data in equivalences out REST interface X frontend Datastore API Duke engine Lucene H2 database

25. Weaknesses Tied to naïve Bayes model research shows more sophisticated models perform better non-trivial to reconcile these with index lookup Value comparison sophistication limited Lucene does support Levenshtein queries (these are slow, though. will be fast in 4.x)

26. Comments/questions?

Deduplication

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Deduplication

Ähnlich wie Deduplication (20)

Mehr von Lars Marius Garshol

Mehr von Lars Marius Garshol (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Deduplication