Duke is an open source tool for deduplicating and linking records across different data sources without common identifiers. It indexes data using Lucene and performs searches to find potential matches. Duke was used in a real-world project linking data from Mondial and DBpedia, where it correctly linked 94.9% of records while avoiding wrong links. Duke is flexible, scalable, and incremental, making it suitable for ongoing use at Hafslund to integrate customer records from multiple systems and remove duplicates. Future work may include improving comparators, adding a web service interface, and exploring parallelism.
1. Linking data without common identifiers Semantic Web Meetup, 2011-08-23 Lars Marius Garshol, <larsga@bouvet.no> http://twitter.com/larsga
2. About me Lars Marius Garshol, Bouvet consultant Worked with semantic technologies since 1999 mostly Topic Maps Before that I worked with XML Also background from Opera Software (Unicode support) open source development (Java/Python)
3. Agenda Linking data – the problem Record linkage theory Duke A real-world example Usage at Hafslund
4. The problem Data sets from different sources generally don’t share identifiers names tend to be spelled every which way So how can they be linked?
6. A difficult problem It requires n2 comparisons for n records a million comparisons for 1000 records, 100 million for 10,000, ... Exact string comparison is not enough must handle misspellings, name variants, etc Interpreting the data can be difficult even for a human being is the address different because there are two different people, or because the person moved? ...
7. Help from an unexpected quarter Statisticians to the rescue!
8. Record linkage Statisticians very often must connect data sets from different sources They call it “record linkage” term coined in 19461) mathematical foundations laid in 19592) formalized in 1969 as “Fellegi-Sunter” model3) A whole subject area has been developed with well-known techniques, methods, and tools these can of course be applied outside of statistics http://ajph.aphapublications.org/cgi/reprint/36/12/1412 http://www.sciencemag.org/content/130/3381/954.citation http://www.jstor.org/pss/2286061
9. Other terms for the same thing Has been independently invented many times Under many names entity resolution identity resolution merge/purge deduplication ... This makes Googling for information an absolute nightmare
10. Application areas Statistics (obviously) Data cleaning Data integration Conversion Fraud detection / intelligence / surveillance
12. Model, simplified We work with records, each of which has fields with values before doing record linkage we’ve processed the data so that they have the same set of fields (fields which exist in only one of the sources get discarded, as they are no help to us) We compare values for the same fields one by one, then estimate the probability that records represent the same entity given these values probability depends on which field and what values estimation can be very complex Finally, we can compute overall probability
14. String comparisons Must handle spelling errors and name variants must estimate probability that values belong to same entity despite differences Examples John Smith ≈ Jonh Smith J. Random Hacker ≈ James Random Hacker ... Many well-known algorithms have been developed no one best choice, unfortunately some are eager to merge, others less so Many are computationally expensive O(n2) where n is length of string is common this gets very costly when number of record pairs to compare is already high
15. Examples of algorithms Levenshtein edit distance ie: number of edits needed to change string 1 into 2 not so eager to merge Jaro-Winkler specifically for short person names very promiscuous Soundex essentially a phonetic hashing algorithm very fast, but also very crude Longest common subsequence / q-grams haven’t tried this yet Monge-Elkan one of the best, but computationally expensive TFIDF based on term frequency studies often find this performs best, but it’s expensive
16. Existing record linkage tools Commercial tools big, sophisticated, and expensive have found little information on what they actually do presumably also effective Open source tools generally made by and for statisticians nice user interfaces and rich configurability architecture often not as flexible as it could be
17. Standard algorithm n2comparisons for n records is unacceptable must reduce number of direct comparisons Solution produce a key from field values, sort records by the key, for each record, compare with n nearest neighbours sometimes several different keys are produced, to increase chances of finding matches Downsides requires coming up with a key difficult to apply incrementally sorting is expensive
18. Good research papers Threat and Fraud Intelligence, Las Vegas Style, Jeff Jonas http://jeffjonas.typepad.com/IEEE.Identity.Resolution.pdf Real-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfo http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.3496&rep=rep1&type=pdf Swoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et al http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.5696&rep=rep1&type=pdf
20. Context Doing a project for Hafslund where we integrate data from many sources Entities are duplicated both inside systems and across systems Suppliers Customers Customers Customers Companies CRM Billing ERP
21. Requirements Must be flexible and configurable no way to know in advance exactly what data we will need to deduplicate Must scale to large data sets CRM alone has 1.4 million customer records that’s 2 trillion comparisons with naïve approach Must have an API project uses SDshare protocol everywhere must therefore be able to build connectors Must be able to work incrementally process data as it changes and update conclusions on the fly
23. Duke Java deduplication engine released as open source http://code.google.com/p/duke/ Does not use key approach instead indexes data with Lucene does Lucene searches to find potential matches Still a work in progress, but high performance (1M records in 10 minutes), several data sources and comparators, being used for real in real projects, flexible architecture
24. How it works XML configuration file defines setup lists data sources, properties, probabilities, etc Data sources provide streams of records Steps: normalize values so they are ready for comparison collect a batch of records, then index it with Lucene compare records one by one against Lucene index do detailed comparisons against search results pairs with probability above configurable threshold are considered matches For now, only API and command-line
25. Record comparisons in detail Comparator compares field values return number in range 1.0 – 0.0 probability for field is high probability if number is 1.0, or low probability if 0.0, otherwise in between the two Probability for entire record is computed using Bayes’s formula approach known as “naïve Bayes” in research literature
26. Components Data sources CSV JDBC Sparql NTriples <plug in your own> Comparators ExactComparator NumericComparator SoundexComparator TokenizedComparator Levenshtein JaroWinkler Dice coefficient
27. Features Fairly complete command-line tool with debugging support API for embedding the tool Two modes: deduplication (all records matched against each other) linking (only matching across groups of records) Pluggable data sources, comparators, and cleaners Framework for composing cleaners Incremental processing High performance
29. Finding properties to match Need properties providing identity evidence Matching on the properties in bold below Extracted data to CSV for ease of use
31. Configuration – matching <schema> <threshold>0.65</threshold> <property type="id"> <name>ID</name> </property> <property> <name>NAME</name> <comparator>no.priv.garshol.duke.Levenshtein</comparator> <low>0.3</low> <high>0.88</high> </property> <property> <name>AREA</name> <comparator>AreaComparator</comparator> <low>0.2</low> <high>0.6</high> </property> <property> <name>CAPITAL</name> <comparator>no.priv.garshol.duke.Levenshtein</comparator> <low>0.4</low> <high>0.88</high> </property> </schema> Duke analyzes this setup and decides only NAME and CAPITAL need to be searched on in Lucene. <object class="no.priv.garshol.duke.NumericComparator" name="AreaComparator"> <param name="min-ratio" value="0.7"/> </object>
32. Result Correct links found: 206 / 217 (94.9%) Wrong links found: 0 / 12 (0.0%) Unknown links found: 0 Percent of links correct 100.0%, wrong 0.0%, unknown 0.0% Records with no link: 25 Precision 100.0%, recall 94.9%, f-number 0.974
35. An example of failure Duke doesn’t find this match no tokens matching exactly Lucene search finds nothing Detailed comparison gives correct result so only problem is Lucene search Lucene does have Levenshtein search, but in Lucene 3.x it’s very slow therefore not enabled now thinking of adding option to enable where needed Lucene 4.x will fix the performance problem
38. The SESAM project Building a new archival system does automatic tagging of documents based on knowledge about the data knowledge extracted from backend systems Search engine-based frontend using Recommind for this Very flexible architecture extracted data stored in triple store (Virtuoso) all data transfer based on SDshare protocol data extracted from RDBMSs with Ontopia’s DB2TM
39. The big picture DUPLICATES! 360 SDshare SDshare Virtuoso ERP Recommind SDshare SDshare CRM Billing Duke SDshare SDshare contains owl:sameAs and haf:possiblySameAs
40. Experiences so far Incremental processing works fine links added and retracted as data changes Performance not an issue at all but then only ~50,000 records so far... Matching works well, but not perfect data are very noisy and messy also, matching is hard
41. Duke roadmap 0.3 clean up the public API and document properly maybe some more comparators support for writing owl:sameAs to Sparql endpoint 0.4 add a web service interface 0.5 and onwards more comparators maybe some parallelism