SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Linking data without common identifiers Semantic Web Meetup, 2011-08-23 Lars Marius Garshol, <larsga@bouvet.no> http://twitter.com/larsga
About me Lars Marius Garshol, Bouvet consultant Worked with semantic technologies since 1999 mostly Topic Maps Before that I worked with XML Also background from Opera Software (Unicode support) open source development (Java/Python)
Agenda Linking data – the problem Record linkage theory Duke A real-world example Usage at Hafslund
The problem Data sets from different sources generally don’t share identifiers names tend to be spelled every which way So how can they be linked?
A real-world example
A difficult problem It requires n2 comparisons for n records a million comparisons for 1000 records, 100 million for 10,000, ... Exact string comparison is not enough must handle misspellings, name variants, etc Interpreting the data can be difficult even for a human being is the address different because there are two different people, or because the person moved? ...
Help from an unexpected quarter Statisticians to the rescue!
Record linkage Statisticians very often must connect data sets from different sources They call it “record linkage” term coined in 19461) mathematical foundations laid in 19592) formalized in 1969 as “Fellegi-Sunter” model3) A whole subject area has been developed with well-known techniques, methods, and tools these can of course be applied outside of statistics http://ajph.aphapublications.org/cgi/reprint/36/12/1412 http://www.sciencemag.org/content/130/3381/954.citation http://www.jstor.org/pss/2286061
Other terms for the same thing Has been independently invented many times Under many names entity resolution identity resolution merge/purge deduplication ... This makes Googling for information an absolute nightmare
Application areas Statistics (obviously) Data cleaning Data integration Conversion Fraud detection / intelligence / surveillance
Mathematical model
Model, simplified We work with records, each of which has fields with values before doing record linkage we’ve processed the data so that they have the same set of fields (fields which exist in only one of the sources get discarded, as they are no help to us) We compare values for the same fields one by one, then estimate the probability that records represent the same entity given these values probability depends on which field and what values estimation can be very complex Finally, we can compute overall probability
Example
String comparisons Must handle spelling errors and name variants must estimate probability that values belong to same entity despite differences Examples John Smith ≈ Jonh Smith J. Random Hacker ≈ James Random Hacker ... Many well-known algorithms have been developed no one best choice, unfortunately some are eager to merge, others less so Many are computationally expensive O(n2) where n is length of string is common this gets very costly when number of record pairs to compare is already high
Examples of algorithms Levenshtein edit distance ie: number of edits needed to change string 1 into 2 not so eager to merge Jaro-Winkler specifically for short person names very promiscuous Soundex essentially a phonetic hashing algorithm very fast, but also very crude Longest common subsequence / q-grams haven’t tried this yet Monge-Elkan one of the best, but computationally expensive TFIDF based on term frequency studies often find this performs best, but it’s expensive
Existing record linkage tools Commercial tools big, sophisticated, and expensive have found little information on what they actually do presumably also effective Open source tools generally made by and for statisticians nice user interfaces and rich configurability architecture often not as flexible as it could be
Standard algorithm n2comparisons for n records is unacceptable must reduce number of direct comparisons Solution produce a key from field values, sort records by the key, for each record, compare with n nearest neighbours sometimes several different keys are produced, to increase chances of finding matches Downsides requires coming up with a key difficult to apply incrementally sorting is expensive
Good research papers Threat and Fraud Intelligence, Las Vegas Style, Jeff Jonas http://jeffjonas.typepad.com/IEEE.Identity.Resolution.pdf Real-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfo http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.3496&rep=rep1&type=pdf Swoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et al http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.5696&rep=rep1&type=pdf
DUplicate KillEr Duke
Context Doing a project for Hafslund where we integrate data from many sources Entities are duplicated both inside systems and across systems Suppliers Customers Customers Customers Companies CRM Billing ERP
Requirements Must be flexible and configurable no way to know in advance exactly what data we will need to deduplicate Must scale to large data sets CRM alone has 1.4 million customer records that’s 2 trillion comparisons with naïve approach Must have an API project uses SDshare protocol everywhere must therefore be able to build connectors Must be able to work incrementally process data as it changes and update conclusions on the fly
Reviewed existing tools... ...but didn’t find anything matching the criteria Therefore...
Duke Java deduplication engine released as open source http://code.google.com/p/duke/ Does not use key approach instead indexes data with Lucene does Lucene searches to find potential matches Still a work in progress, but high performance (1M records in 10 minutes), several data sources and comparators, being used for real in real projects, flexible architecture
How it works XML configuration file defines setup lists data sources, properties, probabilities, etc Data sources provide streams of records Steps: normalize values so they are ready for comparison collect a batch of records, then index it with Lucene compare records one by one against Lucene index do detailed comparisons against search results pairs with probability above configurable threshold are considered matches For now, only API and command-line
Record comparisons in detail Comparator compares field values return number in range 1.0 – 0.0 probability for field is high probability if number is 1.0, or low probability if 0.0, otherwise in between the two Probability for entire record is computed using Bayes’s formula approach known as “naïve Bayes” in research literature
Components Data sources CSV JDBC Sparql NTriples <plug in your own> Comparators ExactComparator NumericComparator SoundexComparator TokenizedComparator Levenshtein JaroWinkler Dice coefficient
Features Fairly complete command-line tool with debugging support API for embedding the tool Two modes: deduplication (all records matched against each other) linking (only matching across groups of records) Pluggable data sources, comparators, and cleaners Framework for composing cleaners Incremental processing High performance
A real-world example Matching Mondial and DBpedia
Finding properties to match Need properties providing identity evidence Matching on the properties in bold below Extracted data to CSV for ease of use
Configuration – data sources  <group>     <csv>       <param name="input-file" value="dbpedia.csv"/>       <param name="header-line" value="false"/>       <column name="1" property="ID"/>       <column name="2"               cleaner="no.priv...CountryNameCleaner"               property="NAME"/>       <column name="3"               property="AREA"/>       <column name="4"               cleaner="no.priv...CapitalCleaner"               property="CAPITAL"/>     </csv>   </group> <group>     <csv>       <param name="input-file" value="mondial.csv"/>       <column name="id" property="ID"/>       <column name="country"               cleaner="no.priv...examples.CountryNameCleaner"               property="NAME"/>       <column name="capital"               cleaner="no.priv...LowerCaseNormalizeCleaner"               property="CAPITAL"/>       <column name="area"               property="AREA"/>     </csv>   </group>
Configuration – matching  <schema>     <threshold>0.65</threshold>     <property type="id">         <name>ID</name>     </property>     <property>         <name>NAME</name>          <comparator>no.priv.garshol.duke.Levenshtein</comparator>         <low>0.3</low>         <high>0.88</high>     </property>         <property>         <name>AREA</name>          <comparator>AreaComparator</comparator>         <low>0.2</low>         <high>0.6</high>     </property>     <property>         <name>CAPITAL</name>          <comparator>no.priv.garshol.duke.Levenshtein</comparator>         <low>0.4</low>         <high>0.88</high>     </property>       </schema> Duke analyzes this setup and decides only NAME and CAPITAL need to be  searched on in Lucene.  <object class="no.priv.garshol.duke.NumericComparator"           name="AreaComparator">     <param name="min-ratio" value="0.7"/>   </object>
Result Correct links found: 206 / 217 (94.9%) Wrong links found: 0 / 12 (0.0%) Unknown links found: 0 Percent of links correct 100.0%, wrong 0.0%, unknown 0.0% Records with no link: 25 Precision 100.0%, recall 94.9%, f-number 0.974
Examples
Choosing the right match
An example of failure Duke doesn’t find this match no tokens matching exactly Lucene search finds nothing Detailed comparison gives correct result so only problem is Lucene search Lucene does have Levenshtein search, but in Lucene 3.x it’s very slow therefore not enabled now thinking of adding option to enable where needed Lucene 4.x will fix the performance problem
The effects of value matching
Duke in real life Usage at Hafslund
The SESAM project Building a new archival system does automatic tagging of documents based on knowledge about the data knowledge extracted from backend systems Search engine-based frontend using Recommind for this Very flexible architecture extracted data stored in triple store (Virtuoso) all data transfer based on SDshare protocol data extracted from RDBMSs with Ontopia’s DB2TM
The big picture DUPLICATES! 360 SDshare SDshare Virtuoso ERP Recommind SDshare SDshare CRM Billing Duke SDshare SDshare contains owl:sameAs and haf:possiblySameAs
Experiences so far Incremental processing works fine links added and retracted as data changes Performance not an issue at all but then only ~50,000 records so far... Matching works well, but not perfect data are very noisy and messy also, matching is hard
Duke roadmap 0.3 clean up the public API and document properly maybe some more comparators support for writing owl:sameAs to Sparql endpoint 0.4 add a web service interface 0.5 and onwards more comparators maybe some parallelism
Slides will appear on http://slideshare.net/larsga Comments/questions?

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Content Management with MongoDB by Mark Helmstetter
 Content Management with MongoDB by Mark Helmstetter Content Management with MongoDB by Mark Helmstetter
Content Management with MongoDB by Mark HelmstetterMongoDB
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic SearchPaul Wlodarczyk
 
Introduction to python programming
Introduction to python programmingIntroduction to python programming
Introduction to python programmingSrinivas Narasegouda
 
Credit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning AlgorithmsCredit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning Algorithmsankit panigrahy
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachinePulse
 
What Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryWhat Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryNeo4j
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using ClusteringDessy Amirudin
 
Introduction to XGboost
Introduction to XGboostIntroduction to XGboost
Introduction to XGboostShuai Zhang
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyMarina Santini
 
Python Data Structures and Algorithms.pptx
Python Data Structures and Algorithms.pptxPython Data Structures and Algorithms.pptx
Python Data Structures and Algorithms.pptxShreyasLawand
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginnersNeil Baker
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with PythonBenjamin Bengfort
 
Natural Language Processing for biomedical text mining - Thierry Hamon
Natural Language Processing for biomedical text mining - Thierry HamonNatural Language Processing for biomedical text mining - Thierry Hamon
Natural Language Processing for biomedical text mining - Thierry HamonGrammarly
 
Python Sequence | Python Lists | Python Sets & Dictionary | Python Strings | ...
Python Sequence | Python Lists | Python Sets & Dictionary | Python Strings | ...Python Sequence | Python Lists | Python Sets & Dictionary | Python Strings | ...
Python Sequence | Python Lists | Python Sets & Dictionary | Python Strings | ...Edureka!
 
Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahoutGaurav Kasliwal
 
Education data mining presentation
Education data mining presentationEducation data mining presentation
Education data mining presentationNishabhanot1
 

Was ist angesagt? (20)

Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Content Management with MongoDB by Mark Helmstetter
 Content Management with MongoDB by Mark Helmstetter Content Management with MongoDB by Mark Helmstetter
Content Management with MongoDB by Mark Helmstetter
 
Encodings
EncodingsEncodings
Encodings
 
Implementing Semantic Search
Implementing Semantic SearchImplementing Semantic Search
Implementing Semantic Search
 
Introduction to python programming
Introduction to python programmingIntroduction to python programming
Introduction to python programming
 
Credit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning AlgorithmsCredit card fraud detection using machine learning Algorithms
Credit card fraud detection using machine learning Algorithms
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
What Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS LibraryWhat Is GDS and Neo4j’s GDS Library
What Is GDS and Neo4j’s GDS Library
 
Customer Segmentation using Clustering
Customer Segmentation using ClusteringCustomer Segmentation using Clustering
Customer Segmentation using Clustering
 
Introduction to XGboost
Introduction to XGboostIntroduction to XGboost
Introduction to XGboost
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
Intro to Jupyter Notebooks
Intro to Jupyter NotebooksIntro to Jupyter Notebooks
Intro to Jupyter Notebooks
 
Python Data Structures and Algorithms.pptx
Python Data Structures and Algorithms.pptxPython Data Structures and Algorithms.pptx
Python Data Structures and Algorithms.pptx
 
Elasticsearch for beginners
Elasticsearch for beginnersElasticsearch for beginners
Elasticsearch for beginners
 
Natural Language Processing with Python
Natural Language Processing with PythonNatural Language Processing with Python
Natural Language Processing with Python
 
Natural Language Processing for biomedical text mining - Thierry Hamon
Natural Language Processing for biomedical text mining - Thierry HamonNatural Language Processing for biomedical text mining - Thierry Hamon
Natural Language Processing for biomedical text mining - Thierry Hamon
 
Python Sequence | Python Lists | Python Sets & Dictionary | Python Strings | ...
Python Sequence | Python Lists | Python Sets & Dictionary | Python Strings | ...Python Sequence | Python Lists | Python Sets & Dictionary | Python Strings | ...
Python Sequence | Python Lists | Python Sets & Dictionary | Python Strings | ...
 
Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahout
 
Education data mining presentation
Education data mining presentationEducation data mining presentation
Education data mining presentation
 
Elasticsearch
ElasticsearchElasticsearch
Elasticsearch
 

Andere mochten auch

Prescription event monitorig
Prescription event monitorigPrescription event monitorig
Prescription event monitorignagpharma
 
Record matching over query results from Web Databases
Record matching over query results from Web DatabasesRecord matching over query results from Web Databases
Record matching over query results from Web Databasestusharjadhav2611
 
An adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate recordsAn adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate recordsLikan Patra
 
online Record Linkage
online Record Linkageonline Record Linkage
online Record LinkagePriya Pandian
 
The Duplicitous Duplicate
The Duplicitous DuplicateThe Duplicitous Duplicate
The Duplicitous DuplicateAnish Raivadera
 
novel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingnovel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingVipin Kp
 
Indexing Techniques for Scalable Record Linkage and Deduplication
Indexing Techniques for Scalable Record Linkage and DeduplicationIndexing Techniques for Scalable Record Linkage and Deduplication
Indexing Techniques for Scalable Record Linkage and DeduplicationPradeeban Kathiravelu, Ph.D.
 
Prescription event monitoring and record linkage system
Prescription event monitoring and record linkage systemPrescription event monitoring and record linkage system
Prescription event monitoring and record linkage systemVineetha Menon
 
Duplicate detection
Duplicate detectionDuplicate detection
Duplicate detectionjonecx
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismseSAT Journals
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data DeduplicationRedWireServices
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Kira
 
Progressive duplicate detection
Progressive duplicate detectionProgressive duplicate detection
Progressive duplicate detectionieeepondy
 
Efficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsEfficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsPradeeban Kathiravelu, Ph.D.
 
Visualising Multi Dimensional Data
Visualising Multi Dimensional DataVisualising Multi Dimensional Data
Visualising Multi Dimensional DataAmit Kapoor
 
Data vs. information
Data vs. informationData vs. information
Data vs. informationBesar Limani
 

Andere mochten auch (20)

Prescription event monitorig
Prescription event monitorigPrescription event monitorig
Prescription event monitorig
 
Progressive Texture
Progressive TextureProgressive Texture
Progressive Texture
 
Record matching over query results from Web Databases
Record matching over query results from Web DatabasesRecord matching over query results from Web Databases
Record matching over query results from Web Databases
 
An adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate recordsAn adaptive algorithm for detection of duplicate records
An adaptive algorithm for detection of duplicate records
 
online Record Linkage
online Record Linkageonline Record Linkage
online Record Linkage
 
The Duplicitous Duplicate
The Duplicitous DuplicateThe Duplicitous Duplicate
The Duplicitous Duplicate
 
novel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawlingnovel and efficient approch for detection of duplicate pages in web crawling
novel and efficient approch for detection of duplicate pages in web crawling
 
Indexing Techniques for Scalable Record Linkage and Deduplication
Indexing Techniques for Scalable Record Linkage and DeduplicationIndexing Techniques for Scalable Record Linkage and Deduplication
Indexing Techniques for Scalable Record Linkage and Deduplication
 
A Case Study in Record Linkage_PVER Conf_May2011
A Case Study in Record Linkage_PVER Conf_May2011A Case Study in Record Linkage_PVER Conf_May2011
A Case Study in Record Linkage_PVER Conf_May2011
 
Prescription event monitoring and record linkage system
Prescription event monitoring and record linkage systemPrescription event monitoring and record linkage system
Prescription event monitoring and record linkage system
 
Duplicate detection
Duplicate detectionDuplicate detection
Duplicate detection
 
A study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanismsA study and survey on various progressive duplicate detection mechanisms
A study and survey on various progressive duplicate detection mechanisms
 
Open Source Data Deduplication
Open Source Data DeduplicationOpen Source Data Deduplication
Open Source Data Deduplication
 
Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)Tutorial 4 (duplicate detection)
Tutorial 4 (duplicate detection)
 
Progressive duplicate detection
Progressive duplicate detectionProgressive duplicate detection
Progressive duplicate detection
 
Efficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsEfficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data Sets
 
Spontaneous reporting
Spontaneous reporting Spontaneous reporting
Spontaneous reporting
 
Record linkage methods applied to population data deduplication
Record linkage methods applied to population data deduplicationRecord linkage methods applied to population data deduplication
Record linkage methods applied to population data deduplication
 
Visualising Multi Dimensional Data
Visualising Multi Dimensional DataVisualising Multi Dimensional Data
Visualising Multi Dimensional Data
 
Data vs. information
Data vs. informationData vs. information
Data vs. information
 

Ähnlich wie Linking data without common identifiers

Swertz Molgenis Bosc2009
Swertz Molgenis Bosc2009Swertz Molgenis Bosc2009
Swertz Molgenis Bosc2009bosc
 
Defense Against the Dark Arts: Protecting Your Data from ORMs
Defense Against the Dark Arts: Protecting Your Data from ORMsDefense Against the Dark Arts: Protecting Your Data from ORMs
Defense Against the Dark Arts: Protecting Your Data from ORMsVanessa Hurst
 
New Directions in Metadata
New Directions in MetadataNew Directions in Metadata
New Directions in Metadatasuyu22
 
Advance Data Mining Project Report
Advance Data Mining Project ReportAdvance Data Mining Project Report
Advance Data Mining Project ReportArnab Mukhopadhyay
 
Biomart Update
Biomart UpdateBiomart Update
Biomart Updatebosc
 
Introduction To Programming
Introduction To ProgrammingIntroduction To Programming
Introduction To Programmingcwarren
 
Comparing DOM XSS Tools On Real World Bug
Comparing DOM XSS Tools On Real World BugComparing DOM XSS Tools On Real World Bug
Comparing DOM XSS Tools On Real World BugStefano Di Paola
 
Automatic Metadata Generation using Associative Networks
Automatic Metadata Generation using Associative NetworksAutomatic Metadata Generation using Associative Networks
Automatic Metadata Generation using Associative NetworksMarko Rodriguez
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchAndrew Lowe
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R NotesLakshmiSarvani6
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and boltsNBER
 

Ähnlich wie Linking data without common identifiers (20)

Swertz Molgenis Bosc2009
Swertz Molgenis Bosc2009Swertz Molgenis Bosc2009
Swertz Molgenis Bosc2009
 
Defense Against the Dark Arts: Protecting Your Data from ORMs
Defense Against the Dark Arts: Protecting Your Data from ORMsDefense Against the Dark Arts: Protecting Your Data from ORMs
Defense Against the Dark Arts: Protecting Your Data from ORMs
 
New Directions in Metadata
New Directions in MetadataNew Directions in Metadata
New Directions in Metadata
 
Sweo talk
Sweo talkSweo talk
Sweo talk
 
Classification
ClassificationClassification
Classification
 
Advance Data Mining Project Report
Advance Data Mining Project ReportAdvance Data Mining Project Report
Advance Data Mining Project Report
 
Biomart Update
Biomart UpdateBiomart Update
Biomart Update
 
Javascript2839
Javascript2839Javascript2839
Javascript2839
 
Json
JsonJson
Json
 
Xml
XmlXml
Xml
 
Struts2
Struts2Struts2
Struts2
 
Lucene And Solr Intro
Lucene And Solr IntroLucene And Solr Intro
Lucene And Solr Intro
 
Introduction To Programming
Introduction To ProgrammingIntroduction To Programming
Introduction To Programming
 
NHibernate
NHibernateNHibernate
NHibernate
 
Comparing DOM XSS Tools On Real World Bug
Comparing DOM XSS Tools On Real World BugComparing DOM XSS Tools On Real World Bug
Comparing DOM XSS Tools On Real World Bug
 
Automatic Metadata Generation using Associative Networks
Automatic Metadata Generation using Associative NetworksAutomatic Metadata Generation using Associative Networks
Automatic Metadata Generation using Associative Networks
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Introduction to Data Science With R Notes
Introduction to Data Science With R NotesIntroduction to Data Science With R Notes
Introduction to Data Science With R Notes
 
Eurolan 2005 Pedersen
Eurolan 2005 PedersenEurolan 2005 Pedersen
Eurolan 2005 Pedersen
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and bolts
 

Mehr von Lars Marius Garshol

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformationLars Marius Garshol
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at SchibstedLars Marius Garshol
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityLars Marius Garshol
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engineLars Marius Garshol
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLars Marius Garshol
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityLars Marius Garshol
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceLars Marius Garshol
 

Mehr von Lars Marius Garshol (20)

JSLT: JSON querying and transformation
JSLT: JSON querying and transformationJSLT: JSON querying and transformation
JSLT: JSON querying and transformation
 
Data collection in AWS at Schibsted
Data collection in AWS at SchibstedData collection in AWS at Schibsted
Data collection in AWS at Schibsted
 
Kveik - what is it?
Kveik - what is it?Kveik - what is it?
Kveik - what is it?
 
Nature-inspired algorithms
Nature-inspired algorithmsNature-inspired algorithms
Nature-inspired algorithms
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
History of writing
History of writingHistory of writing
History of writing
 
NoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativityNoSQL and Einstein's theory of relativity
NoSQL and Einstein's theory of relativity
 
Norwegian farmhouse ale
Norwegian farmhouse aleNorwegian farmhouse ale
Norwegian farmhouse ale
 
Archive integration with RDF
Archive integration with RDFArchive integration with RDF
Archive integration with RDF
 
The Euro crisis in 10 minutes
The Euro crisis in 10 minutesThe Euro crisis in 10 minutes
The Euro crisis in 10 minutes
 
Using the search engine as recommendation engine
Using the search engine as recommendation engineUsing the search engine as recommendation engine
Using the search engine as recommendation engine
 
Linked Open Data for the Cultural Sector
Linked Open Data for the Cultural SectorLinked Open Data for the Cultural Sector
Linked Open Data for the Cultural Sector
 
NoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativityNoSQL databases, the CAP theorem, and the theory of relativity
NoSQL databases, the CAP theorem, and the theory of relativity
 
Bitcoin - digital gold
Bitcoin - digital goldBitcoin - digital gold
Bitcoin - digital gold
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Hops - the green gold
Hops - the green goldHops - the green gold
Hops - the green gold
 
Big data 101
Big data 101Big data 101
Big data 101
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
Hafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practiceHafslund SESAM - Semantic integration in practice
Hafslund SESAM - Semantic integration in practice
 
Approximate string comparators
Approximate string comparatorsApproximate string comparators
Approximate string comparators
 

Kürzlich hochgeladen

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Kürzlich hochgeladen (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Linking data without common identifiers

  • 1. Linking data without common identifiers Semantic Web Meetup, 2011-08-23 Lars Marius Garshol, <larsga@bouvet.no> http://twitter.com/larsga
  • 2. About me Lars Marius Garshol, Bouvet consultant Worked with semantic technologies since 1999 mostly Topic Maps Before that I worked with XML Also background from Opera Software (Unicode support) open source development (Java/Python)
  • 3. Agenda Linking data – the problem Record linkage theory Duke A real-world example Usage at Hafslund
  • 4. The problem Data sets from different sources generally don’t share identifiers names tend to be spelled every which way So how can they be linked?
  • 6. A difficult problem It requires n2 comparisons for n records a million comparisons for 1000 records, 100 million for 10,000, ... Exact string comparison is not enough must handle misspellings, name variants, etc Interpreting the data can be difficult even for a human being is the address different because there are two different people, or because the person moved? ...
  • 7. Help from an unexpected quarter Statisticians to the rescue!
  • 8. Record linkage Statisticians very often must connect data sets from different sources They call it “record linkage” term coined in 19461) mathematical foundations laid in 19592) formalized in 1969 as “Fellegi-Sunter” model3) A whole subject area has been developed with well-known techniques, methods, and tools these can of course be applied outside of statistics http://ajph.aphapublications.org/cgi/reprint/36/12/1412 http://www.sciencemag.org/content/130/3381/954.citation http://www.jstor.org/pss/2286061
  • 9. Other terms for the same thing Has been independently invented many times Under many names entity resolution identity resolution merge/purge deduplication ... This makes Googling for information an absolute nightmare
  • 10. Application areas Statistics (obviously) Data cleaning Data integration Conversion Fraud detection / intelligence / surveillance
  • 12. Model, simplified We work with records, each of which has fields with values before doing record linkage we’ve processed the data so that they have the same set of fields (fields which exist in only one of the sources get discarded, as they are no help to us) We compare values for the same fields one by one, then estimate the probability that records represent the same entity given these values probability depends on which field and what values estimation can be very complex Finally, we can compute overall probability
  • 14. String comparisons Must handle spelling errors and name variants must estimate probability that values belong to same entity despite differences Examples John Smith ≈ Jonh Smith J. Random Hacker ≈ James Random Hacker ... Many well-known algorithms have been developed no one best choice, unfortunately some are eager to merge, others less so Many are computationally expensive O(n2) where n is length of string is common this gets very costly when number of record pairs to compare is already high
  • 15. Examples of algorithms Levenshtein edit distance ie: number of edits needed to change string 1 into 2 not so eager to merge Jaro-Winkler specifically for short person names very promiscuous Soundex essentially a phonetic hashing algorithm very fast, but also very crude Longest common subsequence / q-grams haven’t tried this yet Monge-Elkan one of the best, but computationally expensive TFIDF based on term frequency studies often find this performs best, but it’s expensive
  • 16. Existing record linkage tools Commercial tools big, sophisticated, and expensive have found little information on what they actually do presumably also effective Open source tools generally made by and for statisticians nice user interfaces and rich configurability architecture often not as flexible as it could be
  • 17. Standard algorithm n2comparisons for n records is unacceptable must reduce number of direct comparisons Solution produce a key from field values, sort records by the key, for each record, compare with n nearest neighbours sometimes several different keys are produced, to increase chances of finding matches Downsides requires coming up with a key difficult to apply incrementally sorting is expensive
  • 18. Good research papers Threat and Fraud Intelligence, Las Vegas Style, Jeff Jonas http://jeffjonas.typepad.com/IEEE.Identity.Resolution.pdf Real-world data is dirty: Data Cleansing and the Merge/Purge Problem, Hernandez & Stolfo http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.30.3496&rep=rep1&type=pdf Swoosh: a generic approach to entity resolution, Benjelloun, Garcia-Molina et al http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.122.5696&rep=rep1&type=pdf
  • 20. Context Doing a project for Hafslund where we integrate data from many sources Entities are duplicated both inside systems and across systems Suppliers Customers Customers Customers Companies CRM Billing ERP
  • 21. Requirements Must be flexible and configurable no way to know in advance exactly what data we will need to deduplicate Must scale to large data sets CRM alone has 1.4 million customer records that’s 2 trillion comparisons with naïve approach Must have an API project uses SDshare protocol everywhere must therefore be able to build connectors Must be able to work incrementally process data as it changes and update conclusions on the fly
  • 22. Reviewed existing tools... ...but didn’t find anything matching the criteria Therefore...
  • 23. Duke Java deduplication engine released as open source http://code.google.com/p/duke/ Does not use key approach instead indexes data with Lucene does Lucene searches to find potential matches Still a work in progress, but high performance (1M records in 10 minutes), several data sources and comparators, being used for real in real projects, flexible architecture
  • 24. How it works XML configuration file defines setup lists data sources, properties, probabilities, etc Data sources provide streams of records Steps: normalize values so they are ready for comparison collect a batch of records, then index it with Lucene compare records one by one against Lucene index do detailed comparisons against search results pairs with probability above configurable threshold are considered matches For now, only API and command-line
  • 25. Record comparisons in detail Comparator compares field values return number in range 1.0 – 0.0 probability for field is high probability if number is 1.0, or low probability if 0.0, otherwise in between the two Probability for entire record is computed using Bayes’s formula approach known as “naïve Bayes” in research literature
  • 26. Components Data sources CSV JDBC Sparql NTriples <plug in your own> Comparators ExactComparator NumericComparator SoundexComparator TokenizedComparator Levenshtein JaroWinkler Dice coefficient
  • 27. Features Fairly complete command-line tool with debugging support API for embedding the tool Two modes: deduplication (all records matched against each other) linking (only matching across groups of records) Pluggable data sources, comparators, and cleaners Framework for composing cleaners Incremental processing High performance
  • 28. A real-world example Matching Mondial and DBpedia
  • 29. Finding properties to match Need properties providing identity evidence Matching on the properties in bold below Extracted data to CSV for ease of use
  • 30. Configuration – data sources <group> <csv> <param name="input-file" value="dbpedia.csv"/> <param name="header-line" value="false"/> <column name="1" property="ID"/> <column name="2" cleaner="no.priv...CountryNameCleaner" property="NAME"/> <column name="3" property="AREA"/> <column name="4" cleaner="no.priv...CapitalCleaner" property="CAPITAL"/> </csv> </group> <group> <csv> <param name="input-file" value="mondial.csv"/> <column name="id" property="ID"/> <column name="country" cleaner="no.priv...examples.CountryNameCleaner" property="NAME"/> <column name="capital" cleaner="no.priv...LowerCaseNormalizeCleaner" property="CAPITAL"/> <column name="area" property="AREA"/> </csv> </group>
  • 31. Configuration – matching <schema> <threshold>0.65</threshold> <property type="id"> <name>ID</name> </property> <property> <name>NAME</name> <comparator>no.priv.garshol.duke.Levenshtein</comparator> <low>0.3</low> <high>0.88</high> </property> <property> <name>AREA</name> <comparator>AreaComparator</comparator> <low>0.2</low> <high>0.6</high> </property> <property> <name>CAPITAL</name> <comparator>no.priv.garshol.duke.Levenshtein</comparator> <low>0.4</low> <high>0.88</high> </property> </schema> Duke analyzes this setup and decides only NAME and CAPITAL need to be searched on in Lucene. <object class="no.priv.garshol.duke.NumericComparator" name="AreaComparator"> <param name="min-ratio" value="0.7"/> </object>
  • 32. Result Correct links found: 206 / 217 (94.9%) Wrong links found: 0 / 12 (0.0%) Unknown links found: 0 Percent of links correct 100.0%, wrong 0.0%, unknown 0.0% Records with no link: 25 Precision 100.0%, recall 94.9%, f-number 0.974
  • 35. An example of failure Duke doesn’t find this match no tokens matching exactly Lucene search finds nothing Detailed comparison gives correct result so only problem is Lucene search Lucene does have Levenshtein search, but in Lucene 3.x it’s very slow therefore not enabled now thinking of adding option to enable where needed Lucene 4.x will fix the performance problem
  • 36. The effects of value matching
  • 37. Duke in real life Usage at Hafslund
  • 38. The SESAM project Building a new archival system does automatic tagging of documents based on knowledge about the data knowledge extracted from backend systems Search engine-based frontend using Recommind for this Very flexible architecture extracted data stored in triple store (Virtuoso) all data transfer based on SDshare protocol data extracted from RDBMSs with Ontopia’s DB2TM
  • 39. The big picture DUPLICATES! 360 SDshare SDshare Virtuoso ERP Recommind SDshare SDshare CRM Billing Duke SDshare SDshare contains owl:sameAs and haf:possiblySameAs
  • 40. Experiences so far Incremental processing works fine links added and retracted as data changes Performance not an issue at all but then only ~50,000 records so far... Matching works well, but not perfect data are very noisy and messy also, matching is hard
  • 41. Duke roadmap 0.3 clean up the public API and document properly maybe some more comparators support for writing owl:sameAs to Sparql endpoint 0.4 add a web service interface 0.5 and onwards more comparators maybe some parallelism
  • 42. Slides will appear on http://slideshare.net/larsga Comments/questions?