Links between datasets are an essential ingredient of Linked Open Data. Since the manual creation of links is expensive at large-scale, link sets are often created using heuristics, which may lead to errors. In this paper, we propose an unsupervised approach for finding erroneous links. We represent each link as a feature vector in a higher dimensional vector space, and find wrong links by means of different multi-dimensional outlier detection methods. We show how the approach can be implemented in the RapidMiner platform using only off-the-shelf components, and present a first evaluation with real-world datasets from the Linked Open Data cloud showing promising results, with an F-measure of up to 0.54, and an area under the ROC curve of up to 0.86.
Harnessing the Power of GenAI for BI and Reporting.pptx
Identifying Wrong Links between Datasets by Multi-dimensional Outlier Detection
1. 05/26/14 Heiko Paulheim 1
Identifying Wrong Links between Datasets
by Multi-dimensional Outlier Detection
Heiko Paulheim
2. 05/26/14 Heiko Paulheim 2
Motivation
• Dataset interlinks can be wrong for many reasons
– Oversimplified heuristic generation (e.g., label equality)
– owl:sameAs abuse (a Starbucks coffee shop ↔ Starbucks Inc.)
– Concept drift of link targets
• e.g., dbpedia:Prong used to denote a band until DBpedia 3.1
• now it's a disambiguation page
04/08/0812/04/07
<http://dbtune.org/bbc/peel/artist/1495> owl:sameAs <http://dbpedia.org/resource/Prong> .
3. 05/26/14 Heiko Paulheim 3
Overall Idea
• Links between datasets follow certain patterns
– e.g., linking a mo:MusicArtist to a dbo:Artist,
and a mo:MusicalWork to a dbo:Album or a dbo:Song
• Wrong links violate those patterns
• Hence, outlier detection should find wrong links
– Definition: “finding patterns in data that do not conform to the expected
normal behavior” (Chandola et al., 2009)
• Difference over related approaches
– does not require the same schema used in both datasets
– nor schema mappings
– no external/human knowledge required
4. 05/26/14 Heiko Paulheim 4
Projection of Links into Vector Space
• Represent each link as a point in an n-dimensional vector space
– e.g., using their direct types
• Outliers are found in sparse areas
5. 05/26/14 Heiko Paulheim 5
Projection of Links into Vector Space
• Types
– each type of LHS and RHS resource becomes a binary (0/1) feature
– types on both sides are treated separately
• i.e., LHS_foaf:person and RHS_foaf:person
are distinct features
• Properties
– each ingoing/outgoing property of LHS and RHS resource
becomes a binary (0/1) feature
– properties on both sides are treated separately
– ingoing and outgoing properties are treated separately
• i.e., LHS_foaf:based_near, RHS_foaf:based_near,
foaf:based_near_LHS and foaf:based_near_RHS
are all distinct features
• Joint feature set of types and properties
6. 05/26/14 Heiko Paulheim 6
Experiments
• Datasets: link sets between
– BBC Peel Sessions and DBpedia (2,087 links)
– DBTropes and DBpedia (4,229 links)
• Gold standard
– 100 randomly sampled links from each set, manually evaluated
– Peel: 90 out of 100 are correct
– Tropes: 76 out of 100 are correct
• We run outlier detection on the whole link set
– and validate the output only on the gold standard
7.
8. 05/26/14 Heiko Paulheim 8
Experiments
• Outlier Detection Approaches
– assign a score (or label) to each data point
– the higher the score, the likelier it is an outlier
• Evaluation
– Ordering descending by outlier score
– Ideally, all outliers are above all non-outliers
– Plot a ROC curve to measure the quality
• i.e., AUC
– F-Measure
• with best possible threshold
9.
10. 05/26/14 Heiko Paulheim 10
Results
• Type features work better than property features
• LoOP delivers reliably good results
– though not the best
• Best performance on Peel dataset
– CBLOF (F1 = 0.537), 1-class SVM (AUC = 0.857)
• Best performance on DBTropes dataset
– LOF (F1 = 0.5, AUC = 0.619)
11. 05/26/14 Heiko Paulheim 11
Results
• ROC curves for Peel dataset
0 1
0
1
GAS k=10
GAS k=25
GAS k=50
LOF
LoOP k=10
LoOP k=25
LoOP k=50
CBLOF
LDCOF
1-class SVM
Note: GAS k=10,25,50 identical, LoOP k=25,50 identical
12. 05/26/14 Heiko Paulheim 12
Results
• ROC curves DBTropes dataset
0 1
0
1
GAS k=10
GAS k=25
GAS k=50
LOF
LoOP k=10
LoOP k=25
LoOP k=50
CBLOF
LDCOF
1-class SVM
Note: GAS k=25,50 mostly identical; LoOP k=25,50 identical,
CBLOF and LDCOF mostly identical
13. 05/26/14 Heiko Paulheim 13
Runtimes
• Most outlier detection algorithms are reasonably fast
– both linksets processed in less than 10 seconds on a normal laptop
• Exceptions:
– clustering (for CBLOF/LDCOF) takes up to 30 seconds
– 1-class SVM takes up to 15 minutes
• ...but creating the feature vector representation
takes much more time
– some hours against public SPARQL endpoint(s)
– reasonably fast with downloaded dumps
14. 05/26/14 Heiko Paulheim 14
Discussion of Results
• Results on Peel dataset better than on DBTropes dataset
• Projection based on types better than on properties
• most likely due to lower dimensionality of vector space
• Peel: #types = 34, #properties = 60
• DBTropes: #types = 81, #properties = 142
• Variation of outlier detection algorithms across datasets
– also observed in other experiments
– general rules of thumb are hard to come up with
15. 05/26/14 Heiko Paulheim 15
Possible Improvements & Future Work
• Other projection methods
– e.g., using numeric counts of relations
• Other outlier detection algorithms
– e.g., Replicating Neural Networks and their generalizations
• Preprocessing
– e.g., Feature Subset Selection
– caveat: the valuable features are often sparse
16. 05/26/14 Heiko Paulheim 16
Possible Improvements & Future Work
• So far, we have looked at owl:sameAs links
• The approach is not limited to that
– should work for other link predicates as well
– e.g., a dataset of persons and a dataset of places
– linked by foaf:based_near
• It is not even limited to linksets
– also for debugging statements inside a knowledge base
– e.g., dbpedia-owl:deathPlace
17. 05/26/14 Heiko Paulheim 17
Identifying Wrong Links between Datasets
by Multi-dimensional Outlier Detection
Heiko Paulheim