Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
A Comparison of Supervised Learning Classi
ers 
for Link Discovery 
Tommaso Soru and Axel-Cyrille Ngonga Ngomo 
Agile Knowledge Engineering and Semantic Web 
Departm...
tugraz 
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems 
Introduction/1 
The 4th Linked Data Web Pr...
ers for Link Discovery 
2 / 18
tugraz 
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems 
Introduction/2 
Link Discovery 
What? Disc...
ers for Link Discovery 
3 / 18
tugraz 
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems 
Introduction/2 
Link Discovery 
What? Disc...
ers for Link Discovery 
3 / 18
tugraz 
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems 
Preliminaries 
Link Discovery. 
Given two ...
nd the set 
of resource pairs (s; t) 2 S  T such that R(s; t) holds, where R is a given 
relation such as owl:sameAs or db...
cation. 
A link speci
cation is a rule composed by a complex similarity function sim and 
a threshold  that de
nes which pairs (s; t) should be linked together: 
sim(s; t)   
Main problems 
1 Nave approaches demand quadratic time co...
cations. 
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
ers for Link Discovery 
4 / 18
tugraz 
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems 
Preliminaries 
Link Discovery. 
Given two ...
nd the set 
of resource pairs (s; t) 2 S  T such that R(s; t) holds, where R is a given 
relation such as owl:sameAs or db...
cation. 
A link speci
cation is a rule composed by a complex similarity function sim and 
a threshold  that de
nes which pairs (s; t) should be linked together: 
sim(s; t)   
Main problems 
1 Nave approaches demand quadratic time co...
cations. 
T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
ers for Link Discovery 
4 / 18
tugraz 
SEMANTiCS 2014 | The 10th International Conference on Semantic Systems 
Preliminaries 
Link Discovery. 
Given two ...
nd the set 
of resource pairs (s; t) 2 S  T such that R(s; t) holds, where R is a given 
relation such as owl:sameAs or db...
cation. 
A link speci
cation is a rule composed by a complex similarity function sim and 
a threshold  that de
Nächste SlideShare
Wird geladen in …5
×

A Comparison of Supervised Learning Classifiers for Link Discovery

447 Aufrufe

Veröffentlicht am

Slides for the paper "A Comparison of Supervised Learning Classifiers for Link Discovery" by Tommaso Soru and Axel-Cyrille Ngonga Ngomo (AKSW, University of Leipzig), presented on September 4, 2014 at the 10th International Conference on Semantic Systems (SEMANTiCS) in Leipzig, Germany.

Veröffentlicht in: Wissenschaft
  • Als Erste(r) kommentieren

  • Gehören Sie zu den Ersten, denen das gefällt!

A Comparison of Supervised Learning Classifiers for Link Discovery

  1. 1. A Comparison of Supervised Learning Classi
  2. 2. ers for Link Discovery Tommaso Soru and Axel-Cyrille Ngonga Ngomo Agile Knowledge Engineering and Semantic Web Department of Computer Science University of Leipzig Augustusplatz 10, 04109 Leipzig ftsoru,ngongag@informatik.uni-leipzig.de http://aksw.org September 4, 2014
  3. 3. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Introduction/1 The 4th Linked Data Web Principle. Include links to other URIs, so that they can discover more things." { Tim Berners-Lee 31B triples in 2011 of which only 3% link dierent datasets 71B triples expected in 2014 T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  4. 4. ers for Link Discovery 2 / 18
  5. 5. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Introduction/2 Link Discovery What? Discover new links among resources. How? Using supervised and unsupervised methods. Why? Links are important for data integration, question answering, knowledge extraction. We will focus on supervised machine-learning algorithms. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  6. 6. ers for Link Discovery 3 / 18
  7. 7. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Introduction/2 Link Discovery What? Discover new links among resources. How? Using supervised and unsupervised methods. Why? Links are important for data integration, question answering, knowledge extraction. We will focus on supervised machine-learning algorithms. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  8. 8. ers for Link Discovery 3 / 18
  9. 9. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Preliminaries Link Discovery. Given two datasets S and T, the general aim of link discovery is to
  10. 10. nd the set of resource pairs (s; t) 2 S T such that R(s; t) holds, where R is a given relation such as owl:sameAs or dbp:near. Link Speci
  11. 11. cation. A link speci
  12. 12. cation is a rule composed by a complex similarity function sim and a threshold that de
  13. 13. nes which pairs (s; t) should be linked together: sim(s; t) Main problems 1 Nave approaches demand quadratic time complexity. 2 Ecient algorithms ; accurate link speci
  14. 14. cations. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  15. 15. ers for Link Discovery 4 / 18
  16. 16. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Preliminaries Link Discovery. Given two datasets S and T, the general aim of link discovery is to
  17. 17. nd the set of resource pairs (s; t) 2 S T such that R(s; t) holds, where R is a given relation such as owl:sameAs or dbp:near. Link Speci
  18. 18. cation. A link speci
  19. 19. cation is a rule composed by a complex similarity function sim and a threshold that de
  20. 20. nes which pairs (s; t) should be linked together: sim(s; t) Main problems 1 Nave approaches demand quadratic time complexity. 2 Ecient algorithms ; accurate link speci
  21. 21. cations. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  22. 22. ers for Link Discovery 4 / 18
  23. 23. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Preliminaries Link Discovery. Given two datasets S and T, the general aim of link discovery is to
  24. 24. nd the set of resource pairs (s; t) 2 S T such that R(s; t) holds, where R is a given relation such as owl:sameAs or dbp:near. Link Speci
  25. 25. cation. A link speci
  26. 26. cation is a rule composed by a complex similarity function sim and a threshold that de
  27. 27. nes which pairs (s; t) should be linked together: sim(s; t) Main problems 1 Nave approaches demand quadratic time complexity. 2 Ecient algorithms ; accurate link speci
  28. 28. cations. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  29. 29. ers for Link Discovery 4 / 18
  30. 30. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Motivation We want to answer these questions. Q1: Which of the paradigms achieves the best F-measures? Q2: Which of the paradigms is most robust against noise? Q3: Which of the methods is the most time-ecient? T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  31. 31. ers for Link Discovery 5 / 18
  32. 32. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Motivation We want to answer these questions. Q1: Which of the paradigms achieves the best F-measures? Q2: Which of the paradigms is most robust against noise? Q3: Which of the methods is the most time-ecient? T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  33. 33. ers for Link Discovery 5 / 18
  34. 34. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Motivation We want to answer these questions. Q1: Which of the paradigms achieves the best F-measures? Q2: Which of the paradigms is most robust against noise? Q3: Which of the methods is the most time-ecient? T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  35. 35. ers for Link Discovery 5 / 18
  36. 36. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Overview/1 Evaluation pipeline Alignment between properties is carried out manually. Perfect mapping (i.e., labels) (s; t) is a positive example i R(s; t) holds. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  37. 37. ers for Link Discovery 6 / 18
  38. 38. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Overview/2 Assumptions The complex similarity function sim compares property values. In case of datatype properties: it uses text/numerical/date similarities. object properties: it applies the similarities iteratively. Graph structure has not been considered as a feature per se. Cross-validation has been preferred over semi-supervised learning because it yields more accurate results. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  39. 39. ers for Link Discovery 7 / 18
  40. 40. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Evaluation Setup/1 Similarities for string values: Weighted trigram similarity, setting tf-idf scores as weights Weighted edit distance, setting confusion matrices as weights Cosine similarity for numerical values: Logarithmic similarity for date values: a day-based Date similarity T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  41. 41. ers for Link Discovery 8 / 18
  42. 42. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Evaluation Setup/2 Linear non-probabilistic classi
  43. 43. ers Linear SVM* Polynomial SVM* Linear SVM with Sequential Minimal Optimization Linear Regression Probabilistic classi
  44. 44. ers Logistic Regression Nave Bayes Random Tree J48 Neural networks Multilayer Perceptron Rule-based classi
  45. 45. ers Decision Table We used classi
  46. 46. ers from the Weka library, except (*) from LibSVM. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  47. 47. ers for Link Discovery 9 / 18
  48. 48. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Evaluation Setup/3 Datasets D1-D3: synthetic datasets from the Ontology Alignment Evaluation Initiative (OAEI) 2010 Benchmark D4-D6: real datasets from the Benchmark for Entity Resolution, DBS Leipzig D5-D6: datasets having a high level of noise # dataset domain size D1 OAEI-Persons1 personal data 250k D2 OAEI-Persons2 personal data 240k D3 OAEI-Restaurants places 72k D4 DBLP{ACM bibliographic 6M D5 Amazon{GoogleProducts e-commerce 10M D6 ABT{Buy e-commerce 1M T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  49. 49. ers for Link Discovery 10 / 18
  50. 50. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/1 F-measure Classi
  51. 51. er D1 D2 D3 D4 D5 D6 Linear SVM 99.40% 98.99% 97.75% 97.81% 27.06% 39.18% Linear SMO 100.00% 98.73% 100.00% 92.58% 46.63% 31.39% Polynomial-3 SVM 99.40% 93.76% 98.29% 97.67% 37.28% 31.69% Multilayer Perceptron 99.50% 99.50% 100.00% 97.43% 35.58% 43.49% Logistic Regression 99.90% 98.12% 96.67% 97.71% 40.64% 41.92% Linear Regression 99.30% 96.92% 100.00% 96.36% 37.06% 36.84% Nave Bayes 97.75% 35.05% 95.19% 29.47% 2.92% 11.90% Decision Table 97.98% 100.00% 100.00% 97.66% 42.44% 29.66% Random Tree 97.45% 99.24% 89.89% 96.82% 39.38% 41.03% J48 99.50% 95.56% 98.29% 97.66% 44.28% 31.53% State of the Art 100.00% 100.00% 100.00% 98.20% 62.10% 71.30% F-measure calculated on the class of positive examples. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  52. 52. ers for Link Discovery 11 / 18
  53. 53. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/2 Computation runtimes Classi
  54. 54. er D1 D2 D3 D4 D5 D6 Linear SVM 7.16 6.93 2.67 63.94 484.29 75.44 Linear SMO 17.07 12.93 3.77 113.40 369.20 37.16 Polynomial-3 SVM 5.67 6.18 2.63 162.82 1,091.10 103.89 Multilayer Perceptron 15.13 16.10 3.40 96.96 376.26 41.68 Logistic Regression 16.11 14.91 4.61 110.12 275.94 38.48 Linear Regression 16.04 16.21 5.02 120.54 497.43 44.50 Nave Bayes 17.34 17.09 4.39 105.31 375.91 43.79 Decision Table 16.68 16.44 3.78 90.99 389.35 48.87 Random Tree 12.02 11.16 2.24 53.67 347.36 34.11 J48 21.31 15.96 6.99 131.57 98.27 38.46 All values in seconds. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  55. 55. ers for Link Discovery 12 / 18
  56. 56. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/3 Considerations Some average trends can be suggested, yet no algorithm outperforms all other signi
  57. 57. cantly. Multilayer Perceptrons performed best including and excluding noisy datasets. Random Trees seem the fastest approach overall. The dierent approaches seem complementary on their behaviour. Nave Bayes might fail as it considers all features as independent from each other. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  58. 58. ers for Link Discovery 13 / 18
  59. 59. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/4 Answers Q1: Which of the paradigms achieves the best F-measures? A1: Multilayer Perceptrons, Linear SVMs, Decision Tables. Q2: Which of the paradigms is most robust against noise? A2: Logistic Regression, Random Trees, Multilayer Perceptrons. Q3: Which of the methods is the most time-ecient? A3: Random Trees, however all approaches scale well. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  60. 60. ers for Link Discovery 14 / 18
  61. 61. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/4 Answers Q1: Which of the paradigms achieves the best F-measures? A1: Multilayer Perceptrons, Linear SVMs, Decision Tables. Q2: Which of the paradigms is most robust against noise? A2: Logistic Regression, Random Trees, Multilayer Perceptrons. Q3: Which of the methods is the most time-ecient? A3: Random Trees, however all approaches scale well. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  62. 62. ers for Link Discovery 14 / 18
  63. 63. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/4 Answers Q1: Which of the paradigms achieves the best F-measures? A1: Multilayer Perceptrons, Linear SVMs, Decision Tables. Q2: Which of the paradigms is most robust against noise? A2: Logistic Regression, Random Trees, Multilayer Perceptrons. Q3: Which of the methods is the most time-ecient? A3: Random Trees, however all approaches scale well. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  64. 64. ers for Link Discovery 14 / 18
  65. 65. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Results/4 Answers Q1: Which of the paradigms achieves the best F-measures? A1: Multilayer Perceptrons, Linear SVMs, Decision Tables. Q2: Which of the paradigms is most robust against noise? A2: Logistic Regression, Random Trees, Multilayer Perceptrons. Q3: Which of the methods is the most time-ecient? A3: Random Trees, however all approaches scale well. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  66. 66. ers for Link Discovery 14 / 18
  67. 67. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Related Work Time-ecient deduplication algorithms (PPJoin+, EDJoin, PassJoin, TrieJoin) LIMES { Link Discovery Framework for Metric Spaces Approaches for learning link speci
  68. 68. cations (HYPPO, HR3, EAGLE, ACIDS) Dedicated ecient methods (RDF-AI, REEDED) LinkLion { A Link Repository for the Web of Data The SAIM interface Other link discovery frameworks (SILK, LDIF) Other machine learning frameworks (MARLIN, FEBRL, RAVEN) Other blocking techniques (MultiBlock, KnoFuss) T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  69. 69. ers for Link Discovery 15 / 18
  70. 70. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Future Work 1 Integration of Multilayer Perceptrons into the LIMES framework. 2 Use of ensemble learning techniques. 3 Evaluation on a semi-supervised learning setting with few training data. 4 Evaluation using a larger amount of similarity measures. 5 Incorporation of a component based on Statistical Relational Learning. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  71. 71. ers for Link Discovery 16 / 18
  72. 72. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Web resources Source code { Batch Learners Evaluation for Link Discovery http://github.com/mommi84/BALLAD Technical report { Batch Learners Evaluation for Link Discovery http://mommi84.github.io/BALLAD The OAEI 2010 Benchmark http://oaei.ontologymatching.org/2010/benchmarks The Benchmark for Entity Resolution, DBS Leipzig http://goo.gl/bvWBjA Weka { Data Mining Software in Java http://www.cs.waikato.ac.nz/ml/weka LibSVM { A Library for Support Vector Machines http://www.csie.ntu.edu.tw/~cjlin/libsvm LIMES { Link Discovery Framework for Metric Spaces http://aksw.org/Projects/LIMES LinkLion { A Link Repository for the Web of Data http://www.linklion.org T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  73. 73. ers for Link Discovery 17 / 18
  74. 74. tugraz SEMANTiCS 2014 | The 10th International Conference on Semantic Systems Thank you for your attention. T. Soru, A. Ngonga Ngomo September 4, 2014 A Comparison of Supervised Learning Classi
  75. 75. ers for Link Discovery 18 / 18

×