SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Introduction   Mapping method: classification based on instance similarity   Experiments and results   Summary




               Learning Concept Mappings from Instance
                              Similarity

          Shenghui Wang1                   Gwenn Englebienne2               Stefan Schlobach1
                                      1    Vrije Universiteit Amsterdam
                                       2   Universiteit van Amsterdam


                                                 ISWC 2008
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary




Outline

       1       Introduction
                  Thesaurus mapping
                  Instance-based techniques

       2       Mapping method: classification based on instance similarity
                Representing concepts and the similarity between them
                Classification based on instance similarity

       3       Experiments and results
                 Experiment setup
                 Results

       4       Summary
Introduction        Mapping method: classification based on instance similarity   Experiments and results   Summary

Thesaurus mapping


Thesaurus mapping



               SemanTic Interoperability To access Cultural Heritage
               (STITCH) through mappings between thesauri
                       e.g.“plankzeilen” vs. “surfsport”
                       e.g.“griep” vs. “influenza”
               Scope of the problem:
                       Big thesauri with tens of thousands of concepts
                       Huge collections (e.g., KB: 80km of books in one collection)
                       Heterogeneous (e.g., books, manuscripts, illustrations, etc.)
                       Multi-lingual problem
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary

Instance-based techniques


Instance-based techniques: common instance based
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary

Instance-based techniques


Instance-based techniques: common instance based
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary

Instance-based techniques


Instance-based techniques: common instance based
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary

Instance-based techniques


Pros and cons




               Advantages
                       Simple to implement
                       Interesting results
               Disadvantages
                       Requires sufficient amounts of common instances
                       Only uses part of the available information
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary

Instance-based techniques


Instance-based techniques: Instance similarity based
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary

Instance-based techniques


Instance-based techniques: Instance similarity based
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary

Instance-based techniques


Instance-based techniques: Instance similarity based
Introduction       Mapping method: classification based on instance similarity                       Experiments and results   Summary

Representing concepts and the similarity between them


Representing concepts and the similarity between them




                                      {
                                          Creator                              Creator
                                          Title                              Term 1: 4
                                          Publisher       Bag of words       Term 2: 1
                                                                             Term 3: 0
                                          ...                                ...
                          Concept 1


                                          Creator                                Title
                                          Title                              Term 1: 0
                                          Publisher       Bag of words       Term 2: 3
                                                                             Term 3: 0
                                          ...                                ...
                                          Creator                            Publisher      Cos. dist.
                                          Title                              Term 1: 2
                                                          Bag of words       Term 2: 1
                                          Publisher
                                                                             Term 3: 3
                                          ...                                ...                                 f1
                                                                                            Cos. dist.          f2




                                      {
                                                                               Creator                          f3
                                                                             Term 1: 2                          ...
                                                          Bag of words       Term 2: 0
                                          Creator                            Term 3: 0
                                          Title                              ...            Cos. dist.
                          Concept 2




                                          Publisher                              Title
                                          ...                                Term 1: 0
                                                          Bag of words       Term 2: 4
                                          Creator                            Term 3: 1
                                          Title                              ...
                                          Publisher                          Publisher
                                          ...                                Term 1: 4
                                                          Bag of words       Term 2: 1
                                                                             Term 3: 1
                                                                             ...
                                      {
                                                                         {
                                                                                                            {
                                      Instance features                  Concept features                Pair features
Introduction         Mapping method: classification based on instance similarity   Experiments and results   Summary

Classification based on instance similarity


Classification based on instance similarity


               Each pair of concepts is treated as a point in a “similarity
               space”
                        Its position is defined by the features of the pair.
                        The features of the pair are the different measures of similarity
                        between the concepts’ instances.
               Hypothesis: the label of a point — which represents whether
               the pair is a positive mapping or negative one — is correlated
               with the position of this point in this space.
               With already labelled points and the actual similarity values of
               concepts involved, it is possible to classify a point, i.e., to give
               it a right label, based on its location given by the actual
               similarity values.
Introduction         Mapping method: classification based on instance similarity       Experiments and results    Summary

Classification based on instance similarity


The classifier used: Markov Random Field


               Let T = { (x(i) , y (i) ) }N be the training set
                                          i=1
                        x(i) ∈ RK , the features
                        y (i) ∈ Y = {positive, negative}, the label
               The conditional probability of a label given the input is
               modelled as
                                                                             K
                                                        1
                            p(y (i) |xi , θ) =                 exp                λj φj (y (i) , x(i) ) ,       (1)
                                                    Z (xi , θ)
                                                                            j=1


               where θ = { λj }K are the weights associated to the feature
                               j=1
               functions φ and Z (xi , θ) is a normalisation constant
Introduction         Mapping method: classification based on instance similarity     Experiments and results    Summary

Classification based on instance similarity


The classifier used: Markov Random Field (cont’)

               The likelihood of the data set for given model parameters
               p(T |θ) is given by:
                                                                  N
                                               p(T |θ) =               p(y (i) |x(i) )                        (2)
                                                                 i=1

               During learning, our objective is to find the most likely values
               for θ for the given training data.
               The decision criterion for assigning a label y (i) to a new pair
               of concepts i is then simply given by:

                                                 y (i) = argmax p(y |x(i) )                                   (3)
                                                                 y
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary

Experiment setup


Experiment setup



               Two cases:
                      mapping GTT and Brinkman used in Koninklijke Bibliotheek
                      (KB) — Homogeneous collections
                      mapping GTT/Brinkman and GTAA used in Beeld en Geluid
                      (BG) — Heterogeneous collections
               Evaluation
                      Measures: misclassification rate or error rate
                      10 fold cross-validation
                      testing on special data sets
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary

Results

Experiment I: Feature-similarity based mapping versus
existing methods

          Are the benefits from feature-similarity of instances in extensional
          mapping significant when compared to existing methods?

                            Mapping method                                    Error rate
                                  Falcon                                       0.28895
                                   Slex                                  0.42620 ± 0.049685
                                  Sjacc80                                0.44643 ± 0.059524
                                   Sbag                                  0.57380 ± 0.049685
               {f1 , . . . f28 } (our new approach)                     0.20491 ± 0.026158

          Table: Comparison between existing methods and similarities-based
          mapping, in KB case
Introduction       Mapping method: classification based on instance similarity   Experiments and results   Summary

Results

Experiment I: Feature-similarity based mapping versus
existing methods

          Are the benefits from feature-similarity of instances in extensional
          mapping significant when compared to existing methods?

                            Mapping method                                    Error rate
                                  Falcon                                       0.28895
                                   Slex                                  0.42620 ± 0.049685
                                  Sjacc80                                0.44643 ± 0.059524
                                   Sbag                                  0.57380 ± 0.049685
               {f1 , . . . f28 } (our new approach)                     0.20491 ± 0.026158

          Table: Comparison between existing methods and similarities-based
          mapping, in KB case
Introduction      Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Experiment II: Extending to corpora without joint instances

          Can our approach be applied to corpora for which there are no
          doubly annotated instances, i.e., for which there are no joint
          instances?
                   Collections                         Testing set                  Error rate
                 Joint instances                     golden standard           0.20491 ± 0.026158
              (original KB corpus)                     lexical only                 0.137871
               No joint instances                    golden standard           0.28378 ± 0.026265
           (double instances removed)                  lexical only                 0.161867

          Table: Comparison between classifiers using joint and disjoint instances,
          in KB case
Introduction      Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Experiment II: Extending to corpora without joint instances

          Can our approach be applied to corpora for which there are no
          doubly annotated instances, i.e., for which there are no joint
          instances?
                   Collections                         Testing set                  Error rate
                 Joint instances                     golden standard           0.20491 ± 0.026158
              (original KB corpus)                     lexical only                 0.137871
               No joint instances                    golden standard           0.28378 ± 0.026265
           (double instances removed)                  lexical only                 0.161867

          Table: Comparison between classifiers using joint and disjoint instances,
          in KB case
Introduction     Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Experiment III: Extending to heterogeneous collections



          Can our approach be applied to corpora in which instances are
          described in a heterogeneous way?
               Feature selection
                    exhaustive combination by calculating the similarity between
                    all possible pairs of fields
                            require more training data to avoid over-fitting
                    manual selection of corresponding metadata field pairs
                    mutual information to select the most informative field pairs
Introduction     Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Experiment III: Extending to heterogeneous collections



          Can our approach be applied to corpora in which instances are
          described in a heterogeneous way?
               Feature selection
                    exhaustive combination by calculating the similarity between
                    all possible pairs of fields
                            require more training data to avoid over-fitting
                    manual selection of corresponding metadata field pairs
                    mutual information to select the most informative field pairs
Introduction     Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Experiment III: Extending to heterogeneous collections



          Can our approach be applied to corpora in which instances are
          described in a heterogeneous way?
               Feature selection
                    exhaustive combination by calculating the similarity between
                    all possible pairs of fields
                            require more training data to avoid over-fitting
                    manual selection of corresponding metadata field pairs
                    mutual information to select the most informative field pairs
Introduction     Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Experiment III: Extending to heterogeneous collections



          Can our approach be applied to corpora in which instances are
          described in a heterogeneous way?
               Feature selection
                    exhaustive combination by calculating the similarity between
                    all possible pairs of fields
                            require more training data to avoid over-fitting
                    manual selection of corresponding metadata field pairs
                    mutual information to select the most informative field pairs
Introduction     Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Experiment III: Extending to heterogeneous collections



          Can our approach be applied to corpora in which instances are
          described in a heterogeneous way?
               Feature selection
                    exhaustive combination by calculating the similarity between
                    all possible pairs of fields
                            require more training data to avoid over-fitting
                    manual selection of corresponding metadata field pairs
                    mutual information to select the most informative field pairs
Introduction      Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Feature selection
          Can we maintain high mapping quality when features are selected
          (semi)-automatically?


                 Thesaurus                     Feature selection                     Error rate
                                               manual selection                 0.11290 ± 0.025217
           GTAA vs. Brinkman                  mutual information               0.09355 ± 0.044204
                                                  exhaustive                    0.10323 ± 0.031533
                                               manual selection                 0.10000 ± 0.050413
               GTAA vs. GTT                   mutual information               0.07826 ± 0.044904
                                                  exhaustive                    0.11304 ± 0.046738


          Table: Comparison of the performance with different methods of feature
          selection, using non-lexical dataset
Introduction      Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Feature selection
          Can we maintain high mapping quality when features are selected
          (semi)-automatically?


                 Thesaurus                     Feature selection                     Error rate
                                               manual selection                 0.11290 ± 0.025217
           GTAA vs. Brinkman                  mutual information               0.09355 ± 0.044204
                                                  exhaustive                    0.10323 ± 0.031533
                                               manual selection                 0.10000 ± 0.050413
               GTAA vs. GTT                   mutual information               0.07826 ± 0.044904
                                                  exhaustive                    0.11304 ± 0.046738


          Table: Comparison of the performance with different methods of feature
          selection, using non-lexical dataset
Introduction      Mapping method: classification based on instance similarity      Experiments and results   Summary

Results


Training set


               manually built golden standard (751)
               lexical seeding
               background seeding

                             Thesauri                          lexical         non-lexical
                           GTAA vs. GTT                         2720              116
                         GTAA vs. Brinkman                      1372              323

                  Table: Numbers of positive examples in the training sets
Introduction      Mapping method: classification based on instance similarity      Experiments and results   Summary

Results


Training set


               manually built golden standard (751)
               lexical seeding
               background seeding

                             Thesauri                          lexical         non-lexical
                           GTAA vs. GTT                         2720              116
                         GTAA vs. Brinkman                      1372              323

                  Table: Numbers of positive examples in the training sets
Introduction      Mapping method: classification based on instance similarity      Experiments and results   Summary

Results


Bias in the training sets

               Thesauri       Training set              Test set                    Error rate
                              non-lexical              non-lexical             0.09355 ± 0.044204
               GTAA vs.          lexical               non-lexical                   0.11501
               Brinkman       non-lexical                lexical                     0.07124
                                 lexical                 lexical               0.04871 ± 0.029911
                              non-lexical              non-lexical             0.07826 ± 0.044904
               GTAA vs.          lexical               non-lexical                  0.098712
                 GTT          non-lexical                lexical                    0.088603
                                 lexical                 lexical               0.06195 ± 0.008038

          Table: Comparison using different datasets (feature selected using mutual
          information)
Introduction      Mapping method: classification based on instance similarity      Experiments and results   Summary

Results


Bias in the training sets

               Thesauri       Training set              Test set                    Error rate
                              non-lexical              non-lexical             0.09355 ± 0.044204
               GTAA vs.          lexical               non-lexical                   0.11501
               Brinkman       non-lexical                lexical                     0.07124
                                 lexical                 lexical               0.04871 ± 0.029911
                              non-lexical              non-lexical             0.07826 ± 0.044904
               GTAA vs.          lexical               non-lexical                  0.098712
                 GTT          non-lexical                lexical                    0.088603
                                 lexical                 lexical               0.06195 ± 0.008038

          Table: Comparison using different datasets (feature selected using mutual
          information)
Introduction                      Mapping method: classification based on instance similarity                                                             Experiments and results                                Summary

Results


Positive-negative ratios in the training sets


                       1                                                                                                          1
                                                                                                                                                                                                            error rate
                                                                                                error rate
                      0.9                                                                                                        0.9                                                                        F−measure
                                                                                                F−measure                                                                                                                pos
                                                                                                             pos
                                                                                                                                                                                                            Precision
                      0.8                                                                       Precisionpos                     0.8                                                                                 pos
                                                                                                                                                                                                            Recall pos
                                                                                                Recall pos
                      0.7                                                                                                        0.7

                      0.6                                                                                                        0.6
           Measures




                                                                                                                      Measures
                      0.5                                                                                                        0.5

                      0.4                                                                                                        0.4

                      0.3                                                                                                        0.3

                      0.2                                                                                                        0.2

                      0.1                                                                                                        0.1

                       0                                                                                                          0
                            0   100      200     300       400     500      600     700       800      900     1000                    0    100      200     300       400     500      600     700       800      900     1000
                                      Ratio between positive and negative examples in the training set                                            Ratio between positive and negative examples in the training set




                                (a) testing on 1:1 data                                                                                    (b) testing on 1:1000 data
          Figure: The influence of positive-negative ratios — Brinkman vs. GTAA
Introduction     Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Positive-negative ratios in the training sets (cont’)




          In practice, the training data should be chosen so as to contain a
          representative ratio of positive and negative examples, while still
          providing enough material for the classifier to have good
          predictive capacity on both types of examples.
Introduction     Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Experiment IV: Qualitative use of feature weights
          The value of learning results, λj , reflects the importance of the
          feature fj in the process of determining similarity (mappings)
          between concepts.

                        KB fields                           BG fields
                        kb:title                           bg:subject
                        kb:abstract                        bg:subject
                        kb:annotation                      bg:LOCATIES
                        kb:annotation                      bg:SUBSIDIE
                        kb:creator                         bg:contributor
                        kb:creator                         bg:PERSOONSNAMEN
                        kb:Date                            bg:OPNAMEDATUM
                        kb:dateCopyrighted                 bg:date
                        kb:description                     bg:subject
                        kb:publisher                       bg:NAMEN
                        kb:temporal                        bg:date
Introduction     Mapping method: classification based on instance similarity   Experiments and results   Summary

Results


Experiment IV: Qualitative use of feature weights
          The value of learning results, λj , reflects the importance of the
          feature fj in the process of determining similarity (mappings)
          between concepts.

                        KB fields                           BG fields
                        kb:title                           bg:subject
                        kb:abstract                        bg:subject
                        kb:annotation                      bg:LOCATIES
                        kb:annotation                      bg:SUBSIDIE
                        kb:creator                         bg:contributor
                        kb:creator                         bg:PERSOONSNAMEN
                        kb:Date                            bg:OPNAMEDATUM
                        kb:dateCopyrighted                 bg:date
                        kb:description                     bg:subject
                        kb:publisher                       bg:NAMEN
                        kb:temporal                        bg:date
Introduction     Mapping method: classification based on instance similarity   Experiments and results   Summary




Summary

               We use a machine learning method to automatically use the
               similarity between instances to determine mappings between
               concepts from different thesauri/ontologies.
                    Enables mappings between thesauri used for very
                    heterogeneous collections
                    Does not require dually annotated instances
                    Not limited by the language barrier
                    A contribution to the field of meta-data mapping

               In the future
                    More heterogeneous collections
                    Smarter measures of similarity between instance metadata
                    More similarity dimensions between concepts, e.g., lexical,
                    structural
Introduction   Mapping method: classification based on instance similarity   Experiments and results   Summary




       Thank you

Weitere ähnliche Inhalte

Mehr von Shenghui Wang

Measuring the dynamic bi-directional influence between content and social ne...
Measuring the dynamic bi-directional influence between content and social ne...Measuring the dynamic bi-directional influence between content and social ne...
Measuring the dynamic bi-directional influence between content and social ne...Shenghui Wang
 
Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning Shenghui Wang
 
What is concept dirft and how to measure it?
What is concept dirft and how to measure it?What is concept dirft and how to measure it?
What is concept dirft and how to measure it?Shenghui Wang
 
Study concept drift in political ontologies
Study concept drift in political ontologiesStudy concept drift in political ontologies
Study concept drift in political ontologiesShenghui Wang
 

Mehr von Shenghui Wang (6)

Measuring the dynamic bi-directional influence between content and social ne...
Measuring the dynamic bi-directional influence between content and social ne...Measuring the dynamic bi-directional influence between content and social ne...
Measuring the dynamic bi-directional influence between content and social ne...
 
Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning Similarity Features, and their Role in Concept Alignment Learning
Similarity Features, and their Role in Concept Alignment Learning
 
What is concept dirft and how to measure it?
What is concept dirft and how to measure it?What is concept dirft and how to measure it?
What is concept dirft and how to measure it?
 
ICA Slides
ICA SlidesICA Slides
ICA Slides
 
ECCS 2010
ECCS 2010ECCS 2010
ECCS 2010
 
Study concept drift in political ontologies
Study concept drift in political ontologiesStudy concept drift in political ontologies
Study concept drift in political ontologies
 

Kürzlich hochgeladen

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 

Kürzlich hochgeladen (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 

Learning Concept Mappings from Instance Similarity

  • 1. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Learning Concept Mappings from Instance Similarity Shenghui Wang1 Gwenn Englebienne2 Stefan Schlobach1 1 Vrije Universiteit Amsterdam 2 Universiteit van Amsterdam ISWC 2008
  • 2. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Outline 1 Introduction Thesaurus mapping Instance-based techniques 2 Mapping method: classification based on instance similarity Representing concepts and the similarity between them Classification based on instance similarity 3 Experiments and results Experiment setup Results 4 Summary
  • 3. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Thesaurus mapping Thesaurus mapping SemanTic Interoperability To access Cultural Heritage (STITCH) through mappings between thesauri e.g.“plankzeilen” vs. “surfsport” e.g.“griep” vs. “influenza” Scope of the problem: Big thesauri with tens of thousands of concepts Huge collections (e.g., KB: 80km of books in one collection) Heterogeneous (e.g., books, manuscripts, illustrations, etc.) Multi-lingual problem
  • 4. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Instance-based techniques Instance-based techniques: common instance based
  • 5. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Instance-based techniques Instance-based techniques: common instance based
  • 6. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Instance-based techniques Instance-based techniques: common instance based
  • 7. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Instance-based techniques Pros and cons Advantages Simple to implement Interesting results Disadvantages Requires sufficient amounts of common instances Only uses part of the available information
  • 8. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Instance-based techniques Instance-based techniques: Instance similarity based
  • 9. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Instance-based techniques Instance-based techniques: Instance similarity based
  • 10. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Instance-based techniques Instance-based techniques: Instance similarity based
  • 11. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Representing concepts and the similarity between them Representing concepts and the similarity between them { Creator Creator Title Term 1: 4 Publisher Bag of words Term 2: 1 Term 3: 0 ... ... Concept 1 Creator Title Title Term 1: 0 Publisher Bag of words Term 2: 3 Term 3: 0 ... ... Creator Publisher Cos. dist. Title Term 1: 2 Bag of words Term 2: 1 Publisher Term 3: 3 ... ... f1 Cos. dist. f2 { Creator f3 Term 1: 2 ... Bag of words Term 2: 0 Creator Term 3: 0 Title ... Cos. dist. Concept 2 Publisher Title ... Term 1: 0 Bag of words Term 2: 4 Creator Term 3: 1 Title ... Publisher Publisher ... Term 1: 4 Bag of words Term 2: 1 Term 3: 1 ... { { { Instance features Concept features Pair features
  • 12. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Classification based on instance similarity Classification based on instance similarity Each pair of concepts is treated as a point in a “similarity space” Its position is defined by the features of the pair. The features of the pair are the different measures of similarity between the concepts’ instances. Hypothesis: the label of a point — which represents whether the pair is a positive mapping or negative one — is correlated with the position of this point in this space. With already labelled points and the actual similarity values of concepts involved, it is possible to classify a point, i.e., to give it a right label, based on its location given by the actual similarity values.
  • 13. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Classification based on instance similarity The classifier used: Markov Random Field Let T = { (x(i) , y (i) ) }N be the training set i=1 x(i) ∈ RK , the features y (i) ∈ Y = {positive, negative}, the label The conditional probability of a label given the input is modelled as K 1 p(y (i) |xi , θ) = exp λj φj (y (i) , x(i) ) , (1) Z (xi , θ) j=1 where θ = { λj }K are the weights associated to the feature j=1 functions φ and Z (xi , θ) is a normalisation constant
  • 14. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Classification based on instance similarity The classifier used: Markov Random Field (cont’) The likelihood of the data set for given model parameters p(T |θ) is given by: N p(T |θ) = p(y (i) |x(i) ) (2) i=1 During learning, our objective is to find the most likely values for θ for the given training data. The decision criterion for assigning a label y (i) to a new pair of concepts i is then simply given by: y (i) = argmax p(y |x(i) ) (3) y
  • 15. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Experiment setup Experiment setup Two cases: mapping GTT and Brinkman used in Koninklijke Bibliotheek (KB) — Homogeneous collections mapping GTT/Brinkman and GTAA used in Beeld en Geluid (BG) — Heterogeneous collections Evaluation Measures: misclassification rate or error rate 10 fold cross-validation testing on special data sets
  • 16. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment I: Feature-similarity based mapping versus existing methods Are the benefits from feature-similarity of instances in extensional mapping significant when compared to existing methods? Mapping method Error rate Falcon 0.28895 Slex 0.42620 ± 0.049685 Sjacc80 0.44643 ± 0.059524 Sbag 0.57380 ± 0.049685 {f1 , . . . f28 } (our new approach) 0.20491 ± 0.026158 Table: Comparison between existing methods and similarities-based mapping, in KB case
  • 17. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment I: Feature-similarity based mapping versus existing methods Are the benefits from feature-similarity of instances in extensional mapping significant when compared to existing methods? Mapping method Error rate Falcon 0.28895 Slex 0.42620 ± 0.049685 Sjacc80 0.44643 ± 0.059524 Sbag 0.57380 ± 0.049685 {f1 , . . . f28 } (our new approach) 0.20491 ± 0.026158 Table: Comparison between existing methods and similarities-based mapping, in KB case
  • 18. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment II: Extending to corpora without joint instances Can our approach be applied to corpora for which there are no doubly annotated instances, i.e., for which there are no joint instances? Collections Testing set Error rate Joint instances golden standard 0.20491 ± 0.026158 (original KB corpus) lexical only 0.137871 No joint instances golden standard 0.28378 ± 0.026265 (double instances removed) lexical only 0.161867 Table: Comparison between classifiers using joint and disjoint instances, in KB case
  • 19. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment II: Extending to corpora without joint instances Can our approach be applied to corpora for which there are no doubly annotated instances, i.e., for which there are no joint instances? Collections Testing set Error rate Joint instances golden standard 0.20491 ± 0.026158 (original KB corpus) lexical only 0.137871 No joint instances golden standard 0.28378 ± 0.026265 (double instances removed) lexical only 0.161867 Table: Comparison between classifiers using joint and disjoint instances, in KB case
  • 20. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment III: Extending to heterogeneous collections Can our approach be applied to corpora in which instances are described in a heterogeneous way? Feature selection exhaustive combination by calculating the similarity between all possible pairs of fields require more training data to avoid over-fitting manual selection of corresponding metadata field pairs mutual information to select the most informative field pairs
  • 21. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment III: Extending to heterogeneous collections Can our approach be applied to corpora in which instances are described in a heterogeneous way? Feature selection exhaustive combination by calculating the similarity between all possible pairs of fields require more training data to avoid over-fitting manual selection of corresponding metadata field pairs mutual information to select the most informative field pairs
  • 22. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment III: Extending to heterogeneous collections Can our approach be applied to corpora in which instances are described in a heterogeneous way? Feature selection exhaustive combination by calculating the similarity between all possible pairs of fields require more training data to avoid over-fitting manual selection of corresponding metadata field pairs mutual information to select the most informative field pairs
  • 23. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment III: Extending to heterogeneous collections Can our approach be applied to corpora in which instances are described in a heterogeneous way? Feature selection exhaustive combination by calculating the similarity between all possible pairs of fields require more training data to avoid over-fitting manual selection of corresponding metadata field pairs mutual information to select the most informative field pairs
  • 24. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment III: Extending to heterogeneous collections Can our approach be applied to corpora in which instances are described in a heterogeneous way? Feature selection exhaustive combination by calculating the similarity between all possible pairs of fields require more training data to avoid over-fitting manual selection of corresponding metadata field pairs mutual information to select the most informative field pairs
  • 25. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Feature selection Can we maintain high mapping quality when features are selected (semi)-automatically? Thesaurus Feature selection Error rate manual selection 0.11290 ± 0.025217 GTAA vs. Brinkman mutual information 0.09355 ± 0.044204 exhaustive 0.10323 ± 0.031533 manual selection 0.10000 ± 0.050413 GTAA vs. GTT mutual information 0.07826 ± 0.044904 exhaustive 0.11304 ± 0.046738 Table: Comparison of the performance with different methods of feature selection, using non-lexical dataset
  • 26. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Feature selection Can we maintain high mapping quality when features are selected (semi)-automatically? Thesaurus Feature selection Error rate manual selection 0.11290 ± 0.025217 GTAA vs. Brinkman mutual information 0.09355 ± 0.044204 exhaustive 0.10323 ± 0.031533 manual selection 0.10000 ± 0.050413 GTAA vs. GTT mutual information 0.07826 ± 0.044904 exhaustive 0.11304 ± 0.046738 Table: Comparison of the performance with different methods of feature selection, using non-lexical dataset
  • 27. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Training set manually built golden standard (751) lexical seeding background seeding Thesauri lexical non-lexical GTAA vs. GTT 2720 116 GTAA vs. Brinkman 1372 323 Table: Numbers of positive examples in the training sets
  • 28. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Training set manually built golden standard (751) lexical seeding background seeding Thesauri lexical non-lexical GTAA vs. GTT 2720 116 GTAA vs. Brinkman 1372 323 Table: Numbers of positive examples in the training sets
  • 29. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Bias in the training sets Thesauri Training set Test set Error rate non-lexical non-lexical 0.09355 ± 0.044204 GTAA vs. lexical non-lexical 0.11501 Brinkman non-lexical lexical 0.07124 lexical lexical 0.04871 ± 0.029911 non-lexical non-lexical 0.07826 ± 0.044904 GTAA vs. lexical non-lexical 0.098712 GTT non-lexical lexical 0.088603 lexical lexical 0.06195 ± 0.008038 Table: Comparison using different datasets (feature selected using mutual information)
  • 30. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Bias in the training sets Thesauri Training set Test set Error rate non-lexical non-lexical 0.09355 ± 0.044204 GTAA vs. lexical non-lexical 0.11501 Brinkman non-lexical lexical 0.07124 lexical lexical 0.04871 ± 0.029911 non-lexical non-lexical 0.07826 ± 0.044904 GTAA vs. lexical non-lexical 0.098712 GTT non-lexical lexical 0.088603 lexical lexical 0.06195 ± 0.008038 Table: Comparison using different datasets (feature selected using mutual information)
  • 31. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Positive-negative ratios in the training sets 1 1 error rate error rate 0.9 0.9 F−measure F−measure pos pos Precision 0.8 Precisionpos 0.8 pos Recall pos Recall pos 0.7 0.7 0.6 0.6 Measures Measures 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 Ratio between positive and negative examples in the training set Ratio between positive and negative examples in the training set (a) testing on 1:1 data (b) testing on 1:1000 data Figure: The influence of positive-negative ratios — Brinkman vs. GTAA
  • 32. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Positive-negative ratios in the training sets (cont’) In practice, the training data should be chosen so as to contain a representative ratio of positive and negative examples, while still providing enough material for the classifier to have good predictive capacity on both types of examples.
  • 33. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment IV: Qualitative use of feature weights The value of learning results, λj , reflects the importance of the feature fj in the process of determining similarity (mappings) between concepts. KB fields BG fields kb:title bg:subject kb:abstract bg:subject kb:annotation bg:LOCATIES kb:annotation bg:SUBSIDIE kb:creator bg:contributor kb:creator bg:PERSOONSNAMEN kb:Date bg:OPNAMEDATUM kb:dateCopyrighted bg:date kb:description bg:subject kb:publisher bg:NAMEN kb:temporal bg:date
  • 34. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Results Experiment IV: Qualitative use of feature weights The value of learning results, λj , reflects the importance of the feature fj in the process of determining similarity (mappings) between concepts. KB fields BG fields kb:title bg:subject kb:abstract bg:subject kb:annotation bg:LOCATIES kb:annotation bg:SUBSIDIE kb:creator bg:contributor kb:creator bg:PERSOONSNAMEN kb:Date bg:OPNAMEDATUM kb:dateCopyrighted bg:date kb:description bg:subject kb:publisher bg:NAMEN kb:temporal bg:date
  • 35. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Summary We use a machine learning method to automatically use the similarity between instances to determine mappings between concepts from different thesauri/ontologies. Enables mappings between thesauri used for very heterogeneous collections Does not require dually annotated instances Not limited by the language barrier A contribution to the field of meta-data mapping In the future More heterogeneous collections Smarter measures of similarity between instance metadata More similarity dimensions between concepts, e.g., lexical, structural
  • 36. Introduction Mapping method: classification based on instance similarity Experiments and results Summary Thank you