SlideShare ist ein Scribd-Unternehmen logo
1 von 19
Downloaden Sie, um offline zu lesen
Università degli studi di Bari “Aldo Moro”
                                  Dipartimento di Informatica




                        A Domain Based Approach
                to Information Retrieval in Digital Libraries
                                   F. Rotella, S. Ferilli, F. Leuzzi
L.A.C.A.M.                 ferilli@di.uniba.it, {fabio.leuzzi, rotella.fulvio}@gmail.com
http://lacam.di.uniba.it:8000

                              8th Italian Research Conference on Digital Libraries
                                         Bari, Italy, February 9-10, 2012
Overview
          ●    Introduction & Objectives
          ●    Keyword Extraction
          ●    Word Sense Disambiguation
          ●    Synset Clustering
          ●    A Multistrategy Similarity Measure
          ●    Document Partitioning
          ●    User Query Processing
          ●    A Preliminary Evaluation
          ●    Conclusions & Future Works


A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi   2
Introduction

      Some repositories leave the responsibility of quality to the authors.


                                                      +
                    Anybody can produce and distribute documents.


                                                      =
               Possible low average quality of the repository contents.




     Users are often overwhelmed by documents that only apparently are
                      suitable for satisfying their information needs.


A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi   3
Introduction
          ●       Possible way out: Information Retrieval systems
          ●       Numerical/statistical manipulation of (key)words has
                  been widely explored in the literature
              ●    Still unable to fully solve the problem
          ●       Achieving better retrieval performance requires to go
                  beyond simple lexical interpretation of the user
                  queries
              ●    Pass through an understanding of their semantic
                   content and aims
          ●       Ontological taxonomy
              ●    WordNet
              ●    WordNet Domains
A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi   4
Objectives

                                    Improving fruition of a DL


          ●       Use of advanced techniques for document retrieval
          ●       Try to overcome the ambiguity of natural language
          ●       Inspired by the typical behavior of humans:
              ●    take into account the possible meanings of words
              ●    select the most appropriate one according to the
                   context of the discourse




A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi   5
Keyword Extraction
    ●       Each document in the digital library is progressively split into paragraphs,
            sentences, and single words
        ●       Integrated in the DOMINUS framework
    ●       Obtained the syntactic structure of sentences, and the lemmas
        ●       Integrated in the Stanford Parser
    ●       Classical VSM
        ●       TF*IDF weighting
    ●       Two filters:
        ●       Only nouns considered
            ●    The representation of adverbs, verbs and adjectives in WordNet is
                 different
        ●       Only the top 10% keywords for each document
            ●    To be noise-tolerant
            ●    To limit the possibility of including non-discriminative and very general
                 words in the representation of a document
A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi   6
Word Sense Disambiguation
                     Domain Driven
          One Domain per Discourse assumption: many uses of a word
          in a coherent portion of text tend to share the same domain.
      Prevalent domain
      Prevalent domain
           individuation
          individuation

                                 Extraction of all
                                 Extraction of all
                           synsets for each term
                           synsets for each term

                                                         Extraction of all
                                                         Extraction of all
                                                  domains for each synset
                                                  domains for each synset

                                                                               Choice of prevalent
                                                                               Choice of prevalent
                                                                                   domain synset
                                                                                   domain synset


A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi   7
Synset Clustering
                       Pairwise complete link agglomerative strategy

●       Each synset generates a singleton cluster
●       For each pair of clusters
    ●       If the complete link property holds
        ●    Merge the involved clusters




A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi   8
A Multistrategy
                            Similarity Measure

3 components are summed and
normalized, in ]0,1[
●    depth (ancestors)
●    breadth (direct neighbors)
●    breadth (inverse neighbors)


WordNet relationship are considered




Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella   9
A Multistrategy Similarity Measure
                       Cosidered Relationship
    member meronimy: the latter synset is a member meronym of the former;
    substance meronimy: the latter synset is a substance meronym of the former;
    part meronimy: the latter synset is a part meronym of the former;
    similarity: the latter synset is similar in meaning to the former;
    antonym: specifies antonymous word;
    attribute: defines the attribute relation between noun and adjective synset
    pairs in which the adjective is a value of the noun;
    additional information: additional information about the first word can be
    obtained by seeing the second word;
    part of speech based: specifies two different relations based on the parts of
    speech involved;
    participle: the adjective first word is a participle of the verb second word;
    hyperonymy: the latter synset is a hypernym of the former.



A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi   10
Document Partitioning

       ●       SynsetWord structure:
           ●       Original word
           ●       TF*IDF weight
           ●       Synset
       ●       The Pairwise Clustering step returned a set of synset clusters
       ●       For each document in the collection
           ●       Each of its SynsetWord votes with its TF*IDF weight
           ●       The first three clusters are chosen from the ranked list
               ●    They represent the intensional description of the document




A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi   11
Users Query Elaboration
                                           Overview

      ●       Same grammatical preprocessing as in the previous phase
      ●       Query usually very short
          ●       No keyword extraction: all nouns retained for the next
                  operations
          ●       WSD Domain Driven unreliable
              ●       For each word, all corresponding synsets in WordNet are kept
              ●       A single lexical query yields many semantic queries
                  ●
                       All possible combinations of synsets




A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi   12
Users Query Elaboration
                            A Brute Force WSD

          For each combination:
             ●    a similarity evaluated against each cluster that has at
                  least one associated document
             ●    using the same similarity function as for clustering


          Twofold objective:
             ●    finding the combination of synsets that represents the
                  best word sense disambiguation
             ●    obtaining the most similar cluster to the involved words




A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi   13
Users Query Elaboration
                                    Query Results

          The best combination is used to obtain the list of clusters
          ranked by descending relevance, that can be used as an
          answer to the user search.


          The results are then displayed to the user, in particular are
          displayed the first n sets of document such that n is the
          minimum value that shows at least 10 results.




A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi   14
A Preliminary Evaluation
                       The Quality of Clusters
 86 documents, 4 topics:
 27 general science and physics; 21 music; 15 politics; 23 religion.


 Query: Reincarnation and eternal life
 Best combination:
 ●     synset: 106191212; lemmas: reincarnation; gloss: the Hindu or Buddhist doctrine that a
       person may be reborn successively into one of five classes of living beings (god or human or
       animal or hungry ghost or denizen of Hell) depending on the person’s own actions;
 ●     synset: 100006269; lemmas: life; gloss: living things collectively.
 Most similar cluster:
 ●     synset: 106191212; lemmas: reincarnation; gloss: the Hindu or Buddhist doctrine that a
       person may be reborn successively into one of five classes of living beings (god or human or
       animal or hungry ghost or denizen of Hell) depending on the person’s own actions;
 ●     synset: 105943300; lemmas: doctrine, philosophical system, philosophy and school of
       thought; gloss: a belief (or system of beliefs) accepted as authoritative by some group or
       school;
 ●     synset: 105941423; lemmas: belief; gloss: any cognitive content held as true.

A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi   15
A Preliminary Evaluation
                         The Quality of Clusters
 Query: Ornaments and melodies
 Best combination:
 ●     synset: 103169390; lemmas: decoration, ornament and ornamentation; gloss: something used to
       beautify;
 ●     synset: 107028373; lemmas: air, line, melodic line, melodic phrase, melody, strain and tune; gloss: a
       succession of notes forming a distinctive sequence.
 Most similar cluster:
 ●     synset: 107025900; lemmas: classical, classical music and serious music; gloss: traditional genre of
       music conforming to an established form and appealing to critical interest and developed musical
       taste;
 ●     synset: 107033753; lemmas: mass; gloss: a musical setting for a Mass;
 ●     synset: 107026352; lemmas: opera; gloss: a drama set to music, consists of singing with orchestral
       accompaniment and an orchestral overture and interludes;
 ●     synset: 107071942; lemmas: genre, music genre, musical genre and musical style; gloss: an
       expressive style of music;
 ●     synset: 107064715; lemmas: rock, rock ’n’ roll, rock and roll, rock music, rock’n’roll and rock-and-
       roll; gloss: a genre of popular music originating in the 1950s, a blend of black rhythm-and-blues with
       white country-and-western.

A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi     16
A Preliminary Evaluation
                       Synthesis of Outcomes
  #                           Query                               Outcomes             Precision       Recall
                                                                   [1 to 9] music
  1                 Ornaments and melodies                       [10 to 11] religion   0.82 (1.0)    0.43 (9/21)

                                                                  [1 to 9] religion
  2               Reincarnation and eternal life                    [10] science        0.9 (1.0)    0.39 (9/23)
                                                                   [1 to 4] music
  3                    Traditions and folks                       [5 to 6] religion     0.8 (1.0)    0.38 (8/21)
                                                                  [7 to 10] music
                                                                  [1 to 2] science
                                                                     [3] politics
  4                Limits of theory of relativity                 [4 to 5] religion       0.8        0.44 (12/27)
                                                                 [6 to 15] science
                                                                   [1 to 3] politics
                                                                      [4] science
                                                                   [5 to 6] religion
  5                Capitalism vs communism                        [7 to 11] politics   0.61 (0.77)   0.53 (8/15)
                                                                     [12] science
                                                                      [13] music
                                                                      [1] politics
                                                                       [2] music
  6                Markets and new economy                           [3] science        0.6 (0.7)     0.4 (6/15)
                                                                   [4 to 8] politics
                                                                  [9 to 10] religion
                                                                   [1 to 3] politics
                                                                     [4] science
  7    Relationship between democracy and parliament               [5 to 6] politics    0.5 (0.6)    0.33 (5/15)
                                                                  [7 to 10] religion

A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi          17
Conclusions
  Proposed an approach to extract information from digital libraries
     ●       Go beyond simple lexical matching, toward the semantic content
             underlying queries


  The approach consists of:
     ●       An off-line preprocessing on the entire corpus
         ●       Find sets of synset as intensional descriptions for the documents
     ●       An on-line phase on the queries
         ●       Find the most suitable sense, evaluating all possible combinations
                 of synset against each intensional descriptions of the documents
             ●    In order to propose as result the most relevant ones


  Preliminary experiments show that this approach can be viable.
A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi   18
Future Works

     ●       Substitution of the ODD assumption with a more elaborated
             strategy for WSD
     ●       Avoiding the pre-processing step
         ●    To handle cases when new documents are progressively
              included in the collection
     ●       Including adverbs, verbs and adjectives
         ●    To improve the quality of the semantic representatives of the
              documents
         ●    To explore other approaches to choose better intensional
              descriptions of each document



A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi   19

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Kürzlich hochgeladen (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Empfohlen

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Applitools
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at WorkGetSmarter
 

Empfohlen (20)

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 
ChatGPT webinar slides
ChatGPT webinar slidesChatGPT webinar slides
ChatGPT webinar slides
 

A Domain Based Approach to Information Retrieval in Digital Libraries

  • 1. Università degli studi di Bari “Aldo Moro” Dipartimento di Informatica A Domain Based Approach to Information Retrieval in Digital Libraries F. Rotella, S. Ferilli, F. Leuzzi L.A.C.A.M. ferilli@di.uniba.it, {fabio.leuzzi, rotella.fulvio}@gmail.com http://lacam.di.uniba.it:8000 8th Italian Research Conference on Digital Libraries Bari, Italy, February 9-10, 2012
  • 2. Overview ● Introduction & Objectives ● Keyword Extraction ● Word Sense Disambiguation ● Synset Clustering ● A Multistrategy Similarity Measure ● Document Partitioning ● User Query Processing ● A Preliminary Evaluation ● Conclusions & Future Works A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 2
  • 3. Introduction Some repositories leave the responsibility of quality to the authors. + Anybody can produce and distribute documents. = Possible low average quality of the repository contents. Users are often overwhelmed by documents that only apparently are suitable for satisfying their information needs. A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 3
  • 4. Introduction ● Possible way out: Information Retrieval systems ● Numerical/statistical manipulation of (key)words has been widely explored in the literature ● Still unable to fully solve the problem ● Achieving better retrieval performance requires to go beyond simple lexical interpretation of the user queries ● Pass through an understanding of their semantic content and aims ● Ontological taxonomy ● WordNet ● WordNet Domains A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 4
  • 5. Objectives Improving fruition of a DL ● Use of advanced techniques for document retrieval ● Try to overcome the ambiguity of natural language ● Inspired by the typical behavior of humans: ● take into account the possible meanings of words ● select the most appropriate one according to the context of the discourse A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 5
  • 6. Keyword Extraction ● Each document in the digital library is progressively split into paragraphs, sentences, and single words ● Integrated in the DOMINUS framework ● Obtained the syntactic structure of sentences, and the lemmas ● Integrated in the Stanford Parser ● Classical VSM ● TF*IDF weighting ● Two filters: ● Only nouns considered ● The representation of adverbs, verbs and adjectives in WordNet is different ● Only the top 10% keywords for each document ● To be noise-tolerant ● To limit the possibility of including non-discriminative and very general words in the representation of a document A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 6
  • 7. Word Sense Disambiguation Domain Driven One Domain per Discourse assumption: many uses of a word in a coherent portion of text tend to share the same domain. Prevalent domain Prevalent domain individuation individuation Extraction of all Extraction of all synsets for each term synsets for each term Extraction of all Extraction of all domains for each synset domains for each synset Choice of prevalent Choice of prevalent domain synset domain synset A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 7
  • 8. Synset Clustering Pairwise complete link agglomerative strategy ● Each synset generates a singleton cluster ● For each pair of clusters ● If the complete link property holds ● Merge the involved clusters A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 8
  • 9. A Multistrategy Similarity Measure 3 components are summed and normalized, in ]0,1[ ● depth (ancestors) ● breadth (direct neighbors) ● breadth (inverse neighbors) WordNet relationship are considered Cooperating Techniques for Extracting Conceptual Taxonomies from Text - S. Ferilli, F. Leuzzi, F. Rotella 9
  • 10. A Multistrategy Similarity Measure Cosidered Relationship member meronimy: the latter synset is a member meronym of the former; substance meronimy: the latter synset is a substance meronym of the former; part meronimy: the latter synset is a part meronym of the former; similarity: the latter synset is similar in meaning to the former; antonym: specifies antonymous word; attribute: defines the attribute relation between noun and adjective synset pairs in which the adjective is a value of the noun; additional information: additional information about the first word can be obtained by seeing the second word; part of speech based: specifies two different relations based on the parts of speech involved; participle: the adjective first word is a participle of the verb second word; hyperonymy: the latter synset is a hypernym of the former. A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 10
  • 11. Document Partitioning ● SynsetWord structure: ● Original word ● TF*IDF weight ● Synset ● The Pairwise Clustering step returned a set of synset clusters ● For each document in the collection ● Each of its SynsetWord votes with its TF*IDF weight ● The first three clusters are chosen from the ranked list ● They represent the intensional description of the document A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 11
  • 12. Users Query Elaboration Overview ● Same grammatical preprocessing as in the previous phase ● Query usually very short ● No keyword extraction: all nouns retained for the next operations ● WSD Domain Driven unreliable ● For each word, all corresponding synsets in WordNet are kept ● A single lexical query yields many semantic queries ● All possible combinations of synsets A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 12
  • 13. Users Query Elaboration A Brute Force WSD For each combination: ● a similarity evaluated against each cluster that has at least one associated document ● using the same similarity function as for clustering Twofold objective: ● finding the combination of synsets that represents the best word sense disambiguation ● obtaining the most similar cluster to the involved words A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 13
  • 14. Users Query Elaboration Query Results The best combination is used to obtain the list of clusters ranked by descending relevance, that can be used as an answer to the user search. The results are then displayed to the user, in particular are displayed the first n sets of document such that n is the minimum value that shows at least 10 results. A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 14
  • 15. A Preliminary Evaluation The Quality of Clusters 86 documents, 4 topics: 27 general science and physics; 21 music; 15 politics; 23 religion. Query: Reincarnation and eternal life Best combination: ● synset: 106191212; lemmas: reincarnation; gloss: the Hindu or Buddhist doctrine that a person may be reborn successively into one of five classes of living beings (god or human or animal or hungry ghost or denizen of Hell) depending on the person’s own actions; ● synset: 100006269; lemmas: life; gloss: living things collectively. Most similar cluster: ● synset: 106191212; lemmas: reincarnation; gloss: the Hindu or Buddhist doctrine that a person may be reborn successively into one of five classes of living beings (god or human or animal or hungry ghost or denizen of Hell) depending on the person’s own actions; ● synset: 105943300; lemmas: doctrine, philosophical system, philosophy and school of thought; gloss: a belief (or system of beliefs) accepted as authoritative by some group or school; ● synset: 105941423; lemmas: belief; gloss: any cognitive content held as true. A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 15
  • 16. A Preliminary Evaluation The Quality of Clusters Query: Ornaments and melodies Best combination: ● synset: 103169390; lemmas: decoration, ornament and ornamentation; gloss: something used to beautify; ● synset: 107028373; lemmas: air, line, melodic line, melodic phrase, melody, strain and tune; gloss: a succession of notes forming a distinctive sequence. Most similar cluster: ● synset: 107025900; lemmas: classical, classical music and serious music; gloss: traditional genre of music conforming to an established form and appealing to critical interest and developed musical taste; ● synset: 107033753; lemmas: mass; gloss: a musical setting for a Mass; ● synset: 107026352; lemmas: opera; gloss: a drama set to music, consists of singing with orchestral accompaniment and an orchestral overture and interludes; ● synset: 107071942; lemmas: genre, music genre, musical genre and musical style; gloss: an expressive style of music; ● synset: 107064715; lemmas: rock, rock ’n’ roll, rock and roll, rock music, rock’n’roll and rock-and- roll; gloss: a genre of popular music originating in the 1950s, a blend of black rhythm-and-blues with white country-and-western. A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 16
  • 17. A Preliminary Evaluation Synthesis of Outcomes # Query Outcomes Precision Recall [1 to 9] music 1 Ornaments and melodies [10 to 11] religion 0.82 (1.0) 0.43 (9/21) [1 to 9] religion 2 Reincarnation and eternal life [10] science 0.9 (1.0) 0.39 (9/23) [1 to 4] music 3 Traditions and folks [5 to 6] religion 0.8 (1.0) 0.38 (8/21) [7 to 10] music [1 to 2] science [3] politics 4 Limits of theory of relativity [4 to 5] religion 0.8 0.44 (12/27) [6 to 15] science [1 to 3] politics [4] science [5 to 6] religion 5 Capitalism vs communism [7 to 11] politics 0.61 (0.77) 0.53 (8/15) [12] science [13] music [1] politics [2] music 6 Markets and new economy [3] science 0.6 (0.7) 0.4 (6/15) [4 to 8] politics [9 to 10] religion [1 to 3] politics [4] science 7 Relationship between democracy and parliament [5 to 6] politics 0.5 (0.6) 0.33 (5/15) [7 to 10] religion A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 17
  • 18. Conclusions Proposed an approach to extract information from digital libraries ● Go beyond simple lexical matching, toward the semantic content underlying queries The approach consists of: ● An off-line preprocessing on the entire corpus ● Find sets of synset as intensional descriptions for the documents ● An on-line phase on the queries ● Find the most suitable sense, evaluating all possible combinations of synset against each intensional descriptions of the documents ● In order to propose as result the most relevant ones Preliminary experiments show that this approach can be viable. A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 18
  • 19. Future Works ● Substitution of the ODD assumption with a more elaborated strategy for WSD ● Avoiding the pre-processing step ● To handle cases when new documents are progressively included in the collection ● Including adverbs, verbs and adjectives ● To improve the quality of the semantic representatives of the documents ● To explore other approaches to choose better intensional descriptions of each document A Domain Based Approach to Information Retrieval in Digital Libraries - F. Rotella, S. Ferilli, F. Leuzzi 19