• Save
Vivo Search
Upcoming SlideShare
Loading in...5
×
 

Vivo Search

on

  • 734 Views

Presentation on Search ranking improvement at 2011 VIVO Conference, Washington DC.

Presentation on Search ranking improvement at 2011 VIVO Conference, Washington DC.

Statistics

Views

Total Views
734
Slideshare-icon Views on SlideShare
717
Embed Views
17

Actions

Likes
0
Downloads
0
Comments
0

2 Einbettungen 17

http://www.linkedin.com 10
https://www.linkedin.com 7

Zugänglichkeit

Kategorien

Details hochladen

Uploaded via as Microsoft PowerPoint

Benutzerrechte

© Alle Rechte vorbehalten

Report content

Als unangemessen gemeldet Als unangemessen melden
Als unangemessen melden

Wählen Sie Ihren Grund, warum Sie diese Präsentation als unangemessen melden.

Löschen
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Ihre Nachricht erscheint hier
    Processing...
Kommentar posten
Kommentar bearbeiten
  • Duplicate slide to maintain title and subtitle formatting
  • Duplicate slide to maintain title and subtitle formatting

Vivo Search Vivo Search Presentation Transcript

  • Improving VIVO search results through Semantic Ranking. Anup Sawant Deepak Konidena
  • VIVO Search till Release 1.2.1
    • VIVO Search till Release 1.2.1.
      • Lucene keyword based search.
      • Score based on Textual relevance.
      • Importance of a node was not taken into consideration.
      • Additional data that describes a relationship was not being searched.
  • Adding knowledge from semantic relationships
      •   VIVO 1.2 Search contained restricted information about an individual in the index. This lead people to ask questions like:
    •  
    •   “ Hey I work for "USDA" and when I search for "USDA", my profile doesn't show up in the search results and vice-versa.”  
      • “ Hey information related to my Educational background, Awards, the Roles I assumed, etc. that appear on my profile don't show up in the search results when I search for them individually or when I search for my name.”
    •  
    •  
    •  
    •      
    •  
  • How does the semantic graph look like with the presence of context nodes?
  • Intermediate nodes were overlooked.
    •  
      • Traditionally semantic relationships of an Individual like Roles, Educational Training, Awards, Authorship, etc. were not stored in the Index. 
      • Individuals were connected to these properties through intermediate nodes called "Context Nodes". And the information hiding beyond these context nodes was not captured.
    •  
    •  
    •  
    •      
    •  
  • Lucene field for an Individual.   And here's why                  
  • VIVO Search in 1.3
    • VIVO Search in 1.3
      • Transition from Lucene to SOLR.
      • Provides base for distributed search capabilities.
      • Individuals enriched by description of semantic relationships.
      • Enhanced score by Individual connectivity.
      • Improved precision and recall of search results.
  • Influence of PageRank
      • Introduced by Larry Page & Sergey Brin.
      • Every node relies on every other node for its ranking.
      • Intuitive understanding: Node importance is calculated based on incoming connections and contribution of highly ranked important nodes.
  • Some parameters based on PageRank
    • β
      • Number of nodes connected to a particular node.
      • Intuition: Probably, a node deserves high rank because it is connected to lot of individuals.
    • Φ
      • Average over β values of all the nodes to which a node is connected.
      • Intuition: Probably, a node deserves high rank because it is connected to some important individuals.
    • Γ
      • Average strength of uniqueness of properties through which a node is connected.
      • Intuition: Probably, a node deserves high rank based on the strength of connection to other nodes.
  • Search Index Architecture: Enriching with Semantic Relations. Overall connectivity of an Individual (ß) Apache Solr Relevant Documents. Dismax Query Handler. Indexing Phase Sparql Proper Boosts Searching Phase Multithreaded.
  • Real-time Indexing: Enriching with Semantic Relations. Overall connectivity of an Individual (ß) Apache Solr Relevant Documents. Dismax Query Handler. Indexing Phase Sparql Proper Boosts Searching Phase ADD/EDIT/DELETE of an Individual or its properties. The changes occur in real time and propagate beyond intermediate nodes. Multithreaded.
  • Cluster Analysis of Search Results
    • Intuition
      • Assume search results from Release 1.2.1 and Release 1.3 are two different clusters.
    • Expectation
      • Results from Release 1.3 should have their mean vector close to query vector.
    • Results
      • Text to vector conversion using ‘Bag of words’ technique.
      • Tanimoto distance measure used.
      • Code at : https:// github.com / anupsavvy / Cluster_Analysis
    Query Distance from Mean vector of Release 1.2.1 Distance from Mean vector of Release 1.3 Scripps 0.27286328362357193 0.004277746256068157 Paulson James 0.009907336493786136 0.004650133621323327 Genome Sequencing 9.185463752863598E-4 8.154498815206635E-4 Kenny Paul 0.007610235640599918 0.003984303949283425
  • Understanding how it happens ..
    • R1
    • R2
    • R3
    • R4
    • R5
    • .
    • .
    • .
    • .
    name location description name research name articles name location Bla bla bla ….
  • Understanding how it happens ..
    • scripps
    • loring
    • jeanne
    • institute
    • cornell
    • florida
    • .
    • .
    • .
    • .
    R1 R2 R3 .. .. .. .. 6 1 Q 1 0 0 1 4 0 1 1 0 1 4 0 1 1 0 1 1 1 1 0 0 0 - - - - - - - - - - - - - - - -
  • Understanding how it happens .. institute cornell loring V1 V2 θ Euclidean distance Cosine distance
  • Understanding how it happens .. institute cornell loring V2 θ V1 Euclidean distance increases, Cosine distance remains the same
  • Query vector distance from Cluster Mean vectors
  • User testing for Relevance
  • Precision and Recall Total Relevant Total Retrieved Precision = X / (Total Retrieved) Recall = X / (Total Relevant) X
  • Precision-Recall graphs based on User Analysis.
  • Cluster Analysis for Relevance
  • Precision-Recall graphs based on Cluster Analysis
  • Query vector distance from individual search result vectors
  • Experiments : SOLR
      • Search query expansion can be done using SOLR synonym analyzer.
      • Princeton Wordnet http://wordnet.princeton.edu/ is frequently used with SOLR synonym analyzer.
      • A gist code by Bradford on Github https://gist.github.com/562776 was used to convert wordnet flat file into SOLR compatible synonyms file.
      • Pros
        • High Recall
        • Documents can be matched to well known acronyms and words not present in SOLR index. For instance, a query which has ‘ fl ’ as one of the terms would retrieve documents related to ‘ Florida ’ as well.
      • Cons
        • Documents matching just the synonym part of the query could be ranked higher.
  • Experiments : SOLR ( cont. )
      • Certain degree of spelling correction like feature could be achieved through SOLR Phonetic Analyzer.
      • Phonetic Analyzer uses Apache Commons Codec for phonetic implementations.
      • Pros
        • High Recall
        • Helps in detecting spelling mistakes in search query. For instance, if a query like ‘ scrips ’ would be accurately match to a similar sounding word ‘ scripps ’ which is actually present in the index. Misspelled name like ‘ Polex Frank ’ in the query could be matched to correct name ‘ Polleux Franck ’ .
      • Cons
        • Number of results matched just based on Phonetics could decrease the precision of the engine.
  • Experiments : Ontology provides a good base for Factoid Questioning.
      • Properties of Individuals give direct reference to the information.
      • Natural language techniques and Machine learning algorithms could help us understand the search query better.
      • A query like “What is Brian Lowe’s email id ?” should probably return just the email id on top or a query like “Who are the co-authors of Brian Lowe ?” should return just the list of co-authors of Brian Lowe.
      • We can train an algorithm to know the type of question or search query that has been fired. Cognitive Computation Group of University of Illinois At Urbana-Champaign provides corpus of tagged questions to be used as training set. http://cogcomp.cs.illinois.edu/page/resources/data
  • Experiments : Ontology provides a good base for Factoid Questioning. ( cont. )
      • Once the question type is determined, we could grammatically parse the question using Stanford Lexparser http://nlp.stanford.edu/software/lex- parser.shtml
      • Question type helps us to know whether we should look for a datatype property or an object property. Lexparser will helps us to form a SPARQL query.
    Stanford Lexparser Kmeans/SVM Search Query SPARQL Query Corpora Question type Terms
  • Summary
      • Transition from Lucene to SOLR
      • Additional information of semantic relationships and interconnectivity in the index.
      • More relevant results and good ranking compared to VIVO 1.2.1
      • Improvements in indexing time due to multithreading.
  • Team Work…