Diese Präsentation wurde erfolgreich gemeldet.
Die SlideShare-Präsentation wird heruntergeladen. ×

Related Entity Finding on the Web

Wird geladen in …3

Hier ansehen

1 von 46 Anzeige

Weitere Verwandte Inhalte

Diashows für Sie (19)

Ähnlich wie Related Entity Finding on the Web (20)


Aktuellste (20)


Related Entity Finding on the Web

  1. 1. Related Entity Finding on the Web Peter Mika Senior Research Scientist Yahoo! Research Joint work with B. Barla Cambazoglu and Roi Blanco
  2. 2. - 2 - Search is really fast, without necessarily being intelligent
  3. 3. - 3 - Why Semantic Search? Part I • Improvements in IR are harder and harder to come by – Machine learning using hundreds of features • Text-based features for matching • Graph-based features provide authority – Heavy investment in computational power, e.g. real-time indexing and instant search • Remaining challenges are not computational, but in modeling user cognition – Need a deeper understanding of the query, the content and/or the world at large – Could Watson explain why the answer is Toronto?
  4. 4. - 4 - What it’s like to be a machine? Roi Blanco
  5. 5. - 5 - What it’s like to be a machine? ✜Θ♬♬ţğ ✜Θ♬♬ţğ√∞®ÇĤĪ✜★♬☐✓✓ ţğ★✜ ✪✚✜ΔΤΟŨŸÏĞÊϖυτρ℠≠⅛⌫Γ ≠=⅚©§★✓♪ΒΓΕ℠ ✖Γ ±♫⅜ ⏎↵⏏☐ģğğğμλκσςτ ⏎⌥°¶§ΥΦΦΦ✗✕☐
  6. 6. - 7 - Ambiguity
  7. 7. - 8 - Why Semantic Search? Part II • The Semantic Web is here – Data • Large amounts of RDF data • Heterogeneous schemas • Diverse quality – End users • Not skilled in writing complex queries (e.g. SPARQL) • Not familiar with the data • Novel applications – Complementing document search • Rich Snippets, related entities, direct answers – Other novel search tasks
  8. 8. - 9 - Semantic Web data • Linked Data – Data published as RDF documents linked to other RDF documents and/or using SPARQL end-points – Community effort to re-publish large public datasets (e.g. Dbpedia, open government data) • RDFa and microdata – Data embedded inside HTML pages – Schema.org collaboration among Bing, Google, Yahoo and Yandex – Facebook Open Graph Protocol (OGP)
  9. 9. - 10 - Other novel applications • Aggregation of search results – e.g. price comparison across websites • Analysis and prediction – e.g. world temperature by 2020 • Semantic profiling – Ontology-based modeling of user interests • Semantic log analysis – Linking query and navigation logs to ontologies • Task completion – e.g. booking a vacation using a combination of services • Conversational search – e.g. PARLANCE EU FP7 project Web usage mining with Semantic Analysis Fri 3pm Web usage mining with Semantic Analysis Fri 3pm
  10. 10. - 11 - Interactive search and task completion
  11. 11. Semantic Search
  12. 12. - 13 - Semantic Search: a definition • Semantic search is a retrieval paradigm that – Makes use of the structure of the data or explicit schemas to understand user intent and the meaning of content – Exploits this understanding at some part of the search process • Emerging field of research – Exploiting Semantic Annotations in Information Retrieval (2008- 2012) – Semantic Search (SemSearch) workshop series (2008-2011) – Entity-oriented search workshop (2010-2011) – Joint Intl. Workshop on Semantic and Entity-oriented Search (2012) – SIGIR 2012 tracks on Structured Data and Entities • Related fields: – XML retrieval, Keyword search in databases, NL retrieval
  13. 13. - 14 - Search is required in the presence of ambiguity Query Data KeywordsKeywords NL Questions NL Questions Form- / facet- based Inputs Form- / facet- based Inputs Structured Queries (SPARQL) Structured Queries (SPARQL) OWL ontologies with rich, formal semantics OWL ontologies with rich, formal semantics Structured RDF data Structured RDF data Semi- Structured RDF data Semi- Structured RDF data RDF data embedded in text (RDFa) RDF data embedded in text (RDFa) Ambiguities: interpretation Ambiguities: interpretation, extraction errors, data quality, confidence/trust
  14. 14. - 15 - list search related entity finding entity search SemSearch 2010/11 list completion SemSearch 2011 TREC ELC taskTREC REF-LOD task semantic search Common tasks in Semantic Search
  15. 15. Related entity ranking in web search
  16. 16. - 17 - Motivation • Some users are short on time – Need for direct answers – Query expansion, question-answering, information boxes, rich results… • Other users have time at their hand – Long term interests such as sports, celebrities, movies and music – Long running tasks such as travel planning
  17. 17. - 18 - Example user sessions
  18. 18. - 19 - Spark: related entity recommendations in web search • A search assistance tool for exploration • Recommend related entities given the user’s current query – Cf. Entity Search at SemSearch, TREC Entity Track • Ranking explicit relations in a Knowledge Base – Cf. TREC Related Entity Finding in LOD (REF-LOD) task • A previous version of the system live since 2010 • van Zwol et al.: Faceted exploration of image search results. WWW 2010: 961-970
  19. 19. - 20 - Spark example I.
  20. 20. - 21 - Spark example II.
  21. 21. - 22 - How does it work? /user/torzecn/ Shared/Entity- Relationship_Graphs Yalinda feed parser (GFFeedMapper, GFFeedReducer) $VIS_GRID_HOME/ gfstorageall/{1,2,3}/ yalinda/phyfacet/data /projects/gridfaces/ feed/yahooomg/ Y! OMG feed parser (GFFeedMapper, GFFeedReducer) $VIS_GRID_HOME/ gfstorageall/6/yahooomg/ phyfacet/data /projects/gridfaces/ feed/geo Y! Geo feed parser (GFFeedMapper, GFFeedReducer) $VIS_GRID_HOME/ gfstorageall/10/geo/ phyfacet/data /projects/gridfaces/ feed/yahootv Y! TV feed parser (GFFeedMapper, GFFeedReducer) $VIS_GRID_HOME/ gfstorageall/5 /projects/gridfaces/ feed/editorialdata/ data/objects Editorial feed parser (GFFeedMapper, GFEditorialMergeReducer) $VIS_GRID_HOME/ gfstorageall/logicalobj/ dataArchive/ $VIS_GRID_HOME/ gfstorageall/ logicalfacet/data $VIS_GRID_HOME/ gfstorageall/logicalobj/ data /projects/gridfaces/ feed/editorialdata/ data/facets GFDumpMain-First (DumpFirstMapper, DumpFirstReducer) Empty/missing Activated Deactivated $VIS_GRID_HOME/ rankinput STEP 1: FEED PARSING PIPELINE GFDumpMain-Second (DumpSecondMapper, DumpSecondReducer) $VIS_GRID_HOME/ rankinputTmp CreateDictionary (DictionaryMapper, DistinctReducer) $VIS_GRID_HOME/ rankinput $VIS_GRID_HOME/ rankprepout/ dictionary $VIS_GRID_HOME/ rankprepout/spark/ logs/week{WNo} /data/SDS/data/ search_US CreateIntermediate LogFormat (NullValueSillyMapp er, DistinctReducer) CreateQuerySessions CommonModelFilter (FiltererMapper, FiltererReducer) $VIS_GRID_HOME/ rankprepout/tmp/ qsessions lfi terer CreateQuerySessions CommonModel (SillyMapper, QuerySessionsCreate SessionsReducer) $VIS_GRID_HOME/ rankprepout/tmp/ qsessions_cm CreateQueryTermsJoin DictionaryAndLogs (LeftJoinMapper, LeftJoinReducer) $VIS_GRID_HOME/ rankprepout/tmp/ qtermsjoindictionarylog CreateQueryTermsJoin QTermsAndLogs (SillyMapper, ImplodeReducer) $VIS_GRID_HOME/ rankprepout/tmp/ qterms_cm CreateFlickrIntermediate LogFormat (FlickrReformatMapper) /projects/gridfaces/ ifl ckr/feed $VIS_GRID_HOME/ rankprepout/tmp/ ifl ckrtagsintermediate CreateFlickrTagsCommon ModelFilter (FiltererMapper, FiltererReducer) $VIS_GRID_HOME/ rankprepout/tmp/ ifl ckrtags lfi terer CreateFlickrTagsCommon ModelFilter (SillyMapper, ImplodeReducer) $VIS_GRID_HOME/ rankprepout/tmp/ ifl ckr_cm General Query logs Flickr Twitter /projects/rtds/twitter/ refi hose CreateTwitterIntermediate LogFormat (NullValueSillyMapper, DistinctReducer) $VIS_GRID_HOME/ rankprepout/spark/ twitter_logs/week{WNo} CreateTweetsCommon ModelFilter (FiltererMapper, FiltererReducer) $VIS_GRID_HOME/ rankprepout/tmp/ tweets lfi terer CreateTweetsCommon Model (SillyMapper, ImplodeReducer) $VIS_GRID_HOME/ rankprepout/tmp/ tweets_cm DistinctUsersTwitter (SillyMapper, DistinctReducer) $VIS_GRID_HOME/ rankprepout/tmp/tweets/ distinctusers DistinctUsers ifl ckr (SillyMapper, DistinctReducer) $VIS_GRID_HOME/ rankprepout/tmp/ ifl ckr/ distinctusers DistinctUsersQSessions (SillyMapper, DistinctReducer) $VIS_GRID_HOME/ rankprepout/tmp/ qsessions/distinctusers DistinctUsersQTerms (SillyMapper, DistinctReducer) $VIS_GRID_HOME/ rankprepout/tmp/qterms/ distinctusers CountUsers (CounterMapper, CounterReducer) $VIS_GRID_HOME/ rankprepout/tmp/ distinctusers CountEvents (CounterMapper, CounterReducer) $VIS_GRID_HOME/ rankprepout/tmp/ countevents STEP 2: PREPROCESSING PIPELINE (before feature extraction and ranking) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ probability EventProbabilityQTerms (ProbabilityMapper, ProbabilityReducer) $VIS_GRID_HOME/ rankprepout/tmp/ qterms_cm $VIS_GRID_HOME/ rankprepout/tmp/ countevents EventConditionalProbabilityQterms (ConditionalProbabilityMapper, ConditionalProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ conditionalprobability EventJointProbabilityQterms (JointProbabilityMapper, ProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ jointprobability EventJointUserProbabilityQTerms (JointUserProbabilityMapper, JointUserProbabilityReducer) $VIS_GRID_HOME/ rankprepout/tmp/ countusers $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ jointuserprobability EventConditionalUserProb abilityQterms (ConditionalUserProbability PrepareMapper) $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserproba bility1/qterms ConditionalUserProbability 2_3-qt (ConditionalUserProbability PrepareMapper) $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserprobability2/ qterms/query $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserprobability2/ qterms/queryfacet EventConditionalUser Probability3_qterms (SillyMapper, ConditionalUserProba bilityReducer) $VIS_GRID_HOME/ rankprepout/spark/qterms/ week{WNo}/ conditionaluserprobability EventEntropyQTerms (EntropyMapper, EntropyReducer) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ entropy EventPMI1QTerms (SillyMapper, JoinUnaryMetricReducer) $VIS_GRID_HOME/ rankprepout/tmp/ pmiqterms EventPMIQTerms (PMIMapper, PMIReducer) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ pmi EventKLDivergenceUnary1QTerms (KLDivergenceUnaryJoinerMapper, JoinUnaryMetricReducer) $VIS_GRID_HOME/ rankprepout/tmp/ klunaryqterms EventKLDivergenceUnaryQTerms (SillyMapper, KLDivergenceUnaryReducer) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ kldivergenceunary EventCosineSimilarityQTerms (PMIMapper, CosineSimilarityReducer) $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ cosinesimilarity Extracted features Others STEP 3a: FEATURE EXTRACTION (QUERY TERMS) PIPELINE $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ probability EventProbabilityFlickr (ProbabilityMapper, ProbabilityReducer) $VIS_GRID_HOME/ rankprepout/tmp/ ifl ckr_cm $VIS_GRID_HOME/ rankprepout/tmp/ countevents EventConditionalProbabilityFlickr (ConditionalProbabilityMapper, ConditionalProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ conditionalprobability EventJointProbabilityFlickr (JointProbabilityMapper, ProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ jointprobability EventJointUserProbabilityFlickr (JointUserProbabilityMapper, JointUserProbabilityReducer) $VIS_GRID_HOME/ rankprepout/tmp/ countusers $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ jointuserprobability EventConditionalUserProb abilityFlickr (ConditionalUserProbability PrepareMapper) $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserproba bility1/ ifl ckr ConditionalUserProbability 2_3-fl (ConditionalUserProbability PrepareMapper) $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserprobability2/ ifl ckr/query $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserprobability2/ ifl ckr/queryfacet EventConditionalUser Probability3_ ifl ckr (SillyMapper, ConditionalUserProba bilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/ week{WNo}/ conditionaluserprobability EventEntropyFlickr (EntropyMapper, EntropyReducer) $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ entropy EventPMI1Flickr (SillyMapper, JoinUnaryMetricReducer) $VIS_GRID_HOME/ rankprepout/tmp/ pmi ifl ckr EventPMIFlickr (PMIMapper, PMIReducer) $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ pmi EventKLDivergenceUnary1Flickr (KLDivergenceUnaryJoinerMapper, JoinUnaryMetricReducer) $VIS_GRID_HOME/ rankprepout/tmp/ klunary ifl ckr EventKLDivergenceUnaryFlickr (SillyMapper, KLDivergenceUnaryReducer) $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ kldivergenceunary EventCosineSimilarityFlickr (PMIMapper, CosineSimilarityReducer) $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/ week{WNo}/ cosinesimilarity Extracted features Others STEP 3c: FEATURE EXTRACTION (FLICKR TAGS) PIPELINE $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ probability EventProbabilityTweets (ProbabilityMapper, ProbabilityReducer) $VIS_GRID_HOME/ rankprepout/tmp/ tweets_cm $VIS_GRID_HOME/ rankprepout/tmp/ countevents EventConditionalProbabilityTweets (ConditionalProbabilityMapper, ConditionalProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ conditionalprobability EventJointProbabilityTweets (JointProbabilityMapper, ProbabilityReducer) $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ jointprobability EventJointUserProbabilityTweets (JointUserProbabilityMapper, JointUserProbabilityReducer) $VIS_GRID_HOME/ rankprepout/tmp/ countusers $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ jointuserprobability EventConditionalUserProb abilityTweets (ConditionalUserProbability PrepareMapper) $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserproba bility1/tweets ConditionalUserProbability 2_3-tw (ConditionalUserProbability PrepareMapper) $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserprobability2/ tweets/query $VIS_GRID_HOME/ rankprepout/tmp/ conditionaluserprobability2/ tweets/queryfacet EventConditionalUser Probability3_tweets (SillyMapper, ConditionalUserProba bilityReducer) $VIS_GRID_HOME/ rankprepout/spark/tweets/ week{WNo}/ conditionaluserprobability EventEntropyTweets (EntropyMapper, EntropyReducer) $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ entropy EventPMI1Tweets (SillyMapper, JoinUnaryMetricReducer) $VIS_GRID_HOME/ rankprepout/tmp/ pmitweets EventPMITweets (PMIMapper, PMIReducer) $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ pmi EventKLDivergenceUnary1Tweets (KLDivergenceUnaryJoinerMapper, JoinUnaryMetricReducer) $VIS_GRID_HOME/ rankprepout/tmp/ klunarytweets EventKLDivergenceUnary1Tweets (SillyMapper, KLDivergenceUnaryReducer) $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ kldivergenceunary EventCosineSimilarityTweets (PMIMapper, CosineSimilarityReducer) $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ cosinesimilarity Extracted features Others STEP 3d: FEATURE EXTRACTION (TWEETS) PIPELINE $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ probability unaryfeaturemerger1_qterms (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary1_qterms $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ conditionalprobability $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ jointprobability $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ jointuserprobability $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ conditionaluserprobability $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ entropy $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ pmi $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ kldivergenceunary $VIS_GRID_HOME/ rankprepout/spark/ qterms/week{WNo}/ cosinesimilarity STEP 4a: FEATURE MERGING (QUERY TERMS) PIPELINE unaryfeaturemerger2_qterms (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary2_qterms symmetricfeaturemerger_qterms (MergeFeaturesMapper, SymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ symmetric_qterms asymmetricfeaturemerger_qterms (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ asymmetric_qterms reverseasymmetricfeaturemerger_qterms (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOM E/rankprepout/tmp/ statsmerge/ reverseasymetric_ qterms Features Others $VIS_GRID_HOME/ rankinput $VIS_GRID_HOME/ rankprepout/spark/ qsessions/ week{WNo}/ probability unaryfeaturemerger1_qsessions (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary1_qsessions $VIS_GRID_HOME/ rankprepout/spark/ qsessions/ week{WNo}/ conditionalprobability $VIS_GRID_HOME/ rankprepout/spark/ qsessions/ week{WNo}/ jointprobability $VIS_GRID_HOME/ rankprepout/spark/ qsessions/ week{WNo}/ jointuserprobability $VIS_GRID_HOME/ rankprepout/spark/ qsessions/week{WNo}/ conditionaluserprobability $VIS_GRID_HOME /rankprepout/spark/ qsessions/ week{WNo}/entropy $VIS_GRID_HOME/ rankprepout/spark/ qsessions/ week{WNo}/pmi $VIS_GRID_HOME/ rankprepout/spark/ qsessions/ week{WNo}/ kldivergenceunary $VIS_GRID_HOME/ rankprepout/spark/ qsessions/week{WNo}/ cosinesimilarity STEP 4b: FEATURE MERGING (QUERY SESSIONS) PIPELINE unaryfeaturemerger2_qsessions (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary2_qsessions symmetricfeaturemerger_qsessions (MergeFeaturesMapper, SymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ symmetric_qsessions asymmetricfeaturemerger_qsessions (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ asymmetric_qsessions reverseasymmetricfeaturemerger_qsessions (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOM E/rankprepout/tmp/ statsmerge/ reverseasymetric_ qsessions Features Others $VIS_GRID_HOME/ rankinput $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ probability unaryfeaturemerger1_ ifl ckr (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary1_ ifl ckr $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ conditionalprobability $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ jointprobability $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ jointuserprobability $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/ week{WNo}/ conditionaluserprobability $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ entropy $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/pmi $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/week{WNo}/ kldivergenceunary $VIS_GRID_HOME/ rankprepout/spark/ ifl ckr/ week{WNo}/ cosinesimilarity STEP 4c: FEATURE MERGING (FLICKR TAGS) PIPELINE unaryfeaturemerger2_ ifl ckr (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary2_ ifl ckr symmetricfeaturemerger_ ifl ckr (MergeFeaturesMapper, SymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ symmetric_ ifl ckr asymmetricfeaturemerger_ ifl ckr (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ asymmetric_ ifl ckr reverseasymmetricfeaturemerger_ ifl ckr (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOM E/rankprepout/tmp/ statsmerge/ reverseasymetric_f lickr Features Others $VIS_GRID_HOME/ rankinput $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ probability unaryfeaturemerger1_tweets (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary1_tweets $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ conditionalprobability $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ jointprobability $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ jointuserprobability $VIS_GRID_HOME/ rankprepout/spark/tweets/ week{WNo}/ conditionaluserprobability $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ entropy $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ pmi $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ kldivergenceunary $VIS_GRID_HOME/ rankprepout/spark/ tweets/week{WNo}/ cosinesimilarity STEP 4d: FEATURE MERGING (TWEETS) PIPELINE unaryfeaturemerger2_tweets (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ unary2_tweets symmetricfeaturemerger_tweets (MergeFeaturesMapper, SymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ symmetric_tweets asymmetricfeaturemerger_tweets (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ asymmetric_tweets reverseasymmetricfeaturemerger_tweets (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOM E/rankprepout/tmp/ statsmerge/ reverseasymetric_t weets Features Others $VIS_GRID_HOME/ rankinput STEP 5a: FEATURE EXTRACTION AND MERGING (COMBINED FEATURES) PIPELINE combinedfeaturem erger_qsessions (CombinedFeature MergerMapper) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ reverseasymetric_qsessions $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ combined_qsessions $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ reverseasymetric_qterms combinedfeaturem erger_qterms (CombinedFeature MergerMapper) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ combined_qterms $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ reverseasymetric_ ifl ckr combinedfeaturem erger_ ifl ckr (CombinedFeature MergerMapper) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ combined_ ifl ckr $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ reverseasymetric_tweets combinedfeaturem erger_tweets (CombinedFeature MergerMapper) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ combined_tweets joinfeatures (JoinerMapper, JoinerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeatures3 STEP 5b: FEATURE EXTRACTION AND MERGING (GRAPH) PIPELINE Features Others $VIS_GRID_HOME /rankprepout/tmp/ statsmerge/ joinfeatures3 $VIS_GRID_HOME/ rankinput graph_sharedconnect_1 (GRSharedConnectMapper, GRSharedConnectReducer) $VIS_GRID_HOME/ rankprepout/graph/ sharedconnect graph_sharedconnect_2 (NullValueSillyMapper, CounterReducer) $VIS_GRID_HOME/ rankprepout/graph/ sharedconnect_2 join_graph (MergeFeaturesMapper, SymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeature_g_1 graph_popularity_rank_all (GRPopularityRankMapper, GRPopularityRankReducer) $VIS_GRID_HOME/ rankprepout/graph/ popularity_rank_all graph_popularity_rank_directed (GRPopularityRankMapper, GRPopularityRankReducer) $VIS_GRID_HOME/ rankprepout/graph/ popularity_rank_directed unaryfeaturemerger_entpopmov (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeature_pop1 unaryfeaturemerger_entpopmov2 (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME /rankprepout/tmp/ statsmerge/ joinfeature_pop2 unaryfeaturemerger_entpopmov3 (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeature_pop3 unaryfeaturemerger_entpopmov4 (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeature_pop4 STEP 5c: FEATURE MERGING (POPULARITY) PIPELINE Features Others $VIS_GRID_HOME/ rankinput $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeature_pop4 WebCitationTotalHits Normalization (SillyMapper) $VIS_GRID_HOME/ web_citation/ total_hits $VIS_GRID_HOME/ rankprepout/tmp/ webcitation_totalhits WebCitationDeepHits Normalization (SillyMapper) $VIS_GRID_HOME/ web_citation/ deep_hits $VIS_GRID_HOME/ rankprepout/tmp/ webcitation_deephits asymmetricfeaturemerger_webcitation (MergeFeaturesMapper, AsymmetricFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ asymmetric_webcitation coverage (QueryCountMapper, QueryCountReducer) $VIS_GRID_HOME/ rankprepout/spark/ coverage/week{WNo} $VIS_GRID_HOME/ rankprepout/spark/ logs/week{WNo} joincov1 (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeaturescov1 joincov2 (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeaturescov2 joinwikipop1 (MergeFeaturesMapper, UnaryFeatureMergerReducer) /user/barla/Spark/ wikiResultCounts $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeatureswikipop1 joinwikipop2 (MergeFeaturesMapper, UnaryFeatureMergerReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/ joinfeatureswikipop2 jointypes (EntityRelationTypeMapper, EntityRelationTypeReducer) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/jointypes STEP 6: RANKING PIPELINE MLR scoring (/homes/barla/gf_mlr/ mlr_rank) $VIS_GRID_HOME/ rankprepout/tmp/ statsmerge/jointypes /homes/barla/gf_mlr/ gbrank.xml /homes/barla/gf_mlr/ header.tsv $VIS_GRID_HOME/ rankprepout/spark/ ranking MLR scoring (SillyMapper, DisambiguationReducer) $VIS_GRID_HOME/ rankprepout/spark/ ranking-disambiguated groupmax (SillyMapper, OutputMaxReducer) $VIS_GRID_HOME/ rankprepout/tmp/ ranking formatranking (RankingFormatterMapper, RankingFormatterReducer) $VIS_GRID_HOME/ rankprepout/spark/ ranking nfi al STEP 7: DATAPACK GENERATION PIPELINE $VIS_GRID_HOME/ rankprepout/spark/ ranking nfi al mergerank (GFFeedMapper, GFFeedReducer) $VIS_GRID_HOME/ gfrankout $VIS_GRID_HOME/ gfrankout/1/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfrankout/2/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfrankout/3/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfrankout/4/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfrankout/5/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfrankout10/geo/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/1/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/2/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/3/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/4/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/5/yalinda/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/10/geo/ phyfacet/ranking/spark $VIS_GRID_HOME/ gfstorageall/1/yalinda/ phyfacet/data $VIS_GRID_HOME/ gfstorageall/2/yalinda/ phyfacet/data $VIS_GRID_HOME/ gfstorageall/3/yalinda/ phyfacet/data $VIS_GRID_HOME/ gfstorageall/4/yalinda/ phyfacet/data $VIS_GRID_HOME/ gfstorageall/5/yalinda/ phyfacet/data $VIS_GRID_HOME/ gfstorageall/10/geo/ phyfacet/data $VIS_GRID_HOME/ gfstorageall/ logicalobj/data dumper1 (DumpFirstMapper, DumpFirstReducer) $VIS_GRID_HOME/ platdumpTmp dumper2 (DumpSecondMapper, DumpSecondReducer) $VIS_GRID_HOME/ platdump datapack (Mapper, YISFacetMergeReducer) $VIS_GRID_HOME/ datapack
  22. 22. - 23 - High-Level Architecture View Entity graph Data preprocessing Feature extraction Model learning Feature sources Editorial judgements Datapack Ranking model Ranking and disambiguation Entity data Features
  23. 23. - 24 - Spark Architecture Entity graph Data preprocessing Feature extraction Model learning Feature sources Editorial judgements Datapack Ranking model Ranking and disambiguation Entity data Features
  24. 24. - 25 - Entity graph Data preprocessing Feature extraction Model learning Feature sources Editorial judgements Datapack Ranking model Ranking and disambiguation Entity data Features Data Preprocessing
  25. 25. - 26 - Entity graph • 3.4 million entities, 160 million relations • Locations: Internet Locality, Wikipedia, Yahoo! Travel • Athletes, teams: Yahoo! Sports • People, characters, movies, TV shows, albums: Dbpedia • Example entities • Dbpedia Brad_Pitt Brad Pitt Movie_Actor • Dbpedia Brad_Pitt Brad Pitt Movie_Producer • Dbpedia Brad_Pitt Brad Pitt Person • Dbpedia Brad_Pitt Brad Pitt TV_Actor • Dbpedia Brad_Pitt_(boxer) Brad Pitt Person • Example relations • Dbpedia Dbpedia Brad_Pitt Angelina_Jolie Person_IsPartnerOf_Person • Dbpedia Dbpedia Brad_Pitt Angelina_Jolie MovieActor_CoCastsWith_MovieActor • Dbpedia Dbpedia Brad_Pitt Angelina_Jolie MovieProducer_ProducesMovieCastedBy_MovieActor
  26. 26. - 27 - Entity graph challenges • Coverage of the query volume – New entities and entity types – Additional inference – International data – Aliases, e.g. jlo, big apple, thomas cruise mapother iv • Freshness – People query for a movie long before it’s released • Irrelevant entity and relation types – E.g. voice actors who co-acted in a movie, cities in a continent • Data quality – United States Senate career of Barack Obama is not a person – Andy Lau has never acted in Iron Man 3
  27. 27. - 28 - Entity graph Data preprocessing Feature extraction Model learning Feature sources Editorial judgements Datapack Ranking model Ranking and disambiguation Entity data Features Feature extraction
  28. 28. - 29 - Feature extraction from text • Text sources – Query terms – Query sessions – Flickr tags – Tweets • Common representation Input tweet: Brad Pitt married to Angelina Jolie in Las Vegas Output event: Brad Pitt + Angelina Jolie Brad Pitt + Las Vegas Angelina Jolie + Las Vegas
  29. 29. - 30 - Features • Unary – Popularity features from text: probability, entropy, wiki id popularity … – Graph features: PageRank on the entity graph, wikipedia, web graph – Type features: entity type • Binary – Co-occurrence features from text: conditional probability, joint probability … – Graph features: common neighbors … – Type features: relation type
  30. 30. - 31 - Feature extraction challenges • Efficiency of text tagging – Hadoop Map/Reduce • More features are not always better – Can lead to over-fitting without sufficient training data
  31. 31. - 32 - Entity graph Data preprocessing Feature extraction Model learning Feature sources Editorial judgements Datapack Ranking model Ranking and disambiguation Entity data Features Model Learning
  32. 32. - 33 - Model Learning • Training data created by editors (five grades) 400 Brandi adriana lima Brad Pitt person Embarassing 1397 David H. andy garcia Brad Pitt person Mostly Related 3037 Jennifer benicio del toro Brad Pitt person Somewhat Related 4615 Sarah burn after reading Brad Pitt person Excellent 9853 Jennifer fight club movie Brad Pitt person Perfect • Join between the editorial data and the feature file • Trained a regression model using GBDT –Gradient Boosted Decision Trees • 10-fold cross validation optimizing NDCG and tuning •number of trees •number of nodes per tree
  33. 33. - 34 - Feature importance RANK FEATURE IMPORTANCE 1 Relation type 100 2 PageRank (Related entity) 99.6075 3 Entropy – Flickr 94.7832 4 Probability – Flickr 82.6172 5 Probability – Query terms 78.9377 6 Shared connections 68.296 7 Cond. Probability – Flickr 68.0496 8 PageRank (Entity) 57.6078 9 KL divergence – Flickr 55.4604 10 KL divergence – Query terms 55.0662
  34. 34. - 35 - Impact of training data Number of training instances (judged relations)
  35. 35. - 36 - Performance by query-entity type •High overall performance but some types are more difficult •Locations – Editors downgrade popular entities such as businesses NDCG by type of the query entity
  36. 36. - 37 - Model Learning challenges • Editorial preferences not necessarily coincide with usage – Users click a lot more on people than expected – Image bias? • Alternative: optimize for usage data – Clicks turned into labels or preferences – Size of the data is not a concern – Gains are computed from normalized CTR/COEC – See van Zwol et al. Ranking Entity Facets Based on User Click Feedback. ICSC 2010: 192-199. couple of hundred entities and their facets we find that linear combination of the conditional probabilities gives t performance on the collected judgements using wqt = 2, = 0.5, and wf t = 1. However, the editorial data was not stantial enough to learn a ranking with GBDT. Click-through Rate versus Click over Expected Click From the image search query logs, we collect the user click a that is related to the facets. This allows us to compute the ck-through rate (CTR) on a facet for a given entity that is ected in a user query and for which the facets were shown he user. Let clickse,f be the number of clicks on a facet ty f show in relation to entity e, and viewse,f the number times the facet f is shown to a user for a related entity e, n the probability of a click on a facet entity f for a given ty e can be modelled as ctre,f : ctre,f = clickse,f viewse,f (2) n Figure 3 the conditional click-through rate is shown for first ten positions. It shows the CTR per position for every ge view where one of the facets is clicked, aggregated over coece,f = cl PP p=1 vi Zhang and Jones [3] refer to expected clicks, based on the de expected clicks given the positio C. Gradient Boosted Decision Tr Stochastic gradient boosted dec the most widely used learning alg today. Gradient tree boosting co sion model, utilizing decision tr One advantage over other learn trees in general is that the feat are highly interpretable. GBDT different loss functions can be u research presented here we used our loss function. In related work, pairwise and ranking specific lo well at improving search relevanc shallow decision trees, trees in s on a randomly selected subset of prone to over-fitting [14]. For the shown in the search engine he ground truth for creating set used by the gradient onal Probabilities of the facets search expe- unction rank(e, f) that is onal probabilities extracted ⇥Pqs(f|e)+wf t ⇥Pf t (f, e) (1) e) are the conditional prob- he weights for the different qt), query session (qs) and l judgements collected for their facets we find that ditional probabilities gives udgements using wqt = 2, the editorial data was not ng with GBDT. k over Expected Click gs, we collect the user click all entities. Observe that the CTR declines when the position at which a facet is shown increases. We introduce a second click model, based on the notion of clicks over expected clicks (COEC). To allows us to deal with the so called position bias – where facets appearing in lower positions are less likely to be clicked even if they are relevant [2]. This phenomenon isoften observed in Web search and we adopt the COEC model proposed by Chapelle and Zhang [11]. In that model, we estimate ctrp as the aggregated ctr – over all queries and sessions – in position p for all positions P. Let then clickse,f be the number of clicks on a facet entity f show in relation to entity e, and viewse,f p the number of times the facet f is shown to a user for a related entity e at position p. The probability of a click over expected click on a facet entity f for a given entity e can then be modelled as coece,f : coece,f = clickse,f PP p=1 viewse,f p ⇥ ctrp (3) Zhang and Jones [3] refer to this method as clicks over expected clicks, based on the denominator that includes the expected clicks given the positions that the url appeared in. C. Gradient Boosted Decision Trees Stochastic gradient boosted decision trees (GBDT) is one of
  37. 37. - 38 - Entity graph Data preprocessing Feature extraction Model learning Feature sources Editorial judgements Datapack Ranking model Ranking and disambiguation Entity data Features Ranking and Disambiguation
  38. 38. - 39 - Ranking and Disambiguation • We apply the ranking function offline to the data • Disambiguation – How many times a given wiki id was retrieved for queries containing the entity name? Brad Pitt Brad_Pitt 21158 Brad Pitt Brad_Pitt_(boxer) 247 XXX XXX_(movie) 1775 XXX XXX_(Asia_album) 89 XXX XXX_(ZZ_Top_album) 87 XXX XXX_(Danny_Brown_album) 67 – PageRank for disambiguating locations (wiki ids are not available) • Expansion to query patterns – Entity name + context, e.g. brad pitt actor
  39. 39. - 40 - Ranking and Disambiguation challenges • Disambiguation cases that are too close to call – Fargo Fargo_(film) 3969 – Fargo Fargo,_North_Dakota 4578 • Disambiguation across Wikipedia and other sources
  40. 40. - 41 - Evaluation #2: Side-by-side testing • Comparing two systems – A/B comparison, e.g. current system under development and production system – Scale: A is better, B is better • Separate tests for relevance and image quality – Image quality can significantly influence user perceptions – Images can violate safe search rules • Classification of errors – Results: missing important results/contains irrelevant results, too few results, entities are not fresh, more/less diverse, should not have triggered – Images: bad photo choice, blurry, group shots, nude/racy etc. • Notes – Borderline, set one entities relate to the movie Psy but the query is most likely about Gangnam style – Blondie and Mickey Gilley are 70’s performers and do not belong on a list of 60’s musicians. – There is absolutely no relation between Finland and California.
  41. 41. - 42 - Evaluation #3: Bucket testing • Also called online evaluation – Comparing against baseline version of the system – Baseline does not change during the test • Small % of search traffic redirected to test system, another small % to the baseline system • Data collection over at least a week, looking for stat. significant differences that are also stable over time • Metrics in web search – Coverage and Click-through Rate (CTR) – Searches per browser-cookie (SPBC) – Other key metrics should not impacted negatively, e.g. Abandonment and retry rate, Daily Active Users (DAU), Revenue Per Search (RPS), etc.
  42. 42. - 43 - Coverage before and after the new system 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 Days Coverage Coverage before Spark Trend before Spark Coverage after Spark Trend after Spark Spark is deployed in production Before release: Flat, lower After release: Flat, higher
  43. 43. - 44 - Click-through rate (CTR) before and after the new system Before release: Gradually degrading performance due to lack of fresh data After release: Learning effect: users are starting to use the tool again 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 Days CTR CTR before Spark Trend before Spark CTR after Spark Trend after Spark Spark is deployed in production
  44. 44. - 45 - Summary • Spark – System for related entity recommendations • Knowledge base • Extraction of signals from query logs and other user-generated content • Machine learned ranking • Evaluation • Other applications – Recommendations on topic-entity pages
  45. 45. - 46 - Future work • New query types – Queries with multiple entities • adele skyfall – Question-answering on keyword queries • brad pitt movies • brad pitt movies 2010 • Extending coverage – Spark now live in CA, UK, AU, NZ, TW, HK, ES • Even fresher data – Stream processing of query log data • Data quality improvements • Online ranking with post-retrieval features
  46. 46. - 47 - The End • Many thanks to – Barla Cambazoglu and Roi Blanco (Barcelona) – Nicolas Torzec (US) – Libby Lin (Product Manager, US) – Search engineering (Taiwan) • Contact – pmika@yahoo-inc.com – @pmika

Hinweis der Redaktion

  • This is how a human sees the world.
  • This is how a machine sees the world… Machines are not ‘intelligent’ and can not ‘read’… they just see a string of symbols and try to match the users input to that stream.
  • In fact, some of these searches are so hard that the users don ’t even try them anymore
  • With ads, the situation is even worse due to the sparsity problem. Note how poor the ads are…
  • Semantic search can be seen as a retrieval paradigm Centered on the use of semantics Incorporates the semantics entailed by the query and (or) the resources into the matching process, it essentially performs semantic search.