Anzeige
Anzeige

Más contenido relacionado

Anzeige
Anzeige

From Queries to Answers in the Web

  1. From queries to answers in the Web R o i B l a n c o , S r . R e s e a r c h S c i e n t i s t Y a h o o L a b s
  2. Web Search today
  3. Web Search in 2001
  4. 5 Search now answers queries
  5. 6 Answers arrive even before finishing the query!
  6. Mobile shift 7 Desktop Tablet Mobile Av Words 2.73 2.88 3.05 Av Chars 17.44 18.02 18.93 Song, Ma, Wang, Wang, Exploring and exploiting user search behavior on mobile and tablet devices to improve search relevance WWW 2013 Mobile categories are less skewed (Image 42%, Adult 23.5%, Navigational 15%) vs Desktop (37% Navigational, 19.9% Image, 7.7% commerce) There’s also a difference between top-level domains: Mobile Desktop youtube.com facebook.com wikipedia.org yahoo.com answers.yahoo.com wikipedia.org ehow.com youtube.com imdb.com walmart.com
  7. Q&A in search engines? 8 Fagni, Perego, Silvestri, Orlando. Caching and prefetching query results by exploiting historical usage data. TOIS 2006
  8. The web search perspective  Web search today is really fast, without necessarily being intelligent › A search engine without any understanding  Trends › Convergence of search and online media • End of the 10 blue links › Personal, social search • Search over my world • Search using my profile › New interfaces • Contextual, interactive › Search that anticipates › Solve tasks not queries
  9. Search is really fast, without necessarily being intelligent Could Watson explain why the answer is Toronto?
  10. We came to bury the 10 blue links 8/31/201511 Meaningless query
  11. We came to bury the 10 blue links Meaningful query
  12. 13 Facebook is a search engine
  13. Personalized search Yahoo news feed is a personalized search engine
  14. Search that anticipates 15  Google Now  Star Trek computer • Jason Douglas: Structured Data at Google, SemTechBiz SF 2013
  15. Interactive Voice Search  Apple’s Siri › Question-Answering • Variety of backend sources including Wolfram Alpha and various Yahoo! services › Task completion • E.g. schedule an event  Google Now  Facebook’s M
  16. 17 Facebook’s M mobile assistant
  17. Semantic Search
  18. Web search by 2009 19  Large classes of queries are solved to perfection  Improvements in web search are harder and harder to come by › Relevance models, hyperlink structure and interaction data › Combination of features using machine learning › Heavy investment in computational power • real-time indexing, instant search, datacenters and edge services  Search ranking features › Text matching (including anchor text) › Page authority (Pagerank) › User behavior signals › Other features: context, history (still not very well understood)
  19.  Language issues › Multiple interpretations • jaguar • paris hilton › Secondary meaning • george bush (and I mean the beer brewer in Arizona) › Subjectivity • reliable digital camera • paris hilton sexy › Imprecise or overly precise searches • jim hendler  Complex needs › Missing information • brad pitt zombie • florida man with 115 guns • 35 year old computer scientist living in barcelona › Category queries • countries in africa • barcelona nightlife › Transactional or computational queries • 120 dollars in euros • digital camera under 300 dollars • world temperature in 2020 Poorly solved information needs remain Many of these queries would not be asked by users, who learned over time what search technology can and can not do.
  20. Semantic Search: a definition  Semantic search is a retrieval paradigm where › User intent and resources are represented using semantic models • Not just symbolic representations › Semantic models are exploited in the matching and ranking of resources  Often a hybrid of document and data retrieval › Documents with metadata • Metadata may be embedded inside the document • I’m looking for documents that mention countries in Africa. › Data retrieval • Structured data, but searchable text fields • I’m looking for directors, who have directed movies where the synopsis mentions dinosaurs.  Wide range of semantic search systems › Employ different semantic models, possibly at different steps of the search process and in order to support different tasks
  21. Semantic Search – a process view Query Constructi on •Keywords •Forms •NL •Formal language Query Processin g •IR-style matching & ranking •DB-style precise matching •KB-style matching & inferences Result Presentation •Query visualization •Document and data presentation •Summarization Query Refinement •Implicit feedback •Explicit feedback •Incentives Document Representation Knowledge Representation Semantic Models Resources Documents
  22. Yahoo’s Knowledge Graph Chicago Cubs Chicago Barack Obama Carlos Zambrano 10% off tickets for plays for plays in lives in Brad Pitt Angelina Jolie Steven Soderbergh George Clooney Ocean’s Twelve partner directs casts in E/R casts in takes place in Fight Club casts in Dust Brothers casts in music by Nicolas Torzec: Making knowledge reusable at Yahoo!: a Look at the Yahoo! Knowledge Base (SemTech 2013)
  23. The role of Information Extraction in Semantic Search  Making sense of › Content • Web, News, Twitter, email, etc. › User behavior • Not just queries, also interaction › NER, NEC, NEL, Time expressions, topic, event and relation extraction  Mapping to an abstract representation › Linguistic models • Taxonomies, thesauri, dictionaries of entity names • Natural language structures extracted from text, e.g. using dependency parsing • Inference along linguistic relations, e.g. broader/narrower terms, textual entailment › Conceptual models • Ontologies capture entities in the world and their relationships • Words and phrases in text or records in a database are identified as representations of ontological elements • Inference along ontological relations, e.g. logical entailment
  24. Linguistic Representations of Text 25 Pablo Picasso was born in Málaga, Spain. Pablo Picasso was born Málaga Spain ÷£¿≠¥ ÷ŝc£ËËð №£Ë ¿¥r© ŝ© X£≠£g£, Ë÷£ŝ©. №£Ë ¿¥r© ÷ŝc£ËËð ÷£¿≠¥X£≠£g£ Ë÷£ŝ© IR Text Part-of-Speech tagging Dependency parsing ÷£¿≠¥ ÷ŝc£ËËð №£Ë ¿¥r© ŝ© X£≠£g£, Ë÷£ŝ©. VBDNNP VBN NNP NNPIN Word-sense disambiguation born S#2: (v) give birth, deliver, bear, birth, have (cause to be born) "My wife had twins yesterday!" Root
  25. born-in Conceptual Representations of Text 26 Pablo Picasso was born in Málaga, Spain. Pablo Picasso was born Málaga Spain ÷£¿≠¥ ÷ŝc£ËËð №£Ë ¿¥r© ŝ© X£≠£g£, Ë÷£ŝ©. №£Ë ¿¥r© ÷ŝc£ËËð ÷£¿≠¥X£≠£g£ Ë÷£ŝ© ÷£¿≠¥ ÷ŝc£ËËð №£Ë ¿¥r© ŝ© X£≠£g£, Ë÷£ŝ©. LOC LOCPER ÷£¿≠¥ ÷ŝc£ËËð №£Ë ¿¥r© ŝ© X£≠£g£, Ë÷£ŝ©. IR Text NER Mapping to ontology (NED) city-in
  26. Document processing  Goal › Provide a higher level representation of information in some conceptual space › Conceptual space is different for Semantic Web and NLP based search engines  Limited document understanding in traditional search › Page structure such as fields, templates › Understanding of anchors, other HTML elements › Limited NLP  In Semantic Search, more advanced text processing and/or reliance on explicit metadata › Information sources are not only text but also databases and web services
  27. Example: microformats and RDFa <div class="vcard"> <a class="email fn" href="mailto:jfriday@host.com">Joe Friday</a> <div class="tel">+1-919-555-7878</div> <div class="title">Area Administrator, Assistant</div> </div> <p typeof="contact:Info" about="http://example.org/staff/jo"> <span property="contact:fn">Jo Smith</span>. <span property="contact:title">Web hacker</span> at <a rel="contact:org" href="http://example.org"> Example.org </a>. You can contact me <a rel="contact:email" href="mailto:jo@example.org"> via email </a>. </p> ... Microformat (hCard) RDFa
  28. schema.org  Agreement on a shared set of schemas for common types of web content › Bing, Google, and Yahoo! as initial founders (June, 2011) • Yandex joins schema.org in Nov, 2011 › Similar in intent to sitemaps.org • Use a single format to communicate the same information to all three search engines  schema.org covers areas of interest to all search engines › Business listings (local), creative works (video), recipes, reviews and more › Microdata, RDFa, JSON-LD syntax  Collaborative effort › Growing number of 3rd party contributions › schema.org discussions at public-vocabs@w3.org
  29. Summary 30  If we want to… › Answer queries, not just show links › Personalize search › Take context into account › Anticipate user needs  … we need to understand users, content and the world at large!  Search engine have changed considerably › Queries have changed • Users seek for more info • Vertical search (travel, local, images, videos, news, etc.) • Will move towards a more task-oriented scenario (mobile context shift)  Semantics help tail queries › Head queries solved mostly by clickthrough data
  30. Semantic Search at Yahoo 31
  31. Search over graph data  Unstructured or hybrid search over RDF/graph data › Supporting end-users • Users who can not express their need in SPARQL › Dealing with large-scale data • Giving up query expressivity for scale › Dealing with heterogeneity • Users who are unaware of the schema of the data • No single schema to the data – Example: 2.6m classes and 33k properties in Billion Triples 2009  Entity search › Queries where the user is looking for a single entity named or described in the query › e.g. kaz vaporizer, hospice of cincinnati, mst3000  Elbassuoni, Blanco. Keyword Search over RDF graphs. CIKM 2011  Blanco, Mika, Vigna. Effective and Efficient entity search in RDF data. ISWC 2011
  32.  Entity-seeking queries make up 40- 50% of the query volume › Jeffrey Pound, Peter Mika, Hugo Zaragoza: Ad-hoc object retrieval in the web of data. WWW 2010 › Thomas Lin, Patrick Pantel, Michael Gamon, Anitha Kannan, Ariel Fuxman: Active objects: actions for entity- centric search. WWW 2012 › Show a summary of the most likely information-needs › Including related entities for navigation › Roi Blanco, Berkant Barla Cambazoglu, Peter Mika, Nicolas Torzec: Entity Recommendations in Web Search. ISWC 2013 Application: entity displays in web search
  33. Semantic understanding of queries 38  Entities play an important role › [Pound et al, WWW 2010], [Lin et al WWW 2012] › ~70% of queries contain a named entity (entity mention queries) • brad pitt height › ~50% of queries have an entity focus (entity seeking queries) • brad pitt attacked by fans › ~10% of queries are looking for a class of entities • brad pitt movies  Entity mention query = <entity> {+ <intent>} › Intent is typically an additional word or phrase to • Disambiguate, most often by type e.g. brad pitt actor • Specify action or aspect e.g. brad pitt net worth, toy story trailer
  34. oakland as bradd pitt movie moneyball movies.yahoo.com oakland as wikipedia.org captain america movies.yahoo.com moneyball trailer movies.yahoo.com money moneyball movies.yahoo.com moneyball movies.yahoo.com movies.yahoo.com en.wikipedia.org movies.yahoo.com peter brand peter brand oakland nymag.com moneyball the movie www.imdb.com moneyball trailer movies.yahoo.com moneyball trailer brad pitt brad pitt moneyball brad pitt moneyball movie brad pitt moneyball brad pitt moneyball oscar www.imdb.com relay for life calvert ocunty www.relayforlife.org trailer for moneyball movies.yahoo.com moneyball.movie-trailer.com moneyball en.wikipedia.org movies.yahoo.com map of africa www.africaguide.com money ball movie www.imdb.com money ball movie trailer moneyball.movie-trailer.com brad pitt new www.zimbio.com www.usaweekend.com www.ivillage.com www.ivillage.com brad pitt news news.search.yahoo.com moneyball trailer moneyball trailer www.imdb.com www.imdb.com Patterns in logs are hard to see  Sample of sessions from June, 2011 containing the term “moneyball” › What are users trying to do?
  35. oakland as bradd pitt movie moneyball trailer movies.yahoo.com oakland as wikipedia.org Semantic annotations help to generalize… Sports team Movie Actor
  36. … and understand user needs 8/31/201541 moneyball trailer what the user wants to do with it Movie Object of the query
  37. Semantic analysis of query logs 8/31/201542  Multiple approaches › Dictionary tagging • Match entities in a fixed dictionary • Scalable, high recall, not very precise › Entity retrieval • Retrieval an index of the knowledge base › Post-retrieval methods • Annotate a document corpus with entities • Retrieve documents and aggregate annotations  Applications › Usage mining • L. Hollink, P. Mika and R. Blanco. Web Usage Mining with Semantic Analysis. WWW 2013 › Related-entity recommendations • R. Blanco, B. Cambazoglu, P. Mika, N. Torzec: Entity Recommendations in Web Search. ISWC 2013
  38. Usage mining 43  Site owners would like to find usage patterns › Reducing abandonment › Competitive analysis  Problem: patterns are lost in the data › 64% of queries are unique within a year › Even the most frequent patterns have low support
  39. Solving the sparseness problem through annotations 44  Frequent patterns of annotations are more general and less noisy
  40.  Match by keywords › Closer to text retrieval • Match individual keywords • Score and aggregate • https://github.com/yahoo/Glimmer/  Match by aliases › Closer to entity linking • Find potential mentions of entities (spots) in query • Score candidates for each spot Two matching approaches brad (actor) (boxer) (city) (actor) (boxer) (lake) pitt brad pitt (actor) (boxer)
  41. … back to query understanding 8/31/201546 moneyball trailer what the user wants to do with it Movie Object of the query
  42. Fast Entity Linking in Queries 47  Use aliases to “entity pages” (Wikipedia, IMDB, local, etc.) as source of information for entity-query aliases  Chunk the query into the most likely segmentation  Be fast by avoiding entity to entity decisions when scoring  Add context externally using semantic relatedness of keywords and entities  Compression: › Minimal perfect hashes + Golomb coding › All Wiki + 1 year of query logs of aliases + 1 year of query sessions w2v model < 3GB Blanco, Ottaviano, Meij. Fast and space-efficient entity linking for queries. WSDM 2015
  43. Problem definition 48  Given › Query q consisting of an ordered list of tokens ti › Segment s from a segmentation s from all possible segmentations Sq › Entity e from a set of candidate entities e from the complete set E  Find › For all possible segmentations and candidate entities › Select best entity for segment independently of other segments
  44.  Keyphraseness › How likely is a segment to be an entity mention? › e.g. how common is “in”(unlinked) vs. “in” (linked) in the text  Commonness › How likely that a linked segment refers to a particular entity? › e.g. how often does “brad pitt” refers to Brad Pitt (actor) vs. Brad Pitt (boxer) 49 Intuitions Assume: also given annotated collections ci with segments of text linked to entities from E.
  45. Ranking function Probability of the segment generated by a given collection Commonness Keyphraseness
  46. Context-aware extension 51 Estimated by word2vec representation Probability of segment and query are independent of each other given the entity Probability of segment and query are independent of each other
  47. Results: effectiveness 52  Significant improvement over external baselines and internal system › Measured on public Webscope dataset Yahoo Search Query Log to Entities Search over Bing, top Wikipedia result State-of-the-art in literature A trivial search engine over Wikipedia Our method: Fast Entity Linker (FEL) FEL + context
  48.  Two orders of magnitude faster than state-of-the-art › Simplifying assumptions at scoring time › Adding context independently › Dynamic pruning  Small memory footprint › Compression techniques, e.g. 10x reduction in word2vec storage 53 Results: efficiency
  49. Wrap-up 54
  50. Mobile search challenges and opportunities 55  Interaction › Question-answering › Support for interactive retrieval › Spoken-language access › Task completion  Contextualization › Personalization › Geo › Context (work/home/travel) • Try getaviate.com
  51. Task completion 56  We would like to help our users in task completion › But we have trained our users to talk in nouns • Retrieval performance decreases by adding verbs to queries › We need to understand what the available actions are  Modeling actions › Understand what actions can be taken on a page › Help users in mapping their query to potential actions › Applications in web search, email etc. THING THING Schema.org v1.2 including Actions published April 16, 2014
  52. The end 57  Many thanks for the Semantic Search team in London › Peter Mika, › Edgar Meij › Hugues Bouchard  Joint work with many collaborators: Sebastiano Vigna, Laura Hollink, Giuseppe Ottaviano, Nicolas Torzec, among others.  roi@yahoo-inc.com
Anzeige