Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA

856 Aufrufe

Veröffentlicht am

A presentation on how to produce better semantic relevancy in the context of using stemmers, synonyms and raw content in a dismay field weighting configuration.

Veröffentlicht in: Technologie, Bildung
  • ★★★ How Long Does She Want You to Last? ★★★ A recent study proved that the average man lasts just 2-5 minutes in bed (during intercourse). The study also showed that many women need at least 7-10 minutes of intercourse to reach "The Big O" - and, worse still... 30% of women never get there during intercourse. Clearly, most men are NOT fulfilling there women's needs in bed. Now, as I've said many times - how long you can last is no guarantee of being a GREAT LOVER. But, not being able to last 20, 30 minutes or more, is definitely a sign that you're not going to "set your woman's world on fire" between the sheets. Question is: "What can you do to last longer?" Well, one of the best recommendations I can give you today is to read THIS report. In it, you'll discover a detailed guide to an Ancient Taoist Thrusting Technique that can help any man to last much longer in bed. I can vouch 100% for the technique because my husband has been using it for years :) Here�s the link to the report ★★★ https://tinyurl.com/rockhardxxx
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Gehören Sie zu den Ersten, denen das gefällt!

Relevancy and synonyms - ApacheCon NA 2013 - Portland, Oregon, USA

  1. 1. About us• Founded in 2007• B2B Startup• Semantic Services started in2011• Big Data projects in 2012• Training, Consultancy andSupport on Solr and Hadoop
  2. 2. About me• Leo Oliveira• 15 years of experience withwebsites and search engines• Specialized in Relevancy and Semantics• Graduated in Business Management & ITInnovation
  3. 3. The Importance of Relevancy The game changer in Search Google became what it is due to relevancy algorithms and PageRank Can be achieved through many different types of data Cross-reference, log analysis, social media, market research, new ideas etc It’s about the user And not about what we think of the user If you don’t understand your data, find what is relevant throughresearch Research user behavior and needs and you’ll find a way of finding relevantinformation or you will find a way of discovering what is relevant automatically. If the user can’t find what they want using your search… … they’ll leave.
  4. 4. The Importance of having a resultset Sometimes you have little data and users get a lot of zero-resultqueries. Then it’s time to find the right words for the job Synonym discovery, stemming Analysis: understand what is the user searching for Suggest other options Sometimes a zero-result query is just typo. Use suggestions to find the rightresults. Create an interface that makes it easy for the user to try different keywords to getresults Analyze words that are searched together by the same users to create specialsuggestions or second searches that brings some results
  5. 5. Relevancy and Semantics THE RELEVANCY RULES (the most forgotten ones) Phrases are more relevant the single words “Pure” words, that is, words the way they were typed, are more relevant thanstemmed words or even synonyms In general, newest docs are more relevant than old docs, but this rule could bereversed. Closer venues and locations are more relevant than distant ones. Give the right weights to the right fields (seems simple, but it’s not) When you have a mix of relevant things… for instance, in an e-commerce, that you must consider the freshest docs, plus thesmaller prices and the best sellers, sometimes you need to create a math formula tocope with all these items for relevancy. And it’s tricky. You can boost each parameter separately, but then you must test your formulasvery well not to prioritize one parameter more than the other.
  6. 6. Relevancy and Weighting USING DISMAX (an example of relevancy configuration) By using Dismax, you have two main options: qf and pf. Phrase Fields (pf) should have bigger weights than all of your Query Fields (qf) Field weights must be decided carefully. You can use a scale, such as Fibonaccinumbers, so as to not lose control of what is more relevant and what is not that relevant inyour “search formula” EXAMPLE OF pf for a movies website: MovieName^34 MovieActors^21 MovieDirector^13 MovieYear^8 EXAMPLE OF qf: MovieName^21 MovieActors^13 MovieDirector^8 MovieYear^5
  7. 7. Synonyms Theory Vs Synonyms in Search
  8. 8. Synonym Theory Vs Synonym in Search In search, synonyms can get complicated You don’t need to use only “real” synonyms You must focus on user needs rather than “thesaurus” You can use it for other useful stuff, such as the most common typos to becorrected or even some unrelated words that could do the job for you Sometimes you just need to get creative.
  9. 9. 1-way transformations SynonymFilter E. g. Compter => computer Very useful for common typos or misconceptions Sometimes requires expansion. In our setup, we’ll set a fieldtype that will do the job.
  10. 10. 2-way transformations SynonymFilter E. g. computer, pc, mac, notebook, server Expansion of meanings Sometimes you don’t use exact synonyms. Brand names and other “real world” terms canbe used. It’s hard to configure and maintain 1-way and 2-way in a single synonyms.txt file.
  11. 11. Index Time Vs Query Time Most synonyms should work with Query Time only But there are exceptions. E. g.: photoalbum, photo album When using a phrase synonym, this might require an index time synonym to workcorrectly In our fieldtype, we’ll discuss how to achieve this. This index time synonym file is also useful for search when you need symbols in someterms that could be droped by the tokenizer. E. g. C++ => cpp or .NET => dotnet etc Keep in mind that this index time file will have the exceptions that don’t work well in query-time. To figure this out, sometimes it requires testing, sometimes it is easy to identify, suchas phrase synonyms.
  12. 12. Some examples Query time Enginer => Engineer (1-way) Engineer, engineering (2-way with expand=true) Computer, pc, mac Index time C# => csharp .NET => dotnet Human resources, HR Business intelligence, BI CRM, Costumer Relationship Management
  13. 13. Creating a fieldtype with semantic relevancy<fieldType class="solr.TextField" name="text_pt" positionIncrementGap="100"><analyzer type="index"><tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/><filter class="solr.SynonymFilterFactory" synonyms="synonyms_indextime.txt" ignoreCase="true" expand="true"/><filter catenateAll="0" catenateNumbers="1" catenateWords="1" class="solr.WordDelimiterFilterFactory" generateNumberParts="1"generateWordParts="1" splitOnCaseChange="1"/><filter class="solr.LowerCaseFilterFactory"/><filter class="solr.BrazilianStemFilterFactory"/><filter class="solr.ASCIIFoldingFilterFactory"/></analyzer><analyzer type="query"><tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true" words="stopwords_pt.txt"/><filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt" ignoreCase="true" expand="false"/><filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt" ignoreCase="true" expand="true"/><filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0" catenateWords="0" generateNumberParts="1"generateWordParts="1" splitOnCaseChange="1"/><filter class="solr.LowerCaseFilterFactory"/><filter class="solr.BrazilianStemFilterFactory"/><filter class="solr.ASCIIFoldingFilterFactory"/></analyzer></fieldType>
  14. 14. Creating a fieldtype with semantic relevancy<fieldType class="solr.TextField" name="text_pt" positionIncrementGap="100"><analyzer type="index"><tokenizer class="solr.WhitespaceTokenizerFactory"/>1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"words="stopwords_pt.txt"/>2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_indextime.txt"ignoreCase="true" expand="true"/>3. <filter catenateAll="0" catenateNumbers="1" catenateWords="1"class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1"splitOnCaseChange="1"/>4. <filter class="solr.LowerCaseFilterFactory"/>5. <filter class="solr.BrazilianStemFilterFactory"/>6. <filter class="solr.ASCIIFoldingFilterFactory"/></analyzer>
  15. 15. Creating a fieldtype with semantic relevancy<analyzer type="query"><tokenizer class="solr.WhitespaceTokenizerFactory"/>1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"words="stopwords_pt.txt"/>2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt"ignoreCase="true" expand="false"/>3. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt"ignoreCase="true" expand="true"/>4. <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0"catenateWords="0" generateNumberParts="1" generateWordParts="1"splitOnCaseChange="1"/>5. <filter class="solr.LowerCaseFilterFactory"/>6. <filter class="solr.BrazilianStemFilterFactory"/>7. <filter class="solr.ASCIIFoldingFilterFactory"/></analyzer></fieldType>
  16. 16. Creating a fieldtype with semantic relevancy<analyzer type="query"><tokenizer class="solr.WhitespaceTokenizerFactory"/>1. <filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"words="stopwords_pt.txt"/>2. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_corrections.txt"ignoreCase="true" expand="false"/>3. <filter class="solr.SynonymFilterFactory" synonyms="synonyms_expansions.txt"ignoreCase="true" expand="true"/>4. <filter class="solr.WordDelimiterFilterFactory" catenateAll="0" catenateNumbers="0"catenateWords="0" generateNumberParts="1" generateWordParts="1"splitOnCaseChange="1"/>5. <filter class="solr.LowerCaseFilterFactory"/>6. <filter class="solr.BrazilianStemFilterFactory"/>7. <filter class="solr.ASCIIFoldingFilterFactory"/></analyzer></fieldType>
  17. 17. Creating a fieldtype with semantic relevancy<fieldType class="solr.TextField" name="text_pt_pure" positionIncrementGap="100”><tokenizer class="solr.WhitespaceTokenizerFactory"/><filter class="solr.StopFilterFactory" enablePositionIncrements="true" ignoreCase="true"words="stopwords_pt.txt"/><filter catenateAll="0" catenateNumbers="1" catenateWords="1"class="solr.WordDelimiterFilterFactory" generateNumberParts="1" generateWordParts="1"splitOnCaseChange="1"/><filter class="solr.LowerCaseFilterFactory"/><filter class="solr.ASCIIFoldingFilterFactory"/></analyzer>
  18. 18. Creating a relevant formula for search Imagine you have a field named “Product_name” To keep search result semantically relevant, you’ll need a “Product_name” field and also a“Product_name_pure” Keep in mind that phrases are also more relevant than individual words In that case, here’s how Solrconfig would be: QF (query fields, single words) Product_name_pure^5 Product_name^3 PF (phrase fields) Product_name_pure^8 Product_nameˆ5
  19. 19. What will you get? Relevant results semantically Users will understand the results Synonyms or stemmed tokens won’t disturb or create noise Similar to what main search websites are doing for many different things, such asdocument search, e-commerce, intranet search, document search, library search etc Be able to find relevant results even in a scenario with millions or billions of documents.
  20. 20. I N F I N I T E P O S S I B I L I T I ES
  21. 21. THANK YOU!Get in touch!Email: loliveira@semantix.com.brTwitter: @SemantixBR

×