SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Downloaden Sie, um offline zu lesen
Encores? - Going beyond
matching and ranking of
search results
Berlin Buzzwords 2021
Eric Pugh, René Kriegler
Who we are
@renekrie @dep4b
Combined 30 years of
experience in search
Open Source enthusiasts
ASF member, Committers on:
Solr, Querqy, SMUI, Quepid,
https://mices.co
Search - beyond matching and ranking
We tend to focus on matching and ranking. Other search features are almost
treated like an afterthought, like ‘encores’ that follow the main performance:
● Facets
● Query auto-completion
● Spelling correction
● Query relaxation
=> BUT: These are essential features that help the user formulate the query,
understand and narrow down the results
Our main act today (not encores!)
● Facets
● Query auto-completion
● Spelling correction
● Query relaxation
Learn about solutions that come out of the box (in Solr)
Typical challenges and how to overcome them
Advanced solutions: understand the concepts, create your own
The Art of Facets
Facets help the user ...
● understand the search results (see ‘what is there’, learn about the domain)
● narrow down search results
Chorus Electronics Project: https://github.com/querqy/chorus
Try the Demo Ecommerce Shop: http://chorus.dev.o19s.com:4000/
Facets help the user ...
● understand the search results (see ‘what is there’, learn about the domain)
● narrow down search results
Challenges
Getting the counts right in e-commerce search
Showing the best facets in the best order
Selecting the facet values to show
“Qui numerare incipit errare incipit”
Facet Counts
Facets and filters
A trivial example:
query=t-shirts
filter=color:black
Challenge: We still need to count all colours in the facets, even if the search result
contains only black t-shirts
Solution: Tagging and exclusion of filters
Facets and filters: tagging and exclusion
Tagging:
fq={!tag=f_color}color:black
Exclusion:
Facet param
facet.field={!ex=f_color}color
JSON facets
"facet": {
"color": {
"type": "terms",
"field": "color",
"domain": {
"excludeTags":"f_color"
}
}
}
Challenge: product variants
Product ID: 9739, brand: “inteemate”
Size: XS
Price: 11.99
Size: XL
Price: 11.99
Size: S
Price: 12.99
Size: S
Price: 12.99
Size: M
Price: 13.99
Size: L
Price: 13.99
Challenge: product variants
Product ID: 9739, brand: “inteemate”
Size: XS
Price: 11.99
Size: XL
Price: 11.99
Size: S
Price: 12.99
Size: S
Price: 12.99
Size: M
Price: 13.99
Size: L
Price: 13.99
color: [green, yellow, blue]
size: [XS, S, M, L, XL]
price: [11.99, 12.99, 13.99]
Merge into single document??
Facets would work great but
false matches for filter color:green AND size:M
Challenge: product variants
Best solution (in our opinion):
● Index one document for each variant
● Group variants at query time using the collapse query parser:
fq={!collapse field=productId}
=> Boolean filters work as expected
=> Great flexibility for counting facets
=> Fast enough
Challenge: product variants
Size: XS
Price: 11.99
Product: 9739
Brand: inteemate
Size: XL
Price: 11.99
Product: 9739
Brand: inteemate
Size: S
Price: 12.99
Product: 9739
Brand: inteemate
Size: S
Price: 12.99
Product: 9739
Brand: inteemate
Size: M
Price: 13.99
Product: 9739
Brand: inteemate
Size: L
Price: 13.99
Product: 9739
Brand: inteemate
filter query
fq=brand:inteemate
Challenge: product variants
Size: XS
Price: 11.99
Product: 9739
Brand: inteemate
Size: XL
Price: 11.99
Product: 9739
Brand: inteemate
Size: S
Price: 12.99
Product: 9739
Brand: inteemate
Size: S
Price: 12.99
Product: 9739
Brand: inteemate
Size: M
Price: 13.99
Product: 9739
Brand: inteemate
Size: L
Price: 13.99
Product: 9739
Brand: inteemate
filter query
fq=brand:inteemate
filter query
fq=color:blue
Challenge: product variants
Size: XS
Price: 11.99
Product: 9739
Brand: inteemate
Size: XL
Price: 11.99
Product: 9739
Brand: inteemate
Size: S
Price: 12.99
Product: 9739
Brand: inteemate
Size: S
Price: 12.99
Product: 9739
Brand: inteemate
Size: M
Price: 13.99
Product: 9739
Brand: inteemate
Size: L
Price: 13.99
Product: 9739
Brand: inteemate
filter query
fq=brand:inteemate
filter query
fq=color:blue
query
q=t-shirt
Challenge: product variants
Size: XS
Price: 11.99
Product: 9739
Brand: inteemate
Size: XL
Price: 11.99
Product: 9739
Brand: inteemate
Size: S
Price: 12.99
Product: 9739
Brand: inteemate
Size: S
Price: 12.99
Product: 9739
Brand: inteemate
Size: M
Price: 13.99
Product: 9739
Brand: inteemate
Size: L
Price: 13.99
Product: 9739
Brand: inteemate
filter query
fq=brand:inteemate
filter query
fq=color:blue
query
q=t-shirt
post filter query
fq={!collapse
field=productId}
Challenge: product variants in facets
Size: S
Price: 12.99
Product: 9739
Brand: inteemate
Size: M
Price: 13.99
Product: 9739
Brand: inteemate
Size: L
Price: 13.99
Product: 9739
Brand: inteemate
...
post filter query
fq={!collapse
field=productId}
Facet counts will be correct for
product attributes (“brand”)
Challenge: product variants in facets
Size: S
Price: 12.99
Product: 9739
Brand: inteemate
Size: M
Price: 13.99
Product: 9739
Brand: inteemate
Size: L
Price: 13.99
Product: 9739
Brand: inteemate
...
post filter query
fq={!collapse
field=productId
tag=coll}
For facet counts of variant
attributes we’ll have to tag and
exclude collapse filter:
"facet": {
"size": {
"type": "terms",
"field": "size",
"domain": {
"excludeTags":"coll"
}
}
}
Sizes S, M, L all shown as ‘1 result’ in facets ✅
Challenge: product variants in facets
Size: S
Price: 12.99
Product: 9739
Brand: inteemate
Size: M
Price: 13.99
Product: 9739
Brand: inteemate
Size: L
Price: 13.99
Product: 9739
Brand: inteemate
...
post filter query
fq={!collapse
field=productId
tag=coll}
For facet counts of variant
attributes we’ll have to tag and
exclude collapse filter:
"facet": {
"color": {
"type": "terms",
"field": "color",
"domain": {
"excludeTags":"coll"
}
}
}
Sizes S, M, L all shown as ‘1 result’ in facets ✅
Color ‘Blue’ shown as ‘3 results’ in facets ❌
Challenge: product variants in facets
Size: S
Price: 12.99
Product: 9739
Brand: inteemate
Size: M
Price: 13.99
Product: 9739
Brand: inteemate
Size: L
Price: 13.99
Product: 9739
Brand: inteemate
...
post filter query
fq={!collapse
field=productId
tag=coll}
For facet counts of variant
attributes we’ll have to tag and
exclude collapse filter:
"facet": {
"color": {
"type": "terms",
"field": "color",
"domain": {
"excludeTags":"coll"
},
facet: {
"numProducts":"unique(productId)"
}
}
}
Sizes S, M, L all shown as ‘1 result’ in facets ✅
Color ‘Blue’ shown as ‘1 result’ in facets ✅
Collapse query parser - notes on implementation
● Beware of high cardinality of product IDs.
○ If you have 10M different product IDs in your index, the collapse query parser will allocate
heap space for 2 arrays (float/int) x 10M elements (ca. 80 MB) per request!
○ Solution:
■ Many products have just 1 variant. It’s better to leave the productId empty in this case.
■ Combine with nullPolicy=expand, which avoids reserving array space for products
without a productId:
fq={!collapse field=productId nullPolicy=expand}
● All variants of a product must be indexed to the same shard
Facet Selection
Which facets should we show?
Some domains are rich in attributes. For example, electronics could use 10k
different attributes.
Even if we reduced the number of attributes to be used in facets at index time, we
could be left with several hundreds of candidates for facetting.
Building a request for hundreds of facets is not feasible. We’ll show a simple
solution, that will just use the search engine to select facets.
At the other end of the spectrum, you could train a model, that predicts which
facets to show for a given query.
Which facets should we show? - Solution
Index a field multivalued field that holds the names of the facettable fields...
Doc1:
screenSize: 17, ....
facettableFields: [
“screenSize”, “ramGB”, “height”, “width”, ...
]
Doc2:
...
facettableFields: [
“screenSize”, “numHDMIPorts”, “height”, “width”, ...
]
Which facets should we show? - Solution
... and execute an additional, prior facet request on this field. Add the facet values
returned by this request as facet parameters to the main request:
"facet": {
"facettable_fields": {
"type": "terms",
"field": "facettable_fields"
}
}
(query/filter queries are the same like in
the ‘main request’)
"facets":{
"facettable_fields":{
"buckets":[{
"val":"screenSize",
"count":12},
{
"val":"ramGB",
"count":4},
{
"val":"height",
"count":3},
]
}
}
Which facets should we show? - Solution
Index additional information together with the names of facettable fields
facettableFields: [
"00010;screenSize;Screen size ",
"00100;ramGB;Memory (GB) ",
"00005;height;Height ",
"00005;width;Width ",
...]
Importance
(padding makes
values sortable!)
Field name Label
Facet Value Selection
Which facet values?
Category Pills being dynamically included IF the entropy model says they are
meaningful for filtering the data
Shannon’s Entropy Worksheet
https://bit.ly/measure-diversity
(https://en.wikipedia.org/wiki/Entropy_(information_theory))
Auto-completion & spelling correction
Autocompletion - Using a Suggester
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">mySuggester</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="suggestAnalyzerFieldType">text_general</str>
<str name="buildOnCommit">true</str>
<str name="field">dictionary</str>
</lst>
</searchComponent>
Experiment with combinations of Lookups & Dictionary implementations.
Spellchecking - Using Solr component
Two flavours: “cofffee --> coffee”, collations: “expresso machine”
-->“espresso machine”
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">title</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.extendedResults">true</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.maxCollations">100</str>
<str name="spellcheck.maxCollationTries">5</str>
<str name="spellcheck.count">5</str>
<str name="spellcheck.collateParam.mm">100%</str>
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">mySuggester</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="suggestAnalyzerFieldType">text_general</str>
<str name="buildOnCommit">true</str>
<str name="field">dictionary</str>
</lst>
</searchComponent>
Good collations are
what we want!
Autocompletion - using a query index
User enters: ja
Show the best query completions
in the best order!
Using ideas from Joshua Bacher/Christine Bellstedt, Search
Suggestions - The Underestimated Killer Feature of your Online Shop.
Berlin Buzzwords 2018
Autocompletion - using a query index
User enters: ja
Match prefix in field “match”
q=*:*&fq=match:ja
indexed with EdgeNGramFilter,
lowercase, remove accents/ASCII
folding, ...
Optionally index and match
spelling variants (jacket/jakcet)
Autocompletion - using a query index
Sort by “weight desc”
q=*:*&fq=match:ja
&sort=weight desc
Sorting might get slow for short
prefixes if the query index is large -
tag the top N queries for lengths 1 and
2 and add another filter (fewer
matches to sort, nicely cacheable):
q=*:*&fq=match:ja
&sort=weight desc
&fq=top_len_2:true
Autocompletion - using a query index
If two queries have the same
fingerprint, drop the one with the
lower weight
Fingerprint: concatenated sorted,
normalised query tokens
This increases the diversity of the
suggestions.
Autocompletion - using a query index
Suggest the Labels as query
completions!
Autocompletion - using a query index
You can show the best matching
category for disambiguation and
affirmation:
* jacket
* jacket in Fashion
Spelling correction - using a query index
Structure similar to query index for
autocompletion
Copy of the ‘match’ field indexed
as n-grams
Spelling correction - using a query index
Filters on edit distance and rank
based on n-grams (via TF*IDF)
q=jakc jones
&defType=edismax
&qf=match_ngram
&sow=false
&fq=match:jakc jones~2
Spelling correction - using a query index
Add boost by weight (or a function
of it)
q=jakc jones
&defType=edismax
&qf=match_ngram
&sow=false
&fq=match:jakc jones~2
&boost=weight
General model for spelling correction &
autocompletion
Noisy Channel Model / Bayesian Inference
(Kernighan et. al., 1990; Jurafsky & Martin, 2009)
Our ‘Weight’ field
Edit distance, n-gram
model, keyboard layout
(Symspell!), ... prefix
match for autocompletion
Query relaxation
Query relaxation
Which query term should we drop if we can’t match all of them together?
jacket xs green
jacket xs green
jacket xs green
jacket xs green
iphone 12
iphone 12
iphone 12
Query relaxation - ‘mm’ anti-pattern
Loosening ‘minimum should match’ (mm) constraint to < 100%
iphone 12
iphone 12
You’ll get matches for “12”
She will just see probably imprecise results that don’t match her
query exactly.
You cannot tell the user what happened and which term you
dropped. She wouldn’t know what to do in order to improve the
query.
Don’t do this!
At least not in e-commerce search
Query relaxation - Solutions
René.Kriegler, Query Relaxation - a rewriting technique
between search and recommendations. Haystack
Conference 2019
Query relaxation - Solutions
Try searching with each term individually and drop the one from
the query that yields the fewest results (might require additional
rules to avoid just keeping number terms)
Query relaxation - Solutions
Multi-layer Neural Network,
Word embeddings as input to represent terms
Encores?
Facets, autocompletion, spelling correction, query relaxation are important
features of a search application.
We’ve shown simple out-of-the-box solutions and a path to implement more
advanced approaches.
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/SolrLet's Build an Inverted Index: Introduction to Apache Lucene/Solr
Let's Build an Inverted Index: Introduction to Apache Lucene/Solr
 
Front end-security
Front end-securityFront end-security
Front end-security
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Role of Cataloger in the 21st Century Academic Library
Role of Cataloger in the 21st Century Academic LibraryRole of Cataloger in the 21st Century Academic Library
Role of Cataloger in the 21st Century Academic Library
 
Time Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy RyzaTime Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy Ryza
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowSimplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
 
ELK Stack
ELK StackELK Stack
ELK Stack
 
Centralized Logging System Using ELK Stack
Centralized Logging System Using ELK StackCentralized Logging System Using ELK Stack
Centralized Logging System Using ELK Stack
 
Tag based policies using Apache Atlas and Ranger
Tag based policies using Apache Atlas and RangerTag based policies using Apache Atlas and Ranger
Tag based policies using Apache Atlas and Ranger
 
Introduction to DSpace
Introduction to DSpaceIntroduction to DSpace
Introduction to DSpace
 
Query relaxation - A rewriting technique between search and recommendations
Query relaxation - A rewriting technique between search and recommendationsQuery relaxation - A rewriting technique between search and recommendations
Query relaxation - A rewriting technique between search and recommendations
 
Raffi Krikorian, Twitter Timelines at Scale
Raffi Krikorian, Twitter Timelines at ScaleRaffi Krikorian, Twitter Timelines at Scale
Raffi Krikorian, Twitter Timelines at Scale
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Understanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache CassandraUnderstanding Data Consistency in Apache Cassandra
Understanding Data Consistency in Apache Cassandra
 
Apache Kafka in the Airline, Aviation and Travel Industry
Apache Kafka in the Airline, Aviation and Travel IndustryApache Kafka in the Airline, Aviation and Travel Industry
Apache Kafka in the Airline, Aviation and Travel Industry
 
Making DSpace XMLUI Your Own
Making DSpace XMLUI Your OwnMaking DSpace XMLUI Your Own
Making DSpace XMLUI Your Own
 
Moving Gigantic Files Into and Out of the Alfresco Repository
Moving Gigantic Files Into and Out of the Alfresco RepositoryMoving Gigantic Files Into and Out of the Alfresco Repository
Moving Gigantic Files Into and Out of the Alfresco Repository
 
Oak, the Architecture of the new Repository
Oak, the Architecture of the new RepositoryOak, the Architecture of the new Repository
Oak, the Architecture of the new Repository
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 

Ähnlich wie Encores

Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
Jean Silva
 
QM-008-Design for Six Sigma 1
QM-008-Design for Six Sigma 1QM-008-Design for Six Sigma 1
QM-008-Design for Six Sigma 1
handbook
 
Samples of Database and Website Design
Samples of Database and Website DesignSamples of Database and Website Design
Samples of Database and Website Design
Sherri Orwick Ogden
 

Ähnlich wie Encores (20)

Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Feature surfacing - meetup
Feature surfacing  - meetupFeature surfacing  - meetup
Feature surfacing - meetup
 
Product Line And Product Development Exercises
Product Line And Product Development ExercisesProduct Line And Product Development Exercises
Product Line And Product Development Exercises
 
The product is not "the product". Who owns it anyway?
The product is not "the product". Who owns it anyway? The product is not "the product". Who owns it anyway?
The product is not "the product". Who owns it anyway?
 
Design handoffs with design systems in mind
Design handoffs with design systems in mindDesign handoffs with design systems in mind
Design handoffs with design systems in mind
 
Make money with industrial design. Jukola7 presentation 2014
Make money with industrial design. Jukola7 presentation 2014Make money with industrial design. Jukola7 presentation 2014
Make money with industrial design. Jukola7 presentation 2014
 
QM-008-Design for Six Sigma 1
QM-008-Design for Six Sigma 1QM-008-Design for Six Sigma 1
QM-008-Design for Six Sigma 1
 
Applying NLP to product comparison at visual meta
Applying NLP to product comparison at visual metaApplying NLP to product comparison at visual meta
Applying NLP to product comparison at visual meta
 
Retail referencearchitecture productcatalog
Retail referencearchitecture productcatalogRetail referencearchitecture productcatalog
Retail referencearchitecture productcatalog
 
Samples of Database and Website Design
Samples of Database and Website DesignSamples of Database and Website Design
Samples of Database and Website Design
 
Writing a specification
Writing a specificationWriting a specification
Writing a specification
 
DC Group
DC Group DC Group
DC Group
 
Boosting Product Categorization with Machine Learning
Boosting Product Categorization with Machine LearningBoosting Product Categorization with Machine Learning
Boosting Product Categorization with Machine Learning
 
Just ask: Designing intent-driven algos
Just ask: Designing intent-driven algosJust ask: Designing intent-driven algos
Just ask: Designing intent-driven algos
 
Vb introduction.
Vb introduction.Vb introduction.
Vb introduction.
 
SXSW After Party
SXSW After PartySXSW After Party
SXSW After Party
 
Building complex UI on Android
Building complex UI on AndroidBuilding complex UI on Android
Building complex UI on Android
 
4.Data-Visualization.pptx
4.Data-Visualization.pptx4.Data-Visualization.pptx
4.Data-Visualization.pptx
 
Marketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success RatesMarketing Analytics with R Lifting Campaign Success Rates
Marketing Analytics with R Lifting Campaign Success Rates
 
New Principles for Digital Experiences That Perform
New Principles for Digital Experiences That PerformNew Principles for Digital Experiences That Perform
New Principles for Digital Experiences That Perform
 

Mehr von OpenSource Connections

Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
OpenSource Connections
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
OpenSource Connections
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
OpenSource Connections
 

Mehr von OpenSource Connections (20)

Test driven relevancy
Test driven relevancyTest driven relevancy
Test driven relevancy
 
How To Structure Your Search Team for Success
How To Structure Your Search Team for SuccessHow To Structure Your Search Team for Success
How To Structure Your Search Team for Success
 
The right path to making search relevant - Taxonomy Bootcamp London 2019
The right path to making search relevant  - Taxonomy Bootcamp London 2019The right path to making search relevant  - Taxonomy Bootcamp London 2019
The right path to making search relevant - Taxonomy Bootcamp London 2019
 
Payloads and OCR with Solr
Payloads and OCR with SolrPayloads and OCR with Solr
Payloads and OCR with Solr
 
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie HullHaystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
Haystack 2019 Lightning Talk - The Future of Quepid - Charlie Hull
 
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim AllisonHaystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
Haystack 2019 Lightning Talk - State of Apache Tika - Tim Allison
 
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
Haystack 2019 Lightning Talk - Relevance on 17 million full text documents - ...
 
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj BharadwajHaystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
Haystack 2019 Lightning Talk - Solr Cloud on Kubernetes - Manoj Bharadwaj
 
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
Haystack 2019 Lightning Talk - Quaerite a Search relevance evaluation toolkit...
 
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan KohlHaystack 2019 - Search-based recommendations at Politico - Ryan Kohl
Haystack 2019 - Search-based recommendations at Politico - Ryan Kohl
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey GraingerHaystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
Haystack 2019 - Natural Language Search with Knowledge Graphs - Trey Grainger
 
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
Haystack 2019 - Search Logs + Machine Learning = Auto-Tagging Inventory - Joh...
 
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
Haystack 2019 - Improving Search Relevance with Numeric Features in Elasticse...
 
Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...Haystack 2019 - Architectural considerations on search relevancy in the conte...
Haystack 2019 - Architectural considerations on search relevancy in the conte...
 
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
Haystack 2019 - Custom Solr Query Parser Design Option, and Pros & Cons - Ber...
 
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...Haystack 2019 - Establishing a relevance focused culture in a large organizat...
Haystack 2019 - Establishing a relevance focused culture in a large organizat...
 
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
Haystack 2019 - Solving for Satisfaction: Introduction to Click Models - Eliz...
 
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
2019 Haystack - How The New York Times Tackles Relevance - Jeremiah Via
 
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
Haystack 2019 - Addressing variance in AB tests: Interleaved evaluation of ra...
 

Kürzlich hochgeladen

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
UXDXConf
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Peter Udo Diehl
 

Kürzlich hochgeladen (20)

Structuring Teams and Portfolios for Success
Structuring Teams and Portfolios for SuccessStructuring Teams and Portfolios for Success
Structuring Teams and Portfolios for Success
 
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
Integrating Telephony Systems with Salesforce: Insights and Considerations, B...
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
ASRock Industrial FDO Solutions in Action for Industrial Edge AI _ Kenny at A...
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
Measures in SQL (a talk at SF Distributed Systems meetup, 2024-05-22)
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024What's New in Teams Calling, Meetings and Devices April 2024
What's New in Teams Calling, Meetings and Devices April 2024
 
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi IbrahimzadeFree and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
Free and Effective: Making Flows Publicly Accessible, Yumi Ibrahimzade
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
ECS 2024 Teams Premium - Pretty Secure
ECS 2024   Teams Premium - Pretty SecureECS 2024   Teams Premium - Pretty Secure
ECS 2024 Teams Premium - Pretty Secure
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
AI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří KarpíšekAI revolution and Salesforce, Jiří Karpíšek
AI revolution and Salesforce, Jiří Karpíšek
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Powerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara LaskowskaPowerful Start- the Key to Project Success, Barbara Laskowska
Powerful Start- the Key to Project Success, Barbara Laskowska
 
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo DiehlFuture Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
Future Visions: Predictions to Guide and Time Tech Innovation, Peter Udo Diehl
 
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
TEST BANK For, Information Technology Project Management 9th Edition Kathy Sc...
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Syngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdfSyngulon - Selection technology May 2024.pdf
Syngulon - Selection technology May 2024.pdf
 
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone KomSalesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
Salesforce Adoption – Metrics, Methods, and Motivation, Antone Kom
 

Encores

  • 1. Encores? - Going beyond matching and ranking of search results Berlin Buzzwords 2021 Eric Pugh, René Kriegler
  • 2. Who we are @renekrie @dep4b Combined 30 years of experience in search Open Source enthusiasts ASF member, Committers on: Solr, Querqy, SMUI, Quepid,
  • 4. Search - beyond matching and ranking We tend to focus on matching and ranking. Other search features are almost treated like an afterthought, like ‘encores’ that follow the main performance: ● Facets ● Query auto-completion ● Spelling correction ● Query relaxation => BUT: These are essential features that help the user formulate the query, understand and narrow down the results
  • 5. Our main act today (not encores!) ● Facets ● Query auto-completion ● Spelling correction ● Query relaxation Learn about solutions that come out of the box (in Solr) Typical challenges and how to overcome them Advanced solutions: understand the concepts, create your own
  • 6. The Art of Facets
  • 7. Facets help the user ... ● understand the search results (see ‘what is there’, learn about the domain) ● narrow down search results Chorus Electronics Project: https://github.com/querqy/chorus Try the Demo Ecommerce Shop: http://chorus.dev.o19s.com:4000/
  • 8. Facets help the user ... ● understand the search results (see ‘what is there’, learn about the domain) ● narrow down search results
  • 9. Challenges Getting the counts right in e-commerce search Showing the best facets in the best order Selecting the facet values to show
  • 10. “Qui numerare incipit errare incipit” Facet Counts
  • 11. Facets and filters A trivial example: query=t-shirts filter=color:black Challenge: We still need to count all colours in the facets, even if the search result contains only black t-shirts Solution: Tagging and exclusion of filters
  • 12. Facets and filters: tagging and exclusion Tagging: fq={!tag=f_color}color:black Exclusion: Facet param facet.field={!ex=f_color}color JSON facets "facet": { "color": { "type": "terms", "field": "color", "domain": { "excludeTags":"f_color" } } }
  • 13. Challenge: product variants Product ID: 9739, brand: “inteemate” Size: XS Price: 11.99 Size: XL Price: 11.99 Size: S Price: 12.99 Size: S Price: 12.99 Size: M Price: 13.99 Size: L Price: 13.99
  • 14. Challenge: product variants Product ID: 9739, brand: “inteemate” Size: XS Price: 11.99 Size: XL Price: 11.99 Size: S Price: 12.99 Size: S Price: 12.99 Size: M Price: 13.99 Size: L Price: 13.99 color: [green, yellow, blue] size: [XS, S, M, L, XL] price: [11.99, 12.99, 13.99] Merge into single document?? Facets would work great but false matches for filter color:green AND size:M
  • 15. Challenge: product variants Best solution (in our opinion): ● Index one document for each variant ● Group variants at query time using the collapse query parser: fq={!collapse field=productId} => Boolean filters work as expected => Great flexibility for counting facets => Fast enough
  • 16. Challenge: product variants Size: XS Price: 11.99 Product: 9739 Brand: inteemate Size: XL Price: 11.99 Product: 9739 Brand: inteemate Size: S Price: 12.99 Product: 9739 Brand: inteemate Size: S Price: 12.99 Product: 9739 Brand: inteemate Size: M Price: 13.99 Product: 9739 Brand: inteemate Size: L Price: 13.99 Product: 9739 Brand: inteemate filter query fq=brand:inteemate
  • 17. Challenge: product variants Size: XS Price: 11.99 Product: 9739 Brand: inteemate Size: XL Price: 11.99 Product: 9739 Brand: inteemate Size: S Price: 12.99 Product: 9739 Brand: inteemate Size: S Price: 12.99 Product: 9739 Brand: inteemate Size: M Price: 13.99 Product: 9739 Brand: inteemate Size: L Price: 13.99 Product: 9739 Brand: inteemate filter query fq=brand:inteemate filter query fq=color:blue
  • 18. Challenge: product variants Size: XS Price: 11.99 Product: 9739 Brand: inteemate Size: XL Price: 11.99 Product: 9739 Brand: inteemate Size: S Price: 12.99 Product: 9739 Brand: inteemate Size: S Price: 12.99 Product: 9739 Brand: inteemate Size: M Price: 13.99 Product: 9739 Brand: inteemate Size: L Price: 13.99 Product: 9739 Brand: inteemate filter query fq=brand:inteemate filter query fq=color:blue query q=t-shirt
  • 19. Challenge: product variants Size: XS Price: 11.99 Product: 9739 Brand: inteemate Size: XL Price: 11.99 Product: 9739 Brand: inteemate Size: S Price: 12.99 Product: 9739 Brand: inteemate Size: S Price: 12.99 Product: 9739 Brand: inteemate Size: M Price: 13.99 Product: 9739 Brand: inteemate Size: L Price: 13.99 Product: 9739 Brand: inteemate filter query fq=brand:inteemate filter query fq=color:blue query q=t-shirt post filter query fq={!collapse field=productId}
  • 20. Challenge: product variants in facets Size: S Price: 12.99 Product: 9739 Brand: inteemate Size: M Price: 13.99 Product: 9739 Brand: inteemate Size: L Price: 13.99 Product: 9739 Brand: inteemate ... post filter query fq={!collapse field=productId} Facet counts will be correct for product attributes (“brand”)
  • 21. Challenge: product variants in facets Size: S Price: 12.99 Product: 9739 Brand: inteemate Size: M Price: 13.99 Product: 9739 Brand: inteemate Size: L Price: 13.99 Product: 9739 Brand: inteemate ... post filter query fq={!collapse field=productId tag=coll} For facet counts of variant attributes we’ll have to tag and exclude collapse filter: "facet": { "size": { "type": "terms", "field": "size", "domain": { "excludeTags":"coll" } } } Sizes S, M, L all shown as ‘1 result’ in facets ✅
  • 22. Challenge: product variants in facets Size: S Price: 12.99 Product: 9739 Brand: inteemate Size: M Price: 13.99 Product: 9739 Brand: inteemate Size: L Price: 13.99 Product: 9739 Brand: inteemate ... post filter query fq={!collapse field=productId tag=coll} For facet counts of variant attributes we’ll have to tag and exclude collapse filter: "facet": { "color": { "type": "terms", "field": "color", "domain": { "excludeTags":"coll" } } } Sizes S, M, L all shown as ‘1 result’ in facets ✅ Color ‘Blue’ shown as ‘3 results’ in facets ❌
  • 23. Challenge: product variants in facets Size: S Price: 12.99 Product: 9739 Brand: inteemate Size: M Price: 13.99 Product: 9739 Brand: inteemate Size: L Price: 13.99 Product: 9739 Brand: inteemate ... post filter query fq={!collapse field=productId tag=coll} For facet counts of variant attributes we’ll have to tag and exclude collapse filter: "facet": { "color": { "type": "terms", "field": "color", "domain": { "excludeTags":"coll" }, facet: { "numProducts":"unique(productId)" } } } Sizes S, M, L all shown as ‘1 result’ in facets ✅ Color ‘Blue’ shown as ‘1 result’ in facets ✅
  • 24. Collapse query parser - notes on implementation ● Beware of high cardinality of product IDs. ○ If you have 10M different product IDs in your index, the collapse query parser will allocate heap space for 2 arrays (float/int) x 10M elements (ca. 80 MB) per request! ○ Solution: ■ Many products have just 1 variant. It’s better to leave the productId empty in this case. ■ Combine with nullPolicy=expand, which avoids reserving array space for products without a productId: fq={!collapse field=productId nullPolicy=expand} ● All variants of a product must be indexed to the same shard
  • 26. Which facets should we show? Some domains are rich in attributes. For example, electronics could use 10k different attributes. Even if we reduced the number of attributes to be used in facets at index time, we could be left with several hundreds of candidates for facetting. Building a request for hundreds of facets is not feasible. We’ll show a simple solution, that will just use the search engine to select facets. At the other end of the spectrum, you could train a model, that predicts which facets to show for a given query.
  • 27. Which facets should we show? - Solution Index a field multivalued field that holds the names of the facettable fields... Doc1: screenSize: 17, .... facettableFields: [ “screenSize”, “ramGB”, “height”, “width”, ... ] Doc2: ... facettableFields: [ “screenSize”, “numHDMIPorts”, “height”, “width”, ... ]
  • 28. Which facets should we show? - Solution ... and execute an additional, prior facet request on this field. Add the facet values returned by this request as facet parameters to the main request: "facet": { "facettable_fields": { "type": "terms", "field": "facettable_fields" } } (query/filter queries are the same like in the ‘main request’) "facets":{ "facettable_fields":{ "buckets":[{ "val":"screenSize", "count":12}, { "val":"ramGB", "count":4}, { "val":"height", "count":3}, ] } }
  • 29. Which facets should we show? - Solution Index additional information together with the names of facettable fields facettableFields: [ "00010;screenSize;Screen size ", "00100;ramGB;Memory (GB) ", "00005;height;Height ", "00005;width;Width ", ...] Importance (padding makes values sortable!) Field name Label
  • 31. Which facet values? Category Pills being dynamically included IF the entropy model says they are meaningful for filtering the data
  • 34. Autocompletion - Using a Suggester <searchComponent name="suggest" class="solr.SuggestComponent"> <lst name="suggester"> <str name="name">mySuggester</str> <str name="lookupImpl">FuzzyLookupFactory</str> <str name="suggestAnalyzerFieldType">text_general</str> <str name="buildOnCommit">true</str> <str name="field">dictionary</str> </lst> </searchComponent> Experiment with combinations of Lookups & Dictionary implementations.
  • 35. Spellchecking - Using Solr component Two flavours: “cofffee --> coffee”, collations: “expresso machine” -->“espresso machine” <str name="spellcheck">true</str> <str name="spellcheck.dictionary">title</str> <str name="spellcheck.onlyMorePopular">true</str> <str name="spellcheck.extendedResults">true</str> <str name="spellcheck.collate">true</str> <str name="spellcheck.maxCollations">100</str> <str name="spellcheck.maxCollationTries">5</str> <str name="spellcheck.count">5</str> <str name="spellcheck.collateParam.mm">100%</str> <searchComponent name="suggest" class="solr.SuggestComponent"> <lst name="suggester"> <str name="name">mySuggester</str> <str name="lookupImpl">FuzzyLookupFactory</str> <str name="suggestAnalyzerFieldType">text_general</str> <str name="buildOnCommit">true</str> <str name="field">dictionary</str> </lst> </searchComponent> Good collations are what we want!
  • 36. Autocompletion - using a query index User enters: ja Show the best query completions in the best order! Using ideas from Joshua Bacher/Christine Bellstedt, Search Suggestions - The Underestimated Killer Feature of your Online Shop. Berlin Buzzwords 2018
  • 37. Autocompletion - using a query index User enters: ja Match prefix in field “match” q=*:*&fq=match:ja indexed with EdgeNGramFilter, lowercase, remove accents/ASCII folding, ... Optionally index and match spelling variants (jacket/jakcet)
  • 38. Autocompletion - using a query index Sort by “weight desc” q=*:*&fq=match:ja &sort=weight desc Sorting might get slow for short prefixes if the query index is large - tag the top N queries for lengths 1 and 2 and add another filter (fewer matches to sort, nicely cacheable): q=*:*&fq=match:ja &sort=weight desc &fq=top_len_2:true
  • 39. Autocompletion - using a query index If two queries have the same fingerprint, drop the one with the lower weight Fingerprint: concatenated sorted, normalised query tokens This increases the diversity of the suggestions.
  • 40. Autocompletion - using a query index Suggest the Labels as query completions!
  • 41. Autocompletion - using a query index You can show the best matching category for disambiguation and affirmation: * jacket * jacket in Fashion
  • 42. Spelling correction - using a query index Structure similar to query index for autocompletion Copy of the ‘match’ field indexed as n-grams
  • 43. Spelling correction - using a query index Filters on edit distance and rank based on n-grams (via TF*IDF) q=jakc jones &defType=edismax &qf=match_ngram &sow=false &fq=match:jakc jones~2
  • 44. Spelling correction - using a query index Add boost by weight (or a function of it) q=jakc jones &defType=edismax &qf=match_ngram &sow=false &fq=match:jakc jones~2 &boost=weight
  • 45. General model for spelling correction & autocompletion Noisy Channel Model / Bayesian Inference (Kernighan et. al., 1990; Jurafsky & Martin, 2009) Our ‘Weight’ field Edit distance, n-gram model, keyboard layout (Symspell!), ... prefix match for autocompletion
  • 47. Query relaxation Which query term should we drop if we can’t match all of them together? jacket xs green jacket xs green jacket xs green jacket xs green iphone 12 iphone 12 iphone 12
  • 48. Query relaxation - ‘mm’ anti-pattern Loosening ‘minimum should match’ (mm) constraint to < 100% iphone 12 iphone 12 You’ll get matches for “12” She will just see probably imprecise results that don’t match her query exactly. You cannot tell the user what happened and which term you dropped. She wouldn’t know what to do in order to improve the query. Don’t do this! At least not in e-commerce search
  • 49. Query relaxation - Solutions René.Kriegler, Query Relaxation - a rewriting technique between search and recommendations. Haystack Conference 2019
  • 50. Query relaxation - Solutions Try searching with each term individually and drop the one from the query that yields the fewest results (might require additional rules to avoid just keeping number terms)
  • 51. Query relaxation - Solutions Multi-layer Neural Network, Word embeddings as input to represent terms
  • 52. Encores? Facets, autocompletion, spelling correction, query relaxation are important features of a search application. We’ve shown simple out-of-the-box solutions and a path to implement more advanced approaches.