SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Search Quality Evaluation to
Help Reproducibility:

An Open-source Approach
Alessandro Benedetti, Software Engineer
1st May 2019
Who I am
▪ Search Consultant
▪ R&D Software Engineer
▪ Master in Computer Science
▪ Apache Lucene/Solr Enthusiast
▪ Semantic, NLP, Machine Learning
Technologies passionate
▪ Beach Volleyball Player & Snowboarder
Alessandro Benedetti
Sease
Search Services
● Open Source Enthusiasts
● Apache Lucene/Solr experts
! Community Contributors
● Active Researchers
● Hot Trends : Learning To Rank, Document Similarity,
Search Quality Evaluation, Relevancy Tuning
✓ Search Quality Evaluation
‣ Context overview
‣ Search System Status
‣ Information Need and Relevancy Ratings
‣ Evaluation Measures
➢ Rated Ranking Evaluator (RRE)
➢ Future Works
Agenda
Search Quality Evaluation is the activity of
assessing how good a search system is.



Defining what good means depends on the interests
of who(stakeholder, developer, ect…) is doing the
evaluation.



So it is necessary to measure multiple metrics to
cover all the aspects of the perceived quality and
understand how the system is behaving.



Context Overview
Search Quality Evaluation
Search Quality
Internal Factors
External Factors
Correctness
RobustnessExtendibility
Reusability
Efficiency
Timeliness
Modularity
Readability
Maintainability
Testability
Maintainability
Understandability
Reusability
….
Focused on
Primarily focused on
Search Quality: Correctness
In Information Retrieval Correctness is the ability of
a system to meet the information needs of its users.
For each internal (gray) and external (red) iteration it
is vital to measure correctness variations.
Evaluation measures are used to assert how well
the search results satisfies the user's query intent.
Correctness
New system Existing system
Here are the
requirements
V1.0 has been released
Cool!
a month later…
We have a change
request.
We found a bug
We need to improve our
search system, users are
complaining about junk in
search results.
v0.1
…
v0.9
v1.1
v1.2
v1.3
…
v2.0
v2.0
In terms of correctness, how can
we know the system performance
across various versions?
Search Quality: Relevancy Ratings
A key concept in the calculation
of offline search quality
metrics is the relevance of a
document given a user
information need(query).
Before assessing the
correctness of the system it is
necessary to associate a
relevancy rating to each pair
<query, document> involved in
our evaluation.


Assign Ratings
Ratings
Set
Explicit
Feedback
Implicit
Feedback
Judgements Collector
Interactions Logger
Queen music
Bohemian

Rhapsody
D a n c i n g
Queen
Queen 

Albums
Bohemian

Rhapsody
D a n c i n g
Queen
Queen 

Albums
Search Quality: Measures
Evaluation Measures
Online Measures
Offline Measures
Average Precision
Mean Reciprocal Rank
Recall
NDCG
Precision Click-through rate
F-Measure
Zero result rate
Session abandonment rate
Session success rate
….
….
We are mainly focused here
Evaluation measures for an information retrieval
system try to formalise how well a search system
satisfies its user information needs.
Measures are generally split into two categories:
online and offline measures.
In this context we will focus on offline measures.
Evaluation Measures
Search Quality: Evaluate a System
Input

Information Need with Ratings

e.g.

Set of queries with expected
resulting documents annotated
Metric

e.g.

Precision

Evaluation
System
Corpus of Information



Evaluate Results
Metric Score
0..1
Reproducibility
Keeping these factors locked

I am expecting the same Metric Score
➢ Search Quality Evaluation
✓An Open Source Approach(RRE)
‣ Apache Solr/ES
‣ Search System Status
‣ Rated Ranking Evaluator
‣ Information Need and Relevancy Ratings
‣ Evaluation Measures
‣ Evaluation and Output
➢ Future Works
Agenda
Open Source Search Engines
Solr is the popular, blazing-fast, open
source enterprise search platform built
on Apache Lucene™
Elasticsearch is a distributed, RESTful search and
analytics engine capable of solving a growing
number of use cases.
Search System Status: Index
- Data 

Documents in input
- Index Time Configurations 

Indexing Application Pipeline

Update Processing Chain

Text Analysis Configuration
Index

(Corpus of Information)
System Status: Query
- Search-API 

Build the client query
- Query Time Configurations 

Query Parser
Query Building
(Information Need)
Search-API
Query Parser
QUERY: The White Tiger
QUERY: ?q=the white tiger&qf=title,content^10&bf=popularity
QUERY: title:the white tiger OR
content:the white tiger …
RRE: What is it?
• A set of search quality evaluation tools
• A search quality evaluation framework
• Multi (search) platform
• Written in Java
• It can be used also in non-Java projects
• Licensed under Apache 2.0
• Open to contributions
• Extremely dynamic!
RRE: What is it?
https://github.com/SeaseLtd/rated-ranking-evaluator
RRE: Ecosystem
The picture illustrates the main modules composing
the RRE ecosystem.
All modules with a dashed border are planned for a
future release.
RRE CLI has a double border because although the
rre-cli module hasn’t been developed, you can run
RRE from a command line using RRE Maven
archetype, which is part of the current release.
As you can see, the current implementation includes
two target search platforms: Apache Solr and
Elasticsearch.
The Search Platform API module provide a search
platform abstraction for plugging-in additional
search systems.
RRE Ecosystem
CORE
Plugin
Plugin
Reporting Plugin
Search
Platform
API
RequestHandler
RRE Server
RRE CLI
Plugin
Plugin
Plugin
Archetypes
RRE: Reproducibility in Evaluating a System
INPUT



RRE Ratings

e.g.

Json representation of
Information Need with
related annotated
documents
Metric

e.g.

Precision

Evaluation
Apache Solr/ ES
Index (Corpus Of Information)

- Data

- Index Time Configuration
- Query Building(Search API)
- Query Time Configuration
Evaluate Results
Metric Score
0..1
Reproducibility
Running RRE with the same status 

I am expecting the same metric score
RRE: Information Need Domain Model
• Rooted Tree (the Root is the Evaluation)
• each level enriches the details of the information
need
• The corpus identify the data collection
• The topic assign a human readable semantic
• Query groups expect the same results from the
children
The benefit of having a composite structure is clear:
we can see a metric value at different levels (e.g. a
query, all queries belonging to a query group, all
queries belonging to a topic or at corpus level)
RRE Domain ModelEvaluation
Corpus
1..*
Topic
Query Group
Query
1..*
1..*
1..*
Top level domain entity
dataset / collection to evaluate
High Level Information need
Query variants
Queries
RRE: Define Information Need and Ratings
Although the domain model structure is able to
capture complex scenarios, sometimes we want to
model simpler contexts.
In order to avoid verbose and redundant ratings
definitions it’s possible to omit some level. 

Combinations accepted for each corpus are:
• only queries
• query groups and queries
• topics, query groups and queries
RRE Domain ModelEvaluation
Corpus
1..*
Doc 2 Doc 3 Doc N
Topic
Query Group
Query
1..*
1..*
1..*
…
= Optional
= Required
Doc 1
RRE: Json Ratings
Ratings files associate the RRE domain model
entities with relevance judgments. A ratings file
provides the association between queries and
relevant documents.
There must be at least one ratings file (otherwise no
evaluation happens). Usually there’s a 1:1
relationship between a rating file and a dataset.
Judgments, the most important part of this file,
consist of a list of all relevant documents for a
query group.
Each listed document has a corresponding “gain”
which is the relevancy judgment we want to assign
to that document.
Ratings
OR
RRE: Available metrics
These are the RRE built-in metrics which can be
used out of the box.
The most part of them are computed at query level
and then aggregated at upper levels.
However, compound metrics (e.g. MAP, or GMAP)
are not explicitly declared or defined, because the
computation doesn’t happen at query level. The result
of the aggregation executed on the upper levels will
automatically produce these metric.
e.g.

the Average Precision computed for Q1, Q2, Q3, Qn
becomes the Mean Average Precision at Query
Group or Topic levels.
Available Metrics
Precision
Recall
Precision at 1 (P@1)
Precision at 2 (P@2)
Precision at 3 (P@3)
Precision at 10 (P@10)
Average Precision (AP)
Reciprocal Rank
Mean Reciprocal Rank
Mean Average Precision (MAP)
Normalised Discounted Cumulative Gain (NDCG)
F-Measure Compound Metric
RRE: Reproducibility in Evaluating a System
INPUT



RRE Ratings

e.g.

Json representation of
Information Need with
r e l a t e d a n n o t a t e d
documents
Metric

e.g.

Precision

Evaluation
Apache Solr/ ES
Index (Corpus Of Information)

- Data

- Index Time Configuration
- Query Building(Search API)
- Query Time Configuration
Evaluate Results
Metric Score
0..1
Reproducibility
Running RRE with the same status 

I am expecting the same metric score
System Status: Init Search Engine
Data
Configuration
Spins up an embedded

Search Platform
INPUT LAYER
EVALUATION LAYER
- an instance of ES/ Solr is instantiated from 

the input configurations
- Data is populated from the input
- The Instance is ready to respond to queries

and be evaluated
N.B. an alternative approach we are working on

is to target a QA instance already populated

In that scenario is vital to keep version
controlled the configuration and data
Embedded
System Status: Configuration Sets
- Configurations evolve with time

Reproducibility: track it with version control
systems!
- RRE can take various version of configurations in
input to compare them
- The evaluation process allows you to define
inclusion / exclusion rules (i.e. include only
version 1.0 and 2.0)
Index/Query Time

Configuration
System Status: Feed the Data
An evaluation execution can involve more than one
datasets targeting a given search platform.
A dataset consists consists of representative domain
data; although a compressed dataset can be
provided, generally it has a small/medium size.
Within RRE, corpus, dataset, collection are
synonyms.
Datasets must be located under a configurable
folder. Each dataset is then referenced in one or
more ratings file.
Corpus Of Information

(Data)
System Status: Build the Queries
For each query or query group) it’s possible to
define a template, which is a kind of query shape
containing one or more placeholders.
Then, in the ratings file you can reference one of
those defined templates and you can provide a value
for each placeholder.
Templates have been introduced in order to:
• allow a common query management between
search platforms
• define complex queries
• define runtime parameters that cannot be
statically determined (e.g. filters)
Query templates
only_q.json
filter_by_language.json
RRE: Reproducibility in Evaluating a System
INPUT



RRE Ratings

e.g.

Json representation of
Information Need with
r e l a t e d a n n o t a t e d
documents
Metric

e.g.

Precision

Evaluation
Apache Solr/ ES
Index (Corpus Of Information)

- Data

- Index Time Configuration
- Query Building(Search API)
- Query Time Configuration
Evaluate Results
Metric Score
0..1
Reproducibility
Running RRE with the same status 

I am expecting the same metric score
RRE: Evaluation process overview (1/2)
Data
Configuration
Ratings
Search Platform
uses a
produces
Evaluation Data
INPUT LAYER EVALUATION LAYER OUTPUT LAYER
JSON
RRE Console
…
used for generating
RRE: Evaluation process overview (2/2)
Runtime Container
RRE Core
Rating Files
Datasets
Queries
Starts the search
platform
Stops the search
platform
Creates & configure the index
Indexes data
Executes query
Computes metric
outputs the evaluation data
Init System
Set Status
Set Status
Stop System
RRE: Evaluation Output
The RRE Core itself is a library, so it outputs its
result as a Plain Java object that must be
programmatically used.
However when wrapped within a runtime container,
like the Maven Plugin, the evaluation object tree is
marshalled in JSON format.
Being interoperable, the JSON format can be used by
some other component for producing a different kind
of output.
An example of such usage is the RRE Apache
Maven Reporting Plugin which can
• output a spreadsheet
• send the evaluation data to a running RRE Server
Evaluation output
RRE: Workbook
The RRE domain model (topics, groups and queries)
is on the left and each metric (on the right section)
has a value for each version / entity pair.
In case the evaluation process includes multiple
datasets, there will be a spreadsheet for each of
them.
This output format is useful when
• you want to have (or maintain somewhere) a
snapshot about how the system performed in a
given moment
• the comparison includes a lot of versions
• you want to include all available metrics
Workbook
RRE: RRE Console
• SpringBoot/AngularJS app that shows real-time
information about evaluation results.
• Each time a build happens, the RRE reporting
plugin sends the evaluation result to a RESTFul
endpoint provided by RRE Console.
• The received data immediately updates the web
dashboard with fresh data.
• Useful during the development / tuning phase
iterations (you don’t have to open again and again
the excel report)
RRE Console
RRE: Iterative development & tuning
Dev, tune & Build
Check evaluation results
We are thinking about how
to fill a third monitor
RRE: We are working on…
“I think if we could create a simplified
pass/fail report for the business team,
that would be ideal. So they could
understand the tradeoffs of the new
search.”
“Many search engines process the user
query heavily before it's submitted to the
search engine in whatever DSL is required,
and if you don't retain some idea of the
original query in the system how can you”
relate the test results back to user
behaviour?
Do I have to write all judgments
manually??
How can I use RRE if I have a custom
search platform?
Java is not in my stack
Can I persist the evaluation data?
RRE: Github Repository and Resources
• A sample RRE-enabled project
• No Java code, only configuration
• Search Platform: Elasticsearch 6.3.2
• Seven example iterations
• Index shapes & queries from Relevant Search [1]
• Dataset: TMBD (extract)
Demo Project
https://github.com/SeaseLtd/
rre-demo-walkthrough
[1] https://www.manning.com/books/relevant-search
https://github.com/SeaseLtd/
rated-ranking-evaluator
Github Repo
https://sease.io/2018/07/
rated-ranking-evaluator.html
Blog article
Thanks!

Weitere ähnliche Inhalte

Was ist angesagt?

Haystack London - Search Quality Evaluation, Tools and Techniques
Haystack London - Search Quality Evaluation, Tools and Techniques Haystack London - Search Quality Evaluation, Tools and Techniques
Haystack London - Search Quality Evaluation, Tools and Techniques Andrea Gazzarini
 
From Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryFrom Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryAlessandro Benedetti
 
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachSearch Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachAlessandro Benedetti
 
Rated Ranking Evaluator (FOSDEM 2019)
Rated Ranking Evaluator (FOSDEM 2019)Rated Ranking Evaluator (FOSDEM 2019)
Rated Ranking Evaluator (FOSDEM 2019)Andrea Gazzarini
 
Advanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAdvanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAlessandro Benedetti
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveAndrea Gazzarini
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to RankSease
 
Enterprise Search – How Relevant Is Relevance?
Enterprise Search – How Relevant Is Relevance?Enterprise Search – How Relevant Is Relevance?
Enterprise Search – How Relevant Is Relevance?Sease
 
How the Lucene More Like This Works
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This WorksSease
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic rankingFELIX75
 
Lucene And Solr Document Classification
Lucene And Solr Document ClassificationLucene And Solr Document Classification
Lucene And Solr Document ClassificationAlessandro Benedetti
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Lucidworks
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Trey Grainger
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkSimon Hughes
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Lucidworks
 
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough dataВладимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough dataMail.ru Group
 

Was ist angesagt? (16)

Haystack London - Search Quality Evaluation, Tools and Techniques
Haystack London - Search Quality Evaluation, Tools and Techniques Haystack London - Search Quality Evaluation, Tools and Techniques
Haystack London - Search Quality Evaluation, Tools and Techniques
 
From Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank StoryFrom Academic Papers To Production : A Learning To Rank Story
From Academic Papers To Production : A Learning To Rank Story
 
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source ApproachSearch Quality Evaluation to Help Reproducibility: An Open-source Approach
Search Quality Evaluation to Help Reproducibility: An Open-source Approach
 
Rated Ranking Evaluator (FOSDEM 2019)
Rated Ranking Evaluator (FOSDEM 2019)Rated Ranking Evaluator (FOSDEM 2019)
Rated Ranking Evaluator (FOSDEM 2019)
 
Advanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache LuceneAdvanced Document Similarity With Apache Lucene
Advanced Document Similarity With Apache Lucene
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to Rank
 
Enterprise Search – How Relevant Is Relevance?
Enterprise Search – How Relevant Is Relevance?Enterprise Search – How Relevant Is Relevance?
Enterprise Search – How Relevant Is Relevance?
 
How the Lucene More Like This Works
How the Lucene More Like This WorksHow the Lucene More Like This Works
How the Lucene More Like This Works
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic ranking
 
Lucene And Solr Document Classification
Lucene And Solr Document ClassificationLucene And Solr Document Classification
Lucene And Solr Document Classification
 
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
Leveraging Lucene/Solr as a Knowledge Graph and Intent Engine: Presented by T...
 
Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...Crowdsourced query augmentation through the semantic discovery of domain spec...
Crowdsourced query augmentation through the semantic discovery of domain spec...
 
Dice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank TalkDice.com Bay Area Search - Beyond Learning to Rank Talk
Dice.com Bay Area Search - Beyond Learning to Rank Talk
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough dataВладимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
Владимир Гулин, Mail.Ru Group, Learning to rank using clickthrough data
 

Ähnlich wie Search Quality Evaluation to Help Reproducibility : an Open Source Approach

Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...OpenSource Connections
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationSease
 
Search Quality Evaluation: Tools and Techniques
Search Quality Evaluation: Tools and TechniquesSearch Quality Evaluation: Tools and Techniques
Search Quality Evaluation: Tools and TechniquesAlessandro Benedetti
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSease
 
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationAlessandro Benedetti
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State Universitydhabalia
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxElasticsearch
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic rankingFELIX75
 
Building a Meta-search Engine
Building a Meta-search EngineBuilding a Meta-search Engine
Building a Meta-search EngineAyan Chandra
 
Chapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfChapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfHabtamu100
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineSalford Systems
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataAbhishek M Shivalingaiah
 
information technology materrailas paper
information technology materrailas paperinformation technology materrailas paper
information technology materrailas papermelkamutesfay1
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxElasticsearch
 
Recsys 2018 overview and highlights
Recsys 2018 overview and highlightsRecsys 2018 overview and highlights
Recsys 2018 overview and highlightsSandra Garcia
 

Ähnlich wie Search Quality Evaluation to Help Reproducibility : an Open Source Approach (20)

Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
Haystack 2019 - Rated Ranking Evaluator: an Open Source Approach for Search Q...
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
 
Search Quality Evaluation: Tools and Techniques
Search Quality Evaluation: Tools and TechniquesSearch Quality Evaluation: Tools and Techniques
Search Quality Evaluation: Tools and Techniques
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: An Open Source Approach for Search Quality Evaluation
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolbox
 
probabilistic ranking
probabilistic rankingprobabilistic ranking
probabilistic ranking
 
Building a Meta-search Engine
Building a Meta-search EngineBuilding a Meta-search Engine
Building a Meta-search Engine
 
Chapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdfChapter 5 Query Evaluation.pdf
Chapter 5 Query Evaluation.pdf
 
FINAL REVIEW
FINAL REVIEWFINAL REVIEW
FINAL REVIEW
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
Apache lucene
Apache luceneApache lucene
Apache lucene
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
 
Cloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big DataCloudera Movies Data Science Project On Big Data
Cloudera Movies Data Science Project On Big Data
 
information technology materrailas paper
information technology materrailas paperinformation technology materrailas paper
information technology materrailas paper
 
Data explorer
Data explorerData explorer
Data explorer
 
An introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolboxAn introduction to Elasticsearch's advanced relevance ranking toolbox
An introduction to Elasticsearch's advanced relevance ranking toolbox
 
Recsys 2018 overview and highlights
Recsys 2018 overview and highlightsRecsys 2018 overview and highlights
Recsys 2018 overview and highlights
 

Kürzlich hochgeladen

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Kürzlich hochgeladen (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Search Quality Evaluation to Help Reproducibility : an Open Source Approach

  • 1. Search Quality Evaluation to Help Reproducibility:
 An Open-source Approach Alessandro Benedetti, Software Engineer 1st May 2019
  • 2. Who I am ▪ Search Consultant ▪ R&D Software Engineer ▪ Master in Computer Science ▪ Apache Lucene/Solr Enthusiast ▪ Semantic, NLP, Machine Learning Technologies passionate ▪ Beach Volleyball Player & Snowboarder Alessandro Benedetti
  • 3. Sease Search Services ● Open Source Enthusiasts ● Apache Lucene/Solr experts ! Community Contributors ● Active Researchers ● Hot Trends : Learning To Rank, Document Similarity, Search Quality Evaluation, Relevancy Tuning
  • 4. ✓ Search Quality Evaluation ‣ Context overview ‣ Search System Status ‣ Information Need and Relevancy Ratings ‣ Evaluation Measures ➢ Rated Ranking Evaluator (RRE) ➢ Future Works Agenda
  • 5. Search Quality Evaluation is the activity of assessing how good a search system is.
 
 Defining what good means depends on the interests of who(stakeholder, developer, ect…) is doing the evaluation.
 
 So it is necessary to measure multiple metrics to cover all the aspects of the perceived quality and understand how the system is behaving.
 
 Context Overview Search Quality Evaluation Search Quality Internal Factors External Factors Correctness RobustnessExtendibility Reusability Efficiency Timeliness Modularity Readability Maintainability Testability Maintainability Understandability Reusability …. Focused on Primarily focused on
  • 6. Search Quality: Correctness In Information Retrieval Correctness is the ability of a system to meet the information needs of its users. For each internal (gray) and external (red) iteration it is vital to measure correctness variations. Evaluation measures are used to assert how well the search results satisfies the user's query intent. Correctness New system Existing system Here are the requirements V1.0 has been released Cool! a month later… We have a change request. We found a bug We need to improve our search system, users are complaining about junk in search results. v0.1 … v0.9 v1.1 v1.2 v1.3 … v2.0 v2.0 In terms of correctness, how can we know the system performance across various versions?
  • 7. Search Quality: Relevancy Ratings A key concept in the calculation of offline search quality metrics is the relevance of a document given a user information need(query). Before assessing the correctness of the system it is necessary to associate a relevancy rating to each pair <query, document> involved in our evaluation. 
 Assign Ratings Ratings Set Explicit Feedback Implicit Feedback Judgements Collector Interactions Logger Queen music Bohemian
 Rhapsody D a n c i n g Queen Queen 
 Albums Bohemian
 Rhapsody D a n c i n g Queen Queen 
 Albums
  • 8. Search Quality: Measures Evaluation Measures Online Measures Offline Measures Average Precision Mean Reciprocal Rank Recall NDCG Precision Click-through rate F-Measure Zero result rate Session abandonment rate Session success rate …. …. We are mainly focused here Evaluation measures for an information retrieval system try to formalise how well a search system satisfies its user information needs. Measures are generally split into two categories: online and offline measures. In this context we will focus on offline measures. Evaluation Measures
  • 9. Search Quality: Evaluate a System Input
 Information Need with Ratings
 e.g.
 Set of queries with expected resulting documents annotated Metric
 e.g.
 Precision
 Evaluation System Corpus of Information
 
 Evaluate Results Metric Score 0..1 Reproducibility Keeping these factors locked
 I am expecting the same Metric Score
  • 10. ➢ Search Quality Evaluation ✓An Open Source Approach(RRE) ‣ Apache Solr/ES ‣ Search System Status ‣ Rated Ranking Evaluator ‣ Information Need and Relevancy Ratings ‣ Evaluation Measures ‣ Evaluation and Output ➢ Future Works Agenda
  • 11. Open Source Search Engines Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™ Elasticsearch is a distributed, RESTful search and analytics engine capable of solving a growing number of use cases.
  • 12. Search System Status: Index - Data 
 Documents in input - Index Time Configurations 
 Indexing Application Pipeline
 Update Processing Chain
 Text Analysis Configuration Index
 (Corpus of Information)
  • 13. System Status: Query - Search-API 
 Build the client query - Query Time Configurations 
 Query Parser Query Building (Information Need) Search-API Query Parser QUERY: The White Tiger QUERY: ?q=the white tiger&qf=title,content^10&bf=popularity QUERY: title:the white tiger OR content:the white tiger …
  • 14. RRE: What is it? • A set of search quality evaluation tools • A search quality evaluation framework • Multi (search) platform • Written in Java • It can be used also in non-Java projects • Licensed under Apache 2.0 • Open to contributions • Extremely dynamic! RRE: What is it? https://github.com/SeaseLtd/rated-ranking-evaluator
  • 15. RRE: Ecosystem The picture illustrates the main modules composing the RRE ecosystem. All modules with a dashed border are planned for a future release. RRE CLI has a double border because although the rre-cli module hasn’t been developed, you can run RRE from a command line using RRE Maven archetype, which is part of the current release. As you can see, the current implementation includes two target search platforms: Apache Solr and Elasticsearch. The Search Platform API module provide a search platform abstraction for plugging-in additional search systems. RRE Ecosystem CORE Plugin Plugin Reporting Plugin Search Platform API RequestHandler RRE Server RRE CLI Plugin Plugin Plugin Archetypes
  • 16. RRE: Reproducibility in Evaluating a System INPUT
 
 RRE Ratings
 e.g.
 Json representation of Information Need with related annotated documents Metric
 e.g.
 Precision
 Evaluation Apache Solr/ ES Index (Corpus Of Information)
 - Data
 - Index Time Configuration - Query Building(Search API) - Query Time Configuration Evaluate Results Metric Score 0..1 Reproducibility Running RRE with the same status 
 I am expecting the same metric score
  • 17. RRE: Information Need Domain Model • Rooted Tree (the Root is the Evaluation) • each level enriches the details of the information need • The corpus identify the data collection • The topic assign a human readable semantic • Query groups expect the same results from the children The benefit of having a composite structure is clear: we can see a metric value at different levels (e.g. a query, all queries belonging to a query group, all queries belonging to a topic or at corpus level) RRE Domain ModelEvaluation Corpus 1..* Topic Query Group Query 1..* 1..* 1..* Top level domain entity dataset / collection to evaluate High Level Information need Query variants Queries
  • 18. RRE: Define Information Need and Ratings Although the domain model structure is able to capture complex scenarios, sometimes we want to model simpler contexts. In order to avoid verbose and redundant ratings definitions it’s possible to omit some level. 
 Combinations accepted for each corpus are: • only queries • query groups and queries • topics, query groups and queries RRE Domain ModelEvaluation Corpus 1..* Doc 2 Doc 3 Doc N Topic Query Group Query 1..* 1..* 1..* … = Optional = Required Doc 1
  • 19. RRE: Json Ratings Ratings files associate the RRE domain model entities with relevance judgments. A ratings file provides the association between queries and relevant documents. There must be at least one ratings file (otherwise no evaluation happens). Usually there’s a 1:1 relationship between a rating file and a dataset. Judgments, the most important part of this file, consist of a list of all relevant documents for a query group. Each listed document has a corresponding “gain” which is the relevancy judgment we want to assign to that document. Ratings OR
  • 20. RRE: Available metrics These are the RRE built-in metrics which can be used out of the box. The most part of them are computed at query level and then aggregated at upper levels. However, compound metrics (e.g. MAP, or GMAP) are not explicitly declared or defined, because the computation doesn’t happen at query level. The result of the aggregation executed on the upper levels will automatically produce these metric. e.g.
 the Average Precision computed for Q1, Q2, Q3, Qn becomes the Mean Average Precision at Query Group or Topic levels. Available Metrics Precision Recall Precision at 1 (P@1) Precision at 2 (P@2) Precision at 3 (P@3) Precision at 10 (P@10) Average Precision (AP) Reciprocal Rank Mean Reciprocal Rank Mean Average Precision (MAP) Normalised Discounted Cumulative Gain (NDCG) F-Measure Compound Metric
  • 21. RRE: Reproducibility in Evaluating a System INPUT
 
 RRE Ratings
 e.g.
 Json representation of Information Need with r e l a t e d a n n o t a t e d documents Metric
 e.g.
 Precision
 Evaluation Apache Solr/ ES Index (Corpus Of Information)
 - Data
 - Index Time Configuration - Query Building(Search API) - Query Time Configuration Evaluate Results Metric Score 0..1 Reproducibility Running RRE with the same status 
 I am expecting the same metric score
  • 22. System Status: Init Search Engine Data Configuration Spins up an embedded
 Search Platform INPUT LAYER EVALUATION LAYER - an instance of ES/ Solr is instantiated from 
 the input configurations - Data is populated from the input - The Instance is ready to respond to queries
 and be evaluated N.B. an alternative approach we are working on
 is to target a QA instance already populated
 In that scenario is vital to keep version controlled the configuration and data Embedded
  • 23. System Status: Configuration Sets - Configurations evolve with time
 Reproducibility: track it with version control systems! - RRE can take various version of configurations in input to compare them - The evaluation process allows you to define inclusion / exclusion rules (i.e. include only version 1.0 and 2.0) Index/Query Time
 Configuration
  • 24. System Status: Feed the Data An evaluation execution can involve more than one datasets targeting a given search platform. A dataset consists consists of representative domain data; although a compressed dataset can be provided, generally it has a small/medium size. Within RRE, corpus, dataset, collection are synonyms. Datasets must be located under a configurable folder. Each dataset is then referenced in one or more ratings file. Corpus Of Information
 (Data)
  • 25. System Status: Build the Queries For each query or query group) it’s possible to define a template, which is a kind of query shape containing one or more placeholders. Then, in the ratings file you can reference one of those defined templates and you can provide a value for each placeholder. Templates have been introduced in order to: • allow a common query management between search platforms • define complex queries • define runtime parameters that cannot be statically determined (e.g. filters) Query templates only_q.json filter_by_language.json
  • 26. RRE: Reproducibility in Evaluating a System INPUT
 
 RRE Ratings
 e.g.
 Json representation of Information Need with r e l a t e d a n n o t a t e d documents Metric
 e.g.
 Precision
 Evaluation Apache Solr/ ES Index (Corpus Of Information)
 - Data
 - Index Time Configuration - Query Building(Search API) - Query Time Configuration Evaluate Results Metric Score 0..1 Reproducibility Running RRE with the same status 
 I am expecting the same metric score
  • 27. RRE: Evaluation process overview (1/2) Data Configuration Ratings Search Platform uses a produces Evaluation Data INPUT LAYER EVALUATION LAYER OUTPUT LAYER JSON RRE Console … used for generating
  • 28. RRE: Evaluation process overview (2/2) Runtime Container RRE Core Rating Files Datasets Queries Starts the search platform Stops the search platform Creates & configure the index Indexes data Executes query Computes metric outputs the evaluation data Init System Set Status Set Status Stop System
  • 29. RRE: Evaluation Output The RRE Core itself is a library, so it outputs its result as a Plain Java object that must be programmatically used. However when wrapped within a runtime container, like the Maven Plugin, the evaluation object tree is marshalled in JSON format. Being interoperable, the JSON format can be used by some other component for producing a different kind of output. An example of such usage is the RRE Apache Maven Reporting Plugin which can • output a spreadsheet • send the evaluation data to a running RRE Server Evaluation output
  • 30. RRE: Workbook The RRE domain model (topics, groups and queries) is on the left and each metric (on the right section) has a value for each version / entity pair. In case the evaluation process includes multiple datasets, there will be a spreadsheet for each of them. This output format is useful when • you want to have (or maintain somewhere) a snapshot about how the system performed in a given moment • the comparison includes a lot of versions • you want to include all available metrics Workbook
  • 31. RRE: RRE Console • SpringBoot/AngularJS app that shows real-time information about evaluation results. • Each time a build happens, the RRE reporting plugin sends the evaluation result to a RESTFul endpoint provided by RRE Console. • The received data immediately updates the web dashboard with fresh data. • Useful during the development / tuning phase iterations (you don’t have to open again and again the excel report) RRE Console
  • 32. RRE: Iterative development & tuning Dev, tune & Build Check evaluation results We are thinking about how to fill a third monitor
  • 33. RRE: We are working on… “I think if we could create a simplified pass/fail report for the business team, that would be ideal. So they could understand the tradeoffs of the new search.” “Many search engines process the user query heavily before it's submitted to the search engine in whatever DSL is required, and if you don't retain some idea of the original query in the system how can you” relate the test results back to user behaviour? Do I have to write all judgments manually?? How can I use RRE if I have a custom search platform? Java is not in my stack Can I persist the evaluation data?
  • 34. RRE: Github Repository and Resources • A sample RRE-enabled project • No Java code, only configuration • Search Platform: Elasticsearch 6.3.2 • Seven example iterations • Index shapes & queries from Relevant Search [1] • Dataset: TMBD (extract) Demo Project https://github.com/SeaseLtd/ rre-demo-walkthrough [1] https://www.manning.com/books/relevant-search https://github.com/SeaseLtd/ rated-ranking-evaluator Github Repo https://sease.io/2018/07/ rated-ranking-evaluator.html Blog article