SlideShare a Scribd company logo
1 of 48
Download to read offline
Neural Search Comes to Apache Solr:
Approximate Nearest Neighbor, BERT and More
(Buzzwords)!


Alessandro Benedetti, CEO
13th
June 2022
‣ Born in Tarquinia(ancient Etruscan city in Italy)
‣ R&D Software Engineer
‣ Director
‣ Master in Computer Science
‣ PC member for ECIR, SIGIR and Desires
‣ Apache Lucene/Solr PMC member/committer
‣ Elasticsearch expert
‣ Semantic, NLP, Machine Learning
technologies passionate
‣ Beach Volleyball player and Snowboarder
Who We Are
Alessandro Benedetti
‣ Headquarter in London/distributed
‣ Open Source Enthusiasts
‣ Apache Lucene/Solr experts
‣ Elasticsearch experts
‣ Community Contributors
‣ Active Researchers
‣ Hot Trends : Neural Search,
Natural Language Processing
Learning To Rank,
Document Similarity,
Search Quality Evaluation,
Relevancy Tuning
SEArch SErvices
www.sease.io
Semantic Search Problems
Neural (Vector-based) Search
Apache Solr Implementation
BERT to the rescue!
Future Works
Overview
Semantic Search Problems
Neural (Vector-based) Search
Apache Solr Implementation
BERT to the rescue!
Future Works
Overview
How many people live in Rome?
Rome’s
population
is 4.3
million
Hundreds
of people
queuing
for live
music in
Rome
Vocabulary mismatch problem
False
Positive
How big is a tiger?
The tiger
is the
biggest
member
of the
Felidae
family
Panthera
Tigris can
reach
390cm
nose to
tail
Vocabulary mismatch problem
False
Negative
Semantic Similarity
https://ai.googleblog.com/2018/05/advances-in-semantic-textual-similarity.html
Semantic Search Problems
Neural (Vector-based) Search
Apache Solr Implementation
BERT to the rescue!
Future Works
Overview
Vector representation for query/documents
Sparse Dense
● e.g. Bag-of-words approach
● each term in the corpus
dictionary ->
one vector dimension
● number of dimensions ->
term dictionary cardinality
● the vector for any given
document contains mostly
zeroes
● Fixed number of dimensions
● Normally much lower than
term dictionary cardinality
● the vector for any given
document contains mostly
non-zeroes
How can you generate such
dense vectors?
D=[0, 0, 0, 0, 1, 0, 0, 0, 1, 0] D=[0.7, 0.9, 0.4, 0.6, 1, 0.4, 0.7, 0.8, 1, 0.9]
Neural Search
Training Indexing Searching
Labeled Samples Text to Vectors Query to Vector
Lookup in Index
Neural Search Workflow
Similarity between a Query and a Document is translated to distance in a vector space
Distance Between Vectors
https://www.pinecone.io/learn/what-is-similarity-search/
https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-e51395ffe60d
● Specify the relationship metric
between elements in the dataset
● use-case dependant
○ experiment which one works
better for you!
● In Information Retrieval Cosine
similarity proved to work quite well (it’s
a normalised inner product)
https://www.baeldung.com/cs/euclidean-distance-vs-cosine-similarity
Nearest Neighbour Retrieval (KNN)
The Relevance score is calculated
with vector similarity distance
metric.
Closer vectors means higher
semantic similarity.
ANN - Approximate Nearest Neighbor
● Exact Nearest Neighbor is expensive! (1vs1 vector
distance)
● it’s fine to lose accuracy to get a massive performance gain
● pre-process the dataset to build index data structures
● Generally vectors are quantized(compressed) and then
modelled in:
○ Trees - partitioning of the vector space (k-d tree)
https://en.wikipedia.org/wiki/K-d_tree#Nearest_neighbour_search
○ Hashes - reducing high dimensionality preserving
differences and grouping similar objects
https://towardsdatascience.com/locality-sensitive-hashing-for-music-search-f2f1940ace23
○ Graphs - HNSW
HNSW - Hierarchical Navigable Small World graphs
Hierarchical Navigable Small World (HNSW)
graphs are among the top-performing
index-time data structures for approximate
nearest neighbor search (ANN).
References
https://doi.org/10.1016/j.is.2013.10.006
https://arxiv.org/abs/1603.09320
HNSW - How it works in a nutshell
● Proximity graph
● Vertices are vectors, closer vertices are linked
● Hierarchical Layers based on skip lists
○ longer edges in higher layers(fast retrieval)
○ shorter edges in lower layers(accuracy)
● Each layer is a Navigable Small World Graph
○ greedy search for the closest friend(local minimum)
○ higher the degree of vertices(number of connections)
lower the probability of hitting local min (but more
expensive
○ move down layer for refining the minimum(closest
friend)
Semantic Search Problems
Neural (Vector-based) Search
Apache Solr Implementation
BERT to the rescue!
Future Works
Overview
Nov 2020 - Apache Lucene 9.0
Dedicated File Format for Navigable Small World Graphs
https://issues.apache.org/jira/browse/LUCENE-9004
Jan 2022 - Apache Lucene 9.0
Handle Document Deletions
https://issues.apache.org/jira/browse/LUCENE-10040
Feb 2022 - Apache Lucene 9.1
Introduced Hierarchy in HNSW
https://issues.apache.org/jira/browse/LUCENE-10054
Mar 2022 - Apache Lucene 9.1
Re-use data structures across HNSW Graph
https://issues.apache.org/jira/browse/LUCENE-10391
Mar 2022 - Apache Lucene 9.1
Pre filters with KNN queries
https://issues.apache.org/jira/browse/LUCENE-10382
JIRA ISSUES
https://issues.apache.org/jir
a/issues/?jql=project%20%
3D%20LUCENE%20AND
%20labels%20%3D%20ve
ctor-based-search
Apache Lucene implementation
org.apache.lucene.index.VectorSimilarityFunction
/** Euclidean distance */
EUCLIDEAN
/**
* Dot product. NOTE: this similarity
is intended as an optimized way to
perform cosine
* similarity. In order to use it, all
vectors must be of unit length,
including both document and
* query vectors. Using dot product
with vectors that are not unit length
can result in errors or
* poor search results.
*/
DOT_PRODUCT
/**
* Cosine similarity. NOTE: the
preferred way to perform cosine
similarity is to normalize all
* vectors to unit length, and instead
use {@link
VectorSimilarityFunction#DOT_PRODUCT}
. You
* should only use this function if
you need to preserve the original
vectors and cannot normalize
* them in advance.
*/
COSINE
Apache Lucene - indexing
org.apache.lucene.document.KnnVectorField
private static Field randomKnnVectorField
(Random random, String fieldName) {
VectorSimilarityFunction similarityFunction =
RandomPicks.randomFrom(random, VectorSimilarityFunction
.values());
float[] values = new float[randomIntBetween(1, 10)];
for (int i = 0; i < values.length; i++) {
values[i] = randomFloat();
}
return new KnnVectorField(fieldName, values, similarityFunction
);
}
Document doc = new Document();
doc.add(
new KnnVectorField(
"field", new float[] {j, j}, VectorSimilarityFunction
.EUCLIDEAN));
Apache Lucene - indexing
org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsWriter
Lucene91HnswVectorsReader
.OffHeapVectorValues offHeapVectors=
new Lucene91HnswVectorsReader
.OffHeapVectorValues(
vectors.dimension(), docsWithField.cardinality(), null, vectorDataInput);
OnHeapHnswGraph graph =
offHeapVectors
.size() == 0
? null
: writeGraph(offHeapVectors
, fieldInfo.getVectorSimilarityFunction());
org.apache.lucene.util.hnsw.HnswGraphBuilder
public Lucene91Codec(Mode mode) {
super("Lucene91");
this.storedFieldsFormat =
new Lucene90StoredFieldsFormat(
Objects.requireNonNull(mode).storedMode);
this.defaultPostingsFormat = new Lucene90PostingsFormat();
this.defaultDVFormat = new Lucene90DocValuesFormat();
this.defaultKnnVectorsFormat = new Lucene91HnswVectorsFormat();
}
Apache Lucene - indexing
org.apache.lucene.search.KnnVectorQuery
/**
* Find the <code>k</code> nearest documents to the target vector according to the
vectors in the
* given field. <code>target</code> vector.
*
* @param field a field that has been indexed as a {@link KnnVectorField}.
* @param target the target of the search
* @param k the number of documents to find
* @param filter a filter applied before the vector search
* @throws IllegalArgumentException if <code>k</code> is less than 1
*/
public KnnVectorQuery
(String field, float[] target, int k, Query filter) {
this.field = field;
this.target = target;
this.k = k;
if (k < 1) {
throw new IllegalArgumentException(
"k must be at least 1, got: " + k);
}
this.filter = filter;
}
Apache Lucene - searching
May 2022 - Apache Solr 9.0
Sease Introduced support to KNN search (HNSW)
https://issues.apache.org/jira/browse/SOLR-15880
JIRA ISSUES
https://issues.apache.org/jir
a/browse/SOLR-15880?jql=
labels%20%3D%20vector-
based-search
Apache Solr implementation
Apache Solr 9.0 - Schema
DenseVectorField
The dense vector field gives the possibility of indexing and searching dense vectors of float elements.
For example:
[1.0, 2.5, 3.7, 4.1]
Here’s how DenseVectorField should be configured in the schema:
<fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension ="4"
similarityFunction ="cosine"/>
<field name="vector" type="knn_vector" indexed="true" stored="true"/>
vectorDimension
The dimension of the dense vector to
pass in.
Accepted values: Any integer < = 1024.
similarityFunction
Vector similarity function; used in
search to return top K most similar
vectors to a target vector.
Accepted values: euclidean,
dot_product or cosine.
● euclidean: Euclidean distance
● dot_product: Dot product
● cosine: Cosine similarity
<config>
<codecFactory class="solr.SchemaCodecFactory"/>
<fieldType
name="knn_vector"
class="solr.DenseVectorFiel
d" vectorDimension="4"
similarityFunction="cosine"
codecFormat="Lucene90HnswVe
ctorsFormat"
hnswMaxConnections="10"
hnswBeamWidth="40"/>
solrconfig.xml
schema.xml
Optional Default: 16
hnswMaxConnections
(advanced) This parameter is specific
for the Lucene90HnswVectorsFormat
codec format:
Controls how many of the nearest
neighbor candidates are connected to
the new node.
It has the same meaning as M from the
2018 paper.
Accepted values: Any integer.
Optional Default: 100
hnswBeamWidth
(advanced) This parameter is specific
for the Lucene90HnswVectorsFormat
codec format:
It is the number of nearest neighbor
candidates to track while searching the
graph for each newly inserted node.
It has the same meaning as
efConstruction from the 2018 paper.
Accepted values: Any integer.
Apache Solr 9.0 - Schema
Apache Solr 9.0 - Indexing
JSON
[{ "id": "1",
"vector": [1.0, 2.5, 3.7,
4.1]
},
{ "id": "2",
"vector": [1.5, 5.5, 6.7,
65.1]
}
]
XML
<add>
<doc>
<field name="id">1</field>
<field name="vector">1.0</field>
<field name="vector">2.5</field>
<field name="vector">3.7</field>
<field name="vector">4.1</field>
</doc>
<doc>
<field name="id">2</field>
<field name="vector">1.5</field>
<field name="vector">5.5</field>
<field name="vector">6.7</field>
<field name="vector">65.1</field>
</doc>
</add>
SolrJ
final SolrClient client =
getSolrClient();
final SolrInputDocument d1 = new
SolrInputDocument();
d1.setField("id", "1");
d1.setField("vector",
Arrays.asList(1.0f, 2.5f, 3.7f,
4.1f));
final SolrInputDocument d2 = new
SolrInputDocument();
d2.setField("id", "2");
d2.setField("vector",
Arrays.asList(1.5f, 5.5f, 6.7f,
65.1f));
client.add(Arrays.asList(d1,
d2));
N.B. from indexing and
storing perspective a dense
vector field is not any
different from an array of float
elements
Apache Solr 9.0 - Searching
knn Query Parser
The knn k-nearest neighbors query parser allows to find the k-nearest documents to the target vector
according to indexed dense vectors in the given field.
Required Default: none
Optional Default: 10
f
The DenseVectorField to search in.
topK
How many k-nearest results to return.
Here’s how to run a KNN search:
e.g.
&q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]
Apache Solr 9.0-Searching with Filter Queries(fq)
When using knn in these scenarios make sure you have a clear understanding of how filter queries
work in Apache Solr:
The Ranked List of document IDs resulting from the main query q is intersected with the set of
document IDs deriving from each filter query fq.
e.g.
Ranked List from q=[ID1, ID4, ID2, ID10] <intersects> Set from fq={ID3, ID2, ID9, ID4}
= [ID4,ID2]
Usage with Filter Queries
The knn query parser can be used in filter queries:
&q=id:(1 2 3)&fq={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]
The knn query parser can be used with filter queries:
&q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]&fq=id:(1 2 3)
N.B. this can be called ‘post filtering’, ‘pre filtering’ is coming in a future release
Apache Solr 9.0 - Reranking
When using knn in re-ranking pay attention to the topK parameter.
The second pass score(deriving from knn) is calculated only if the document d from the first pass is
within the k-nearest neighbors(in the whole index) of the target vector to search.
This means the second pass knn is executed on the whole index anyway, which is a current
limitation.
The final ranked list of results will have the first pass score(main query q) added to the second pass
score(the approximated similarityFunction distance to the target vector to search) multiplied by a
multiplicative factor(reRankWeight).
Details about using the ReRank Query Parser can be found in the Query Re-Ranking section.
Usage as Re-Ranking Query
The knn query parser can be used to rerank first pass query results:
&q=id:(3 4 9 2)&rq={!rerank reRankQuery=$rqq reRankDocs=4
reRankWeight=1}&rqq={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]
N.B. pure re-scoring is coming in a later release
Apache Solr 9.0 - Hybrid Dense/Sparse search
…/solr/dense/select?indent=true&q={!bool should=$clause1 should=$clause2}
&clause1={!type=field f=id v='901'}
&clause2={!knn f=vector topK=10}[0.4,0.5,0.3,0.6,0.8]
&fl=id,score,vector
"debug":{
"rawquerystring":"{!bool should=$clause1 should=$clause2}",
"querystring":"{!bool should=$clause1 should=$clause2}",
"parsedquery":"id:901
KnnVectorQuery(KnnVectorQuery:vector[0.4,...][10])",
"parsedquery_toString":"id:901 KnnVectorQuery:vector[0.4,...][10]",
Apache Solr 9.0 - Reranking
Apache Solr 9.0 - An Initial Benchmark
N.B. this is a
quick
benchmark, it
doesn’t
necessary
reflect bigger
volumes
linearly
KNN
text
Semantic Search Problems
Neural (Vector-based) Search
Apache Solr Implementation
BERT to the rescue!
Future Works
Overview
● Large language models need to be trained on a lot of data to work well
● … which is normally difficult for the average enterprise/project
● Transformers are pre-trained in an unsupervised way on large corpora
(Wikipedia, web …)
● Pre-training captures meaning, topics, word pattern in the language
(rather than the specific domain)
● Shares similarities with word2vec (vectors per word) ->
main difference is encoding word/sentences in a context
● Pre-trained model can be fine-tuned for specific tasks (dense retrieval,
essay generation, translation, text summarization…)
How to produce vectors?
Masked Language Modeling
Models are pre-trained using the
masked language modeling
technique
Take a test, mask random words
and try to predict the masked words
This technique allows to capture
interactions between word within the
context
Bidirectional Encoder Representations from Transformers
Large model (many dimensions and layers – base: 12 layers and
768 dim.)
Special tokens:
[CLS] Classification token, used as pooling operator to get a
single vector per sequence
[MASK] Used in the masked language model, to predict this
word
[SEP] Used to separate sentences
Devlin, Chang, Lee, Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019.
BERT to the rescue!
https://sease.io/2021/12/using-bert-to-improve-search-relevance.html
Training/ Fine Tuning
Self-supervised on ∞ training data Supervised on few labeled examples
Fine Tuning
From Text to Vectors
from sentence_transformers import SentenceTransformer
import torch
import sys
BATCH_SIZE = 50
MODEL_NAME = ‘msmarco-distilbert-base-dot-prod-v3’
model = SentenceTransformer(MODEL_NAME)
if torch.cuda.is_available():
model = model.to(torch.device(“cuda”))
def main():
input_filename = sys.argv[1]
output_filename = sys.argv[2]
create_embeddings(input_filename, output_filename)
https://pytorch.org
https://pytorch.org/tutorials/beginne
r/basics/quickstart_tutorial.html
https://www.sbert.net
https://www.sbert.net/docs/pretrain
ed_models.html
From Text to Vectors
def create_embeddings(input_filename, output_filename):
with open(input_filename, ‘r’, encoding=“utf-8") as f:
with open(output_filename, ‘w’) as out:
processed = 0
sentences = []
for line in f:
sentences.append(line)
if len(sentences) == BATCH_SIZE:
processed += 1
if (processed % 1000 == 0):
print(“processed {} documents”.format(processed))
vectors = process(sentences)
for v in vectors:
out.write(‘,’.join([str(i) for i in v]))
out.write(‘n’)
sentences = []
Each line in the
input file is a
sentence.
In this example we
consider each
sentence a
separate
document to be
indexed in Apache
Solr.
From Text to Vectors
def process(sentences):
embeddings = model.encode(sentences, show_progress_bar=True)
return embeddings
embeddings is an
array [] of vectors
Each vector
represents a
sentence
Semantic Search Problems
Neural (Vector-based) Search
Apache Solr Implementation
BERT to the rescue!
Future Works
Overview
Future Works: [Solr] Codec Agnostic
May 2022 - Apache Lucene 9.1 in Apache Solr
https://issues.apache.org/jira/browse/SOLR-16204
https://issues.apache.org/jira/browse/SOLR-16245
Currently it is possible to configure the codec to use for the HNSW
implementation in Lucene.
The proposal is to change this so that the user can only configure
the algorithm (HNSW now and default, maybe IVFFlat in the future?
https://issues.apache.org/jira/browse/LUCENE-9136)
<fieldType name="knn_vector"
class="solr.DenseVectorField"
vectorDimension="4"
similarityFunction="cosine"
codecFormat="Lucene90HnswVecto
rsFormat"
hnswMaxConnections="10"
hnswBeamWidth="40"/>
Future Works: [Solr] Pre-filtering
Mar 2022 - Apache Lucene 9.1
Pre filters with KNN queries
https://issues.apache.org/jira/browse/LUCENE-10382
https://issues.apache.org/jira/browse/SOLR-16246
This PR adds support for a query filter in KnnVectorQuery. First, we gather the
query results for each leaf as a bit set. Then the HNSW search skips over the
non-matching documents (using the same approach as for live docs). To prevent
HNSW search from visiting too many documents when the filter is very selective,
we short-circuit if HNSW has already visited more than the number of documents
that match the filter, and execute an exact search instead. This bounds the
number of visited documents at roughly 2x the cost of just running the exact
filter, while in most cases HNSW completes successfully and does a lot better.
Future Works: [Lucene] VectorSimilarityFunction simplification
https://issues.apache.org/jira/browse/LUCENE-10593
The proposal in this Pull Request aims to:
1) the Euclidean similarity just returns the score, in line with the other similarities, with the
formula currently used
2) simplify the code, removing the bound checker that's not necessary anymore
3) refactor here and there to be in line with the simplification
4) refactor of NeighborQueue to clearly state when it's a MIN_HEAP or MAX_HEAP, now
debugging is much easier and understanding the HNSW code is much more intuitive
Future Works: [Solr] BERT Integration
We are planning to contribute:
1) an update request processor, that takes in input a BERT model and does the inference
vectorization at indexing time
2) a query parser, that takes in input a BERT model and does the inference vectorization at
query time
Apache Solr 9.0 - Additional Resources
Additional Resources
● Blog: https://sease.io/2022/01/apache-solr-neural-search.html
● Blog: https://sease.io/2022/01/apache-solr-neural-search-knn-benchmark.html
● Blog: https://sease.io/2021/07/artificial-intelligence-applied-to-search-introduction.html
● Blog: https://sease.io/2021/12/using-bert-to-improve-search-relevance.html
● Blog: https://sease.io/2022/01/tackling-vocabulary-mismatch-with-document-expansion.html
● Blog: https://sease.io/2022/04/have-neural-networks-killed-the-inverted-index.html
Thanks!
Special thanks to:
● Apache Lucene community for all the HNSW goodies
● Elia Porciani - active contributor of code for Apache Solr Neural Search components
● Christine Poerschke for the accurate review
● Cassandra Targett for all the documentation corrections
● Michael Gibney for the discussion on dense vectors, and everyone involved in the review process!

More Related Content

What's hot

Scalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDBScalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDBAlluxio, Inc.
 
Elasticsearch 한글 형태소 분석기 Nori 노리
Elasticsearch 한글 형태소 분석기 Nori 노리Elasticsearch 한글 형태소 분석기 Nori 노리
Elasticsearch 한글 형태소 분석기 Nori 노리종민 김
 
Case Study: MySQL migration from latin1 to UTF-8
Case Study: MySQL migration from latin1 to UTF-8Case Study: MySQL migration from latin1 to UTF-8
Case Study: MySQL migration from latin1 to UTF-8Olivier DASINI
 
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams APIuser Behavior Analysis with Session Windows and Apache Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams APIconfluent
 
Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-bas...
Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-bas...Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-bas...
Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-bas...Willy Lulciuc
 
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오PgDay.Seoul
 
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Spark Summit
 
MySQL InnoDB Cluster - Advanced Configuration & Operations
MySQL InnoDB Cluster - Advanced Configuration & OperationsMySQL InnoDB Cluster - Advanced Configuration & Operations
MySQL InnoDB Cluster - Advanced Configuration & OperationsFrederic Descamps
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLScyllaDB
 
2021 10-13 i ox query processing
2021 10-13 i ox query processing2021 10-13 i ox query processing
2021 10-13 i ox query processingAndrew Lamb
 
B+Tree Indexes and InnoDB
B+Tree Indexes and InnoDBB+Tree Indexes and InnoDB
B+Tree Indexes and InnoDBOvais Tariq
 
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
InfluxDB IOx Tech Talks: The Impossible Dream:  Easy-to-Use, Super Fast Softw...InfluxDB IOx Tech Talks: The Impossible Dream:  Easy-to-Use, Super Fast Softw...
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...InfluxData
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberXiang Fu
 
Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Willy Lulciuc
 
Working with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrWorking with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrAnshum Gupta
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Icebergkbajda
 
Tanel Poder - Scripts and Tools short
Tanel Poder - Scripts and Tools shortTanel Poder - Scripts and Tools short
Tanel Poder - Scripts and Tools shortTanel Poder
 

What's hot (20)

Scalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDBScalable Filesystem Metadata Services with RocksDB
Scalable Filesystem Metadata Services with RocksDB
 
Elasticsearch 한글 형태소 분석기 Nori 노리
Elasticsearch 한글 형태소 분석기 Nori 노리Elasticsearch 한글 형태소 분석기 Nori 노리
Elasticsearch 한글 형태소 분석기 Nori 노리
 
Case Study: MySQL migration from latin1 to UTF-8
Case Study: MySQL migration from latin1 to UTF-8Case Study: MySQL migration from latin1 to UTF-8
Case Study: MySQL migration from latin1 to UTF-8
 
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams APIuser Behavior Analysis with Session Windows and Apache Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
 
Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-bas...
Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-bas...Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-bas...
Marquez: A Metadata Service for Data Abstraction, Data Lineage, and Event-bas...
 
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
[Pgday.Seoul 2017] 3. PostgreSQL WAL Buffers, Clog Buffers Deep Dive - 이근오
 
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
Sparkler—Crawler on Apache Spark: Spark Summit East talk by Karanjeet Singh a...
 
MySQL InnoDB Cluster - Advanced Configuration & Operations
MySQL InnoDB Cluster - Advanced Configuration & OperationsMySQL InnoDB Cluster - Advanced Configuration & Operations
MySQL InnoDB Cluster - Advanced Configuration & Operations
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Modeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQLModeling Data and Queries for Wide Column NoSQL
Modeling Data and Queries for Wide Column NoSQL
 
2021 10-13 i ox query processing
2021 10-13 i ox query processing2021 10-13 i ox query processing
2021 10-13 i ox query processing
 
B+Tree Indexes and InnoDB
B+Tree Indexes and InnoDBB+Tree Indexes and InnoDB
B+Tree Indexes and InnoDB
 
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
InfluxDB IOx Tech Talks: The Impossible Dream:  Easy-to-Use, Super Fast Softw...InfluxDB IOx Tech Talks: The Impossible Dream:  Easy-to-Use, Super Fast Softw...
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
 
Lena Application Server
Lena  Application ServerLena  Application Server
Lena Application Server
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez Data Lineage with Apache Airflow using Marquez
Data Lineage with Apache Airflow using Marquez
 
Vector database
Vector databaseVector database
Vector database
 
Working with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrWorking with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache Solr
 
Presto Summit 2018 - 09 - Netflix Iceberg
Presto Summit 2018  - 09 - Netflix IcebergPresto Summit 2018  - 09 - Netflix Iceberg
Presto Summit 2018 - 09 - Netflix Iceberg
 
Tanel Poder - Scripts and Tools short
Tanel Poder - Scripts and Tools shortTanel Poder - Scripts and Tools short
Tanel Poder - Scripts and Tools short
 

Similar to Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and More (Buzzwords).pdf

Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfSease
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneRahul Jain
 
Using Elastiknn for exact and approximate nearest neighbor search
Using Elastiknn for exact and approximate nearest neighbor searchUsing Elastiknn for exact and approximate nearest neighbor search
Using Elastiknn for exact and approximate nearest neighbor searchFaithWestdorp
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Data Con LA
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkMatt Ingenthron
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
The life of a query (oracle edition)
The life of a query (oracle edition)The life of a query (oracle edition)
The life of a query (oracle edition)maclean liu
 
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...AWS User Group - Thailand
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache SolrEdureka!
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Databricks
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...Lucidworks
 
Dynamic Resource Allocation Algorithm using Containers
Dynamic Resource Allocation Algorithm using ContainersDynamic Resource Allocation Algorithm using Containers
Dynamic Resource Allocation Algorithm using ContainersIRJET Journal
 
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc AnalyticsA General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc AnalyticsFlurry, Inc.
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)dnaber
 
OpenStack Collaboration made in heaven with Heat, Mistral, Neutron and more..
OpenStack Collaboration made in heaven with Heat, Mistral, Neutron and more..OpenStack Collaboration made in heaven with Heat, Mistral, Neutron and more..
OpenStack Collaboration made in heaven with Heat, Mistral, Neutron and more..Trinath Somanchi
 
Modern DevOps with Spinnaker/Concourse and Micrometer
Modern DevOps with Spinnaker/Concourse and MicrometerModern DevOps with Spinnaker/Concourse and Micrometer
Modern DevOps with Spinnaker/Concourse and MicrometerJesse Tate Pulfer
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 

Similar to Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and More (Buzzwords).pdf (20)

Dense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdfDense Retrieval with Apache Solr Neural Search.pdf
Dense Retrieval with Apache Solr Neural Search.pdf
 
LarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - IntroductionLarKC Tutorial at ISWC 2009 - Introduction
LarKC Tutorial at ISWC 2009 - Introduction
 
Introduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of LuceneIntroduction to Elasticsearch with basics of Lucene
Introduction to Elasticsearch with basics of Lucene
 
Using Elastiknn for exact and approximate nearest neighbor search
Using Elastiknn for exact and approximate nearest neighbor searchUsing Elastiknn for exact and approximate nearest neighbor search
Using Elastiknn for exact and approximate nearest neighbor search
 
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
Big Data Day LA 2016/ NoSQL track - Spark And Couchbase: Augmenting The Opera...
 
Spark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with SparkSpark and Couchbase– Augmenting the Operational Database with Spark
Spark and Couchbase– Augmenting the Operational Database with Spark
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
The life of a query (oracle edition)
The life of a query (oracle edition)The life of a query (oracle edition)
The life of a query (oracle edition)
 
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
 
New-Age Search through Apache Solr
New-Age Search through Apache SolrNew-Age Search through Apache Solr
New-Age Search through Apache Solr
 
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
Microservices and Teraflops: Effortlessly Scaling Data Science with PyWren wi...
 
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo..."Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
"Spark Search" - In-memory, Distributed Search with Lucene, Spark, and Tachyo...
 
Dynamic Resource Allocation Algorithm using Containers
Dynamic Resource Allocation Algorithm using ContainersDynamic Resource Allocation Algorithm using Containers
Dynamic Resource Allocation Algorithm using Containers
 
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc AnalyticsA General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Apache Lucene Searching The Web
Apache Lucene Searching The WebApache Lucene Searching The Web
Apache Lucene Searching The Web
 
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)Apache Lucene: Searching the Web and Everything Else (Jazoon07)
Apache Lucene: Searching the Web and Everything Else (Jazoon07)
 
OpenStack Collaboration made in heaven with Heat, Mistral, Neutron and more..
OpenStack Collaboration made in heaven with Heat, Mistral, Neutron and more..OpenStack Collaboration made in heaven with Heat, Mistral, Neutron and more..
OpenStack Collaboration made in heaven with Heat, Mistral, Neutron and more..
 
Modern DevOps with Spinnaker/Concourse and Micrometer
Modern DevOps with Spinnaker/Concourse and MicrometerModern DevOps with Spinnaker/Concourse and Micrometer
Modern DevOps with Spinnaker/Concourse and Micrometer
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 

More from Sease

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors LuceneSease
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...Sease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneSease
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Sease
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveSease
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaSease
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale IndexingSease
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfSease
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxSease
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingSease
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Sease
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationSease
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneSease
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSease
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information RetrievalSease
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationSease
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to RankSease
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusSease
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupSease
 

More from Sease (20)

Multi Valued Vectors Lucene
Multi Valued Vectors LuceneMulti Valued Vectors Lucene
Multi Valued Vectors Lucene
 
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
When SDMX meets AI-Leveraging Open Source LLMs To Make Official Statistics Mo...
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Introducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache LuceneIntroducing Multi Valued Vectors Fields in Apache Lucene
Introducing Multi Valued Vectors Fields in Apache Lucene
 
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
Stat-weight Improving the Estimator of Interleaved Methods Outcomes with Stat...
 
How does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspectiveHow does ChatGPT work: an Information Retrieval perspective
How does ChatGPT work: an Information Retrieval perspective
 
How To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With KibanaHow To Implement Your Online Search Quality Evaluation With Kibana
How To Implement Your Online Search Quality Evaluation With Kibana
 
Large Scale Indexing
Large Scale IndexingLarge Scale Indexing
Large Scale Indexing
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
How to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptxHow to cache your searches_ an open source implementation.pptx
How to cache your searches_ an open source implementation.pptx
 
Online Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr InterleavingOnline Testing Learning to Rank with Solr Interleaving
Online Testing Learning to Rank with Solr Interleaving
 
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
Rated Ranking Evaluator Enterprise: the next generation of free Search Qualit...
 
Apache Lucene/Solr Document Classification
Apache Lucene/Solr Document ClassificationApache Lucene/Solr Document Classification
Apache Lucene/Solr Document Classification
 
Advanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache LuceneAdvanced Document Similarity with Apache Lucene
Advanced Document Similarity with Apache Lucene
 
Search Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer PerspectiveSearch Quality Evaluation: a Developer Perspective
Search Quality Evaluation: a Developer Perspective
 
Introduction to Music Information Retrieval
Introduction to Music Information RetrievalIntroduction to Music Information Retrieval
Introduction to Music Information Retrieval
 
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality EvaluationRated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
Rated Ranking Evaluator: an Open Source Approach for Search Quality Evaluation
 
Explainability for Learning to Rank
Explainability for Learning to RankExplainability for Learning to Rank
Explainability for Learning to Rank
 
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @ChorusRated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
Rated Ranking Evaluator (RRE) Hands-on Relevance Testing @Chorus
 
Interactive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval MeetupInteractive Questions and Answers - London Information Retrieval Meetup
Interactive Questions and Answers - London Information Retrieval Meetup
 

Recently uploaded

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Neural Search Comes to Apache Solr_ Approximate Nearest Neighbor, BERT and More (Buzzwords).pdf

  • 1. Neural Search Comes to Apache Solr: Approximate Nearest Neighbor, BERT and More (Buzzwords)! 
 Alessandro Benedetti, CEO 13th June 2022
  • 2. ‣ Born in Tarquinia(ancient Etruscan city in Italy) ‣ R&D Software Engineer ‣ Director ‣ Master in Computer Science ‣ PC member for ECIR, SIGIR and Desires ‣ Apache Lucene/Solr PMC member/committer ‣ Elasticsearch expert ‣ Semantic, NLP, Machine Learning technologies passionate ‣ Beach Volleyball player and Snowboarder Who We Are Alessandro Benedetti
  • 3. ‣ Headquarter in London/distributed ‣ Open Source Enthusiasts ‣ Apache Lucene/Solr experts ‣ Elasticsearch experts ‣ Community Contributors ‣ Active Researchers ‣ Hot Trends : Neural Search, Natural Language Processing Learning To Rank, Document Similarity, Search Quality Evaluation, Relevancy Tuning SEArch SErvices www.sease.io
  • 4. Semantic Search Problems Neural (Vector-based) Search Apache Solr Implementation BERT to the rescue! Future Works Overview
  • 5. Semantic Search Problems Neural (Vector-based) Search Apache Solr Implementation BERT to the rescue! Future Works Overview
  • 6. How many people live in Rome? Rome’s population is 4.3 million Hundreds of people queuing for live music in Rome Vocabulary mismatch problem False Positive
  • 7. How big is a tiger? The tiger is the biggest member of the Felidae family Panthera Tigris can reach 390cm nose to tail Vocabulary mismatch problem False Negative
  • 9. Semantic Search Problems Neural (Vector-based) Search Apache Solr Implementation BERT to the rescue! Future Works Overview
  • 10. Vector representation for query/documents Sparse Dense ● e.g. Bag-of-words approach ● each term in the corpus dictionary -> one vector dimension ● number of dimensions -> term dictionary cardinality ● the vector for any given document contains mostly zeroes ● Fixed number of dimensions ● Normally much lower than term dictionary cardinality ● the vector for any given document contains mostly non-zeroes How can you generate such dense vectors? D=[0, 0, 0, 0, 1, 0, 0, 0, 1, 0] D=[0.7, 0.9, 0.4, 0.6, 1, 0.4, 0.7, 0.8, 1, 0.9]
  • 11. Neural Search Training Indexing Searching Labeled Samples Text to Vectors Query to Vector Lookup in Index
  • 12. Neural Search Workflow Similarity between a Query and a Document is translated to distance in a vector space
  • 13. Distance Between Vectors https://www.pinecone.io/learn/what-is-similarity-search/ https://towardsdatascience.com/importance-of-distance-metrics-in-machine-learning-modelling-e51395ffe60d ● Specify the relationship metric between elements in the dataset ● use-case dependant ○ experiment which one works better for you! ● In Information Retrieval Cosine similarity proved to work quite well (it’s a normalised inner product) https://www.baeldung.com/cs/euclidean-distance-vs-cosine-similarity
  • 14. Nearest Neighbour Retrieval (KNN) The Relevance score is calculated with vector similarity distance metric. Closer vectors means higher semantic similarity.
  • 15. ANN - Approximate Nearest Neighbor ● Exact Nearest Neighbor is expensive! (1vs1 vector distance) ● it’s fine to lose accuracy to get a massive performance gain ● pre-process the dataset to build index data structures ● Generally vectors are quantized(compressed) and then modelled in: ○ Trees - partitioning of the vector space (k-d tree) https://en.wikipedia.org/wiki/K-d_tree#Nearest_neighbour_search ○ Hashes - reducing high dimensionality preserving differences and grouping similar objects https://towardsdatascience.com/locality-sensitive-hashing-for-music-search-f2f1940ace23 ○ Graphs - HNSW
  • 16. HNSW - Hierarchical Navigable Small World graphs Hierarchical Navigable Small World (HNSW) graphs are among the top-performing index-time data structures for approximate nearest neighbor search (ANN). References https://doi.org/10.1016/j.is.2013.10.006 https://arxiv.org/abs/1603.09320
  • 17. HNSW - How it works in a nutshell ● Proximity graph ● Vertices are vectors, closer vertices are linked ● Hierarchical Layers based on skip lists ○ longer edges in higher layers(fast retrieval) ○ shorter edges in lower layers(accuracy) ● Each layer is a Navigable Small World Graph ○ greedy search for the closest friend(local minimum) ○ higher the degree of vertices(number of connections) lower the probability of hitting local min (but more expensive ○ move down layer for refining the minimum(closest friend)
  • 18. Semantic Search Problems Neural (Vector-based) Search Apache Solr Implementation BERT to the rescue! Future Works Overview
  • 19. Nov 2020 - Apache Lucene 9.0 Dedicated File Format for Navigable Small World Graphs https://issues.apache.org/jira/browse/LUCENE-9004 Jan 2022 - Apache Lucene 9.0 Handle Document Deletions https://issues.apache.org/jira/browse/LUCENE-10040 Feb 2022 - Apache Lucene 9.1 Introduced Hierarchy in HNSW https://issues.apache.org/jira/browse/LUCENE-10054 Mar 2022 - Apache Lucene 9.1 Re-use data structures across HNSW Graph https://issues.apache.org/jira/browse/LUCENE-10391 Mar 2022 - Apache Lucene 9.1 Pre filters with KNN queries https://issues.apache.org/jira/browse/LUCENE-10382 JIRA ISSUES https://issues.apache.org/jir a/issues/?jql=project%20% 3D%20LUCENE%20AND %20labels%20%3D%20ve ctor-based-search Apache Lucene implementation
  • 20. org.apache.lucene.index.VectorSimilarityFunction /** Euclidean distance */ EUCLIDEAN /** * Dot product. NOTE: this similarity is intended as an optimized way to perform cosine * similarity. In order to use it, all vectors must be of unit length, including both document and * query vectors. Using dot product with vectors that are not unit length can result in errors or * poor search results. */ DOT_PRODUCT /** * Cosine similarity. NOTE: the preferred way to perform cosine similarity is to normalize all * vectors to unit length, and instead use {@link VectorSimilarityFunction#DOT_PRODUCT} . You * should only use this function if you need to preserve the original vectors and cannot normalize * them in advance. */ COSINE Apache Lucene - indexing
  • 21. org.apache.lucene.document.KnnVectorField private static Field randomKnnVectorField (Random random, String fieldName) { VectorSimilarityFunction similarityFunction = RandomPicks.randomFrom(random, VectorSimilarityFunction .values()); float[] values = new float[randomIntBetween(1, 10)]; for (int i = 0; i < values.length; i++) { values[i] = randomFloat(); } return new KnnVectorField(fieldName, values, similarityFunction ); } Document doc = new Document(); doc.add( new KnnVectorField( "field", new float[] {j, j}, VectorSimilarityFunction .EUCLIDEAN)); Apache Lucene - indexing
  • 22. org.apache.lucene.codecs.lucene92.Lucene92HnswVectorsWriter Lucene91HnswVectorsReader .OffHeapVectorValues offHeapVectors= new Lucene91HnswVectorsReader .OffHeapVectorValues( vectors.dimension(), docsWithField.cardinality(), null, vectorDataInput); OnHeapHnswGraph graph = offHeapVectors .size() == 0 ? null : writeGraph(offHeapVectors , fieldInfo.getVectorSimilarityFunction()); org.apache.lucene.util.hnsw.HnswGraphBuilder public Lucene91Codec(Mode mode) { super("Lucene91"); this.storedFieldsFormat = new Lucene90StoredFieldsFormat( Objects.requireNonNull(mode).storedMode); this.defaultPostingsFormat = new Lucene90PostingsFormat(); this.defaultDVFormat = new Lucene90DocValuesFormat(); this.defaultKnnVectorsFormat = new Lucene91HnswVectorsFormat(); } Apache Lucene - indexing
  • 23. org.apache.lucene.search.KnnVectorQuery /** * Find the <code>k</code> nearest documents to the target vector according to the vectors in the * given field. <code>target</code> vector. * * @param field a field that has been indexed as a {@link KnnVectorField}. * @param target the target of the search * @param k the number of documents to find * @param filter a filter applied before the vector search * @throws IllegalArgumentException if <code>k</code> is less than 1 */ public KnnVectorQuery (String field, float[] target, int k, Query filter) { this.field = field; this.target = target; this.k = k; if (k < 1) { throw new IllegalArgumentException( "k must be at least 1, got: " + k); } this.filter = filter; } Apache Lucene - searching
  • 24. May 2022 - Apache Solr 9.0 Sease Introduced support to KNN search (HNSW) https://issues.apache.org/jira/browse/SOLR-15880 JIRA ISSUES https://issues.apache.org/jir a/browse/SOLR-15880?jql= labels%20%3D%20vector- based-search Apache Solr implementation
  • 25. Apache Solr 9.0 - Schema DenseVectorField The dense vector field gives the possibility of indexing and searching dense vectors of float elements. For example: [1.0, 2.5, 3.7, 4.1] Here’s how DenseVectorField should be configured in the schema: <fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension ="4" similarityFunction ="cosine"/> <field name="vector" type="knn_vector" indexed="true" stored="true"/> vectorDimension The dimension of the dense vector to pass in. Accepted values: Any integer < = 1024. similarityFunction Vector similarity function; used in search to return top K most similar vectors to a target vector. Accepted values: euclidean, dot_product or cosine. ● euclidean: Euclidean distance ● dot_product: Dot product ● cosine: Cosine similarity
  • 26. <config> <codecFactory class="solr.SchemaCodecFactory"/> <fieldType name="knn_vector" class="solr.DenseVectorFiel d" vectorDimension="4" similarityFunction="cosine" codecFormat="Lucene90HnswVe ctorsFormat" hnswMaxConnections="10" hnswBeamWidth="40"/> solrconfig.xml schema.xml Optional Default: 16 hnswMaxConnections (advanced) This parameter is specific for the Lucene90HnswVectorsFormat codec format: Controls how many of the nearest neighbor candidates are connected to the new node. It has the same meaning as M from the 2018 paper. Accepted values: Any integer. Optional Default: 100 hnswBeamWidth (advanced) This parameter is specific for the Lucene90HnswVectorsFormat codec format: It is the number of nearest neighbor candidates to track while searching the graph for each newly inserted node. It has the same meaning as efConstruction from the 2018 paper. Accepted values: Any integer. Apache Solr 9.0 - Schema
  • 27. Apache Solr 9.0 - Indexing JSON [{ "id": "1", "vector": [1.0, 2.5, 3.7, 4.1] }, { "id": "2", "vector": [1.5, 5.5, 6.7, 65.1] } ] XML <add> <doc> <field name="id">1</field> <field name="vector">1.0</field> <field name="vector">2.5</field> <field name="vector">3.7</field> <field name="vector">4.1</field> </doc> <doc> <field name="id">2</field> <field name="vector">1.5</field> <field name="vector">5.5</field> <field name="vector">6.7</field> <field name="vector">65.1</field> </doc> </add> SolrJ final SolrClient client = getSolrClient(); final SolrInputDocument d1 = new SolrInputDocument(); d1.setField("id", "1"); d1.setField("vector", Arrays.asList(1.0f, 2.5f, 3.7f, 4.1f)); final SolrInputDocument d2 = new SolrInputDocument(); d2.setField("id", "2"); d2.setField("vector", Arrays.asList(1.5f, 5.5f, 6.7f, 65.1f)); client.add(Arrays.asList(d1, d2)); N.B. from indexing and storing perspective a dense vector field is not any different from an array of float elements
  • 28. Apache Solr 9.0 - Searching knn Query Parser The knn k-nearest neighbors query parser allows to find the k-nearest documents to the target vector according to indexed dense vectors in the given field. Required Default: none Optional Default: 10 f The DenseVectorField to search in. topK How many k-nearest results to return. Here’s how to run a KNN search: e.g. &q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]
  • 29. Apache Solr 9.0-Searching with Filter Queries(fq) When using knn in these scenarios make sure you have a clear understanding of how filter queries work in Apache Solr: The Ranked List of document IDs resulting from the main query q is intersected with the set of document IDs deriving from each filter query fq. e.g. Ranked List from q=[ID1, ID4, ID2, ID10] <intersects> Set from fq={ID3, ID2, ID9, ID4} = [ID4,ID2] Usage with Filter Queries The knn query parser can be used in filter queries: &q=id:(1 2 3)&fq={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0] The knn query parser can be used with filter queries: &q={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0]&fq=id:(1 2 3) N.B. this can be called ‘post filtering’, ‘pre filtering’ is coming in a future release
  • 30. Apache Solr 9.0 - Reranking When using knn in re-ranking pay attention to the topK parameter. The second pass score(deriving from knn) is calculated only if the document d from the first pass is within the k-nearest neighbors(in the whole index) of the target vector to search. This means the second pass knn is executed on the whole index anyway, which is a current limitation. The final ranked list of results will have the first pass score(main query q) added to the second pass score(the approximated similarityFunction distance to the target vector to search) multiplied by a multiplicative factor(reRankWeight). Details about using the ReRank Query Parser can be found in the Query Re-Ranking section. Usage as Re-Ranking Query The knn query parser can be used to rerank first pass query results: &q=id:(3 4 9 2)&rq={!rerank reRankQuery=$rqq reRankDocs=4 reRankWeight=1}&rqq={!knn f=vector topK=10}[1.0, 2.0, 3.0, 4.0] N.B. pure re-scoring is coming in a later release
  • 31. Apache Solr 9.0 - Hybrid Dense/Sparse search …/solr/dense/select?indent=true&q={!bool should=$clause1 should=$clause2} &clause1={!type=field f=id v='901'} &clause2={!knn f=vector topK=10}[0.4,0.5,0.3,0.6,0.8] &fl=id,score,vector "debug":{ "rawquerystring":"{!bool should=$clause1 should=$clause2}", "querystring":"{!bool should=$clause1 should=$clause2}", "parsedquery":"id:901 KnnVectorQuery(KnnVectorQuery:vector[0.4,...][10])", "parsedquery_toString":"id:901 KnnVectorQuery:vector[0.4,...][10]",
  • 32. Apache Solr 9.0 - Reranking Apache Solr 9.0 - An Initial Benchmark N.B. this is a quick benchmark, it doesn’t necessary reflect bigger volumes linearly KNN text
  • 33. Semantic Search Problems Neural (Vector-based) Search Apache Solr Implementation BERT to the rescue! Future Works Overview
  • 34. ● Large language models need to be trained on a lot of data to work well ● … which is normally difficult for the average enterprise/project ● Transformers are pre-trained in an unsupervised way on large corpora (Wikipedia, web …) ● Pre-training captures meaning, topics, word pattern in the language (rather than the specific domain) ● Shares similarities with word2vec (vectors per word) -> main difference is encoding word/sentences in a context ● Pre-trained model can be fine-tuned for specific tasks (dense retrieval, essay generation, translation, text summarization…) How to produce vectors?
  • 35. Masked Language Modeling Models are pre-trained using the masked language modeling technique Take a test, mask random words and try to predict the masked words This technique allows to capture interactions between word within the context
  • 36. Bidirectional Encoder Representations from Transformers Large model (many dimensions and layers – base: 12 layers and 768 dim.) Special tokens: [CLS] Classification token, used as pooling operator to get a single vector per sequence [MASK] Used in the masked language model, to predict this word [SEP] Used to separate sentences Devlin, Chang, Lee, Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. NAACL 2019. BERT to the rescue! https://sease.io/2021/12/using-bert-to-improve-search-relevance.html
  • 37. Training/ Fine Tuning Self-supervised on ∞ training data Supervised on few labeled examples
  • 39. From Text to Vectors from sentence_transformers import SentenceTransformer import torch import sys BATCH_SIZE = 50 MODEL_NAME = ‘msmarco-distilbert-base-dot-prod-v3’ model = SentenceTransformer(MODEL_NAME) if torch.cuda.is_available(): model = model.to(torch.device(“cuda”)) def main(): input_filename = sys.argv[1] output_filename = sys.argv[2] create_embeddings(input_filename, output_filename) https://pytorch.org https://pytorch.org/tutorials/beginne r/basics/quickstart_tutorial.html https://www.sbert.net https://www.sbert.net/docs/pretrain ed_models.html
  • 40. From Text to Vectors def create_embeddings(input_filename, output_filename): with open(input_filename, ‘r’, encoding=“utf-8") as f: with open(output_filename, ‘w’) as out: processed = 0 sentences = [] for line in f: sentences.append(line) if len(sentences) == BATCH_SIZE: processed += 1 if (processed % 1000 == 0): print(“processed {} documents”.format(processed)) vectors = process(sentences) for v in vectors: out.write(‘,’.join([str(i) for i in v])) out.write(‘n’) sentences = [] Each line in the input file is a sentence. In this example we consider each sentence a separate document to be indexed in Apache Solr.
  • 41. From Text to Vectors def process(sentences): embeddings = model.encode(sentences, show_progress_bar=True) return embeddings embeddings is an array [] of vectors Each vector represents a sentence
  • 42. Semantic Search Problems Neural (Vector-based) Search Apache Solr Implementation BERT to the rescue! Future Works Overview
  • 43. Future Works: [Solr] Codec Agnostic May 2022 - Apache Lucene 9.1 in Apache Solr https://issues.apache.org/jira/browse/SOLR-16204 https://issues.apache.org/jira/browse/SOLR-16245 Currently it is possible to configure the codec to use for the HNSW implementation in Lucene. The proposal is to change this so that the user can only configure the algorithm (HNSW now and default, maybe IVFFlat in the future? https://issues.apache.org/jira/browse/LUCENE-9136) <fieldType name="knn_vector" class="solr.DenseVectorField" vectorDimension="4" similarityFunction="cosine" codecFormat="Lucene90HnswVecto rsFormat" hnswMaxConnections="10" hnswBeamWidth="40"/>
  • 44. Future Works: [Solr] Pre-filtering Mar 2022 - Apache Lucene 9.1 Pre filters with KNN queries https://issues.apache.org/jira/browse/LUCENE-10382 https://issues.apache.org/jira/browse/SOLR-16246 This PR adds support for a query filter in KnnVectorQuery. First, we gather the query results for each leaf as a bit set. Then the HNSW search skips over the non-matching documents (using the same approach as for live docs). To prevent HNSW search from visiting too many documents when the filter is very selective, we short-circuit if HNSW has already visited more than the number of documents that match the filter, and execute an exact search instead. This bounds the number of visited documents at roughly 2x the cost of just running the exact filter, while in most cases HNSW completes successfully and does a lot better.
  • 45. Future Works: [Lucene] VectorSimilarityFunction simplification https://issues.apache.org/jira/browse/LUCENE-10593 The proposal in this Pull Request aims to: 1) the Euclidean similarity just returns the score, in line with the other similarities, with the formula currently used 2) simplify the code, removing the bound checker that's not necessary anymore 3) refactor here and there to be in line with the simplification 4) refactor of NeighborQueue to clearly state when it's a MIN_HEAP or MAX_HEAP, now debugging is much easier and understanding the HNSW code is much more intuitive
  • 46. Future Works: [Solr] BERT Integration We are planning to contribute: 1) an update request processor, that takes in input a BERT model and does the inference vectorization at indexing time 2) a query parser, that takes in input a BERT model and does the inference vectorization at query time
  • 47. Apache Solr 9.0 - Additional Resources Additional Resources ● Blog: https://sease.io/2022/01/apache-solr-neural-search.html ● Blog: https://sease.io/2022/01/apache-solr-neural-search-knn-benchmark.html ● Blog: https://sease.io/2021/07/artificial-intelligence-applied-to-search-introduction.html ● Blog: https://sease.io/2021/12/using-bert-to-improve-search-relevance.html ● Blog: https://sease.io/2022/01/tackling-vocabulary-mismatch-with-document-expansion.html ● Blog: https://sease.io/2022/04/have-neural-networks-killed-the-inverted-index.html
  • 48. Thanks! Special thanks to: ● Apache Lucene community for all the HNSW goodies ● Elia Porciani - active contributor of code for Apache Solr Neural Search components ● Christine Poerschke for the accurate review ● Cassandra Targett for all the documentation corrections ● Michael Gibney for the discussion on dense vectors, and everyone involved in the review process!