SlideShare ist ein Scribd-Unternehmen logo
1 von 58
The Neural Search Frontier
The Relevance Engineer's Deep Learning Toolbelt
About us
Doug Turnbull / Author "Relevant Search"
Doug Turnbull, CTO - OpenSource Connections
(http://o19s.com)
Tommaso Teofili
Computer Scientist, Adobe
ASF memberDiscount Code:
ctwactivate18
Relevance: challenge of 'aboutness'
bitcoin regulation
'bitcoin' - aboutness
‘regulation’
aboutness
X
X
Doc 1
Doc 2
X
Query
Doc 2 more about terms,
ranked 'higher'
One hypothesis: TF*IDF vector similarity (and bm25 ...)
bitcoin regulation
'bitcoin' - TF*IDF
‘regulation’
TF*IDF
X
X
Doc 1
Doc 2
X
Query
Doc 2 more 'about' because higher
concentration of the corpus's
'bitcoin' and 'regulation' terms
Sadly 'Aboutness' is hard
After minutes - simple
TF*IDF
After years
relevance /
feature eng.
work
Possible
Search
quality
Why!?
● Searcher vocab mismatches
● Searchers expect context
(personal, emotional, ...)
● Different entities to match to
● Language evolves
● Misspellings, typos, etc
● TF*IDF doesn't always matter
There's Something About Aboutness?
Neural Search Frontier
Hunt for the About October?
The Last Sharknado: It's *About* Time?
Embeddings: another vector sim
A vector storing information about the context a word appears in
“Have you heard the hype about bitcoin currency?”
“Bitcoin a currency for hackers?”
“Have you heard the hype about bitcoin currency?”
“Hacking on my Bitcoin server! It's not hyped”
Encode
contexts
[0.6,-0.4] bitcoin
Docs
Easier to see as a graph
[0.5,-0.5] cryptocurrency
[0.6,-0.4] bitcoin
These two terms
occur in similar
contexts
Graph from Desmos
Document embeddings
Encode full text as context for document
Encode [0.5,-0.4] doc1234
Have you heard of cryptocurrency? It's the cool
currency being worked on by hackers! Bitcoin is
one prominent example. Some say it's all hype
...
Is shared 'context' same as 'aboutness'!?
[0.5,-0.5] cryptocurrency
[0.6,-0.4] bitcoin
[0.5,-0.4] doc1234
[0.4,-0.2] doc5678
Embeddings:
bitcoin Closer docs to 'bitcoin'
ranked higher
1. doc1234
2. doc5678
Sadly no :(
I want to cancel my reservation
I want to confirm my reservation
Words occur in similar contexts can
have opposite meanings!!
'Doc about canceling' not
relevant for 'confirm'
Fundamentally there’s still a mismatch
Fiat currencies
have a global
conspiracy
against
dogecoin
bit dollar?
Crypto expert creates
content with nerd
speak
Average Citizen
searches in lay-
speak
Embeddings derived from corpus, not our
searcher’s language
Embedding Modeling
Can we improve on embeddings?
Dot Product
(‘similarity’)
Sigmoid
(forces to 0..1
‘probability’)
[5.5,
8.1,
...
0.6]
'Bitcoin'
Embedding
[6.0,
1.7,
...
8.4]
'Energy'
Embedding
5.5*6.0 +
8.1*1.7 +
...
+ 0.6*8.4 =
102.65
0...1
0.67
Out of
context True
context
Term Other Term In Same Context? Model Prediction
bitcoin energy 1 0.67
Trying to predict
A word2vec Model
Term
embedding
Table
(initialized w/
random vals)
Term
embedding
Table
(initialized w/
random vals)
Tweak…
Term Other
Term
True
Context?
Prediction
bitcoin energy 1 0.67
bitcoin dongles 0 0.78
bitcoin relevance 0 0.56
bitcoin cables 0 0.34
To get into the weeds on the loss function & gradient descent for
word2vec:Stanford CS224D Deep Learning for NLP Lecture Notes
Chaubard, Mundra, Socher
Random
unrelated
terms
Terms
sharing
context
Tweak ‘bitcoin’ to get
prediction closer to 1
Tweak ‘bitcoin’ to get
prediction closer to 0
Goal:
Goal:
(showing skipgram / negative sampling method)
Maximize Me!
-
Maximize ‘True’ Contexts
Dot Product
(‘similarity’)
Sigmoid
(forces to
0..1)
[5.5,
8.1,
...
0.6]
[6.0,
1.7,
...
8.4]
Minimize ‘False’ Contexts
Just the model (sigmoid of dot product)
F =
Did you just neural network?
Backpropagate tweaks to weights
to reduce error
Neuron
(sigmoid)
Word1Word2
(Dot product implicit)
d F
d v[0]
d F
d v[1]
d F
d v[2]
...
Our fitness function
Tweak this
component to
maximize fit
More complex/'deep' neural nets, can
propagate back error to earlier layers
to learn weights
Dot Product
(‘similarity’)
Sigmoid
(forces to 0..1)
[5.5,
8.1,
...
0.6]
'Bitcoin'
Embedding
[6.0,
1.7,
...
8.4]
'Energy'
Embedding
5.5*6.0*2.0 +
8.1*1.7*0.5 +
...
+ 0.6*8.4*1.4 =
102.65
0...1
0.67
Out of
context True
context
Doc2Vec, train paragraph vector w/ term vectors
[2.0,
0.5,
...
1.4]
Para.
embedding
matrix
Term
embedding
matrix
Term
embedding
matrix
backpropagate
Same neg sampling can apply here
Term Para True
Context?
Prediction
bitcoin Doc 1234
Para 1
1 0.67
bitcoin Doc 5678
Para 5
0 0.78
bitcoin Doc 1234
Para 8
0 0.56
bitcoin Doc 1537
Para 1
0 0.34
I’m continuing the thread of negative sampling, but doc2vec need not
use negative sampling
Random
unrelated
docs
Terms
sharing
context
Tweak embeddings to
get prediction closer to 1
Tweak embeddings to
get prediction closer to 0
Embedding Modeling:
[0.5,-0.4] doc1234
...For years you've been
manipulating this...
Your new clay to mold...
Manipulating the constructions of
embeddings to measure something *domain
specific*
You're
*about* to see
some fun
embedding
hacks...
Query-based embeddings?
Query
Term
Other Q.
Term
Same
Session
Relevance
Score
bitcoin regulation 1 0.67
bitcoin bananas 0 0.78
bitcoin headaches 0 0.56
bitcoin daycare 0 0.34
Random
unrelated
docs
Relevant
doc for
term
Tweak embeddings to
get query-doc score
closer to 1
Tweak embeddings to
get prediction closer to 0
Embeddings from judgments?
Query
Term
Doc Relevant? Relevance
Score
bitcoin Doc 1234 1 0.67
bitcoin Doc 5678 0 0.78
bitcoin Doc 1234 0 0.56
bitcoin Doc 1537 0 0.34
Random
unrelated
docs
Relevant
doc for
term
Tweak embeddings to
get query-doc score
closer to 1
Tweak embeddings to
get prediction closer to 0
(derived from
clicks, etc)
Pretrain word2vec with corpus/sessions?
Query
Term
Doc Corpus
Model
True
Relevance
Tweaked
Score
bitcoin Doc 1234 0 1 0.01
bitcoin Doc 5678 1 0 0.99
bitcoin Doc 1234 0 0 0.56
bitcoin Doc 1537 0 0 0.34
Takes a lot to overcome
original unified model,
and its a case-by-case
basis
(a kind of prior)
An online / evolutionary approach to converge on improved
embeddings?
Geometrically adjusted query embeddings?
Averaging query vectors by means of some possibly relevant
documents
Embeddings go beyond language
User Item Purchased? Recommen
ded?
Tom Jeans 1 0.67
Tom Khakis 0 0.78
Tom iPad 0 0.56
Tom Dress
Shoes
0 0.34
Tweak embeddings to
get user-item score
closer to 1
Tweak embeddings to
get prediction closer to 0
(use to find similar items / users)
More features beyond just 'context'
Query Term Doc Sentiment Same
Context
TweakedSco
re
cancel Doc 1234 Angry 1 0.01
confirm Doc 5678 Angry 0 0.99
cancel Doc 1234 Happy 0 0.56
confirm Doc 1537 Happy 0 0.34
See: https://arxiv.org/abs/1805.07966
Evolutionary/contextual bandit embeddings?
Query
Term
Doc Corpus
Model
True
Relevance
Tweaked
Score
bitcoin Doc 1234 0 1 0.01
bitcoin Doc 5678 1 0 0.99
bitcoin Doc 1234 0 0 0.56
bitcoin Doc 1537 0 0 0.34
Pull embedding in the
direction your KPIs seem
to say is successful
(a kind of prior)
An online / evolutionary approach to converge on improved
embeddings?
Embedding Frontier...
There's a lot we could do
- Model specificity: high specificity terms cluster differently than low
specificity terms
- You often data-model/clean A LOT before computing your embeddings
(just like you do your search index)
- See Simon Hughes "Vectors in Search" talk on encoding vectors and
using them in Solr
Neural Language Models
Serĉa
konferenco
Translators grok 'aboutness'
What is in the head of
this person that 'groks'
Esperanto?
'a search
conference?'
Language Model
Can seeing text, and guessing what would "come next" help us on our
quest for 'aboutness'?
“eat __?_____”
cat: 0.001
...
chair: 0.001
pizza: 0.02
...
nap: 0.001
Highest
Probability
Obvious applications: autosuggest, etc
Markov language model, vocab size V
pizza chair nap ...
eat 0.02 0.0002 0.0001 ...
cat 0.0001 0.0001 0.02 ...
... ... ... ...
Probability of ‘pizza’ following
‘eat’ (as in “eat pizza”)
V words
Vwords
From corpus we can build a transition matrix
reflecting frequency of word transitions:
Transition Matrix Math
pizza chair nap ...
eat 0.02 0.0002 0.0001 ...
cat 0.0001 0.0001 0.02 ...
... ... ... ...
Let's say we don't 100% know the 'current' word can
we still guess the 'next' word?
Prob
eat 0.25
cat 0.40
... ...
X =
Prob 'pizza' next P(chair) P(nap) ...
0.25*0.02+0.40*0.0001 + ... ... ... ...
Pcurr
Pnext:
Vwords
V words
Language model for embeddings?
[5.5,
8.1,
...
0.6]
Term
embedding
matrix
‘eat’
embedding
‘h’dimensions
0 1 2 ... h
0 0.2 -0.4 0.1
1 -0.1 0.3 -0.24
2
...
h
X
[0.5,
6.1,
...
-4.8]
Embedding
of next word
=
(probably clusters near
‘pizza’ etc)A transition matrix - weights learned
through backpropagation
Accuracy requires context...
The race is on! dust!eat
Word => Next Word
pizza!
Context +
(and new context)
Context + new word => New Context/word‘Embedding’ markov model
from earlier
Word => New Context & WordOld Context +
[0.5, 1.6 …
6.1, -4.8 …
… ]
[ 1.6,
-5.4
…]
X
[5.5,
8.1,
...
Eat(embedding)
=
Input
‘Transition’
[...]
[-0.9,
-1.2
…]
X
[5.5,
8.1,
...
=
Input -> New
Context
Transition
[0.5, 1.6 …
6.1, -4.8 …
… ]
X =
Prev Context -> New Context Transition
[ 0.3,
4.5
…]
[-0.6,
3.3
…]+
New
Context
‘Embedding’
for next
word
[...]
Context ->
Output
Transition
[-1.6,
7.3
…]
X
=
Output
Embedding
Next...
Simpler view
[5.5,
9.1,
...
W_xh
[-5.5,
1.1,
... W_hh
[5.0,
4.1,
...
[1.5,
0.1,
...
Unbearable Effectiveness of Recurrent Neural Networks
W_xh
[ 8.5,
-1.1,
...
Predicted next word
(or embedding)
Weights learned through
backpropagation:
(trained on true
sentences in our corpus)
on eat
[-5.5,
1.1,
...
[5.5,
9.1,
...
is
dust
W_hy
W_xh
[-2.5,
5.6,
...
[5.5,
9.1,
...
race
W_xh
Psssst… this is a
Recurrent Neural
Network (RNN)!
For search...
[5.5,
9.1,
...
W_xh
[-5.5,
1.1,
... W_hh
[5.0,
4.1,
...
[1.5,
0.1,
...
W_xh
[ 8.5,
-1.1,
...
[-5.5,
1.1,
...
[5.5,
9.1,
...
W_hy
W_xh
Inject other contextually likely terms:
At any given point scanning our document,
we can get a probability distribution of likely
terms
A dog is loose! Please call the
animal control
dog catcher
phone
yodel
walk to
Not a silver bullet
[5.5,
W_xh
[-5.5,
1.1,
... W_hh
[5.0,
[1.5,
0.1,
...
W_xh
[ 8.5,
-1.1,
...
[-5.5,
1.1,
...
[5.5,
W_hy
W_xh
A dog is loose! Please call the
animal control
dog catcher
(corpus language)
(searcher language)
Won't learn this ideal
language model if
corpus never says
'dog catcher'
One Vector Space To Rule Them
All: Seq2Seq
Translation: RNN decoder/encoder
[-1.5,
3.1,
...
W_xh
on
[-5.5,
1.1,
...
[5.5,
9.1,
...
is
W_xh
[-2.5,
5.6,
...
[1.5,
9.5,
...
race
W_xh
W_hhW_hh
Encoder RNN
[-4.5,
5.1,
...
[-4.5,
5.1,
...
[4.5,
3.1,
...
W_xh
carrera
[-1.5,
9.1,
...
[1.3,
4.3,
...
la
W_xh
W_hhW_hh
[-4.7,
5.1,
...
Decoder RNN
[2.8,
7.3,
...
W_xh
<START>
W_hy
la
W_hy
carrera
W_hy
es
Seq2Seq The Clown Car of Deep Learning
Backprop...
W_
'Translate' documents to queries?
Encoder RNN Decoder RNN
Animal Control Law
Herefore let it be …
And therefore… with
much to be blah blah
blah
…
…
The End
Predicted Queries
Doc Query Relevant?
Animal Control Law Dog catcher 1
Littering Law Dog Catcher 0
Using judgments/clicks as training data:
Can we use graded judgments?
Encoder RNN Decoder RNN
Animal Control Law
Herefore let it be …
And therefore… with
much to be blah blah
blah
…
…
The End
Predicted Queries
Doc Query Relevant?
Animal Control Law Dog catcher 4
Animal Leash Law Dog Catcher 2
Construct training data to
reflect weighting
Skip-Thought Vectors (a kind of ‘sentence2vec’)
Encoder RNN
Decoder RNN
(sentence before)
Decoder RNN
(sentence after)
It’s fleece was white
as snow
Becomes an embedding for the
sentence encoding semantic
meaning
Mary had a little lamb
And everywhere that
mary went, that lamb
was sure to go
Skip-Thought Vectors for queries? (‘doc2queryVec’)
Encoder RNN
Decoder RNN
(relevant query)
Decoder RNN
(relevant query)
Becomes an embedding mapping
docs into the query’s semantic
space
q=dog catcher
q=loose dog
Animal Control Law
Herefore let it be …
And therefore… with
much to be blah blah
blah
…
…
The End
Decoder RNN
(relevant query)
...
q=dog fight
My Thoughts on Skip Thoughts
The Frontier
Anything2vec
Deep learning is especially good at learning representations of :
- Images
- User History
- Songs
- Video
- Text
If everything can be ‘tokenized’ into some kind of token space for retrieval
Everything can be ‘vectorized’ into some embedding space for retrieval / ranking
Lucene community needs to get better first-class
vector support
Similarity API and vectors ?
long computeNorm(FieldInvertState state)
SimWeight computeWeight(float boost, CollectionStatistics
collectionStats, TermStatistics... termStats)
SimScorer simScorer(SimWeight weight, LeafReaderContext context)
Some progress
- Delimited Term Frequency Filter Factory (use tf to encode
- Payloads & Payload Score
Check out Simon Hughes talk: "Vectors in Search"
It’s not magic, it’s math!
Join the Search Relevancy Slack Community
http://o19s.com/slack
(projects, chat, conferences, help, book authors, and more…)
Matching: vocab mismatch!
Taxonomical Semantical Magical Search Doug Turnbull,
Lucene/Solr Revolution 2017
"Animal
Control
Law"
Dog catcher
Dog
Catcher
Animal
Control
Concept
1234
Taxonomist /
Librarian
Costly to manually
maintain mapping
between searcher &
corpus vocabulary
No Match, despite relevant results!
legalese
lay-speak
Animal Control Law
Herefore let it be …
And therefore… with
much to be blah blah
blah
…
…
The End
Ranking optimization hard!
Dog catcher
Score
doc
Ranked
Results
LTR is only as good as
your features: Solr/ES
queries based on your
skill with Solr/ES queries
LTR?
Ranking can be hard
optimization problem:
Fine tuning heuristics:
TF*IDF, Solr/ES queries,
analyzers, etc...
How can deep learning help?
Taxonomical Semantical Magical Search Doug Turnbull,
Lucene/Solr Revolution 2017
Can we learn, quickly at scale, the semantic
relationships between concepts?
What other forms of every-day enrichment
could deep-learning accelerate?
Dog catcher
Animal control
Sessions as documents
Embeddings
built for just
query terms
Dog catcher
Dog bit me
Lost dog
[5.5,
8.1,
...
0.6]
word2vec/doc2vec
Can we translate between searcher & corpus?
[5.5,
8.1,
...
0.6]
[2.0,
0.5,
...
1.4]
Insert Machine
Learning Here
Ranking
Prediction
Doc
Search term
(classic LTR could go here)
Embeddings
Words with similar context share close embeddings
Cryptocurrency is not hyped! It's the future
Hackers invent their own cryptocurrency
Cryptocurrency is a digital currency
Hyped cryptocurrency is for hackers!
Encode
contexts
[0.5,-0.5] cryptocurrency
[0.6,-0.4] bitcoin

Weitere ähnliche Inhalte

Ähnlich wie The Neural Search Frontier - Doug Turnbull, OpenSource Connections

支撐英雄聯盟戰績網的那條巨蟒
支撐英雄聯盟戰績網的那條巨蟒支撐英雄聯盟戰績網的那條巨蟒
支撐英雄聯盟戰績網的那條巨蟒
Toki Kanno
 
Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017
Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017
Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017
MLconf
 

Ähnlich wie The Neural Search Frontier - Doug Turnbull, OpenSource Connections (20)

Better Machine Learning with Less Data - Slater Victoroff (Indico Data)
Better Machine Learning with Less Data - Slater Victoroff (Indico Data)Better Machine Learning with Less Data - Slater Victoroff (Indico Data)
Better Machine Learning with Less Data - Slater Victoroff (Indico Data)
 
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsBreaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
 
The BioMoby Semantic Annotation Experiment
The BioMoby Semantic Annotation ExperimentThe BioMoby Semantic Annotation Experiment
The BioMoby Semantic Annotation Experiment
 
Cloud AI GenAI Overview.pptx
Cloud AI GenAI Overview.pptxCloud AI GenAI Overview.pptx
Cloud AI GenAI Overview.pptx
 
Chatbot_Presentation
Chatbot_PresentationChatbot_Presentation
Chatbot_Presentation
 
Tokens, Complex Systems, and Nature
Tokens, Complex Systems, and NatureTokens, Complex Systems, and Nature
Tokens, Complex Systems, and Nature
 
Beginning text analysis
Beginning text analysisBeginning text analysis
Beginning text analysis
 
KiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorialKiwiPyCon 2014 - NLP with Python tutorial
KiwiPyCon 2014 - NLP with Python tutorial
 
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
 
Deep learning for NLP
Deep learning for NLPDeep learning for NLP
Deep learning for NLP
 
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
Von JavaEE auf Microservice in 6 Monaten - The Good, the Bad, and the wtfs...
 
Data Day Seattle, From NLP to AI
Data Day Seattle, From NLP to AIData Day Seattle, From NLP to AI
Data Day Seattle, From NLP to AI
 
[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx
[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx
[GDSC-GNIOT] Google Cloud Study Jams Day 2- Cloud AI GenAI Overview.pptx
 
Perl DBI Scripting with the ILS
Perl DBI Scripting with the ILSPerl DBI Scripting with the ILS
Perl DBI Scripting with the ILS
 
The Rise of Vector Data
The Rise of Vector DataThe Rise of Vector Data
The Rise of Vector Data
 
支撐英雄聯盟戰績網的那條巨蟒
支撐英雄聯盟戰績網的那條巨蟒支撐英雄聯盟戰績網的那條巨蟒
支撐英雄聯盟戰績網的那條巨蟒
 
Being Professional
Being ProfessionalBeing Professional
Being Professional
 
Going open source with small teams
Going open source with small teamsGoing open source with small teams
Going open source with small teams
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017
Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017
Michael Alcorn, Sr. Software Engineer, Red Hat Inc. at MLconf SF 2017
 

Mehr von Lucidworks

Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Lucidworks
 

Mehr von Lucidworks (20)

Search is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce StrategySearch is the Tip of the Spear for Your B2B eCommerce Strategy
Search is the Tip of the Spear for Your B2B eCommerce Strategy
 
Drive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in SalesforceDrive Agent Effectiveness in Salesforce
Drive Agent Effectiveness in Salesforce
 
How Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant ProductsHow Crate & Barrel Connects Shoppers with Relevant Products
How Crate & Barrel Connects Shoppers with Relevant Products
 
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product DiscoveryLucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
Lucidworks & IMRG Webinar – Best-In-Class Retail Product Discovery
 
Connected Experiences Are Personalized Experiences
Connected Experiences Are Personalized ExperiencesConnected Experiences Are Personalized Experiences
Connected Experiences Are Personalized Experiences
 
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
Intelligent Insight Driven Policing with MC+A, Toronto Police Service and Luc...
 
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
[Webinar] Intelligent Policing. Leveraging Data to more effectively Serve Com...
 
Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020Preparing for Peak in Ecommerce | eTail Asia 2020
Preparing for Peak in Ecommerce | eTail Asia 2020
 
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
Accelerate The Path To Purchase With Product Discovery at Retail Innovation C...
 
AI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and RosetteAI-Powered Linguistics and Search with Fusion and Rosette
AI-Powered Linguistics and Search with Fusion and Rosette
 
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual MomentThe Service Industry After COVID-19: The Soul of Service in a Virtual Moment
The Service Industry After COVID-19: The Soul of Service in a Virtual Moment
 
Webinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - EuropeWebinar: Smart answers for employee and customer support after covid 19 - Europe
Webinar: Smart answers for employee and customer support after covid 19 - Europe
 
Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19Smart Answers for Employee and Customer Support After COVID-19
Smart Answers for Employee and Customer Support After COVID-19
 
Applying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 ResearchApplying AI & Search in Europe - featuring 451 Research
Applying AI & Search in Europe - featuring 451 Research
 
Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1Webinar: Accelerate Data Science with Fusion 5.1
Webinar: Accelerate Data Science with Fusion 5.1
 
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce StrategyWebinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
Webinar: 5 Must-Have Items You Need for Your 2020 Ecommerce Strategy
 
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
Where Search Meets Science and Style Meets Savings: Nordstrom Rack's Journey ...
 
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision IntelligenceApply Knowledge Graphs and Search for Real-World Decision Intelligence
Apply Knowledge Graphs and Search for Real-World Decision Intelligence
 
Webinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise SearchWebinar: Building a Business Case for Enterprise Search
Webinar: Building a Business Case for Enterprise Search
 
Why Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and BeyondWhy Insight Engines Matter in 2020 and Beyond
Why Insight Engines Matter in 2020 and Beyond
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 

The Neural Search Frontier - Doug Turnbull, OpenSource Connections

  • 1. The Neural Search Frontier The Relevance Engineer's Deep Learning Toolbelt
  • 2. About us Doug Turnbull / Author "Relevant Search" Doug Turnbull, CTO - OpenSource Connections (http://o19s.com) Tommaso Teofili Computer Scientist, Adobe ASF memberDiscount Code: ctwactivate18
  • 3. Relevance: challenge of 'aboutness' bitcoin regulation 'bitcoin' - aboutness ‘regulation’ aboutness X X Doc 1 Doc 2 X Query Doc 2 more about terms, ranked 'higher'
  • 4. One hypothesis: TF*IDF vector similarity (and bm25 ...) bitcoin regulation 'bitcoin' - TF*IDF ‘regulation’ TF*IDF X X Doc 1 Doc 2 X Query Doc 2 more 'about' because higher concentration of the corpus's 'bitcoin' and 'regulation' terms
  • 5. Sadly 'Aboutness' is hard After minutes - simple TF*IDF After years relevance / feature eng. work Possible Search quality Why!? ● Searcher vocab mismatches ● Searchers expect context (personal, emotional, ...) ● Different entities to match to ● Language evolves ● Misspellings, typos, etc ● TF*IDF doesn't always matter
  • 6. There's Something About Aboutness? Neural Search Frontier Hunt for the About October? The Last Sharknado: It's *About* Time?
  • 7. Embeddings: another vector sim A vector storing information about the context a word appears in “Have you heard the hype about bitcoin currency?” “Bitcoin a currency for hackers?” “Have you heard the hype about bitcoin currency?” “Hacking on my Bitcoin server! It's not hyped” Encode contexts [0.6,-0.4] bitcoin Docs
  • 8. Easier to see as a graph [0.5,-0.5] cryptocurrency [0.6,-0.4] bitcoin These two terms occur in similar contexts Graph from Desmos
  • 9. Document embeddings Encode full text as context for document Encode [0.5,-0.4] doc1234 Have you heard of cryptocurrency? It's the cool currency being worked on by hackers! Bitcoin is one prominent example. Some say it's all hype ...
  • 10. Is shared 'context' same as 'aboutness'!? [0.5,-0.5] cryptocurrency [0.6,-0.4] bitcoin [0.5,-0.4] doc1234 [0.4,-0.2] doc5678 Embeddings: bitcoin Closer docs to 'bitcoin' ranked higher 1. doc1234 2. doc5678
  • 11. Sadly no :( I want to cancel my reservation I want to confirm my reservation Words occur in similar contexts can have opposite meanings!! 'Doc about canceling' not relevant for 'confirm'
  • 12. Fundamentally there’s still a mismatch Fiat currencies have a global conspiracy against dogecoin bit dollar? Crypto expert creates content with nerd speak Average Citizen searches in lay- speak Embeddings derived from corpus, not our searcher’s language
  • 14. Can we improve on embeddings?
  • 15. Dot Product (‘similarity’) Sigmoid (forces to 0..1 ‘probability’) [5.5, 8.1, ... 0.6] 'Bitcoin' Embedding [6.0, 1.7, ... 8.4] 'Energy' Embedding 5.5*6.0 + 8.1*1.7 + ... + 0.6*8.4 = 102.65 0...1 0.67 Out of context True context Term Other Term In Same Context? Model Prediction bitcoin energy 1 0.67 Trying to predict A word2vec Model Term embedding Table (initialized w/ random vals) Term embedding Table (initialized w/ random vals)
  • 16. Tweak… Term Other Term True Context? Prediction bitcoin energy 1 0.67 bitcoin dongles 0 0.78 bitcoin relevance 0 0.56 bitcoin cables 0 0.34 To get into the weeds on the loss function & gradient descent for word2vec:Stanford CS224D Deep Learning for NLP Lecture Notes Chaubard, Mundra, Socher Random unrelated terms Terms sharing context Tweak ‘bitcoin’ to get prediction closer to 1 Tweak ‘bitcoin’ to get prediction closer to 0 Goal: Goal: (showing skipgram / negative sampling method)
  • 17. Maximize Me! - Maximize ‘True’ Contexts Dot Product (‘similarity’) Sigmoid (forces to 0..1) [5.5, 8.1, ... 0.6] [6.0, 1.7, ... 8.4] Minimize ‘False’ Contexts Just the model (sigmoid of dot product) F =
  • 18. Did you just neural network? Backpropagate tweaks to weights to reduce error Neuron (sigmoid) Word1Word2 (Dot product implicit) d F d v[0] d F d v[1] d F d v[2] ... Our fitness function Tweak this component to maximize fit More complex/'deep' neural nets, can propagate back error to earlier layers to learn weights
  • 19. Dot Product (‘similarity’) Sigmoid (forces to 0..1) [5.5, 8.1, ... 0.6] 'Bitcoin' Embedding [6.0, 1.7, ... 8.4] 'Energy' Embedding 5.5*6.0*2.0 + 8.1*1.7*0.5 + ... + 0.6*8.4*1.4 = 102.65 0...1 0.67 Out of context True context Doc2Vec, train paragraph vector w/ term vectors [2.0, 0.5, ... 1.4] Para. embedding matrix Term embedding matrix Term embedding matrix backpropagate
  • 20. Same neg sampling can apply here Term Para True Context? Prediction bitcoin Doc 1234 Para 1 1 0.67 bitcoin Doc 5678 Para 5 0 0.78 bitcoin Doc 1234 Para 8 0 0.56 bitcoin Doc 1537 Para 1 0 0.34 I’m continuing the thread of negative sampling, but doc2vec need not use negative sampling Random unrelated docs Terms sharing context Tweak embeddings to get prediction closer to 1 Tweak embeddings to get prediction closer to 0
  • 21. Embedding Modeling: [0.5,-0.4] doc1234 ...For years you've been manipulating this... Your new clay to mold... Manipulating the constructions of embeddings to measure something *domain specific*
  • 22. You're *about* to see some fun embedding hacks...
  • 23. Query-based embeddings? Query Term Other Q. Term Same Session Relevance Score bitcoin regulation 1 0.67 bitcoin bananas 0 0.78 bitcoin headaches 0 0.56 bitcoin daycare 0 0.34 Random unrelated docs Relevant doc for term Tweak embeddings to get query-doc score closer to 1 Tweak embeddings to get prediction closer to 0
  • 24. Embeddings from judgments? Query Term Doc Relevant? Relevance Score bitcoin Doc 1234 1 0.67 bitcoin Doc 5678 0 0.78 bitcoin Doc 1234 0 0.56 bitcoin Doc 1537 0 0.34 Random unrelated docs Relevant doc for term Tweak embeddings to get query-doc score closer to 1 Tweak embeddings to get prediction closer to 0 (derived from clicks, etc)
  • 25. Pretrain word2vec with corpus/sessions? Query Term Doc Corpus Model True Relevance Tweaked Score bitcoin Doc 1234 0 1 0.01 bitcoin Doc 5678 1 0 0.99 bitcoin Doc 1234 0 0 0.56 bitcoin Doc 1537 0 0 0.34 Takes a lot to overcome original unified model, and its a case-by-case basis (a kind of prior) An online / evolutionary approach to converge on improved embeddings?
  • 26. Geometrically adjusted query embeddings? Averaging query vectors by means of some possibly relevant documents
  • 27. Embeddings go beyond language User Item Purchased? Recommen ded? Tom Jeans 1 0.67 Tom Khakis 0 0.78 Tom iPad 0 0.56 Tom Dress Shoes 0 0.34 Tweak embeddings to get user-item score closer to 1 Tweak embeddings to get prediction closer to 0 (use to find similar items / users)
  • 28. More features beyond just 'context' Query Term Doc Sentiment Same Context TweakedSco re cancel Doc 1234 Angry 1 0.01 confirm Doc 5678 Angry 0 0.99 cancel Doc 1234 Happy 0 0.56 confirm Doc 1537 Happy 0 0.34 See: https://arxiv.org/abs/1805.07966
  • 29. Evolutionary/contextual bandit embeddings? Query Term Doc Corpus Model True Relevance Tweaked Score bitcoin Doc 1234 0 1 0.01 bitcoin Doc 5678 1 0 0.99 bitcoin Doc 1234 0 0 0.56 bitcoin Doc 1537 0 0 0.34 Pull embedding in the direction your KPIs seem to say is successful (a kind of prior) An online / evolutionary approach to converge on improved embeddings?
  • 30. Embedding Frontier... There's a lot we could do - Model specificity: high specificity terms cluster differently than low specificity terms - You often data-model/clean A LOT before computing your embeddings (just like you do your search index) - See Simon Hughes "Vectors in Search" talk on encoding vectors and using them in Solr
  • 32. Serĉa konferenco Translators grok 'aboutness' What is in the head of this person that 'groks' Esperanto? 'a search conference?'
  • 33. Language Model Can seeing text, and guessing what would "come next" help us on our quest for 'aboutness'? “eat __?_____” cat: 0.001 ... chair: 0.001 pizza: 0.02 ... nap: 0.001 Highest Probability Obvious applications: autosuggest, etc
  • 34. Markov language model, vocab size V pizza chair nap ... eat 0.02 0.0002 0.0001 ... cat 0.0001 0.0001 0.02 ... ... ... ... ... Probability of ‘pizza’ following ‘eat’ (as in “eat pizza”) V words Vwords From corpus we can build a transition matrix reflecting frequency of word transitions:
  • 35. Transition Matrix Math pizza chair nap ... eat 0.02 0.0002 0.0001 ... cat 0.0001 0.0001 0.02 ... ... ... ... ... Let's say we don't 100% know the 'current' word can we still guess the 'next' word? Prob eat 0.25 cat 0.40 ... ... X = Prob 'pizza' next P(chair) P(nap) ... 0.25*0.02+0.40*0.0001 + ... ... ... ... Pcurr Pnext: Vwords V words
  • 36. Language model for embeddings? [5.5, 8.1, ... 0.6] Term embedding matrix ‘eat’ embedding ‘h’dimensions 0 1 2 ... h 0 0.2 -0.4 0.1 1 -0.1 0.3 -0.24 2 ... h X [0.5, 6.1, ... -4.8] Embedding of next word = (probably clusters near ‘pizza’ etc)A transition matrix - weights learned through backpropagation
  • 37. Accuracy requires context... The race is on! dust!eat Word => Next Word pizza! Context + (and new context)
  • 38. Context + new word => New Context/word‘Embedding’ markov model from earlier Word => New Context & WordOld Context + [0.5, 1.6 … 6.1, -4.8 … … ] [ 1.6, -5.4 …] X [5.5, 8.1, ... Eat(embedding) = Input ‘Transition’ [...] [-0.9, -1.2 …] X [5.5, 8.1, ... = Input -> New Context Transition [0.5, 1.6 … 6.1, -4.8 … … ] X = Prev Context -> New Context Transition [ 0.3, 4.5 …] [-0.6, 3.3 …]+ New Context ‘Embedding’ for next word [...] Context -> Output Transition [-1.6, 7.3 …] X = Output Embedding Next...
  • 39. Simpler view [5.5, 9.1, ... W_xh [-5.5, 1.1, ... W_hh [5.0, 4.1, ... [1.5, 0.1, ... Unbearable Effectiveness of Recurrent Neural Networks W_xh [ 8.5, -1.1, ... Predicted next word (or embedding) Weights learned through backpropagation: (trained on true sentences in our corpus) on eat [-5.5, 1.1, ... [5.5, 9.1, ... is dust W_hy W_xh [-2.5, 5.6, ... [5.5, 9.1, ... race W_xh Psssst… this is a Recurrent Neural Network (RNN)!
  • 40. For search... [5.5, 9.1, ... W_xh [-5.5, 1.1, ... W_hh [5.0, 4.1, ... [1.5, 0.1, ... W_xh [ 8.5, -1.1, ... [-5.5, 1.1, ... [5.5, 9.1, ... W_hy W_xh Inject other contextually likely terms: At any given point scanning our document, we can get a probability distribution of likely terms A dog is loose! Please call the animal control dog catcher phone yodel walk to
  • 41. Not a silver bullet [5.5, W_xh [-5.5, 1.1, ... W_hh [5.0, [1.5, 0.1, ... W_xh [ 8.5, -1.1, ... [-5.5, 1.1, ... [5.5, W_hy W_xh A dog is loose! Please call the animal control dog catcher (corpus language) (searcher language) Won't learn this ideal language model if corpus never says 'dog catcher'
  • 42. One Vector Space To Rule Them All: Seq2Seq
  • 43. Translation: RNN decoder/encoder [-1.5, 3.1, ... W_xh on [-5.5, 1.1, ... [5.5, 9.1, ... is W_xh [-2.5, 5.6, ... [1.5, 9.5, ... race W_xh W_hhW_hh Encoder RNN [-4.5, 5.1, ... [-4.5, 5.1, ... [4.5, 3.1, ... W_xh carrera [-1.5, 9.1, ... [1.3, 4.3, ... la W_xh W_hhW_hh [-4.7, 5.1, ... Decoder RNN [2.8, 7.3, ... W_xh <START> W_hy la W_hy carrera W_hy es Seq2Seq The Clown Car of Deep Learning Backprop... W_
  • 44. 'Translate' documents to queries? Encoder RNN Decoder RNN Animal Control Law Herefore let it be … And therefore… with much to be blah blah blah … … The End Predicted Queries Doc Query Relevant? Animal Control Law Dog catcher 1 Littering Law Dog Catcher 0 Using judgments/clicks as training data:
  • 45. Can we use graded judgments? Encoder RNN Decoder RNN Animal Control Law Herefore let it be … And therefore… with much to be blah blah blah … … The End Predicted Queries Doc Query Relevant? Animal Control Law Dog catcher 4 Animal Leash Law Dog Catcher 2 Construct training data to reflect weighting
  • 46. Skip-Thought Vectors (a kind of ‘sentence2vec’) Encoder RNN Decoder RNN (sentence before) Decoder RNN (sentence after) It’s fleece was white as snow Becomes an embedding for the sentence encoding semantic meaning Mary had a little lamb And everywhere that mary went, that lamb was sure to go
  • 47. Skip-Thought Vectors for queries? (‘doc2queryVec’) Encoder RNN Decoder RNN (relevant query) Decoder RNN (relevant query) Becomes an embedding mapping docs into the query’s semantic space q=dog catcher q=loose dog Animal Control Law Herefore let it be … And therefore… with much to be blah blah blah … … The End Decoder RNN (relevant query) ... q=dog fight My Thoughts on Skip Thoughts
  • 49. Anything2vec Deep learning is especially good at learning representations of : - Images - User History - Songs - Video - Text If everything can be ‘tokenized’ into some kind of token space for retrieval Everything can be ‘vectorized’ into some embedding space for retrieval / ranking
  • 50. Lucene community needs to get better first-class vector support Similarity API and vectors ? long computeNorm(FieldInvertState state) SimWeight computeWeight(float boost, CollectionStatistics collectionStats, TermStatistics... termStats) SimScorer simScorer(SimWeight weight, LeafReaderContext context)
  • 51. Some progress - Delimited Term Frequency Filter Factory (use tf to encode - Payloads & Payload Score Check out Simon Hughes talk: "Vectors in Search"
  • 52. It’s not magic, it’s math! Join the Search Relevancy Slack Community http://o19s.com/slack (projects, chat, conferences, help, book authors, and more…)
  • 53. Matching: vocab mismatch! Taxonomical Semantical Magical Search Doug Turnbull, Lucene/Solr Revolution 2017 "Animal Control Law" Dog catcher Dog Catcher Animal Control Concept 1234 Taxonomist / Librarian Costly to manually maintain mapping between searcher & corpus vocabulary No Match, despite relevant results! legalese lay-speak
  • 54. Animal Control Law Herefore let it be … And therefore… with much to be blah blah blah … … The End Ranking optimization hard! Dog catcher Score doc Ranked Results LTR is only as good as your features: Solr/ES queries based on your skill with Solr/ES queries LTR? Ranking can be hard optimization problem: Fine tuning heuristics: TF*IDF, Solr/ES queries, analyzers, etc...
  • 55. How can deep learning help? Taxonomical Semantical Magical Search Doug Turnbull, Lucene/Solr Revolution 2017 Can we learn, quickly at scale, the semantic relationships between concepts? What other forms of every-day enrichment could deep-learning accelerate? Dog catcher Animal control
  • 56. Sessions as documents Embeddings built for just query terms Dog catcher Dog bit me Lost dog [5.5, 8.1, ... 0.6] word2vec/doc2vec
  • 57. Can we translate between searcher & corpus? [5.5, 8.1, ... 0.6] [2.0, 0.5, ... 1.4] Insert Machine Learning Here Ranking Prediction Doc Search term (classic LTR could go here)
  • 58. Embeddings Words with similar context share close embeddings Cryptocurrency is not hyped! It's the future Hackers invent their own cryptocurrency Cryptocurrency is a digital currency Hyped cryptocurrency is for hackers! Encode contexts [0.5,-0.5] cryptocurrency [0.6,-0.4] bitcoin

Hinweis der Redaktion

  1. This doesn’t sufficiently generalize the queries Likely don’t have as much training data in relevance judgments as in the unlabeled data from the corpus This is a kind of point-wise learning to rank
  2. This doesn’t sufficiently generalize the queries Likely don’t have as much training data in relevance judgments as in the unlabeled data from the corpus This is a kind of point-wise learning to rank
  3. Alternatively, the relevance-based embeddings are a kind of prior
  4. We can leverage information about possibly relevant documents (clicks, query logs, etc.) wrt queries while learning embeddings. We can adjust query vectors and mitigate the possible vocabulary distinction between docs and queries sets
  5. This doesn’t sufficiently generalize the queries Likely don’t have as much training data in relevance judgments as in the unlabeled data from the corpus This is a kind of point-wise learning to rank
  6. Alternatively, the relevance-based embeddings are a kind of prior
  7. Alternatively, the relevance-based embeddings are a kind of prior
  8. Alternatively, the relevance-based embeddings are a kind of prior
  9. Alternatively, the relevance-based embeddings are a kind of prior
  10. Alternatively, the relevance-based embeddings are a kind of prior
  11. Alternatively, the relevance-based embeddings are a kind of prior
  12. Alternatively, the relevance-based embeddings are a kind of prior
  13. Alternatively, the relevance-based embeddings are a kind of prior
  14. Alternatively, the relevance-based embeddings are a kind of prior
  15. Alternatively, the relevance-based embeddings are a kind of prior