In 2013 we witnessed an evolutionary change in the NLP field evolved thanks to the introduction of space embeddings that, with the use of deep learning architectures, achieved human-level performances in many NLP tasks. With the introduction of the Attention mechanism in 2017 the results were further improved and, as result, embeddings are quickly becoming the de facto standards in solving many NLP problems. In this presentation, you will learn how generate and use space embedding for search purposes and provide comparison metrics to more traditional relevance-based search engines. Moreover, I will provide some initial results on a paper currently under review that provides an insight on hyperparameter tuning during the generation of embeddings.
Ähnlich wie AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new with the old Giancarlo Crocetti (Boehringer Ingelheim, USA) (20)
5. Why Embeddings Search
1. Embeddings are good to identify
semantic and syntactic structures
2. A match can occur even without any
match in query keywords
“what are the side effects of ibuprofen?”
“NSAIDs can cause a range of adverse
reactions, especially …”
3. A search is nothing else that a dot-
product between two vectors
6. From Words to Documents
Given a document 𝑑𝑖 = {𝑤𝑖1, 𝑤𝑖2, … , 𝑤𝑖𝑚}
A possible document embedding for 𝑑𝑖 is:
𝑣 𝑑𝑖 = σ𝑗=1
𝑚
𝑣(𝑤𝑖𝑗) or
𝑣 𝑑𝑖 =
1
𝑚
σ𝑗=1
𝑚
𝑣(𝑤𝑖𝑗) (mean-pooling).
Note: 𝑖𝑓𝑤𝑖𝑗 ∉ V → 𝑣(𝑤𝑖𝑗) = 𝑣(′𝑢𝑛𝑘′) or use
embeddings like fastText
7. Smooth Inverse Frequency (SIF) Pooling Method
S. Arora, Y. Liang, and T. Ma, “A Simple but tough-to-beat baseline for sentence embeddings.” ICLR, 2017
hyperparameter
Word frequency
𝑢𝑢⊺ = 𝑆𝑖𝑛𝑔𝑢𝑙𝑎𝑟 𝑣𝑒𝑐𝑡𝑜𝑟: 𝑠𝑡𝑟𝑜𝑛𝑔𝑒𝑟 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡 𝑜𝑓 𝑆𝑖𝑛𝑔𝑢𝑙𝑎𝑟 𝑉𝑎𝑙𝑢𝑒 𝐷𝑒𝑐𝑜𝑚𝑝𝑜𝑠𝑖𝑡𝑖𝑜𝑛
𝑢𝑢⊺ = 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑚𝑚𝑜𝑛 𝑡𝑜 𝑎𝑙𝑙 𝑣𝑒𝑐𝑡𝑜𝑟𝑠: 𝑟𝑒𝑚𝑜𝑣𝑖𝑛𝑔 𝑐𝑜𝑚𝑚𝑜𝑛 𝑝𝑎𝑡𝑡𝑒𝑟𝑛
9. Different Search Cases
• Classic Search: Keyword Search
• Use token-based embeddings
• Most similar sentence in a document
• Document Search: human-like questions
• Sentence or paragraph embeddings
• Best document match based on the question
• Media Search: image, sounds, etc.
• Object embeddings
• Best matched based on media characteristics
10. What do you Need?
1. A good document embedding
2. Select your measure of similarity
3. Generate the vector for your query
4. Find the most similar items to the query vector
Faiss (Facebook AI)
Non-Metric Space Library
11. Generating
Good Quality
Embeddings
1. Meditate on your real needs: KISS
2. Use your data: make it domain-specific
3. Refrain yourself to use the latest
technologies
• Might be better, but they are slower
• You must transfer learn, and forget about
one-shot learning
• You might not need context awareness with
domain specific data
4. Simpler embeddings are easier to
generate
12. Testing your
Embeddings
1. You must formally test your model
2. Different embeddings for different tasks
3. Extrinsic evaluation are good but …
4. Hyperparameters tuning is the key*
5. Hyperparameters tuning is task oriented*
O. Levy, Y. Goldberg, and I. Dagan, “Improving Distributional Similarity with Lessons Learned from Word Embeddings”. MIT Press, 2015
A. Gladkova, A. Drozd, and S. Matsuoka, “"Analogy-based detection of morphological and semantic relations” Association for Computational Linguistics, 2016
14. Word2Vec
• The higher the dimension, better the performance.
• Skip-gram better than CBOW
• These are very general statements:
• Do not provide any insight on other hyperparameters?
• How does the corpus size influence the quality?
• What about specific tasks, like classification?
15. Intrinsic Testing
Syntactic Relationships:
• Apple and apples (regular plurals)
• Go and went (verb conjugations)
• Angry and angrier (comparative adjectives)
• Healthy and healthiest (superlative adjectives)
• Many more …
Semantic Relationships:
• DaVinci and Italy(name-nationality)
• Boy and girl(male-female)
• Car and automobile (synonym)
• Many more …
17. Findings
• Use Skip-Grams instead of CBOW
• High dimensionality space is not critical (surprising finding)
• start with 128 or 256
• Smaller the corpus size, smaller the embedding (e.g.; use 64)
• Larger the embedding size, larger the window
• Corpus size matters
• 5,000,000 documents or more
• With larger corpus sizes use a window size of 10 or more
19. Conclusions
• Look at your use case: is this a Search Embedding case?
• Start simple: e.g.; Word2Vec
• Use your own data or transfer learn using your data
• Create your baseline before employing more complex models
• Hyperparameter tuning is more important than fancy models