What Are The Drone Anti-jamming Systems Technology?
Latest trends in AI and information Retrieval
1. Latest trends in AI and
Information Retrieval
- Abhay Ratnaparkhi
2. Outline
• Introduction
• Overview of how search engines work
• Crawling, Indexing, Querying, Ranking
• Open-source solutions and products
• Real world problems
• Extracting text from HTML
• Ranking documents – Learning to Rank
• Formulating better query – Relevance Feedback
• Feature Snippet - Automated Question Answer Generation
• Federated Search
• Finding Near duplicates from large set of documents
• Neural Information Retrieval – Trends
• Local vs distributed representations
• Query document matching
• Query Expansion
• Working in software industry
• Job roles
• Software Development processes
• Skills you need
3. What is information retrieval?
• Finding material of an unstructured nature that satisfies
an information need from within large collections.
• Search Engines
• Question Answering systems
• Recommendation systems
4. Expert Systems - IBM Watson DeepQA
https://www.aaai.org/Magazine/Watson/watson.php
IBM Watson DeepQA system outperforms human in
Jeopardy Challenge - 2011
Search is an integral part of such QA systems
5. Virtual Assistant - Amazon Alexa
Alexa, What’s the India’s current score?
Alexa, Play Marathi song?
Search is required to answer questions related to
most of the skills
7. How Search Works?
Open Source
Web Search
Pr
et
Given a query `q’ find matching set of documents `d `
Insight Engines
IBM Watson
Discovery
8. Web Crawler
• Finding Web pages on the web by recursively visiting linked pages from
some seed URLs.
• Crawling at scale – Needs distributed system
• Apache Nutch, StormCrawler, Scrapy, Sparkler
• Storing crawled content
• Server-side rendering vs Client-side rendering
• Googlebot uses headless chrome to render pages.
• Google Puppeteer
• Link Analysis- Finding page importance – PageRank
• Getting features like Page speed, mobile friendliness, content quality etc.
• Deep Web – Portion of web not accessible to crawler - ~90%
9. Inverted Index
• Ranking functions
• Term Frequency (tf) X Inverse
Document Frequency (idf)
• Okapi BM25
• Details about lucene inverted
index
Source: - https://nlp.stanford.edu/IR-
book/html/htmledition/an-example-information-retrieval-
problem-1.html#1533
11. Extracting clean text from a web page
• Remove unnecessary information like
headers, footers, advertisements etc.
• Boilerplate content deteriorate search
precision
• CLEANEVAL. - Competitive evaluation on
the topic of cleaning arbitrary web pages
• Using shallow text features – 2010
• http://www.l3s.de/~kohlschuetter/boilerplat
e/WSDM2010-Kohlschuetter-slides.pdf
• Web2Text: Deep Structured Boilerplate
Removal
Source - https://arxiv.org/abs/1801.02607
12. Learning to Rank – How to measure
relevancy?
• Human Annotators - Give relevancy labels to
the documents manually by many annotators
• Automated Ways - Observer Click patterns
and other metrics on Search Engine Results
Page (SERP). Click Models
• Relevancy metrics
• Precision: is the fraction of
results that are relevant
• Recall: is the fraction of
relevant results that are
returned
• nDCG : Normalized
Discounted Cumulative Gain -
This metric asserts that the
highly relevant documents are
more useful than moderately
relevant documents, which are
in turn more useful than
irrelevant documents.
• E. g. if documents given
labels from 0 to 5.
• {5, 5, 4, 3, 0} - High nDCG
13. Reranking using - Learning to Rank
• Ranking model
• The model is trained using labels
• Aim is to Maximize nDCG
• Pair wise, point wise and list wise approaches
• https://www.cl.cam.ac.uk/teaching/1516/R222/l
2r-overview.pdf
• RankNet, LamdaRank, LambdaMart
Document Label Orig
score
BM25 -
title
Page
Rank
#Visits
ibm
products
www.ibm.com 4 2.3 2.0 3 200K
www.ibm.com/products 5 2.4 3.0 2 10K
www.microsoft.com 2 2.1 1.1 3 300K
14. Relevance Feedback and Query Expansion
Relevance Feedback (local analysis)
Pseudo Relevance Feedback – Automated way to change query
considering top retrieved documents are relevant
Query Expansion (Global analysis)
15. Feature Snippets & Automated QA generation
• Natural Language Generation
• Stanford Question Answer Dataset (SQuAD)
https://www.coursera.org/specializations/natural-language-
processing#courses
• Transfer learning – Use the model with little retraining
in other domains.
• Transformer based models – BERT, GPT-3, LaMDA
16. Federated/Aggregated Search
• Resource selection (or query
intent prediction).
• Result aggregation
• if w1, w2, w3, w4, w5 are the
web results, we can constrain
the vertical result blocks to end
up in one of the slots s1,s2, s3
that are distributed in a
following way among the web
results: s1, w1, s2, w2, w3, w4,
w5, s3.
17. Finding near duplicate documents
• Document similarity
• Set a = new Set(["chair", "desk", "rug", "keyboard", "mouse"]);
• Set b = new Set(["chair", "rug", "keyboard"]);
• Jaccard Coefficient = 3 / (8 - 3) = 0.6, or 60%
• MinHash (Locality Sensitive Hashing)
• Intelligent mechanism to reduce big data to smaller
hash values for easy similarity computations
• Mining Massive Datasets
• http://www.mmds.org/#book
18. Neural Information Retrieval
• Neural IR is the application of shallow or deep neural networks to IR tasks.
• Other natural language processing capabilities such as machine translation and named entity linking are
not neural IR but could be used in an IR system.
Neural IR models can be categorized based on whether they influence the query representation,
the document representation, the relevance estimation, or a combination of these steps.
Source – Neural IR
20. Word Embeddings
learn an embedding from words into vectors
Need to have a function W(word) that returns a vector encoding that word.
Relationships between words correspond to difference
between vectors.
Word2vec, GloVe
“a word is characterized by the company it keeps”