SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Latest trends in AI and
Information Retrieval
- Abhay Ratnaparkhi
Outline
• Introduction
• Overview of how search engines work
• Crawling, Indexing, Querying, Ranking
• Open-source solutions and products
• Real world problems
• Extracting text from HTML
• Ranking documents – Learning to Rank
• Formulating better query – Relevance Feedback
• Feature Snippet - Automated Question Answer Generation
• Federated Search
• Finding Near duplicates from large set of documents
• Neural Information Retrieval – Trends
• Local vs distributed representations
• Query document matching
• Query Expansion
• Working in software industry
• Job roles
• Software Development processes
• Skills you need
What is information retrieval?
• Finding material of an unstructured nature that satisfies
an information need from within large collections.
• Search Engines
• Question Answering systems
• Recommendation systems
Expert Systems - IBM Watson DeepQA
https://www.aaai.org/Magazine/Watson/watson.php
IBM Watson DeepQA system outperforms human in
Jeopardy Challenge - 2011
Search is an integral part of such QA systems
Virtual Assistant - Amazon Alexa
Alexa, What’s the India’s current score?
Alexa, Play Marathi song?
Search is required to answer questions related to
most of the skills
Search Engines
How Search Works?
Open Source
Web Search
Pr
et
Given a query `q’ find matching set of documents `d `
Insight Engines
IBM Watson
Discovery
Web Crawler
• Finding Web pages on the web by recursively visiting linked pages from
some seed URLs.
• Crawling at scale – Needs distributed system
• Apache Nutch, StormCrawler, Scrapy, Sparkler
• Storing crawled content
• Server-side rendering vs Client-side rendering
• Googlebot uses headless chrome to render pages.
• Google Puppeteer
• Link Analysis- Finding page importance – PageRank
• Getting features like Page speed, mobile friendliness, content quality etc.
• Deep Web – Portion of web not accessible to crawler - ~90%
Inverted Index
• Ranking functions
• Term Frequency (tf) X Inverse
Document Frequency (idf)
• Okapi BM25
• Details about lucene inverted
index
Source: - https://nlp.stanford.edu/IR-
book/html/htmledition/an-example-information-retrieval-
problem-1.html#1533
Real World Problems
Extracting clean text from a web page
• Remove unnecessary information like
headers, footers, advertisements etc.
• Boilerplate content deteriorate search
precision
• CLEANEVAL. - Competitive evaluation on
the topic of cleaning arbitrary web pages
• Using shallow text features – 2010
• http://www.l3s.de/~kohlschuetter/boilerplat
e/WSDM2010-Kohlschuetter-slides.pdf
• Web2Text: Deep Structured Boilerplate
Removal
Source - https://arxiv.org/abs/1801.02607
Learning to Rank – How to measure
relevancy?
• Human Annotators - Give relevancy labels to
the documents manually by many annotators
• Automated Ways - Observer Click patterns
and other metrics on Search Engine Results
Page (SERP). Click Models
• Relevancy metrics
• Precision: is the fraction of
results that are relevant
• Recall: is the fraction of
relevant results that are
returned
• nDCG : Normalized
Discounted Cumulative Gain -
This metric asserts that the
highly relevant documents are
more useful than moderately
relevant documents, which are
in turn more useful than
irrelevant documents.
• E. g. if documents given
labels from 0 to 5.
• {5, 5, 4, 3, 0} - High nDCG
Reranking using - Learning to Rank
• Ranking model
• The model is trained using labels
• Aim is to Maximize nDCG
• Pair wise, point wise and list wise approaches
• https://www.cl.cam.ac.uk/teaching/1516/R222/l
2r-overview.pdf
• RankNet, LamdaRank, LambdaMart
Document Label Orig
score
BM25 -
title
Page
Rank
#Visits
ibm
products
www.ibm.com 4 2.3 2.0 3 200K
www.ibm.com/products 5 2.4 3.0 2 10K
www.microsoft.com 2 2.1 1.1 3 300K
Relevance Feedback and Query Expansion
Relevance Feedback (local analysis)
Pseudo Relevance Feedback – Automated way to change query
considering top retrieved documents are relevant
Query Expansion (Global analysis)
Feature Snippets & Automated QA generation
• Natural Language Generation
• Stanford Question Answer Dataset (SQuAD)
https://www.coursera.org/specializations/natural-language-
processing#courses
• Transfer learning – Use the model with little retraining
in other domains.
• Transformer based models – BERT, GPT-3, LaMDA
Federated/Aggregated Search
• Resource selection (or query
intent prediction).
• Result aggregation
• if w1, w2, w3, w4, w5 are the
web results, we can constrain
the vertical result blocks to end
up in one of the slots s1,s2, s3
that are distributed in a
following way among the web
results: s1, w1, s2, w2, w3, w4,
w5, s3.
Finding near duplicate documents
• Document similarity
• Set a = new Set(["chair", "desk", "rug", "keyboard", "mouse"]);
• Set b = new Set(["chair", "rug", "keyboard"]);
• Jaccard Coefficient = 3 / (8 - 3) = 0.6, or 60%
• MinHash (Locality Sensitive Hashing)
• Intelligent mechanism to reduce big data to smaller
hash values for easy similarity computations
• Mining Massive Datasets
• http://www.mmds.org/#book
Neural Information Retrieval
• Neural IR is the application of shallow or deep neural networks to IR tasks.
• Other natural language processing capabilities such as machine translation and named entity linking are
not neural IR but could be used in an IR system.
Neural IR models can be categorized based on whether they influence the query representation,
the document representation, the relevance estimation, or a combination of these steps.
Source – Neural IR
Neural Information Retrieval
Source – Neural IR
Word Embeddings
learn an embedding from words into vectors
Need to have a function W(word) that returns a vector encoding that word.
Relationships between words correspond to difference
between vectors.
Word2vec, GloVe
“a word is characterized by the company it keeps”
Vector Search Engines
• Weaviate
• Milvus
• Approximate Nearest Neighbors
Search
Working in software industry
Job Roles
• Software Developer
• Full Stack Developer
• Machine Learning Engineer
• Data Scientist
• Site Reliability Engineer
• Software Architect
• Front End Developer
• UX Designer
• Iteration Manager
• Scrum Master
• Product Owner
• Research Staff Member
• People Manager Agile Software development

Weitere ähnliche Inhalte

Was ist angesagt?

Staff manual,lib.survey,statistics,standards.
Staff manual,lib.survey,statistics,standards.Staff manual,lib.survey,statistics,standards.
Staff manual,lib.survey,statistics,standards.ghulamsamdani
 
Greenstone Digital Library
Greenstone Digital LibraryGreenstone Digital Library
Greenstone Digital LibraryImran Mansuri
 
alerting services.pptx
alerting services.pptxalerting services.pptx
alerting services.pptxRbalasubramani
 
Post coordinate indexing .. Library and information science
Post coordinate indexing .. Library and information sciencePost coordinate indexing .. Library and information science
Post coordinate indexing .. Library and information scienceharshaec
 
Z39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol pptZ39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol pptSUNILKUMARSINGH
 
Information Consolidation
Information ConsolidationInformation Consolidation
Information ConsolidationKishor Sakariya
 
Ontology and Ontology Libraries: a Critical Study
Ontology and Ontology Libraries: a Critical StudyOntology and Ontology Libraries: a Critical Study
Ontology and Ontology Libraries: a Critical StudyDebashisnaskar
 
Taxonomies & folksonomies
Taxonomies  & folksonomiesTaxonomies  & folksonomies
Taxonomies & folksonomiesAparna Sane
 
Design and development of subject gateways with special reference to lisgateway
Design and development of subject  gateways with special reference to lisgatewayDesign and development of subject  gateways with special reference to lisgateway
Design and development of subject gateways with special reference to lisgatewaykmusthu
 
eprints digital library software
eprints digital library softwareeprints digital library software
eprints digital library softwaresonia naomi bandao
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information RetrievalRoi Blanco
 
Change Management for Libraries
Change Management for LibrariesChange Management for Libraries
Change Management for LibrariesThomas King
 
Digital preservation
Digital preservationDigital preservation
Digital preservationMichael Day
 
Library networks and consortium
Library networks and consortiumLibrary networks and consortium
Library networks and consortiumSunilKumar5028
 
Greenstone Digital Library Software
Greenstone Digital Library SoftwareGreenstone Digital Library Software
Greenstone Digital Library SoftwareMINTUMATHEW8
 

Was ist angesagt? (20)

Staff manual,lib.survey,statistics,standards.
Staff manual,lib.survey,statistics,standards.Staff manual,lib.survey,statistics,standards.
Staff manual,lib.survey,statistics,standards.
 
Greenstone Digital Library
Greenstone Digital LibraryGreenstone Digital Library
Greenstone Digital Library
 
Open source Library Management Systems
Open source Library Management SystemsOpen source Library Management Systems
Open source Library Management Systems
 
alerting services.pptx
alerting services.pptxalerting services.pptx
alerting services.pptx
 
Dspace
DspaceDspace
Dspace
 
Soul
Soul Soul
Soul
 
Controlled Vocabullary.pptx
Controlled Vocabullary.pptxControlled Vocabullary.pptx
Controlled Vocabullary.pptx
 
Post coordinate indexing .. Library and information science
Post coordinate indexing .. Library and information sciencePost coordinate indexing .. Library and information science
Post coordinate indexing .. Library and information science
 
Z39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol pptZ39.50: Information Retrieval protocol ppt
Z39.50: Information Retrieval protocol ppt
 
Information Consolidation
Information ConsolidationInformation Consolidation
Information Consolidation
 
Ontology and Ontology Libraries: a Critical Study
Ontology and Ontology Libraries: a Critical StudyOntology and Ontology Libraries: a Critical Study
Ontology and Ontology Libraries: a Critical Study
 
Taxonomies & folksonomies
Taxonomies  & folksonomiesTaxonomies  & folksonomies
Taxonomies & folksonomies
 
Design and development of subject gateways with special reference to lisgateway
Design and development of subject  gateways with special reference to lisgatewayDesign and development of subject  gateways with special reference to lisgateway
Design and development of subject gateways with special reference to lisgateway
 
eprints digital library software
eprints digital library softwareeprints digital library software
eprints digital library software
 
Presentation federated search
Presentation federated searchPresentation federated search
Presentation federated search
 
Introduction to Information Retrieval
Introduction to Information RetrievalIntroduction to Information Retrieval
Introduction to Information Retrieval
 
Change Management for Libraries
Change Management for LibrariesChange Management for Libraries
Change Management for Libraries
 
Digital preservation
Digital preservationDigital preservation
Digital preservation
 
Library networks and consortium
Library networks and consortiumLibrary networks and consortium
Library networks and consortium
 
Greenstone Digital Library Software
Greenstone Digital Library SoftwareGreenstone Digital Library Software
Greenstone Digital Library Software
 

Ähnlich wie Latest trends in AI and information Retrieval

RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...S. Diana Hu
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Petter Skodvin-Hvammen
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notesAnandh Arumugakan
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewLucidworks
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1Bill Liu
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Lucidworks
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningJoaquin Delgado PhD.
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningS. Diana Hu
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technologyStefanos Anastasiadis
 
Text Mining & Sentiment Analysis with Power BI & Azure
Text Mining & Sentiment Analysis with Power BI & AzureText Mining & Sentiment Analysis with Power BI & Azure
Text Mining & Sentiment Analysis with Power BI & AzureSanil Mhatre
 
Entity centric data_management_2013
Entity centric data_management_2013Entity centric data_management_2013
Entity centric data_management_2013eXascale Infolab
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialBarbara Starr
 

Ähnlich wie Latest trends in AI and information Retrieval (20)

RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)Share point 2013 enterprise search (public)
Share point 2013 enterprise search (public)
 
CS6007 information retrieval - 5 units notes
CS6007   information retrieval - 5 units notesCS6007   information retrieval - 5 units notes
CS6007 information retrieval - 5 units notes
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
Webinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's NewWebinar: Fusion 3.1 - What's New
Webinar: Fusion 3.1 - What's New
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Rdbms
RdbmsRdbms
Rdbms
 
C19013010 the tutorial to build shared ai services session 1
C19013010  the tutorial to build shared ai services session 1C19013010  the tutorial to build shared ai services session 1
C19013010 the tutorial to build shared ai services session 1
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Web mining
Web miningWeb mining
Web mining
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine LearningLucene/Solr Revolution 2015: Where Search Meets Machine Learning
Lucene/Solr Revolution 2015: Where Search Meets Machine Learning
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technology
 
Text Mining & Sentiment Analysis with Power BI & Azure
Text Mining & Sentiment Analysis with Power BI & AzureText Mining & Sentiment Analysis with Power BI & Azure
Text Mining & Sentiment Analysis with Power BI & Azure
 
Entity centric data_management_2013
Entity centric data_management_2013Entity centric data_management_2013
Entity centric data_management_2013
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Semtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorialSemtech bizsemanticsearchtutorial
Semtech bizsemanticsearchtutorial
 

Kürzlich hochgeladen

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Kürzlich hochgeladen (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Latest trends in AI and information Retrieval

  • 1. Latest trends in AI and Information Retrieval - Abhay Ratnaparkhi
  • 2. Outline • Introduction • Overview of how search engines work • Crawling, Indexing, Querying, Ranking • Open-source solutions and products • Real world problems • Extracting text from HTML • Ranking documents – Learning to Rank • Formulating better query – Relevance Feedback • Feature Snippet - Automated Question Answer Generation • Federated Search • Finding Near duplicates from large set of documents • Neural Information Retrieval – Trends • Local vs distributed representations • Query document matching • Query Expansion • Working in software industry • Job roles • Software Development processes • Skills you need
  • 3. What is information retrieval? • Finding material of an unstructured nature that satisfies an information need from within large collections. • Search Engines • Question Answering systems • Recommendation systems
  • 4. Expert Systems - IBM Watson DeepQA https://www.aaai.org/Magazine/Watson/watson.php IBM Watson DeepQA system outperforms human in Jeopardy Challenge - 2011 Search is an integral part of such QA systems
  • 5. Virtual Assistant - Amazon Alexa Alexa, What’s the India’s current score? Alexa, Play Marathi song? Search is required to answer questions related to most of the skills
  • 7. How Search Works? Open Source Web Search Pr et Given a query `q’ find matching set of documents `d ` Insight Engines IBM Watson Discovery
  • 8. Web Crawler • Finding Web pages on the web by recursively visiting linked pages from some seed URLs. • Crawling at scale – Needs distributed system • Apache Nutch, StormCrawler, Scrapy, Sparkler • Storing crawled content • Server-side rendering vs Client-side rendering • Googlebot uses headless chrome to render pages. • Google Puppeteer • Link Analysis- Finding page importance – PageRank • Getting features like Page speed, mobile friendliness, content quality etc. • Deep Web – Portion of web not accessible to crawler - ~90%
  • 9. Inverted Index • Ranking functions • Term Frequency (tf) X Inverse Document Frequency (idf) • Okapi BM25 • Details about lucene inverted index Source: - https://nlp.stanford.edu/IR- book/html/htmledition/an-example-information-retrieval- problem-1.html#1533
  • 11. Extracting clean text from a web page • Remove unnecessary information like headers, footers, advertisements etc. • Boilerplate content deteriorate search precision • CLEANEVAL. - Competitive evaluation on the topic of cleaning arbitrary web pages • Using shallow text features – 2010 • http://www.l3s.de/~kohlschuetter/boilerplat e/WSDM2010-Kohlschuetter-slides.pdf • Web2Text: Deep Structured Boilerplate Removal Source - https://arxiv.org/abs/1801.02607
  • 12. Learning to Rank – How to measure relevancy? • Human Annotators - Give relevancy labels to the documents manually by many annotators • Automated Ways - Observer Click patterns and other metrics on Search Engine Results Page (SERP). Click Models • Relevancy metrics • Precision: is the fraction of results that are relevant • Recall: is the fraction of relevant results that are returned • nDCG : Normalized Discounted Cumulative Gain - This metric asserts that the highly relevant documents are more useful than moderately relevant documents, which are in turn more useful than irrelevant documents. • E. g. if documents given labels from 0 to 5. • {5, 5, 4, 3, 0} - High nDCG
  • 13. Reranking using - Learning to Rank • Ranking model • The model is trained using labels • Aim is to Maximize nDCG • Pair wise, point wise and list wise approaches • https://www.cl.cam.ac.uk/teaching/1516/R222/l 2r-overview.pdf • RankNet, LamdaRank, LambdaMart Document Label Orig score BM25 - title Page Rank #Visits ibm products www.ibm.com 4 2.3 2.0 3 200K www.ibm.com/products 5 2.4 3.0 2 10K www.microsoft.com 2 2.1 1.1 3 300K
  • 14. Relevance Feedback and Query Expansion Relevance Feedback (local analysis) Pseudo Relevance Feedback – Automated way to change query considering top retrieved documents are relevant Query Expansion (Global analysis)
  • 15. Feature Snippets & Automated QA generation • Natural Language Generation • Stanford Question Answer Dataset (SQuAD) https://www.coursera.org/specializations/natural-language- processing#courses • Transfer learning – Use the model with little retraining in other domains. • Transformer based models – BERT, GPT-3, LaMDA
  • 16. Federated/Aggregated Search • Resource selection (or query intent prediction). • Result aggregation • if w1, w2, w3, w4, w5 are the web results, we can constrain the vertical result blocks to end up in one of the slots s1,s2, s3 that are distributed in a following way among the web results: s1, w1, s2, w2, w3, w4, w5, s3.
  • 17. Finding near duplicate documents • Document similarity • Set a = new Set(["chair", "desk", "rug", "keyboard", "mouse"]); • Set b = new Set(["chair", "rug", "keyboard"]); • Jaccard Coefficient = 3 / (8 - 3) = 0.6, or 60% • MinHash (Locality Sensitive Hashing) • Intelligent mechanism to reduce big data to smaller hash values for easy similarity computations • Mining Massive Datasets • http://www.mmds.org/#book
  • 18. Neural Information Retrieval • Neural IR is the application of shallow or deep neural networks to IR tasks. • Other natural language processing capabilities such as machine translation and named entity linking are not neural IR but could be used in an IR system. Neural IR models can be categorized based on whether they influence the query representation, the document representation, the relevance estimation, or a combination of these steps. Source – Neural IR
  • 20. Word Embeddings learn an embedding from words into vectors Need to have a function W(word) that returns a vector encoding that word. Relationships between words correspond to difference between vectors. Word2vec, GloVe “a word is characterized by the company it keeps”
  • 21. Vector Search Engines • Weaviate • Milvus • Approximate Nearest Neighbors Search
  • 23. Job Roles • Software Developer • Full Stack Developer • Machine Learning Engineer • Data Scientist • Site Reliability Engineer • Software Architect • Front End Developer • UX Designer • Iteration Manager • Scrum Master • Product Owner • Research Staff Member • People Manager Agile Software development