SlideShare ist ein Scribd-Unternehmen logo
1 von 22
Phrase Based Indexing
           By
      Bala Abirami
•   Introduction of Phrase Based Indexing
•   What is Phrase Based Indexing?
•   Back ground of Invention
•   Summary on Invention
•   Spam Detection
Introduction
• An information retrieval system uses phrases to
  index, retrieve, organize and describe
  documents.
• It was a patent application submitted by the
  Google Engineer, Anna Lynn Patterson to US
• Application filed: July, 2004
• Published: January, 2006
Background of Invention
• Information retrieval systems, generally called
  search engines, are now an essential tool for
  finding information in large scale, diverse, and
  growing corpuses such as the Internet.

• A document is retrieved in response to a query
  containing a number of query terms, typically
  based on having some number of query terms
  present in the document.

• The retrieved documents are then ranked
  according to other statistical measures, such as
  frequency of occurrence of the query terms, host
  domain, link analysis, and the like
Cont…
• Concepts are often expressed in phrases, such
  as "Australian Shepherd," "President of the
  United States," or "Sundance Film Festival".
• Accordingly, there is a need for an information
  retrieval system and methodology that can
  identify phrases, index documents according to
  phrases, search and rank documents in
  accordance with their phrases.
Summary
  An information retrieval system and
  methodology uses phrases to index, search,
  rank, and describe documents in the document
  collection.

1. Identifying Phrases and Related Phrases
2. Indexing Documents w.r.t Phrases
3. Ranking Documents w.r.t Phrases
4. Creating description for the document
5. Elimination of Duplicate Documents
Identifying Phrase and Related
               Phrases
• Based on a phrase's ability to predict the
  presence of other phrases in a document.
• It looks to identify phrases that have
  frequent and/or distinguished/unique
  usage
• Prediction measure is used for identifying related
  phrases
• Prediction measure relates Actual co
  -occurrence rate of two phrases to expected co-
  occurrence rate of the two phrases
• Information gain = actual co-occurrence rate :
Cont…
• Two Phrases are related to each other
  when the prediction measure exceeds the
  prediction threshold.
• Example:
  Phrase : “President of the United States”
  predicts the related phrase “White House”,
  “George Bush” etc.,
Indexing documents based on
           related Phrases
• An information retrieval system indexes
  documents in the document collection by the
  valid or good phrases.
• Posting List = documents that contain the
  phrase
• Second List = used to store data indicating
  which of the related phrases of the given phrase
  are also present in each document containing
  the given phrase
Ranking

•   Ranking documents is based on two factors
      1. Ranking Documents based on Contained
    Phrases
      2. Ranking Documents based on Anchor
    Phrases
•   Document Score = Body Hit Score + Anchor Hit
    Score
•   For Example: Body Hit Score = 0.30, Anchor
    Hit Score = 0.70
•   Document Score = 0.30 + 0.70
Phrase Extension
• The information retrieval system is also adapted
  to use the phrases when searching for
  documents in response to a query.
• A user may enter an incomplete phrase in a
  search query, such as "President of the“
   Incomplete phrases such as these may be
  identified and replaced by a phrase extension,
  such as "President of the United States."
Descriptions for Documents
• Phrase information is used to create description
  of a document.
• System identifies phrase present in the query,
  related phrases and Phrase extensions in each
  sentences and have a count for each sentences.
• Ranks the sentences based on the count.
• Selects some number of top ranking sentences
  as description and includes it in the search
  results.
Eliminating Duplicate documents
• Identifying and Eliminating duplicate documents while
  crawling a document or when processing the search
  query.
• The description is stored in association with every
  document in a hash table.
• The system concatenates the newly crawled page with
  that stored hash value in the Hash table. If it finds a
  match, then it indicates that the current document is
  duplicate value.
• The system keeps the one which has higher page rank
  or more document significance and remove the duplicate
  document and will not appear in future search results for
  any query.
Functions of Indexing system

• Indentifies Phrases in documents
• Indexing Documents according to the
  phrases by accessing various websites.

Functions of Front End Server

• Receives queries from a user
• Provides those queries to the search system
Functions of Searching System

• Searching for documents relevant to the
  search query
• Identifies the phrases in the search query
• Ranking the documents

Functions of Presentation system

• Modifying the search results including
  removing of duplicate content.
• Generating topical descriptions of
  documents and provides modified
Spam Detection
• “Spam” pages have little meaningful content,
  but may instead be made up of large
  collections of popular words and phrases.
  These are sometimes referred to as “keyword
  stuffed pages”.

• Pages containing specific words and phrases
  that advertisers might be interested in are
  often called “honeypots,” and are created for
  search engines to display along with paid
  advertisements .
Cont…
• A phrase based indexing system knows the
  number of related phrases in a document.

• A normal, non-spam document will generally
  have a relatively limited number of related
  phrases, typically on the order of between 8 and
  20, depending on the document collection.

• A spam document will have an excessive
  number of related phrases, for example on the
  order of between 100 and 1000 related phrases.
Advantages of Phrase Based
            Indexing

• Detecting Duplicate Pages
• Spam Detection
• Save time
Other Patent Applications
• Phrase identification in an information retrieval system

• Phrase-based searching in an information retrieval system

• Phrase-based generation of document descriptions

• Detecting spam documents in a phrase based information
  retrieval system

• Efficient Phrase Based Document Indexing for Document
  Clustering
According to data collected from users of European Web
 analytics provider OneStat, most people use 2- or 3-word
 queries in search engines


Two-word phrases -- 28.38 percent
Three-word phrases -- 27.15 percent
Four-word phrases -- 16.42 percent
One-word phrase -- 13.48 percent
Five-word phrases -- 8.03 percent
Six-word phrases -- 3.67 percent
Seven-word phrases -- 1.63 percent
Eight-word phrases -- 0.73 percent
Nine-word phrases -- 0.34 percent
Ten-word phrases -- 0.16 percent
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

Conversational AI– Beyond the chatbot hype
 Conversational AI– Beyond the chatbot hype Conversational AI– Beyond the chatbot hype
Conversational AI– Beyond the chatbot hypeNUS-ISS
 
Chapter 1 semantic web
Chapter 1 semantic webChapter 1 semantic web
Chapter 1 semantic webR A Akerkar
 
Crafting Expertise, Authority and Trust with Entity-Based Content Strategy - ...
Crafting Expertise, Authority and Trust with Entity-Based Content Strategy - ...Crafting Expertise, Authority and Trust with Entity-Based Content Strategy - ...
Crafting Expertise, Authority and Trust with Entity-Based Content Strategy - ...Jamie Indigo
 
BrightonSEO March 2021 | Dan Taylor, Image Entity Tags
BrightonSEO March 2021 | Dan Taylor, Image Entity TagsBrightonSEO March 2021 | Dan Taylor, Image Entity Tags
BrightonSEO March 2021 | Dan Taylor, Image Entity TagsDan Taylor
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data LiteracyDATAVERSITY
 
Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...
Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...
Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...Dawn Anderson MSc DigM
 
The Big SEO Migration - Learnings from a first time hiker
The Big SEO Migration - Learnings from a first time hiker The Big SEO Migration - Learnings from a first time hiker
The Big SEO Migration - Learnings from a first time hiker ReneHarris7
 
kevin Indig - Internal Link Building on Steroids (Tech SEO Boost )
kevin Indig - Internal Link Building on Steroids (Tech SEO Boost )kevin Indig - Internal Link Building on Steroids (Tech SEO Boost )
kevin Indig - Internal Link Building on Steroids (Tech SEO Boost )Kevin Indig
 
Presentation web 3.0(part 1)
Presentation web 3.0(part 1)Presentation web 3.0(part 1)
Presentation web 3.0(part 1)Abhishek Roy
 
The Python Cheat Sheet for the Busy Marketer
The Python Cheat Sheet for the Busy MarketerThe Python Cheat Sheet for the Busy Marketer
The Python Cheat Sheet for the Busy MarketerHamlet Batista
 
Slawski New Approaches for Structured Data:Evolution of Question Answering
Slawski   New Approaches for Structured Data:Evolution of Question Answering Slawski   New Approaches for Structured Data:Evolution of Question Answering
Slawski New Approaches for Structured Data:Evolution of Question Answering Bill Slawski
 
How To EAT Links.pptx
How To EAT Links.pptxHow To EAT Links.pptx
How To EAT Links.pptxDixon Jones
 
Internal Linking - The Topic Clustering Way edited.pptx
Internal Linking - The Topic Clustering Way edited.pptxInternal Linking - The Topic Clustering Way edited.pptx
Internal Linking - The Topic Clustering Way edited.pptxDixon Jones
 
Link Building's Tipping Point
Link Building's Tipping PointLink Building's Tipping Point
Link Building's Tipping PointRand Fishkin
 
Automating Google Lighthouse
Automating Google LighthouseAutomating Google Lighthouse
Automating Google LighthouseHamlet Batista
 
Data Stewards – Defining and Assigning
Data Stewards – Defining and AssigningData Stewards – Defining and Assigning
Data Stewards – Defining and AssigningDATAVERSITY
 
Web scraping
Web scrapingWeb scraping
Web scrapingSelecto
 
What we can learn from losing SEO tests
What we can learn from losing SEO testsWhat we can learn from losing SEO tests
What we can learn from losing SEO testsWill Critchlow
 

Was ist angesagt? (20)

Conversational AI– Beyond the chatbot hype
 Conversational AI– Beyond the chatbot hype Conversational AI– Beyond the chatbot hype
Conversational AI– Beyond the chatbot hype
 
Chapter 1 semantic web
Chapter 1 semantic webChapter 1 semantic web
Chapter 1 semantic web
 
Crafting Expertise, Authority and Trust with Entity-Based Content Strategy - ...
Crafting Expertise, Authority and Trust with Entity-Based Content Strategy - ...Crafting Expertise, Authority and Trust with Entity-Based Content Strategy - ...
Crafting Expertise, Authority and Trust with Entity-Based Content Strategy - ...
 
BrightonSEO March 2021 | Dan Taylor, Image Entity Tags
BrightonSEO March 2021 | Dan Taylor, Image Entity TagsBrightonSEO March 2021 | Dan Taylor, Image Entity Tags
BrightonSEO March 2021 | Dan Taylor, Image Entity Tags
 
Exploring Levels of Data Literacy
Exploring Levels of Data LiteracyExploring Levels of Data Literacy
Exploring Levels of Data Literacy
 
Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...
Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...
Natural Semantic SEO - Surfacing Walnuts in Densely Represented, Every Increa...
 
The Big SEO Migration - Learnings from a first time hiker
The Big SEO Migration - Learnings from a first time hiker The Big SEO Migration - Learnings from a first time hiker
The Big SEO Migration - Learnings from a first time hiker
 
kevin Indig - Internal Link Building on Steroids (Tech SEO Boost )
kevin Indig - Internal Link Building on Steroids (Tech SEO Boost )kevin Indig - Internal Link Building on Steroids (Tech SEO Boost )
kevin Indig - Internal Link Building on Steroids (Tech SEO Boost )
 
Presentation web 3.0(part 1)
Presentation web 3.0(part 1)Presentation web 3.0(part 1)
Presentation web 3.0(part 1)
 
The Python Cheat Sheet for the Busy Marketer
The Python Cheat Sheet for the Busy MarketerThe Python Cheat Sheet for the Busy Marketer
The Python Cheat Sheet for the Busy Marketer
 
Slawski New Approaches for Structured Data:Evolution of Question Answering
Slawski   New Approaches for Structured Data:Evolution of Question Answering Slawski   New Approaches for Structured Data:Evolution of Question Answering
Slawski New Approaches for Structured Data:Evolution of Question Answering
 
How To EAT Links.pptx
How To EAT Links.pptxHow To EAT Links.pptx
How To EAT Links.pptx
 
Internal Linking - The Topic Clustering Way edited.pptx
Internal Linking - The Topic Clustering Way edited.pptxInternal Linking - The Topic Clustering Way edited.pptx
Internal Linking - The Topic Clustering Way edited.pptx
 
Link Building's Tipping Point
Link Building's Tipping PointLink Building's Tipping Point
Link Building's Tipping Point
 
Data ethics
Data ethicsData ethics
Data ethics
 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDF
 
Automating Google Lighthouse
Automating Google LighthouseAutomating Google Lighthouse
Automating Google Lighthouse
 
Data Stewards – Defining and Assigning
Data Stewards – Defining and AssigningData Stewards – Defining and Assigning
Data Stewards – Defining and Assigning
 
Web scraping
Web scrapingWeb scraping
Web scraping
 
What we can learn from losing SEO tests
What we can learn from losing SEO testsWhat we can learn from losing SEO tests
What we can learn from losing SEO tests
 

Ähnlich wie Phrase Based Indexing Explained

Phrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information RetrivelPhrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information Retrivelbalaabirami
 
Phrase Based Indexing
Phrase Based IndexingPhrase Based Indexing
Phrase Based Indexingbalaabirami
 
How search engines work Anand Saini
How search engines work Anand SainiHow search engines work Anand Saini
How search engines work Anand SainiDr,Saini Anand
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantinimaxfalc
 
information retrieval in artificial intelligence
information retrieval in artificial intelligenceinformation retrieval in artificial intelligence
information retrieval in artificial intelligencePriyadharshiniG41
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrievalSadaf Rafiq
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrievalcaptainmactavish1996
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval ssilambu111
 
Information retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of wordsInformation retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of wordsVaibhav Khanna
 
How to search on internet.pptx
How to search on internet.pptxHow to search on internet.pptx
How to search on internet.pptxRehanZia10
 
How a search engine works slide
How a search engine works slideHow a search engine works slide
How a search engine works slideSovan Misra
 
Writing & publishing research workshop
Writing & publishing research workshopWriting & publishing research workshop
Writing & publishing research workshopSeth Porter, MA, MLIS
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Shahriar Rafee
 
A Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis IA Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis IUNCResearchHub
 

Ähnlich wie Phrase Based Indexing Explained (20)

Phrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information RetrivelPhrase Based Indexing and Information Retrivel
Phrase Based Indexing and Information Retrivel
 
Phrase Based Indexing
Phrase Based IndexingPhrase Based Indexing
Phrase Based Indexing
 
How search engines work Anand Saini
How search engines work Anand SainiHow search engines work Anand Saini
How search engines work Anand Saini
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
 
Using Technology for Academic Research
Using Technology for Academic ResearchUsing Technology for Academic Research
Using Technology for Academic Research
 
information retrieval in artificial intelligence
information retrieval in artificial intelligenceinformation retrieval in artificial intelligence
information retrieval in artificial intelligence
 
Information storage and retrieval
Information storage and retrievalInformation storage and retrieval
Information storage and retrieval
 
Text mining
Text miningText mining
Text mining
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrieval
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Information retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of wordsInformation retrieval 10 tf idf and bag of words
Information retrieval 10 tf idf and bag of words
 
How to search on internet.pptx
How to search on internet.pptxHow to search on internet.pptx
How to search on internet.pptx
 
File000162
File000162File000162
File000162
 
How a search engine works slide
How a search engine works slideHow a search engine works slide
How a search engine works slide
 
Writing & publishing research workshop
Writing & publishing research workshopWriting & publishing research workshop
Writing & publishing research workshop
 
Info 2402 irt-chapter_2
Info 2402 irt-chapter_2Info 2402 irt-chapter_2
Info 2402 irt-chapter_2
 
EDS for IFLA
EDS for IFLAEDS for IFLA
EDS for IFLA
 
A Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis IA Gentle Introduction to Text Analysis I
A Gentle Introduction to Text Analysis I
 
Jonathan Breeze, Symplectic
Jonathan Breeze, SymplecticJonathan Breeze, Symplectic
Jonathan Breeze, Symplectic
 
BLC & Digital Science: Jonathan Breeze, Symplectic
BLC & Digital Science: Jonathan Breeze, SymplecticBLC & Digital Science: Jonathan Breeze, Symplectic
BLC & Digital Science: Jonathan Breeze, Symplectic
 

Kürzlich hochgeladen

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 

Phrase Based Indexing Explained

  • 1. Phrase Based Indexing By Bala Abirami
  • 2. Introduction of Phrase Based Indexing • What is Phrase Based Indexing? • Back ground of Invention • Summary on Invention • Spam Detection
  • 3. Introduction • An information retrieval system uses phrases to index, retrieve, organize and describe documents. • It was a patent application submitted by the Google Engineer, Anna Lynn Patterson to US • Application filed: July, 2004 • Published: January, 2006
  • 4. Background of Invention • Information retrieval systems, generally called search engines, are now an essential tool for finding information in large scale, diverse, and growing corpuses such as the Internet. • A document is retrieved in response to a query containing a number of query terms, typically based on having some number of query terms present in the document. • The retrieved documents are then ranked according to other statistical measures, such as frequency of occurrence of the query terms, host domain, link analysis, and the like
  • 5. Cont… • Concepts are often expressed in phrases, such as "Australian Shepherd," "President of the United States," or "Sundance Film Festival". • Accordingly, there is a need for an information retrieval system and methodology that can identify phrases, index documents according to phrases, search and rank documents in accordance with their phrases.
  • 6. Summary An information retrieval system and methodology uses phrases to index, search, rank, and describe documents in the document collection. 1. Identifying Phrases and Related Phrases 2. Indexing Documents w.r.t Phrases 3. Ranking Documents w.r.t Phrases 4. Creating description for the document 5. Elimination of Duplicate Documents
  • 7. Identifying Phrase and Related Phrases • Based on a phrase's ability to predict the presence of other phrases in a document. • It looks to identify phrases that have frequent and/or distinguished/unique usage • Prediction measure is used for identifying related phrases • Prediction measure relates Actual co -occurrence rate of two phrases to expected co- occurrence rate of the two phrases • Information gain = actual co-occurrence rate :
  • 8. Cont… • Two Phrases are related to each other when the prediction measure exceeds the prediction threshold. • Example: Phrase : “President of the United States” predicts the related phrase “White House”, “George Bush” etc.,
  • 9. Indexing documents based on related Phrases • An information retrieval system indexes documents in the document collection by the valid or good phrases. • Posting List = documents that contain the phrase • Second List = used to store data indicating which of the related phrases of the given phrase are also present in each document containing the given phrase
  • 10. Ranking • Ranking documents is based on two factors 1. Ranking Documents based on Contained Phrases 2. Ranking Documents based on Anchor Phrases • Document Score = Body Hit Score + Anchor Hit Score • For Example: Body Hit Score = 0.30, Anchor Hit Score = 0.70 • Document Score = 0.30 + 0.70
  • 11. Phrase Extension • The information retrieval system is also adapted to use the phrases when searching for documents in response to a query. • A user may enter an incomplete phrase in a search query, such as "President of the“ Incomplete phrases such as these may be identified and replaced by a phrase extension, such as "President of the United States."
  • 12. Descriptions for Documents • Phrase information is used to create description of a document. • System identifies phrase present in the query, related phrases and Phrase extensions in each sentences and have a count for each sentences. • Ranks the sentences based on the count. • Selects some number of top ranking sentences as description and includes it in the search results.
  • 13. Eliminating Duplicate documents • Identifying and Eliminating duplicate documents while crawling a document or when processing the search query. • The description is stored in association with every document in a hash table. • The system concatenates the newly crawled page with that stored hash value in the Hash table. If it finds a match, then it indicates that the current document is duplicate value. • The system keeps the one which has higher page rank or more document significance and remove the duplicate document and will not appear in future search results for any query.
  • 14.
  • 15. Functions of Indexing system • Indentifies Phrases in documents • Indexing Documents according to the phrases by accessing various websites. Functions of Front End Server • Receives queries from a user • Provides those queries to the search system
  • 16. Functions of Searching System • Searching for documents relevant to the search query • Identifies the phrases in the search query • Ranking the documents Functions of Presentation system • Modifying the search results including removing of duplicate content. • Generating topical descriptions of documents and provides modified
  • 17. Spam Detection • “Spam” pages have little meaningful content, but may instead be made up of large collections of popular words and phrases. These are sometimes referred to as “keyword stuffed pages”. • Pages containing specific words and phrases that advertisers might be interested in are often called “honeypots,” and are created for search engines to display along with paid advertisements .
  • 18. Cont… • A phrase based indexing system knows the number of related phrases in a document. • A normal, non-spam document will generally have a relatively limited number of related phrases, typically on the order of between 8 and 20, depending on the document collection. • A spam document will have an excessive number of related phrases, for example on the order of between 100 and 1000 related phrases.
  • 19. Advantages of Phrase Based Indexing • Detecting Duplicate Pages • Spam Detection • Save time
  • 20. Other Patent Applications • Phrase identification in an information retrieval system • Phrase-based searching in an information retrieval system • Phrase-based generation of document descriptions • Detecting spam documents in a phrase based information retrieval system • Efficient Phrase Based Document Indexing for Document Clustering
  • 21. According to data collected from users of European Web analytics provider OneStat, most people use 2- or 3-word queries in search engines Two-word phrases -- 28.38 percent Three-word phrases -- 27.15 percent Four-word phrases -- 16.42 percent One-word phrase -- 13.48 percent Five-word phrases -- 8.03 percent Six-word phrases -- 3.67 percent Seven-word phrases -- 1.63 percent Eight-word phrases -- 0.73 percent Nine-word phrases -- 0.34 percent Ten-word phrases -- 0.16 percent