Unit 1

Overview
• Intro to IR
• History of IR
• Components of IR
• Issues in IR
• Open source engine frame works
• Impact of web on IR
• Role of AI in IR
• Components of Search Engine

What is Information Retrieval ?
Information retrieval, as the name implies,
concerns the retrieving of relevant
information from databases. It is basically
concerned with facilitating the user's access to
large amounts of (predominantly textual)
information.

What is information retrieval
• Gathering information from a source(s) based on
an information need usually from a query
– Major assumption - that the information need can be
specified
– Broad definition of information
– Most methods are automated - scaling
• Sources of information
– Searching in laptops
– Archived information (libraries, maps, etc.)
– E-mail search
– Web (search engines)
Information retrieval is more than just web search

information retrieval vs ?
• Information retrieval (IR) is the activity or process of
obtaining information resources relevant to an information
need from a collection of information resources.
• Data mining is the process that attempts to discover
patterns in large data sets.
• Information extraction (IE) is the task of automatically
extracting structured information from unstructured and/or
semi-structured machine-readable documents

Unstructured (text) vs. structured (database)
data in the mid-nineties (90’s)
6

DATABASE vs IR
DATABASE IR
What we are retrieving Structured data Mostly Unstructured
Queries we are posing Formally queries Expression in natural
language(free of queirs)
Results we get Exact. Always in correct
format
Sometimes relvant, Often
not
Interaction with system One-short query Interaction based.

Goal of IR
• Goal of IR to search large documents
collections to retrieve small subsets to the
user’s information need.
• Popular IR systems are
Internet Search Engine(Google , Bing,
Yahoo)
Digital Library Catalogues

how trap mice alive
The classic search model
Collection
User task
Info need
Query
Results
Search
engine
Query
refinement
Get rid of mice in a
politically correct way
Info about removing mice
without killing them
Misconception?
Misformulation?
Sear
ch

• IR is study of finding needed information. It
helps us to find information that matches
their information needs.
• IR locates relevant documents , on the
basis of user input keywords or free text
queries.

General model of IR
Query
/user text
Matching
rule
Data store
Retrieval
result

GOAL OF IR
The goal of the IR system is to
retrieve all the items that are relevant to user
query. While retrieving as few non relevant
items as possible.

PROBLEMS IN IR?
• Document and query indexing
 how to represent best contents?
 query evaluation(retrieval process)
 To what extend does a query should respond?
How good is IR system?
 Are the retrieved documents relevant?
Are the all relevant documents has been retrieved.

Why IR is Difficult?
• Vocabularies mismatching
• synonyms : car vs vechicle
• : anna University vs annamalai
university
• Content representation may be inadequate and
incomplete.
• The user is the ultimate judge, so IR system is must
be so effective to retrieve the information

Challenges in IR
• High heterogeneity
• document structure , size and quality of data.
• What does the user expected be retrieved?
• Retrieval strategies
• Scale and distribution of data

• Relevance
– relevance is the fundamental concept in
information retrieval
– there are many factors that IR has portrayed a
particular document is retrieved
– vocabulary mismatch problem
• it is important to distinguish between topic relevant
and user relevant

– Retrieval Models:
– To address these problems in IR Retrieval
Models has been proposed.
– A good retrieval system will find documents
are likely to be considered relevant when the
user submit the query.
–

• Evaluation
– How quality the documents matches the
person’s expectation, since the quality of
ranking depends upon raking algorithms.
– Page ranking algorithm has been introduced.

OPEN SOURCE SEARCH
ENGINE FRAMEWORK

Definition
• Open source is an approach to design,
development, and distribution of software,
offering, practical accessibility to
software’s source code with free of charge.

Need of open source
• Demand of consumers as well as enterprises
are increasing with increase in information
technology usage. Information technology
solutions are required to satisfy their
different needs.
• Single solution provider cannot provide all
needed solutions
• open source, freeware, free software.

• 1970 and 1980 software organization uses
technical measures to prevent computer
users from being able to modify and use
software. 1980 copyright law has been
introduced.
• Richard stallman is the founder of Free
Software Foundation (FSF)
• The primary goal is to use application
software and os being shared among
different users with full freedom.

• OS: Linux, symbian, NetBSD.
• Servers: Apache, Tomcat, Drupal,
Wordpress, Eclipse,Joomla
• Programming languages: java, PhP,
Python, JavaScript
• Digital Content: Wikipedia, project
gutenburg.

Open source software Closed/Proprietary Software
Source code freely available Source code is kept secret
Modification are allowed Modifications are not allowed
Sublicensing is allowed Sublicensing is not allowed
No guarantee of further development Guarantee of further development
Wikipedia, Android os, google iOs, Microsoft windows

Advantages
• Right to use software in any way
• Usually no license cost and totally free of
cost.
• Higher flexibility
• Source code is open & can be modified
freely.

Application
• Social networking
• Animation
• Instant messaging
• Website development
• ERP
• Multimedia

• Freeware : It is a software that available
free of cost and can be easily distributed
without any restrictions.
• Free software:
• used to run free to run programs
• user is free distributed the program with anybody
• user will modify and improve the program

Reasons for choosing Open
source
• Development and maintenance of open
source is a community based activity
• open source allows us to study, modify and
distribute the software.
• open source allows customer enhancement.

Widely used open source
software license
• Apache license
• BSD license
• GNU General Public License
• Mozilla Public license
• Eclipse Public license

IMPACT OF WEB ON IR
• www is developed by Tim Berners lee in
1990 to organize research documents
available on the internet. It is an idea of
making documents available by FTP of
hypertext to link documents.
• Client use browser application to send URI
via http to server requesting a web page.

• Web pages are constructed using HTML.
• Servers returned with requested web pages.
• IR has 3 kinds of components file
organization, storage, and retrieval.
• One way to find relevant documents on the
web it to launch web robot. It also called
crawler, spider, worm.
• These software programs receive user
query , then explore web to locate
documents , evaluate their relevance and
return a rank order documents

• IR queries
– Keyword query
– Boolean query
– Phrase query
– Full document query
– Natural language query

Web challenges on IR
• www is expanding faster than any IR
models and web pages are update
frequently or dynamically.
• Many web pages are not indexed by search
engine , this phenomenon is called invisible
web.
• Two problems
– Problem with data
– Problem with user

• Problem with data
– Distributed data (different servers)
– Large volume (billions of separate documents)
– Quality of data
– High heterogeneity
– Unstructured data (30 % duplicate data)
• Problem with user
– How to specify the query
– How to interpret answer provided by the system

Role of AI in IR
• AI is collection of hard problems which can
be solved by humans and other things.
• Natural Language Processing (NLP) is a
major part of AI which serves as a filed of
application in IR.
• NLP techniques uses to make queries to
extract information, retrieve documents
from a collection and translate them from
one form to another.

• Two types of NLP is used in IR.
– NLP allows to respond to a range of large
inputs, to produce more experienced results.
– NLP allows text processing system to scan
source texts, to retrieve particular information
• 3 consideration in applying NLP to IR
– .selection of NLP technologies
– Choice of models
– Approach to extract information (acquisition)

Unit 1

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Unit 1

Similar to Unit 1 (20)

More from karthiksmart21

More from karthiksmart21 (9)

Recently uploaded

Recently uploaded (20)

Unit 1