3. Introduction
ď˝
Information Retrieval (IR) is the discipline that deals with retrieval of
unstructured data, especially textual documents, in response to a
query .
User Interface
User need
Text Operations
Indexing
Inverted
file
Documents
Similarity Computation
(Searching)
Retrieved docs
Ranking
Ranked docs
Index
4. Text operation and Indexing
ď˝
Text operations: reduce the complexity of the document
representation
Q=List of the European countries
ď˝
List , Europe , country
Indexing: A simple alternative is to search the whole text
sequentially
Vocabular
y
beautiful
flowers
garden
house
70
45, 58
18, 29
6
Occurrences
6. Popular search engines
ď˝
Google
Yahoo
Bing
âŚ
ď˝
Google search engine
ď˝
ď˝
ď˝
ď˝
ď˝
ď˝
Google search is based on priority
Priority rank used âPageRankâ algorithm
Search Google can be using Boolean operators such as :
exclusion ( -aa ) , alternatives ( aa OR bb)
7. PageRank algorithm
ď˝
PageRank is an algorithm used by Google search
engine to rank websites in their search engine
results.
PR(B) = PR(E) + PR(F) + PR(D) + P(C)
8. Googlebot : Googleâs Web Crawler
ď˝
Googlebot is Googleâs web crawling robot, which finds
and retrieves pages on the web and hands them off to
the Google indexer.
ď˝
Googlebot finds pages in two ways:
ď˝
ď˝
Through an add URL form, www.google.com/addurl.html
Finding links by crawling the web.
12. Metasearch engines
ď˝
A meta search engine is a search tool that send user
requests to several other search engines and/or
databases and aggregate results into a single list or
displays them according to their source.
ď˝
Metasearch engines enable users to enter search criteria
once and access several search engines simultaneously.
15. Some current research topics in IRS
ď˝
Visual Indexing
ď˝
ď˝
Indexing of (video, images, audio).
Visual content extraction
ď˝
Machine learning in information retrieval
ď˝
Web information retrieval (including blogs)
ď˝
Mobile computing related information retrieval issues
ď˝
Performance measures
ď˝
Query languages and optimization
16. What is MapReduce ?
ď˝
MapReduce is a programming model for processing
large data sets
ď˝
The first is the map job, which takes a set of data
and converts it into another set of data, where
individual elements are broken down into tuples
(key/value pairs)
ď˝
The reduce job takes the output from a map as input
and combines those data tuples into a smaller set of
tuples.
18. Programming Model
ď˝
Map(k1,v1) â list(k2,v2)
Reduce(k2, list (v2)) â list(v3)
ď˝
Ex: 5 files
ď˝
ď˝
ď˝
Toronto, 20
Whitby, 25
New York, 22
Rome, 32
Toronto, 4
Rome, 33
New York, 18
File 1
19. Programming Model (continued..)
ď˝
we want to find the maximum tem-perature for each
city across all of the data files
ď˝
Break this into 5 Map tasks
ď˝
Each mapper work on 1 file and return the Max tem
in each city
ď˝
All five of these output streams would be fed into the
reduce tasks, which combine the input results and
output a single value for each city, producing a final
result.
20. Programming Model(continued..)
ď˝
Map(output) : (Toronto, 18) (Whitby, 27) (New York,
32) (Rome, 37)(Toronto, 32) (Whitby, 20) (New York,
33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New York,
20) (Rome, 31)(Toronto, 31) (Whitby, 22) (New York,
19) (Rome, 30)
ď˝
Reduce(output):(Toronto, 32) (Whitby, 27) (New
York, 33) (Rome, 38)
21. MapReduce uses
ď˝
MapReduce is useful in a wide range of applications,
including distributed pattern-based searching, distributed
sorting, web link-graph reversal, term-vector per host,
web access log stats, inverted index construction,
document clustering, and machine learning
ď˝
Moreover, the MapReduce model has been adapted to
several computing environments like multi-core systems,
desktop grids, dynamic cloud environments, and mobile
environments.
ď˝
At Google, MapReduce was used to completely
regenerate Google's index of the World Wide Web. It
replaced the old ad hoc programs that updated the index
and ran the various analyses.
22. Current conferences in information retrieval
ď˝
3rd Spanish Conference on Information Retrieval
ď˝
ď˝
ď˝
The European Conference on Information Retrieval
ď˝
ď˝
ď˝
2014 , June 20
Spain
2014, April 17
Netherland
7th International Workshop on Information Filtering
and Retrieval
ď˝
ď˝
2013, Dec 6
Italy
Digital libraries: video recordings, ppt slides, presentations, audio recordings, âŚThe electronic content may be stored locally, or accessed remotely via computer networksEnterprise search is how your organization helps people seek the information they need from anywhere, in any format, from anywhere inside their company â in databases, document management systems, on paper, wherever. Just because there are powerful search tools available, does not mean that you should not organize your content. Desktop search all pc + internet browsing + mails
Result : (Toronto, 18) (Whitby, 27) (New York, 32) (Rome, 37)(Toronto, 32) (Whitby, 20) (New York, 33) (Rome, 38)(Toronto, 22) (Whitby, 19) (New York, 20) (Rome, 31)(Toronto, 31) (Whitby, 22) (New York, 19) (Rome, 30)(Toronto, 32) (Whitby, 27) (New York, 33) (Rome, 38)