SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Web search engines
Rooted in Information Retrieval (IR) systems

•Prepare a keyword index for corpus
•Respond to keyword queries with a ranked list of
documents.

ARCHIE

•Earliest application of rudimentary IR systems to
the Internet
•Title search across sites serving files over FTP
Boolean queries: Examples
 Simple queries involving relationships
between terms and documents

• Documents containing the word Java
• Documents containing the word Java but not
the word coffee

 Proximity queries

• Documents containing the phrase Java beans
•

or the term API
Documents where Java and island occur in
the same sentence

Mining the Web Chakrabarti and Ramakrishnan

2
Document preprocessing
 Tokenization

• Filtering away tags
• Tokens regarded as nonempty sequence of
•
•
•

characters excluding spaces and punctuations.
Token represented by a suitable integer, tid,
typically 32 bits
Optional: stemming/conflation of words
Result: document (did) transformed into a
sequence of integers (tid, pos)

Mining the Web Chakrabarti and Ramakrishnan

3
Storing tokens
 Straight-forward implementation using a
relational database

• Example figure
• Space scales to almost 10 times
 Accesses to table show common pattern
• reduce the storage by mapping tids to a
•

lexicographically sorted buffer of (did, pos)
tuples.
Indexing = transposing document-term matrix

Mining the Web Chakrabarti and Ramakrishnan

4
Two variants of the inverted index data structure, usually stored on disk. The simpler
version in the middle does not store term offset information; the version to the right stores
term
offsets. The mapping from terms to documents and positions (written as
“document/position”) may
be implemented using a B-tree or a hash-table.

Mining the Web Chakrabarti and Ramakrishnan

5
Storage
 For dynamic corpora

• Berkeley DB2 storage manager
• Can frequently add, modify and delete
documents

 For static collections

• Index compression techniques (to be
discussed)

Mining the Web Chakrabarti and Ramakrishnan

6
Stopwords
 Function words and connectives
 Appear in large number of documents and little
use in pinpointing documents
 Indexing stopwords
• Stopwords not indexed


For reducing index space and improving performance

• Replace stopwords with a placeholder (to remember
the offset)

 Issues
• Queries containing only stopwords ruled out
• Polysemous words that are stopwords in one sense
but not in others


E.g.; can as a verb vs. can as a noun

Mining the Web Chakrabarti and Ramakrishnan

7
Stemming
 Conflating words to help match a query term with a
morphological variant in the corpus.
 Remove inflections that convey parts of speech, tense
and number
 E.g.: university and universal both stem to universe.
 Techniques
• morphological analysis (e.g., Porter's algorithm)
• dictionary lookup (e.g., WordNet).
 Stemming may increase recall but at the price of
precision
• Abbreviations, polysemy and names coined in the technical and
•

commercial sectors
E.g.: Stemming “ides” to “IDE”, “SOCKS” to “sock”, “gated ” to
“gate”, may be bad !

Mining the Web Chakrabarti and Ramakrishnan

8
Batch indexing and updates
 Incremental indexing

• Time-consuming due to random disk IO
• High level of disk block fragmentation
 Simple sort-merges.
• To replace the indexed update of variablelength postings

 For a dynamic collection

• single document-level change may need to
•

update hundreds to thousands of records.
Solution : create an additional “stop-press”
index.

Mining the Web Chakrabarti and Ramakrishnan

9
Maintaining indices over dynamic collections.

Mining the Web Chakrabarti and Ramakrishnan

10
Stop-press index
 Collection of document in flux

• Model document modification as deletion followed by insertion
• Documents in flux represented by a signed record (d,t,s)
• “s” specifies if “d” has been deleted or inserted .

 Getting the final answer to a query
• Main index returns a document set D0.
• Stop-press index returns two document sets


D+ : documents not yet indexed in D0 matching the query
 D- : documents matching the query removed from the collection
since D0 was constructed.

 Stop-press index getting too large
• Rebuild the main index


•

signed (d,t,s) records are sorted in (t,d,s) order and mergepurged into the master (t,d) records
Stop-press index can be emptied out.

Mining the Web Chakrabarti and Ramakrishnan

11
Index compression techniques
 Compressing the index so that much of it
can be held in memory

• Required for high-performance IR installations
(as with Web search engines),

 Redundancy in index storage

• Storage of document IDs.
 Delta encoding
• Sort Doc IDs in increasing order
• Store the first ID in full
• Subsequently store only difference (gap) from
previous ID
Mining the Web Chakrabarti and Ramakrishnan

12
Encoding gaps
 Small gap must cost far fewer bits than a
document ID.
 Binary encoding

• Optimal when all symbols are equally likely
 Unary code
• optimal if probability of large gaps decays
exponentially

Mining the Web Chakrabarti and Ramakrishnan

13
Encoding gaps
 Gamma code

• Represent gap xas
code for 1 +  logx 
followed by
represented in binary (
bits)
 logx
x - 2  logx 

 Unary


 Golomb codes

• Further enhancement

Mining the Web Chakrabarti and Ramakrishnan

14
Lossy compression mechanisms
 Trading off space for time
 collect documents into buckets

• Construct inverted index from terms to
bucket IDs
Document' IDs shrink to half their size.

•
 Cost: time overheads
• For each query, all documents in that bucket
need to be scanned

 Solution: index documents in each bucket
separately

•

E.g.: Glimpse (http://webglimpse.org/)
Mining the Web Chakrabarti and Ramakrishnan
15
General dilemmas
 Messy updates vs. High compression rate
 Storage allocation vs. Random I/Os
 Random I/O vs. large scale
implementation

Mining the Web Chakrabarti and Ramakrishnan

16
Relevance ranking
 Keyword queries
• In natural language
• Not precise, unlike SQL


Boolean decision for response unacceptable

• Solution


Rate each document for how likely it is to satisfy the user's
information need
 Sort in decreasing order of the score
 Present results in a ranked list.

 No algorithmic way of ensuring that the ranking
strategy always favors the information need
• Query: only a part of the user's information need
Mining the Web Chakrabarti and Ramakrishnan

17
Responding to queries
 Set-valued response

• Response set may be very large
 (E.g.,

by recent estimates, over 12 million Web
pages contain the word java.)

 Demanding selective query from user
 Guessing user's information need and
ranking responses
 Evaluating rankings

Mining the Web Chakrabarti and Ramakrishnan

18
Evaluating procedure
 Given benchmark

• Corpus of ndocuments D
• A set of queries Q
• For each query,q ∈ Q
an exhaustive set of
D
relevant documents q ⊆ D
manually

identified

 Query submitted system 1 , d 2 ,…, d n )
(d

• Ranked list of documents
•

(r1 , r2 , .., rn )

retrieved d ∈ D
ri = 1
i
q
compute a 0/1 relevance list
ri = 0


iff
Mining  Web Chakrabarti and Ramakrishnan
the

19
Recall and precision
 Recall at rank

• Fraction of all relevant documents included in
(d1 , d 2 , …, d n )
1
. recall(k) =
| Dq |

.

•
∑kri
1≤ i ≤
 Precision at rank ≥ 1
k
• Fraction of the top kresponses that are
•

actually relevant.
1
precision(k) = ∑ ri
.
k

1≤ i ≤ k

Mining the Web Chakrabarti and Ramakrishnan

20
Other measures
 Average precision
• Sum of precision at each relevant hit position in the
response list, divided by the total number of relevant
documents
• . avg.precision = 1 ∑ rk * precision(k )
| D q | . 1≤ k ≤|D|
• avg.precision =1 iff engine retrieves all relevant
documents and ranks them ahead of any irrelevant
document
 Interpolated precision
• To combine precision values from multiple queries
• Gives precision-vs.-recall curve for the benchmark.

ρ
For each query, take the maximum precision obtained for the
query for any recall greater than or equal to
 average them together for all queries


Mining the Web Chakrabarti and Ramakrishnan


21
Precision-Recall tradeoff
 Interpolated precision cannot increase with recall
• Interpolated precision at recall level 0 may be less
than 1

 Atlevelk= 0
• Precision(byconvention)=1,Recall=0
 Inspecting more documents
• Canincreaserecall
• Precisionmaydecrease


we will start encountering more and more irrelevant
documents

 Search engine with a good ranking function will
generally show a negative relation between recall
and precision.
• Higher the curve, better the engine
Mining the Web Chakrabarti and Ramakrishnan

22
ecision and interpolated precision plotted against recall for the given relevance vect
Missing rk are zeroes.

Mining the Web Chakrabarti and Ramakrishnan

23
The vector space model
 Documents represented as vectors in a
multi-dimensional Euclidean space

• Each axis = a term (token)
 Coordinate of document din direction of
term tdetermined by:
• Term frequency TF(d,t)
 number

of times term toccurs in document d,
scaled in a variety of ways to normalize document
length

• Inverse document frequency IDF(t)
 to

scale down the coordinates of terms that occur
in many documents
Mining the Web Chakrabarti and Ramakrishnan
24
Term frequency
 . TF(d, t) =

n(d, t)
∑ n(d,τ )

n(d, t)
TF(d, t) =
max (n(d,τ ))

τ
.
 Cornell SMART system uses a smoothed
version
τ

n( d , t ) = 0
TF (d , t ) = 0
TF (d , t ) = 1 + log(1 + n(d , t )) otherwise

Mining the Web Chakrabarti and Ramakrishnan

25
Inverse document frequency
 Given

• Dis the document collection andt
D

is the set

of documents containing t

 Formulae

• mostly dampened functions
• SMART
.

D
ofDt |
|

1+ | D |
IDF (t ) = log(
)
| Dt |

Mining the Web Chakrabarti and Ramakrishnan

26
Vector space model
 Coordinate of document din axis t

• .dt = TF (d , t ) IDF (t )

• Transformed tod in the TFIDF-space
 Query q
• Interpreted as a document

• Transformed toq in the same TFIDF-space
as d

Mining the Web Chakrabarti and Ramakrishnan

27
Measures of proximity
 Distance measure

• Magnitude of the vector difference
 
.
|d −q|
• Document vectors must be normalized to unit
L1
( L2 or

) length

 Else

shorter documents dominate (since queries are
short)

 Cosine similarity

•


d
cosine of the angle between
 Shorter documents are penalized

Mining the Web Chakrabarti and Ramakrishnan


q
and

28
Relevance feedback
 Users learning how to modify queries
• Response list must have least some relevant

•

documents
Relevance feedback


`correcting' the ranks to the user's taste
 automates the query refinement process

 Rocchio's method

• Folding-in user q
feedback
• To query vector


•

Adda weighted sum of vectors for relevant documents D+




q' Subtract a d - γ ∑ d sum of the irrelevant documents D= αq + β ∑ weighted
D+
D.

Mining the Web Chakrabarti and Ramakrishnan

29
Relevance feedback (contd.)
 Pseudo-relevance feedback

• D+ and D- generated automatically
 E.g.:

Cornell SMART system
 top 10 documents reported by the first round of
query execution are included in D+

• γ typically set to 0; D- not used
 Not a commonly available feature
• Web users want instant gratification
• System complexity
 Executing

the second round query slower and
expensive for major search engines

Mining the Web Chakrabarti and Ramakrishnan

30
Ranking by odds ratio
 R: Boolean random variable which
represents the relevance of document d
w.r.t. query q.
 Ranking documents by their odds ratio for


Pr( R | q, d ) Pr( R, q, d ) / Pr(q, d ) Pr( R | q) / Pr(d | R , q )

relevance= Pr( R , q, d) / Pr(q, d ) = Pr(R | q) / Pr(d | R, q)
Pr( R | q, d )

•.
 Approximating probability of d by product

Pr( d
Pr( x R
of Pr(d || R ,,probabilities of individual terms in d
theR q)) ≈ ∏ Pr( x || R ,, q))
q
q
a (1 − b )
Pr( R | q, d )
 ∝ ∏
•.
b (1 − a )
Pr( R | q, d )
• Approximately…
t

t

t

t ,q

t∈q ∊ d

t ,q

t ,q

t ,q

Mining the Web Chakrabarti and Ramakrishnan

31
Bayesian Inferencing

Bayesian inference network for relevance ranking. A
document is relevant to the extent that setting its
corresponding belief node to true lets us assign a high degree
of belief in the node corresponding to the query.

Mining the Web Chakrabarti and Ramakrishnan

Manual specification of
mappings between terms to
approximate concepts.

32
Bayesian Inferencing (contd.)
 Four layers

1.Document layer
2.Representation layer
3.Query concept layer
4.Query
 Each node is associated with a random
Boolean variable, reflecting belief
 Directed arcs signify that the belief of a
node is a function of the belief of its
immediate parents (and so on..)
Mining the Web Chakrabarti and Ramakrishnan

33
Bayesian Inferencing systems
 2 & 3 same for basic vector-space IR
systems
 Verity's Search97

• Allows administrators and users to define
hierarchies of concepts in files

 Estimation of relevance of a document d
w.r.t. the query q

•
•
•
•

Set the belief of the corresponding node to 1
Set all other document beliefs to 0
Compute the belief of the query
Rank documents in decreasing order of belief
that they induce in the query

Mining the Web Chakrabarti and Ramakrishnan

34
Other issues
 Spamming
• Adding popular query terms to a page unrelated to

•

those terms
E.g.: Adding “Hawaii vacation rental” to a page about
“Internet gambling”
Little setback due to hyperlink-based ranking

•
 Titles, headings, meta tags and anchor-text
• TFIDF framework treats all terms the same
• Meta search engines:


Assign weight age to text occurring in tags, meta-tags

• Using anchor-text on pages uwhich link to v


Anchor-text on uoffers valuable editorial judgment about vas
well.

Mining the Web Chakrabarti and Ramakrishnan

35
Other issues (contd..)
 Including phrases to rank complex queries

• Operators to specify word inclusions and
•

exclusions
With operators and phrases
queries/documents can no longer be treated as
ordinary points in vector space

 Dictionary of phrases

• Could be cataloged manually
• Could be derived from the corpus itself using
•

statistical techniques
Two separate indices:
 one

for single terms and another for phrases

Mining the Web Chakrabarti and Ramakrishnan

36
Corpus derived phrase dictionary
t2

 Two termst1 and
 Null hypothesis = occurrences of and
are
t1
independent
 To the extent the pair violates the null hypothesis, it is
likely to be a phrase

t2

• Measuring violation with likelihood ratio of
•

the hypothesis
Pick phrases that violate the null hypothesis
with large confidence

 Contingency table built from statistics

k10 = k (t1 , t 2 )

k11 = k (t1 , t 2 )

k00 = k (t1 , t 2 ) k 01 = k (t1 , t 2 )
Mining the Web Chakrabarti and Ramakrishnan

37
Corpus derived phrase dictionary
 Hypotheses

• Null hypothesis

k 00 k 01 k10 k11
H ( p00 , p01 , p10 , p11 ; k 00 , k01 , k10 , k11 ) ∝ p00 p01 p10 p11

• Alternative hypothesis
H ( p1 , p2 ; k00 , k01 , k10 , k11 ) ∝ ((1 − p1 )(1 − p2 )) k00 ((1 − p1 ) p2 ) k01 ( p1 (1 − p2 )) k10 ( p1 p2 ) k11

• Likelihood ratio
Îť=

max H ( p; k )
p∈∏ 0

max H ( p; k )
p∈∏

Mining the Web Chakrabarti and Ramakrishnan

38
Approximate string matching


Non-uniformity of word spellings

• dialects of English
• transliteration from other languages
 Two ways to reduce this problem.
1. Aggressive conflation mechanism to collapse
2.

variant spellings into the same token
Decompose terms into a sequence of q-grams
or sequences of qcharacters

Mining the Web Chakrabarti and Ramakrishnan

39
Approximate string matching
1. Aggressive conflation mechanism to collapse
variant spellings into the same token
•
•

E.g.: Soundex : takes phonetics and pronunciation details
into account
used with great success in indexing and searching last
names in census and telephone directory data.

1. Decompose terms into a sequence of q-grams or
sequences of qcharacters
•
•

Check for similarity in the q(2 ≤ q ≤ 4)
grams
Looking up the inverted index : a two-stage affair:
•
•

•
•

Smaller index of q-grams consulted to expand each query term
into a set of slightly distorted query terms
These terms are submitted to the regular index

Used by Google for spelling correction
Idea also adopted for eliminating near-duplicate pages

Mining the Web Chakrabarti and Ramakrishnan

40
Meta-search systems
• Take the search engine to the document
• Forward queries to many geographically distributed
repositories
•

Each has its own search service

•

Suit a single user query to many search engines with
different query syntax

• Consolidate their responses.
• Advantages
• Perform non-trivial query rewriting
• Surprisingly small overlap between crawls
• Consolidating responses
• Function goes beyond just eliminating duplicates
• Search services do not provide standard ranks which
can be combined meaningfully
Mining the Web Chakrabarti and Ramakrishnan

41
Similarity search
• Cluster hypothesis

• Documents similar to relevant documents are
also likely to be relevant

• Handling “find similar” queries

• Replication or duplication of pages
• Mirroring of sites

Mining the Web Chakrabarti and Ramakrishnan

42
Document similarity
• Jaccard coefficient of similarity between
d
document d1 and2
• T(d) = set of tokens in document d
| T ( d1 ) ∊ T (d 2 ) |
| T ( d1 ) ∪ T (d 2 ) |

•
• Symmetric, reflexive, not a metric
• Forgives any number of occurrences and any
. r ' (d1 , d 2 ) =

permutations of the terms.

• 1 − r ' (d1 , d 2 )

is a metric

Mining the Web Chakrabarti and Ramakrishnan

43
Estimating Jaccard coefficient with
random permutations

1.
2.
3.
4.
5.
6.

∏
Generate a set of mrandom permutations
for each ∏
do
∏( d
compute∏(d1 )
and 2 )
check if min T (d1 ) = min T (d 2 )
end for
if equality was observed in kcases,
estimate.k
r ' (d1 , d 2 ) =
m

Mining the Web Chakrabarti and Ramakrishnan

44
Fast similarity search with random
permutations
∏
1. for each random permutation
do
2.
3.
4.
5.
6.

create a file∏
f
for each document ddo
f∏
<
write out s = min ∏(T (d )), d >
to
end for
∏
sort fusing key s--this results in contiguous blocks with fixed s
containing all associatedd s
g
7.
create a file ∏
(d
f∏
8.
for each pair1 , d 2 )
within a run of
having a given s
do
(d1 , d 2 )
9.
write out a document-pair record
to g
10.
end ∏
g for
(d1 , d 2 )
11.
sort
on key
12. end for∏
g
(d1 , d 2 )
(d1 , d 2 )
∏
13. merge
for all
in
order, counting the number
of
Mining the entries
Web Chakrabarti and Ramakrishnan
45
Eliminating near-duplicates via shingling
• “Find-similar” algorithm reports all duplicate/nearduplicate pages
• Eliminating duplicates
• Maintain a checksum with every page in the corpus
• Eliminating near-duplicates
• Represent each document as a set T(d) of q-grams (shingles)
r
d1
• Find Jaccard similarity(d1 , d 2 )
between d 2 and
• Eliminate the pair from step 9 if it has similarity above a
threshold

Mining the Web Chakrabarti and Ramakrishnan

46
•
•

Detecting locally similar sub-graphs of the
Web
•
•

Similarity search and duplicate elimination on the
graph structure of the web
To improve quality of hyperlink-assisted ranking

Detecting mirrored sites
Approach 1 [Bottom-up Approach]
1.

Start process with textual duplicate detection
•
•
•

1.

•

2.

cleaned URLs are listed and sorted to find duplicates/nearduplicates
each set of equivalent URLs is assigned a unique token ID
each page is stripped of all text, and represented as a sequence of
outlink IDs

Continue using link sequence representation
Until no further collapse of multiple URLs are possible

Approach 2 [Bottom-up Approach]
1.
2.
3.

identify single nodes which are near duplicates (using textshingling)
extend single-node mirrors to two-node mirrors
continue on to larger and larger graphs which are likely mirrors of
one another

Mining the Web Chakrabarti and Ramakrishnan

47
Detecting mirrored sites (contd.)
• Approach 3 [Step before fetching all pages]
•

Uses regularity in URL strings to identify host-pairs which are
mirrors
• Preprocessing

• Host are represented as sets of positional bigrams
• Convert host and path to all lowercase characters
• Let any punctuation or digit sequence be a token separator
• Tokenize the URL into a sequence of tokens, (e.g.,
www6.infoseek.com gives www, infoseek, com)
• Eliminate stop terms such as htm, html, txt, main, index, home,
bin, cgi
• Form positional bigrams from the token sequence

•

Two hosts are said to be mirrors if

• A large fraction of paths are valid on both web sites
• These common paths link to pages that are near-duplicates.

Mining the Web Chakrabarti and Ramakrishnan

48

Weitere ähnliche Inhalte

Was ist angesagt?

RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rYanchang Zhao
 
An Approach for the Incremental Export of Relational Databases into RDF Graphs
An Approach for the Incremental Export of Relational Databases into RDF GraphsAn Approach for the Incremental Export of Relational Databases into RDF Graphs
An Approach for the Incremental Export of Relational Databases into RDF GraphsNikolaos Konstantinou
 
Hybrid geo textual index structure
Hybrid geo textual index structureHybrid geo textual index structure
Hybrid geo textual index structurecseij
 
Text Mining with R
Text Mining with RText Mining with R
Text Mining with RSanjay Mishra
 
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
9. Searching & Sorting - Data Structures using C++ by Varsha Patil9. Searching & Sorting - Data Structures using C++ by Varsha Patil
9. Searching & Sorting - Data Structures using C++ by Varsha Patilwidespreadpromotion
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenRevolution Analytics
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonPaco Nathan
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With RJahnab Kumar Deka
 
Latent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkLatent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkSandy Ryza
 
final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)Ankit Rathi
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Incremental Export of Relational Database Contents into RDF Graphs
Incremental Export of Relational Database Contents into RDF GraphsIncremental Export of Relational Database Contents into RDF Graphs
Incremental Export of Relational Database Contents into RDF GraphsNikolaos Konstantinou
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence
 
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformatanalyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformatleorick lin
 

Was ist angesagt? (20)

RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
Classification and Clustering of arXiv Documents, Sections, and Abstracts, Co...
 
master_thesis_greciano_v2
master_thesis_greciano_v2master_thesis_greciano_v2
master_thesis_greciano_v2
 
An Approach for the Incremental Export of Relational Databases into RDF Graphs
An Approach for the Incremental Export of Relational Databases into RDF GraphsAn Approach for the Incremental Export of Relational Databases into RDF Graphs
An Approach for the Incremental Export of Relational Databases into RDF Graphs
 
Hybrid geo textual index structure
Hybrid geo textual index structureHybrid geo textual index structure
Hybrid geo textual index structure
 
inteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access FrameworkinteSearch: An Intelligent Linked Data Information Access Framework
inteSearch: An Intelligent Linked Data Information Access Framework
 
Text Mining with R
Text Mining with RText Mining with R
Text Mining with R
 
Unit 3
Unit 3Unit 3
Unit 3
 
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
9. Searching & Sorting - Data Structures using C++ by Varsha Patil9. Searching & Sorting - Data Structures using C++ by Varsha Patil
9. Searching & Sorting - Data Structures using C++ by Varsha Patil
 
Scalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee EdlefsenScalable Data Analysis in R -- Lee Edlefsen
Scalable Data Analysis in R -- Lee Edlefsen
 
SF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in PythonSF Python Meetup: TextRank in Python
SF Python Meetup: TextRank in Python
 
hands on: Text Mining With R
hands on: Text Mining With Rhands on: Text Mining With R
hands on: Text Mining With R
 
Latent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with SparkLatent Semantic Analysis of Wikipedia with Spark
Latent Semantic Analysis of Wikipedia with Spark
 
final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)final_copy_camera_ready_paper (7)
final_copy_camera_ready_paper (7)
 
A Survey of Entity Ranking over RDF Graphs
A Survey of Entity Ranking over RDF GraphsA Survey of Entity Ranking over RDF Graphs
A Survey of Entity Ranking over RDF Graphs
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Spatial LDA
Spatial LDASpatial LDA
Spatial LDA
 
Incremental Export of Relational Database Contents into RDF Graphs
Incremental Export of Relational Database Contents into RDF GraphsIncremental Export of Relational Database Contents into RDF Graphs
Incremental Export of Relational Database Contents into RDF Graphs
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformatanalyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
analyzing hdfs files using apace spark and mapreduce FixedLengthInputformat
 

Andere mochten auch

Low-cost Management of Inverted Files for Online Full-Text Search
Low-cost Management of Inverted Files for Online Full-Text Search Low-cost Management of Inverted Files for Online Full-Text Search
Low-cost Management of Inverted Files for Online Full-Text Search gmargari
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search enginesunyil96
 
Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineGan Keng Hoon
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with LuceneKai Chan
 
What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?Andrii Soldatenko
 
Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrKai Chan
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challengeGan Keng Hoon
 

Andere mochten auch (9)

Low-cost Management of Inverted Files for Online Full-Text Search
Low-cost Management of Inverted Files for Online Full-Text Search Low-cost Management of Inverted Files for Online Full-Text Search
Low-cost Management of Inverted Files for Online Full-Text Search
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search engines
 
Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search Engine
 
Introduction to search engine-building with Lucene
Introduction to search engine-building with LuceneIntroduction to search engine-building with Lucene
Introduction to search engine-building with Lucene
 
What is the best full text search engine for Python?
What is the best full text search engine for Python?What is the best full text search engine for Python?
What is the best full text search engine for Python?
 
Inverted index
Inverted indexInverted index
Inverted index
 
Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and Solr
 
Web Search Engine
Web Search EngineWeb Search Engine
Web Search Engine
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challenge
 

Ähnlich wie search engine

search.ppt
search.pptsearch.ppt
search.pptPikaj2
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...Oscar Corcho
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Julian Hyde
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...eswcsummerschool
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...Stefan Adam
 
Data science : R Basics Harvard University
Data science : R Basics Harvard UniversityData science : R Basics Harvard University
Data science : R Basics Harvard UniversityMrMoliya
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesTrey Grainger
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systemsTrey Grainger
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Kira
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
highly available distributed databases (poster)
highly available distributed databases (poster)highly available distributed databases (poster)
highly available distributed databases (poster)Rim Moussa
 
Technologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic RecordsTechnologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic Recordspbajcsy
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Spark Summit
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleEvan Chan
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkDataWorks Summit
 

Ähnlich wie search engine (20)

search.ppt
search.pptsearch.ppt
search.ppt
 
Web search engines
Web search enginesWeb search engines
Web search engines
 
Unit iii
Unit iiiUnit iii
Unit iii
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over ...
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
 
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr..."PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
"PageRank" - "The Anatomy of a Large-Scale Hypertextual Web Search Engine” pr...
 
Data science : R Basics Harvard University
Data science : R Basics Harvard UniversityData science : R Basics Harvard University
Data science : R Basics Harvard University
 
The Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation EnginesThe Intent Algorithms of Search & Recommendation Engines
The Intent Algorithms of Search & Recommendation Engines
 
Reflected intelligence evolving self-learning data systems
Reflected intelligence  evolving self-learning data systemsReflected intelligence  evolving self-learning data systems
Reflected intelligence evolving self-learning data systems
 
Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)Tutorial 1 (information retrieval basics)
Tutorial 1 (information retrieval basics)
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
highly available distributed databases (poster)
highly available distributed databases (poster)highly available distributed databases (poster)
highly available distributed databases (poster)
 
Technologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic RecordsTechnologies For Appraising and Managing Electronic Records
Technologies For Appraising and Managing Electronic Records
 
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 
Building a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache SparkBuilding a unified data pipeline in Apache Spark
Building a unified data pipeline in Apache Spark
 

KĂźrzlich hochgeladen

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 

KĂźrzlich hochgeladen (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 

search engine

  • 1. Web search engines Rooted in Information Retrieval (IR) systems •Prepare a keyword index for corpus •Respond to keyword queries with a ranked list of documents. ARCHIE •Earliest application of rudimentary IR systems to the Internet •Title search across sites serving files over FTP
  • 2. Boolean queries: Examples  Simple queries involving relationships between terms and documents • Documents containing the word Java • Documents containing the word Java but not the word coffee  Proximity queries • Documents containing the phrase Java beans • or the term API Documents where Java and island occur in the same sentence Mining the Web Chakrabarti and Ramakrishnan 2
  • 3. Document preprocessing  Tokenization • Filtering away tags • Tokens regarded as nonempty sequence of • • • characters excluding spaces and punctuations. Token represented by a suitable integer, tid, typically 32 bits Optional: stemming/conflation of words Result: document (did) transformed into a sequence of integers (tid, pos) Mining the Web Chakrabarti and Ramakrishnan 3
  • 4. Storing tokens  Straight-forward implementation using a relational database • Example figure • Space scales to almost 10 times  Accesses to table show common pattern • reduce the storage by mapping tids to a • lexicographically sorted buffer of (did, pos) tuples. Indexing = transposing document-term matrix Mining the Web Chakrabarti and Ramakrishnan 4
  • 5. Two variants of the inverted index data structure, usually stored on disk. The simpler version in the middle does not store term offset information; the version to the right stores term offsets. The mapping from terms to documents and positions (written as “document/position”) may be implemented using a B-tree or a hash-table. Mining the Web Chakrabarti and Ramakrishnan 5
  • 6. Storage  For dynamic corpora • Berkeley DB2 storage manager • Can frequently add, modify and delete documents  For static collections • Index compression techniques (to be discussed) Mining the Web Chakrabarti and Ramakrishnan 6
  • 7. Stopwords  Function words and connectives  Appear in large number of documents and little use in pinpointing documents  Indexing stopwords • Stopwords not indexed  For reducing index space and improving performance • Replace stopwords with a placeholder (to remember the offset)  Issues • Queries containing only stopwords ruled out • Polysemous words that are stopwords in one sense but not in others  E.g.; can as a verb vs. can as a noun Mining the Web Chakrabarti and Ramakrishnan 7
  • 8. Stemming  Conflating words to help match a query term with a morphological variant in the corpus.  Remove inflections that convey parts of speech, tense and number  E.g.: university and universal both stem to universe.  Techniques • morphological analysis (e.g., Porter's algorithm) • dictionary lookup (e.g., WordNet).  Stemming may increase recall but at the price of precision • Abbreviations, polysemy and names coined in the technical and • commercial sectors E.g.: Stemming “ides” to “IDE”, “SOCKS” to “sock”, “gated ” to “gate”, may be bad ! Mining the Web Chakrabarti and Ramakrishnan 8
  • 9. Batch indexing and updates  Incremental indexing • Time-consuming due to random disk IO • High level of disk block fragmentation  Simple sort-merges. • To replace the indexed update of variablelength postings  For a dynamic collection • single document-level change may need to • update hundreds to thousands of records. Solution : create an additional “stop-press” index. Mining the Web Chakrabarti and Ramakrishnan 9
  • 10. Maintaining indices over dynamic collections. Mining the Web Chakrabarti and Ramakrishnan 10
  • 11. Stop-press index  Collection of document in flux • Model document modification as deletion followed by insertion • Documents in flux represented by a signed record (d,t,s) • “s” specifies if “d” has been deleted or inserted .  Getting the final answer to a query • Main index returns a document set D0. • Stop-press index returns two document sets  D+ : documents not yet indexed in D0 matching the query  D- : documents matching the query removed from the collection since D0 was constructed.  Stop-press index getting too large • Rebuild the main index  • signed (d,t,s) records are sorted in (t,d,s) order and mergepurged into the master (t,d) records Stop-press index can be emptied out. Mining the Web Chakrabarti and Ramakrishnan 11
  • 12. Index compression techniques  Compressing the index so that much of it can be held in memory • Required for high-performance IR installations (as with Web search engines),  Redundancy in index storage • Storage of document IDs.  Delta encoding • Sort Doc IDs in increasing order • Store the first ID in full • Subsequently store only difference (gap) from previous ID Mining the Web Chakrabarti and Ramakrishnan 12
  • 13. Encoding gaps  Small gap must cost far fewer bits than a document ID.  Binary encoding • Optimal when all symbols are equally likely  Unary code • optimal if probability of large gaps decays exponentially Mining the Web Chakrabarti and Ramakrishnan 13
  • 14. Encoding gaps  Gamma code • Represent gap xas code for 1 +  logx  followed by represented in binary ( bits)  logx x - 2  logx   Unary   Golomb codes • Further enhancement Mining the Web Chakrabarti and Ramakrishnan 14
  • 15. Lossy compression mechanisms  Trading off space for time  collect documents into buckets • Construct inverted index from terms to bucket IDs Document' IDs shrink to half their size. •  Cost: time overheads • For each query, all documents in that bucket need to be scanned  Solution: index documents in each bucket separately • E.g.: Glimpse (http://webglimpse.org/) Mining the Web Chakrabarti and Ramakrishnan 15
  • 16. General dilemmas  Messy updates vs. High compression rate  Storage allocation vs. Random I/Os  Random I/O vs. large scale implementation Mining the Web Chakrabarti and Ramakrishnan 16
  • 17. Relevance ranking  Keyword queries • In natural language • Not precise, unlike SQL  Boolean decision for response unacceptable • Solution  Rate each document for how likely it is to satisfy the user's information need  Sort in decreasing order of the score  Present results in a ranked list.  No algorithmic way of ensuring that the ranking strategy always favors the information need • Query: only a part of the user's information need Mining the Web Chakrabarti and Ramakrishnan 17
  • 18. Responding to queries  Set-valued response • Response set may be very large  (E.g., by recent estimates, over 12 million Web pages contain the word java.)  Demanding selective query from user  Guessing user's information need and ranking responses  Evaluating rankings Mining the Web Chakrabarti and Ramakrishnan 18
  • 19. Evaluating procedure  Given benchmark • Corpus of ndocuments D • A set of queries Q • For each query,q ∈ Q an exhaustive set of D relevant documents q ⊆ D manually identified  Query submitted system 1 , d 2 ,…, d n ) (d • Ranked list of documents • (r1 , r2 , .., rn ) retrieved d ∈ D ri = 1 i q compute a 0/1 relevance list ri = 0  iff Mining  Web Chakrabarti and Ramakrishnan the 19
  • 20. Recall and precision  Recall at rank • Fraction of all relevant documents included in (d1 , d 2 , …, d n ) 1 . recall(k) = | Dq | . • ∑kri 1≤ i ≤  Precision at rank ≥ 1 k • Fraction of the top kresponses that are • actually relevant. 1 precision(k) = ∑ ri . k 1≤ i ≤ k Mining the Web Chakrabarti and Ramakrishnan 20
  • 21. Other measures  Average precision • Sum of precision at each relevant hit position in the response list, divided by the total number of relevant documents • . avg.precision = 1 ∑ rk * precision(k ) | D q | . 1≤ k ≤|D| • avg.precision =1 iff engine retrieves all relevant documents and ranks them ahead of any irrelevant document  Interpolated precision • To combine precision values from multiple queries • Gives precision-vs.-recall curve for the benchmark. ρ For each query, take the maximum precision obtained for the query for any recall greater than or equal to  average them together for all queries  Mining the Web Chakrabarti and Ramakrishnan  21
  • 22. Precision-Recall tradeoff  Interpolated precision cannot increase with recall • Interpolated precision at recall level 0 may be less than 1  Atlevelk= 0 • Precision(byconvention)=1,Recall=0  Inspecting more documents • Canincreaserecall • Precisionmaydecrease  we will start encountering more and more irrelevant documents  Search engine with a good ranking function will generally show a negative relation between recall and precision. • Higher the curve, better the engine Mining the Web Chakrabarti and Ramakrishnan 22
  • 23. ecision and interpolated precision plotted against recall for the given relevance vect Missing rk are zeroes. Mining the Web Chakrabarti and Ramakrishnan 23
  • 24. The vector space model  Documents represented as vectors in a multi-dimensional Euclidean space • Each axis = a term (token)  Coordinate of document din direction of term tdetermined by: • Term frequency TF(d,t)  number of times term toccurs in document d, scaled in a variety of ways to normalize document length • Inverse document frequency IDF(t)  to scale down the coordinates of terms that occur in many documents Mining the Web Chakrabarti and Ramakrishnan 24
  • 25. Term frequency  . TF(d, t) = n(d, t) ∑ n(d,τ ) n(d, t) TF(d, t) = max (n(d,τ )) τ .  Cornell SMART system uses a smoothed version τ n( d , t ) = 0 TF (d , t ) = 0 TF (d , t ) = 1 + log(1 + n(d , t )) otherwise Mining the Web Chakrabarti and Ramakrishnan 25
  • 26. Inverse document frequency  Given • Dis the document collection andt D is the set of documents containing t  Formulae • mostly dampened functions • SMART . D ofDt | | 1+ | D | IDF (t ) = log( ) | Dt | Mining the Web Chakrabarti and Ramakrishnan 26
  • 27. Vector space model  Coordinate of document din axis t • .dt = TF (d , t ) IDF (t )  • Transformed tod in the TFIDF-space  Query q • Interpreted as a document  • Transformed toq in the same TFIDF-space as d Mining the Web Chakrabarti and Ramakrishnan 27
  • 28. Measures of proximity  Distance measure • Magnitude of the vector difference   . |d −q| • Document vectors must be normalized to unit L1 ( L2 or ) length  Else shorter documents dominate (since queries are short)  Cosine similarity •  d cosine of the angle between  Shorter documents are penalized Mining the Web Chakrabarti and Ramakrishnan  q and 28
  • 29. Relevance feedback  Users learning how to modify queries • Response list must have least some relevant • documents Relevance feedback  `correcting' the ranks to the user's taste  automates the query refinement process  Rocchio's method  • Folding-in user q feedback • To query vector  • Adda weighted sum of vectors for relevant documents D+     q' Subtract a d - Îł ∑ d sum of the irrelevant documents D= Îąq + β ∑ weighted D+ D. Mining the Web Chakrabarti and Ramakrishnan 29
  • 30. Relevance feedback (contd.)  Pseudo-relevance feedback • D+ and D- generated automatically  E.g.: Cornell SMART system  top 10 documents reported by the first round of query execution are included in D+ • Îł typically set to 0; D- not used  Not a commonly available feature • Web users want instant gratification • System complexity  Executing the second round query slower and expensive for major search engines Mining the Web Chakrabarti and Ramakrishnan 30
  • 31. Ranking by odds ratio  R: Boolean random variable which represents the relevance of document d w.r.t. query q.  Ranking documents by their odds ratio for   Pr( R | q, d ) Pr( R, q, d ) / Pr(q, d ) Pr( R | q) / Pr(d | R , q )  relevance= Pr( R , q, d) / Pr(q, d ) = Pr(R | q) / Pr(d | R, q) Pr( R | q, d ) •.  Approximating probability of d by product  Pr( d Pr( x R of Pr(d || R ,,probabilities of individual terms in d theR q)) ≈ ∏ Pr( x || R ,, q)) q q a (1 − b ) Pr( R | q, d )  ∝ ∏ •. b (1 − a ) Pr( R | q, d ) • Approximately… t t t t ,q t∈q ∊ d t ,q t ,q t ,q Mining the Web Chakrabarti and Ramakrishnan 31
  • 32. Bayesian Inferencing Bayesian inference network for relevance ranking. A document is relevant to the extent that setting its corresponding belief node to true lets us assign a high degree of belief in the node corresponding to the query. Mining the Web Chakrabarti and Ramakrishnan Manual specification of mappings between terms to approximate concepts. 32
  • 33. Bayesian Inferencing (contd.)  Four layers 1.Document layer 2.Representation layer 3.Query concept layer 4.Query  Each node is associated with a random Boolean variable, reflecting belief  Directed arcs signify that the belief of a node is a function of the belief of its immediate parents (and so on..) Mining the Web Chakrabarti and Ramakrishnan 33
  • 34. Bayesian Inferencing systems  2 & 3 same for basic vector-space IR systems  Verity's Search97 • Allows administrators and users to define hierarchies of concepts in files  Estimation of relevance of a document d w.r.t. the query q • • • • Set the belief of the corresponding node to 1 Set all other document beliefs to 0 Compute the belief of the query Rank documents in decreasing order of belief that they induce in the query Mining the Web Chakrabarti and Ramakrishnan 34
  • 35. Other issues  Spamming • Adding popular query terms to a page unrelated to • those terms E.g.: Adding “Hawaii vacation rental” to a page about “Internet gambling” Little setback due to hyperlink-based ranking •  Titles, headings, meta tags and anchor-text • TFIDF framework treats all terms the same • Meta search engines:  Assign weight age to text occurring in tags, meta-tags • Using anchor-text on pages uwhich link to v  Anchor-text on uoffers valuable editorial judgment about vas well. Mining the Web Chakrabarti and Ramakrishnan 35
  • 36. Other issues (contd..)  Including phrases to rank complex queries • Operators to specify word inclusions and • exclusions With operators and phrases queries/documents can no longer be treated as ordinary points in vector space  Dictionary of phrases • Could be cataloged manually • Could be derived from the corpus itself using • statistical techniques Two separate indices:  one for single terms and another for phrases Mining the Web Chakrabarti and Ramakrishnan 36
  • 37. Corpus derived phrase dictionary t2  Two termst1 and  Null hypothesis = occurrences of and are t1 independent  To the extent the pair violates the null hypothesis, it is likely to be a phrase t2 • Measuring violation with likelihood ratio of • the hypothesis Pick phrases that violate the null hypothesis with large confidence  Contingency table built from statistics k10 = k (t1 , t 2 ) k11 = k (t1 , t 2 ) k00 = k (t1 , t 2 ) k 01 = k (t1 , t 2 ) Mining the Web Chakrabarti and Ramakrishnan 37
  • 38. Corpus derived phrase dictionary  Hypotheses • Null hypothesis k 00 k 01 k10 k11 H ( p00 , p01 , p10 , p11 ; k 00 , k01 , k10 , k11 ) ∝ p00 p01 p10 p11 • Alternative hypothesis H ( p1 , p2 ; k00 , k01 , k10 , k11 ) ∝ ((1 − p1 )(1 − p2 )) k00 ((1 − p1 ) p2 ) k01 ( p1 (1 − p2 )) k10 ( p1 p2 ) k11 • Likelihood ratio Îť= max H ( p; k ) p∈∏ 0 max H ( p; k ) p∈∏ Mining the Web Chakrabarti and Ramakrishnan 38
  • 39. Approximate string matching  Non-uniformity of word spellings • dialects of English • transliteration from other languages  Two ways to reduce this problem. 1. Aggressive conflation mechanism to collapse 2. variant spellings into the same token Decompose terms into a sequence of q-grams or sequences of qcharacters Mining the Web Chakrabarti and Ramakrishnan 39
  • 40. Approximate string matching 1. Aggressive conflation mechanism to collapse variant spellings into the same token • • E.g.: Soundex : takes phonetics and pronunciation details into account used with great success in indexing and searching last names in census and telephone directory data. 1. Decompose terms into a sequence of q-grams or sequences of qcharacters • • Check for similarity in the q(2 ≤ q ≤ 4) grams Looking up the inverted index : a two-stage affair: • • • • Smaller index of q-grams consulted to expand each query term into a set of slightly distorted query terms These terms are submitted to the regular index Used by Google for spelling correction Idea also adopted for eliminating near-duplicate pages Mining the Web Chakrabarti and Ramakrishnan 40
  • 41. Meta-search systems • Take the search engine to the document • Forward queries to many geographically distributed repositories • Each has its own search service • Suit a single user query to many search engines with different query syntax • Consolidate their responses. • Advantages • Perform non-trivial query rewriting • Surprisingly small overlap between crawls • Consolidating responses • Function goes beyond just eliminating duplicates • Search services do not provide standard ranks which can be combined meaningfully Mining the Web Chakrabarti and Ramakrishnan 41
  • 42. Similarity search • Cluster hypothesis • Documents similar to relevant documents are also likely to be relevant • Handling “find similar” queries • Replication or duplication of pages • Mirroring of sites Mining the Web Chakrabarti and Ramakrishnan 42
  • 43. Document similarity • Jaccard coefficient of similarity between d document d1 and2 • T(d) = set of tokens in document d | T ( d1 ) ∊ T (d 2 ) | | T ( d1 ) ∪ T (d 2 ) | • • Symmetric, reflexive, not a metric • Forgives any number of occurrences and any . r ' (d1 , d 2 ) = permutations of the terms. • 1 − r ' (d1 , d 2 ) is a metric Mining the Web Chakrabarti and Ramakrishnan 43
  • 44. Estimating Jaccard coefficient with random permutations 1. 2. 3. 4. 5. 6. ∏ Generate a set of mrandom permutations for each ∏ do ∏( d compute∏(d1 ) and 2 ) check if min T (d1 ) = min T (d 2 ) end for if equality was observed in kcases, estimate.k r ' (d1 , d 2 ) = m Mining the Web Chakrabarti and Ramakrishnan 44
  • 45. Fast similarity search with random permutations ∏ 1. for each random permutation do 2. 3. 4. 5. 6. create a file∏ f for each document ddo f∏ < write out s = min ∏(T (d )), d > to end for ∏ sort fusing key s--this results in contiguous blocks with fixed s containing all associatedd s g 7. create a file ∏ (d f∏ 8. for each pair1 , d 2 ) within a run of having a given s do (d1 , d 2 ) 9. write out a document-pair record to g 10. end ∏ g for (d1 , d 2 ) 11. sort on key 12. end for∏ g (d1 , d 2 ) (d1 , d 2 ) ∏ 13. merge for all in order, counting the number of Mining the entries Web Chakrabarti and Ramakrishnan 45
  • 46. Eliminating near-duplicates via shingling • “Find-similar” algorithm reports all duplicate/nearduplicate pages • Eliminating duplicates • Maintain a checksum with every page in the corpus • Eliminating near-duplicates • Represent each document as a set T(d) of q-grams (shingles) r d1 • Find Jaccard similarity(d1 , d 2 ) between d 2 and • Eliminate the pair from step 9 if it has similarity above a threshold Mining the Web Chakrabarti and Ramakrishnan 46
  • 47. • • Detecting locally similar sub-graphs of the Web • • Similarity search and duplicate elimination on the graph structure of the web To improve quality of hyperlink-assisted ranking Detecting mirrored sites Approach 1 [Bottom-up Approach] 1. Start process with textual duplicate detection • • • 1. • 2. cleaned URLs are listed and sorted to find duplicates/nearduplicates each set of equivalent URLs is assigned a unique token ID each page is stripped of all text, and represented as a sequence of outlink IDs Continue using link sequence representation Until no further collapse of multiple URLs are possible Approach 2 [Bottom-up Approach] 1. 2. 3. identify single nodes which are near duplicates (using textshingling) extend single-node mirrors to two-node mirrors continue on to larger and larger graphs which are likely mirrors of one another Mining the Web Chakrabarti and Ramakrishnan 47
  • 48. Detecting mirrored sites (contd.) • Approach 3 [Step before fetching all pages] • Uses regularity in URL strings to identify host-pairs which are mirrors • Preprocessing • Host are represented as sets of positional bigrams • Convert host and path to all lowercase characters • Let any punctuation or digit sequence be a token separator • Tokenize the URL into a sequence of tokens, (e.g., www6.infoseek.com gives www, infoseek, com) • Eliminate stop terms such as htm, html, txt, main, index, home, bin, cgi • Form positional bigrams from the token sequence • Two hosts are said to be mirrors if • A large fraction of paths are valid on both web sites • These common paths link to pages that are near-duplicates. Mining the Web Chakrabarti and Ramakrishnan 48