Boolean retrieval

Introduction to Information Retrieval(IIR)
1. Boolean retrieval

Online text: http://nlp.stanford.edu/IR-book/pdf/01bool.pdf
Slide based on: http://nlp.stanford.edu/IR-book/ppt/01intro.pptx

1

Outline

1. 0. Introduction
- What is IR?

1. 1. Grepping
- Why grepping is so bad?

1. 2. Inverted index
- to reduce the amount of memory

1. 3. Processing query

1. 4. Optimization
- to reduce the amount of memory more

2

Introduction

Information Retrieval (IR):
finding material of an unstructured nature that
satisfies an information need from within large
collections.
(term is coined by Calvin Mooers in 1948/50)

※searchは曖昧なので避けられることも。IIRでは同義。

3

Introduction

Information Retrieval (IR):
finding document of an unstructured nature that
material a text
satisfies an information need from within large
collections. stored on computers.
(term is coined by Calvin Mooers in 1948/50)

※searchは曖昧なので避けられることも。IIRでは同義。

4

information need(p.5)

information need:
the topic about which the user desires to know
more
query:
(the text that) the user convey to the computer in
an attempt to communicate the information need
※information needとqueryの違いに注意

A document is relevant
if it is one that the user perceives as containing
information of value with respect to their
personal information need. information need query
type

relevant? result
perceive
user computer

5

ad hoc retrieval
 ad hoc retrieval:
対象となる文書集合(collection)はあまり変化せ
ず、queryが短期的に変化するIR。
 information filtering:
queryはあまり変化せず、collectionが短期的に変化。

※「アドホック検索とは？ - たつをのChangeLog」
http://chalow.net/2008-01-16-1.html

6

Outline

1. 0. Introduction
- What is IR?

1. 1. Grepping



1. 4. Optimization
- reduce the amount of memory more

7

1.1. Why grepping is so bad?

grep:
すべての文書を先頭から末尾までスキャン
指定された単語を探せる
最近の高性能なPCではある程度実用的
> grep -r “query” *

-> 大規模なcollectionの検索を高速に行うには?
-> もっと柔軟な検索条件は指定できないのか?
-> 得られた結果をランク付けできないのか?

--> Index in advance!

8

Term-document incidence matrix

incidence matrix:
各文書に出現する語に1を、出現しない語に0を充てて
生成した行列(行: 語, 列: 文書)
incidence vector:
matrixの各列(縦列)

Queryの処理: ex. [(A AND B) OR NOT C]
- AND条件: incidence vectorの論理積
- OR条件: incidence vectorの論理和
- NOT条件: incidence vectorを反転させて処理

9

term/doc L+ SD kyrn SH Uta CLN
同級生 1 1 1 1 1
優等生 1 1
お嬢様 1
後輩 1 1
クール 1 1 1 1
先輩 1
忌避 1
ヤンデレ 1
お姫様 1 1
メイド 1
フルート 1
義妹 1
バイト 1 1
従姉 1
宇宙人 1
未来人 1
超能力 1
人見知り 1
病弱 1 1
演劇 1
委員長 1
図書室 1 1
ヒトデ 1

10

Query: [図書室 AND クール AND NOT 病弱]
term/doc L+ SD kyrn SH Uta CLN
同級生 1 1 1 1 1
優等生 1 1
お嬢様 1
後輩 1 1
クール 1 1 1 1
先輩 1
忌避 1
ヤンデレ 1
お姫様 1 1
メイド 1
フルート 1
義妹 1
バイト 1 1
従姉 1
宇宙人 1
未来人 1
超能力 1
人見知り 1
病弱(NOT) 1 1 1 1 0 0
演劇 1
委員長 1
図書室 1 1
ヒトデ 1

11


Problem is ...
- the matrix is too large!
- although the matrix contains too many 0s(sparse)!

12

Outline

1. 0. Introduction
- What is IR?

1. 1. Grepping



1. 4. Optimization

13

1.2. Inverted index

「ごく普通の」index。転置インデックスとも。
※index自体が「逆引き」の意なのでinvertedは本来不要
※予め文書にIDを付加しておく

postings list:
list
postingを並べたもの
dictionary
(vocabulary, posting:
posting
lexicon) termを含む文書のID
(docID)

postings
term
(word, token)
14

How to make inverted index?

1. 対象となる文書を取得
Democracy is the worst form of government except all those other forms that
have been tried from time to time.
(by Sir Winston Churchill, 1947)

2. テキストをtokenに分割
Democracy / is / the / worst / form / of / government / except / all / those /
other / forms / that / have / been / tried / from / time / to / time

3. 言語処理により、tokenを正規化
democracy / is / the / worst / form / of / government / except / all / those / other /
forms / that / have / been / tried / from / time / to / time

※日本語の場合は2.で形態素分析などが必要。

15

How to make inverted index?

4. 1つのindexにまとめ、語を整列
5. 重複する項目を1つにまとめ、postingsを作成
※実装: ○リスト, ○可変長配列, ○連想配列, ×固定長配列
※dictionary on memory, postings on disk
※postings listの長さは語が出現する文書の数(出現頻度)
index of doc1 index of doc2
democracy ... 1 follow ... 2 all ... 1 all ... 1
is ... 1 also ... 2 also ... 2 also ... 2
the ... 1 the ... 2 and ... 2 and ... 2
worst ... 1 guiding ... 2 democracy ... 1 democracy ... 1
form ... 1 lights ... 2 devotion ... 2 devotion ... 2
of ... 1 of ... 2 4. except ... 1 5. except ... 1
government ... 1 love ... 2 follow ... 2 follow ... 2
except ... 1 and ... 2 form ... 1 form ... 1
all ... 1 devotion ... 2 guiding ... 2 guiding ... 2
those ... 1 in ... 2 ... government ... 1
other ... 1 women ... 2 of ... 1 ...
forms ... 1 these ... 2 of ... 2 of ... 1,2
... ... ... ...
16

Outline

1. 0. Introduction
- What is IR?

1. 1. Grepping



1. 4. Optimization

17

1.3. Query: [図書室 AND クール]

処理手順:
1. 辞書から[図書室]を探す
2. [図書室]のpostings list Aを取得
3. 辞書から[クール]を探す
4. [クール]のpostings list Bを取得
5. A ∩ Bを計算してresultとする
... How to get intersection?

図書室 ∩ クール ... 1, 6

18

Intersection of two postings list(p.11 fig.1.6)
postings_list intersect(postings_list p1, postings_list p2)
{
postings_list answer;
// intersectionなので、どちらかの末尾にくれば終わり。
while (p1.current() != null && p2.current() != null)
{
if (p1.current().docID() == p2.current().docID())
{
answer.add(p1.current().docID()); // 両方に同じIDがある。
p1.next(); p2.next(); // 次のpostingに進む。
}
else if (p1.current().docID() < p2.current().docID())
p1.next(); // p1のIDの方が小さいので、p1を進める。
else
p2.next(); // p2のIDの方が小さいので、p2を進める。
}
return answer;
}
19

effectiveness

The quality of search results.

precision(精度/適合率):
What fraction of the returned results are relevant
to the information need?
#(relevant result) / #(result)
recall(再現率):
What fraction of the relevant documents in the
collection were returned by the system?
# (relevant result) / # (relevant document)

20

Boolean retrieval

AND以外にORやNOTも同様にして実装可能

AND検索はprecisionが高くなるがrecallが低くなる。
OR検索はprecisionが低くなるがrecallが高くなる。
trade-off: precision ↔ recall


Q. Boolean retrievalのprecisionが1ではないのはなぜ?
A. resultはあくまでqueryに対する結果であって、それが
information needに一致しているか(relevantか)を判断す
るのは利用者(the user perceives)である。


…もう少し高度なBoolean retrievalは?
-> proximity operator

21

extended Boolean model

proximity operator:
文書中でtermがどのくらい近接しているかを指定
ex.
Westlaw(http://www.westlaw.com/)で用いられる記法
/s: same sentence
/p: same paragraph
/k: within k-words
!: trailing wildcard
[president /s said] -> “president Obama said”, ...
[depend!] -> “dependability”, “dependency”, ...
[twin-tail] -> “twintail”, “twin-tail”, “twin tail”

単語指定検索の方が良い結果を返すという調査もある。
Turtle, Natural language vs. Boolean query evaluation: a comparison
of retrieval performance, 1994

22

Outline

1. 0. Introduction
- What is IR?

1. 1. Grepping



1. 4. Optimization

23

1.4. Optimizing query

Q1: [(同級生 AND クール) AND 委員長]
Q2: [(同級生 AND 委員長) AND クール]
Q3: [(委員長 AND クール) AND 同級生]
... Which query is better?

24

Optimizing query

途中でpostings listを保持するのに必要な領域を考える
Q1: 同級生 AND クール ... 1, 4, 6
Q2: 同級生 AND 委員長 ... 6
Q3: 委員長 AND クール ... 6
 従って、Q2かQ3が良い。

より一般には、頻度の小さいもの
から順に処理するとよい。

inverted indexに予め頻度を記録
しておけば、postings listを参照
しなくても頻度が分かる

25

Intersection of n postings list(p.12 fig.1.7)
postings_list intersect_opt(term t1, ..., term tn)
{
terms terms = SortByIncreasingOrder(t1, ..., tn);
// 頻度が最も小さいtermのpostings listを入れ、ここから削っていく。
postings_list answer = terms.first().postings_list;
terms = terms.rest();
// intersectionなので、生き残るpostingがなくなっても終わり。
while (terms != null && answer != null)
{
// 現在生き残っているtermとintersectionをとる。
answer = intersect(answer, terms.first().postings_list);
terms = terms.rest(); // 次へ進む。
}
return answer;
}

26

Optimizing query

 Q4: [(同級生 OR お姫様) AND (委員長 OR 先輩) AND
(メイド OR フルート)]

queryのそれぞれの部分について、
中間で保持するpostings listの大
きさを見積もり、小さい順に処理

部分queryがORなので、それぞれ
の頻度の和で見積もれば良い

27

Boolean retrieval

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von saireya _

Mehr von saireya _ (20)

Boolean retrieval