1. Introduction to Information Retrieval(IIR)
1. Boolean retrieval
Online text: http://nlp.stanford.edu/IR-book/pdf/01bool.pdf
Slide based on: http://nlp.stanford.edu/IR-book/ppt/01intro.pptx
1
2. Outline
1. 0. Introduction
- What is IR?
1. 1. Grepping
- Why grepping is so bad?
1. 2. Inverted index
- to reduce the amount of memory
1. 3. Processing query
1. 4. Optimization
- to reduce the amount of memory more
2
3. Introduction
Information Retrieval (IR):
finding material of an unstructured nature that
satisfies an information need from within large
collections.
(term is coined by Calvin Mooers in 1948/50)
※searchは曖昧なので避けられることも。IIRでは同義。
3
4. Introduction
Information Retrieval (IR):
finding document of an unstructured nature that
material a text
satisfies an information need from within large
collections. stored on computers.
(term is coined by Calvin Mooers in 1948/50)
※searchは曖昧なので避けられることも。IIRでは同義。
4
5. information need(p.5)
information need:
the topic about which the user desires to know
more
query:
(the text that) the user convey to the computer in
an attempt to communicate the information need
※information needとqueryの違いに注意
A document is relevant
if it is one that the user perceives as containing
information of value with respect to their
personal information need. information need query
type
relevant? result
perceive
user computer
5
6. ad hoc retrieval
ad hoc retrieval:
対象となる文書集合(collection)はあまり変化せ
ず、queryが短期的に変化するIR。
information filtering:
queryはあまり変化せず、collectionが短期的に変化。
※「アドホック検索とは? - たつをのChangeLog」
http://chalow.net/2008-01-16-1.html
6
7. Outline
1. 0. Introduction
- What is IR?
1. 1. Grepping
- Why grepping is so bad?
1. 2. Inverted index
- to reduce the amount of memory
1. 3. Processing query
1. 4. Optimization
- reduce the amount of memory more
7
8. 1.1. Why grepping is so bad?
grep:
すべての文書を先頭から末尾までスキャン
指定された単語を探せる
最近の高性能なPCではある程度実用的
> grep -r “query” *
-> 大規模なcollectionの検索を高速に行うには?
-> もっと柔軟な検索条件は指定できないのか?
-> 得られた結果をランク付けできないのか?
--> Index in advance!
8
9. Term-document incidence matrix
incidence matrix:
各文書に出現する語に1を、出現しない語に0を充てて
生成した行列(行: 語, 列: 文書)
incidence vector:
matrixの各列(縦列)
Queryの処理: ex. [(A AND B) OR NOT C]
- AND条件: incidence vectorの論理積
- OR条件: incidence vectorの論理和
- NOT条件: incidence vectorを反転させて処理
9
12. Term-document incidence matrix
Problem is ...
- the matrix is too large!
- although the matrix contains too many 0s(sparse)!
12
13. Outline
1. 0. Introduction
- What is IR?
1. 1. Grepping
- Why grepping is so bad?
1. 2. Inverted index
- to reduce the amount of memory
1. 3. Processing query
1. 4. Optimization
- to reduce the amount of memory more
13
14. 1.2. Inverted index
「ごく普通の」index。転置インデックスとも。
※index自体が「逆引き」の意なのでinvertedは本来不要
※予め文書にIDを付加しておく
postings list:
list
postingを並べたもの
dictionary
(vocabulary, posting:
posting
lexicon) termを含む文書のID
(docID)
postings
term
(word, token)
14
15. How to make inverted index?
1. 対象となる文書を取得
Democracy is the worst form of government except all those other forms that
have been tried from time to time.
(by Sir Winston Churchill, 1947)
2. テキストをtokenに分割
Democracy / is / the / worst / form / of / government / except / all / those /
other / forms / that / have / been / tried / from / time / to / time
3. 言語処理により、tokenを正規化
democracy / is / the / worst / form / of / government / except / all / those / other /
forms / that / have / been / tried / from / time / to / time
※日本語の場合は2.で形態素分析などが必要。
15
16. How to make inverted index?
4. 1つのindexにまとめ、語を整列
5. 重複する項目を1つにまとめ、postingsを作成
※実装: ○リスト, ○可変長配列, ○連想配列, ×固定長配列
※dictionary on memory, postings on disk
※postings listの長さは語が出現する文書の数(出現頻度)
index of doc1 index of doc2
democracy ... 1 follow ... 2 all ... 1 all ... 1
is ... 1 also ... 2 also ... 2 also ... 2
the ... 1 the ... 2 and ... 2 and ... 2
worst ... 1 guiding ... 2 democracy ... 1 democracy ... 1
form ... 1 lights ... 2 devotion ... 2 devotion ... 2
of ... 1 of ... 2 4. except ... 1 5. except ... 1
government ... 1 love ... 2 follow ... 2 follow ... 2
except ... 1 and ... 2 form ... 1 form ... 1
all ... 1 devotion ... 2 guiding ... 2 guiding ... 2
those ... 1 in ... 2 ... government ... 1
other ... 1 women ... 2 of ... 1 ...
forms ... 1 these ... 2 of ... 2 of ... 1,2
... ... ... ...
16
17. Outline
1. 0. Introduction
- What is IR?
1. 1. Grepping
- Why grepping is so bad?
1. 2. Inverted index
- to reduce the amount of memory
1. 3. Processing query
1. 4. Optimization
- to reduce the amount of memory more
17
18. 1.3. Query: [図書室 AND クール]
処理手順:
1. 辞書から[図書室]を探す
2. [図書室]のpostings list Aを取得
3. 辞書から[クール]を探す
4. [クール]のpostings list Bを取得
5. A ∩ Bを計算してresultとする
... How to get intersection?
図書室 ∩ クール ... 1, 6
18
20. effectiveness
The quality of search results.
precision(精度/適合率):
What fraction of the returned results are relevant
to the information need?
#(relevant result) / #(result)
recall(再現率):
What fraction of the relevant documents in the
collection were returned by the system?
# (relevant result) / # (relevant document)
20
21. Boolean retrieval
AND以外にORやNOTも同様にして実装可能
AND検索はprecisionが高くなるがrecallが低くなる。
OR検索はprecisionが低くなるがrecallが高くなる。
trade-off: precision ↔ recall
Q. Boolean retrievalのprecisionが1ではないのはなぜ?
A. resultはあくまでqueryに対する結果であって、それが
information needに一致しているか(relevantか)を判断す
るのは利用者(the user perceives)である。
…もう少し高度なBoolean retrievalは?
-> proximity operator
21
22. extended Boolean model
proximity operator:
文書中でtermがどのくらい近接しているかを指定
ex.
Westlaw(http://www.westlaw.com/)で用いられる記法
/s: same sentence
/p: same paragraph
/k: within k-words
!: trailing wildcard
[president /s said] -> “president Obama said”, ...
[depend!] -> “dependability”, “dependency”, ...
[twin-tail] -> “twintail”, “twin-tail”, “twin tail”
単語指定検索の方が良い結果を返すという調査もある。
Turtle, Natural language vs. Boolean query evaluation: a comparison
of retrieval performance, 1994
22
23. Outline
1. 0. Introduction
- What is IR?
1. 1. Grepping
- Why grepping is so bad?
1. 2. Inverted index
- to reduce the amount of memory
1. 3. Processing query
1. 4. Optimization
- to reduce the amount of memory more
23
24. 1.4. Optimizing query
Q1: [(同級生 AND クール) AND 委員長]
Q2: [(同級生 AND 委員長) AND クール]
Q3: [(委員長 AND クール) AND 同級生]
... Which query is better?
24
27. Optimizing query
Q4: [(同級生 OR お姫様) AND (委員長 OR 先輩) AND
(メイド OR フルート)]
queryのそれぞれの部分について、
中間で保持するpostings listの大
きさを見積もり、小さい順に処理
部分queryがORなので、それぞれ
の頻度の和で見積もれば良い
27