Evaluation via Negativa of Chinese Word Segmentation for Information Retrieval @ PACLIC 2011

/ 36
EVALUATION
via Negativa
Mike Tian-Jian Jiang, Chen-Wei Shih, Chan-Hung Kuo,
Richard Tzong-Han Tsai, and Wen-Lian Hsu
National Tsing Hua University
Academia Sinica
Taiwan
中
文
詞分
INFORMATION
RETRIEVAL
1

/ 36
Fundamental Unit?
a meta-communication
2

/ 36
What is a Word?
to linguistics
3

/ 36
“... the smallest free form that may be
uttered in isolation with semantic or
pragmatic content (with literal or
practical meaning) ...”
http://en.wikipedia.org/wiki/Word
4

/ 36
“... the task of defining what
constitutes a ‘word’ involves
determining where one word ends
and another word begins...”
http://en.wikipedia.org/wiki/Word#Word_boundaries
5

/ 36
Word Boundary?
• Phonology
• Morphology
• Orthography
• Compound? Multi-word expression?
• Multi-word vs. multiword vs. multi word
• CJKV?
• Multi-character expression?
6

/ 36
What is a Word?
to computational linguistics
7

/ 36
Standard de jure?
• Academia Sinica Balanced Corpus
• Chinese Treebank of University of
Pennsylvania
• City University of Hong Kong
• Microsoft Research Asia
• Peking University
8

/ 36
... then match
standards
the more accuracy, the better communication?
9

/ 36
What is a Word?
to computational linguistics applications
10

/ 36
e.g. Information
Retrieval
11

/ 36
Standard de facto?
• Word n-gram
• Character n-gram
• Hybrid
12

/ 36
Monotonic or not?
better WS results yield better IR outcomes?
13

/ 36
Is it finite?
How to evaluate WS-to-application influence?
14

/ 36
Via Negativa
“It describes God by saying what he is not, rather than what he is, because as
finite beings we can not recognize God's attributes in any real and full sense
and because God is beyond what our language can positively describe. “
http://www.blackwellreference.com/public/tocnode?id=g9781405106795_chunk_g978140510679515_ss1-58
http://www.blackmetal.com/scans0710/teratism-via-negativa.jpg
15

/ 36
Binary Classification?
clinical trial?
16

/ 36
Something about
Evaluation
17

/ 36
IR Evaluation
• Data
• TREC, NTCIR, etc.
• Metrics
• P@k, MRR, MAP, etc.
• Doubts
• Pooling bias
• Score standardization
18

/ 36
CWS Evaluation
• Recall and precision counted by
• Boundary
• Token
• Constituent
• Similarity?
19

/ 36
WS-to-IR
• Peng et al. (2002)
• WS: 44-70%, IR: ↗
• WS: 70-77%, IR: ⤴
• WS: 85-95%, IR: ⤵
• He et al. (2002)
• WS: ↗(91-94%), IR: ⤴
20

/ 36
Why Inconclusive?
• WS accuracy ranges?
• WS/IR evaluation metrics?
• Query length?
• Term types?
21

/ 36
Term Type
• Kwok (2002)
• Insensitive: stop-words; frequent non-content-bearing
• Monotonic: content-bearing
• Non-monotonic:
• 西土耳其 (Western Turkey)
• Semantic, syntax, or surface?
• 农 (agricultural) / 作物 (plants)
• 旱 (drought) / 灾 (disaster) vs. 春旱 (Spring drought) vs. 旱区 (area or
drought disaster)
• Recall or precision?
• 火 (fire) / 山 (mountain) vs. 火山 (volcano)
22

/ 36
Surface Pattern
• Ambiguity
• Combinatorial
• 西土耳其、农作物、旱灾、春旱、旱区、火
山... etc.
• Overlapping
• 施政 (practice policy) / 伟 (great) vs. 施
(Shih) / 政伟 (Zheng-Wei)
• Which is more harmful?
http://www.definicionabc.com/general/gestalt-psicologia.php
23

/ 36
Is it finite?
How to evaluate WS-to-IR influence?
25

/ 36
IR Is Rallying
• Indexing models
• Retrieval models
• Data collections
• Evaluation metrics
26

/ 36
Tractable Simulation?
http://imgs.xkcd.com/store/glen_shirts/g_try_science_shirt_2.jpg
27

/ 36
Balanced
NTCIR (long) and Sogou (short) query collections
28

/ 36
Pragmatical WS
accuracy-controlled systems on different standards
1, 1/2, 1/4, ..., 1/16384 data of Bakeoff 2005 for
CRF
http://scifun.files.wordpress.com/2010/07/1278929569066.jpg
29

/ 36
Popularity
similarity (MAP) to a black box’s preference (top-
100)
31

/ 36
Correlation≠Causation
TNR and NPV may imply something
http://imgs.xkcd.com/store/imgs/correlation_shirt_300.png
33

/ 36
Discussion
• 上海滩 (the bund of Shanghai)
• MSR: 上海滩，上海 / 滩，上 / 海 / 滩
• PKU: 上海滩，上海 / 滩，上 / 海滩
• May be caused by......
• Standard differences?
• Lexicon disappearances?
34

/ 36
Concerns
• Other accuracy-controlled WS systems than CRF?
• The same training data, different standards?
• Conventional/comparative IR experiments?
• Lucene? Lemur/Indri?
• TREC and NTCIR?
• Silver standards?
• Relaxation of negative patterns?
• Graphical or n-best list output of WS?
• Oracle precision, recall, TNR, NPV, etc?
• Other applications than IR?
• Out-of-vocabulary?
35

Evaluation via Negativa of Chinese Word Segmentation for Information Retrieval @ PACLIC 2011

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Empfohlen

Empfohlen (20)

Evaluation via Negativa of Chinese Word Segmentation for Information Retrieval @ PACLIC 2011