Linguistic Techniques for Text Mining

Linguistic techniques for Text
Mining
NaCTeM team
www.nactem.ac.uk
Sophia Ananiadou
Chikashi Nobata
Yutaka Sasaki
Yoshimasa Tsuruoka

lexicon ontology

Natural Language Processing

deep annotated
raw part-of-speech named entity
syntactic
(unstructured) tagging recognition (structured)
parsing
text text

………………………………..………… S
……………………………….………....
... Secretion of TNF was abolished by VP
BHA in PMA-stimulated U937 cells.
…………………………………………… VP
NP
………………..
PP

NP PP PP NP

NN IN NN VBZ VBN IN NN IN JJ NN NNS .
Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells .
protein_molecule organic_compound cell_line

negative regulation 2

Basic Steps of Natural Language
Processing
• Sentence splitting
• Tokenization
• Part-of-speech tagging
• Shallow parsing
• Named entity recognition
• Syntactic parsing
• (Semantic Role Labeling)
3

Sentence splitting
Current immunosuppression protocols to prevent lung transplant rejection
reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1
T-cell pro-inflammatory cytokine production is important in host defense
against bacterial infection in the lungs. Excessive immunosuppression of Th1
T-cell pro-inflammatory cytokines leaves patients susceptible to infection.

Current immunosuppression protocols to prevent lung transplant rejection
reduce pro-inflammatory and T-helper type 1 (Th1) cytokines.

However, Th1 T-cell pro-inflammatory cytokine production is important in host
defense against bacterial infection in the lungs.

Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines
leaves patients susceptible to infection.
4

A heuristic rule for sentence splitting

sentence boundary
= period + space(s) + capital letter

Regular expression in Perl

s/. +([A-Z])/.n1/g;

5

Errors
IL-33 is known to induce the production of Th2-associated
cytokines (e.g. IL-5 and IL-13).

IL-33 is known to induce the production of Th2-associated
cytokines (e.g.

IL-5 and IL-13).

• Two solutions:
– Add more rules to handle exceptions
– Machine learning
6

Tools for sentence splitting
• JASMINE
– Rule-based
– http://uvdb3.hgc.jp/ALICE/program_download.html
• Scott Piao‟s splitter
– Rule-based
– http://text0.mib.man.ac.uk:8080/scottpiao/sent_det
ector
• OpenNLP
– Maximum-entropy learning
– https://sourceforge.net/projects/opennlp/
– Needs training data

7

Tokenization
The protein is activated by IL2.

The protein is activated by IL2 .

• Convert a sentence into a sequence of tokens

• Why do we tokenize?
• Because we do not want to treat a sentence as a
sequence of characters!
8

Tokenization
The protein is activated by IL2.

The protein is activated by IL2 .

• Tokenizing general English sentences is
relatively straightforward.
• Use spaces as the boundaries
• Use some heuristics to handle exceptions
9

Tokenisation issues
• separate possessive endings or abbreviated forms from
preceding words:
– Mary‟s Mary „s
Mary‟s Mary is
Mary‟s Mary has
• separate punctuation marks and quotes from words :
– Mary. Mary .
– “new” “ new “

10

Tokenization
• Tokenizer.sed: a simple script in sed
• http://www.cis.upenn.edu/~treebank/tokenization.h
tml
• Undesirable tokenization
– original: “1,25(OH)2D3”
– tokenized: “1 , 25 ( OH ) 2D3”
• Tokenization for biomedical text
– Not straight-forward
– Needs dictionary? Machine learning?

11

Tokenisation problems in Bio-text
• Commas
– 2,6-diaminohexanoic acid
– tricyclo(3.3.1.13,7)decanone
• Four kinds of hyphens
– “Syntactic:”
– Calcium-dependent
– Hsp-60
– Knocked-out gene: lush-- flies
– Negation: -fever
– Electric charge: Cl-

K. Cohen NAACL-2007 12

Tokenisation

• Tokenization: Divides the text into smallest
units (usually words), removing punctuation.
Challenge: What should be done with
punctuation that has linguistic meaning?
• Negative charge (Cl-)
• Absence of symptom (-fever)
• Knocked-out gene (Ski-/-)
• Gene name (IL-2 –mediated)
• Plus, “syntactic”uses (insulin-dependent)

K. Cohen NAACL-2007
13

Part-of-speech tagging

The peri-kappa B site mediates human immunodeficiency
DT NN NN NN VBZ JJ NN
virus type 2 enhancer activation in monocytes …
NN NN CD NN NN IN NNS

• Assign a part-of-speech tag to each token in a
sentence.

14

Part-of-speech tags
• The Penn Treebank tagset
– http://www.cis.upenn.edu/~treebank/
– 45 tags
NN Noun, singular or mass JJ Adjective
NNS Noun, plural JJR Adjective, comparative
NNP Proper noun, singular JJS Adjective, superlative
NNPS Proper noun, plural : :
: : DT Determiner
VB Verb, base form CD Cardinal number
VBD Verb, past tense CC Coordinating conjunction
VBG Verb, gerund or present participle IN Preposition or subordinating
VBN Verb, past participle conjunction
VBZ Verb, 3rd person singular present FW Foreign word
: : : :
15

Part-of-speech tagging is not easy
• Parts-of-speech are often ambiguous
I have to go to school.
verb

I had a go at skiing.
noun

• We need to look at the context
• But how?

16

Writing rules for part-of-speech tagging

I have to go to school. I had a go at skiing.
verb noun

• If the previous word is “to”, then it‟s a verb.
• If the previous word is “a”, then it‟s a noun.
• If the next word is …
:

Writing rules manually is impossible
17

Learning from examples
The involvement of ion channels in B and T lymphocyte activation is
DT NN IN NN NNS IN NN CC NN NN NN VBZ
supported by many reports of changes in ion fluxes and membrane
VBN IN JJ NNS IN NNS IN NN NNS CC NN
…………………………………………………………………………………….
…………………………………………………………………………………….

training
Unseen text
We demonstrate
We demonstrate PRP VBP
Machine Learning
that … that …
Algorithm
IN

18

First-order Hidden Markov Models

• Training
– Estimate P word j | tagx
P tag y | tagz
– Counting (+ smoothing)

• Using the tagger
n
arg max P wi | ti P ti | ti 1
i 1

20

Machine learning using diverse features

• We want to use diverse types of
information when predicting the tag.

He opened it

Verb

The word is “opened”
The suffix is “ed”
many clues The previous word is “He”
:
21

Machine learning with log-linear models

Feature function
Feature weight

1
p y|x exp f x, y
i i
Z x i

Z x exp f x, y
i i
y i

22

Machine learning with log-linear models

• Maximum likelihood estimation
– Find the parameters that maximize the
conditional log-likelihood of the training data
~ x ~ y| x
p p
LL( ) log p y|x
x, y

• Gradient
LL( )
E~ [ fi ] E p [ fi ]
p
i 23

Computing likelihood and model
expectation
• Example
– Two possible tags: “Noun” and “Verb”
– Two types of features: “word” and “suffix”

He opened it

Noun Verb Noun

tag verb, word opened tag verb, suffix ed

tag noun , word opened tag noun , suffix ed tag verb, word opened tag verb, suffix ed

24
tag = noun tag = verb

Conditional Random Fields (CRFs)

• A single log-linear model on the whole sentence

F n
1
P( y1...yn | x) exp f t , yt 1 , yt , x
i i
Z i 1 t 1

• The number of classes is HUGE, so it is
impossible to do the estimation in a naive way.

25

Conditional Random Fields (CRFs)

• Solution
– Let‟s restrict the types of features
– You can then use a dynamic programming
algorithm that drastically reduces the amount of
computation

• Features you can use (in first-order CRFs)
– Features defined on the tag
– Features defined on the adjacent pair of tags

26

Features
• Feature weights are associated with states
W0=He
and edges &
Tag = Noun

He has opened it

Noun Noun Noun Noun

Tagleft = Noun
Verb Verb Verb Verb
&
Tagright = Noun
27

A naive way of calculating Z(x)

Noun Noun Noun Noun = 7.2 Verb Noun Noun Noun = 4.1
Noun Noun Noun Verb = 1.3 Verb Noun Noun Verb = 0.8
Noun Noun Verb Noun = 4.5 Verb Noun Verb Noun = 9.7
Noun Noun Verb Verb = 0.9 Verb Noun Verb Verb = 5.5
Noun Verb Noun Noun = 2.3 Verb Verb Noun Noun = 5.7
Noun Verb Noun Verb = 11.2 Verb Verb Noun Verb = 4.3
Noun Verb Verb Noun = 3.4 Verb Verb Verb Noun = 2.2
Noun Verb Verb Verb = 2.5 Verb Verb Verb Verb = 1.9

Sum = 67.5
28

Dynamic programming
• Results of intermediate computation can
be reused.

He has opened it

Noun Noun Noun Noun

Verb Verb Verb Verb

29
forward

Dynamic programming
• Results of intermediate computation can
be reused.

He has opened it

Noun Noun Noun Noun

Verb Verb Verb Verb

30
backward

Dynamic programming
• Computing marginal distribution

He has opened it

Noun Noun Noun Noun

Verb Verb Verb Verb

31

Maximum entropy learning and
Conditional Random Fields
• Maximum entropy learning
– Log-linear modeling + MLE
– Parameter estimation
• Likelihood of each sample
• Model expectation of each feature
• Conditional Random Fields
– Log-linear modeling on the whole sentence
– Features are defined on states and edges
– Dynamic programming
32

POS tagging algorithms
• Performance on the Wall Street Journal corpus

Training Speed Accura
Cost cy
Dependency Net (2003) Low Low 97.2
Conditional Random Fields High High 97.1
Support vector machines (2003) 97.1
Bidirectional MEMM (2005) Low 97.1
Brill‟s tagger (1995) Low 96.6
HMM (2000) Very low High 96.7

33

POS taggers
• Brill‟s tagger
– http://www.cs.jhu.edu/~brill/
• TnT tagger
– http://www.coli.uni-saarland.de/~thorsten/tnt/
• Stanford tagger
– http://nlp.stanford.edu/software/tagger.shtml
• SVMTool
– http://www.lsi.upc.es/~nlp/SVMTool/
• GENIA tagger
– http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/

34

Tagging errors made by
a WSJ-trained POS tagger
… and membrane potential after mitogen binding.
CC NN NN IN NN JJ
… two factors, which bind to the same kappa B enhancers…
CD NNS WDT NN TO DT JJ NN NN NNS
… by analysing the Ag amino acid sequence.
IN VBG DT VBG JJ NN NN
… to contain more T-cell determinants than …
TO VB RBR JJ NNS IN
Stimulation of interferon beta gene transcription in vitro by
NN IN JJ JJ NN NN IN NN IN

35

Taggers for general text do not work well
on biomedical text

Performance of the Brill tagger evaluated on randomly selected 1000
MEDLINE sentences: 86.8% (Smith et al., 2004)

Accuracy
Exact 84.4%
NNP = NN, NNPS = NNS 90.0%
LS = NN 91.3%
JJ = NN 94.9%
Accuracies of a WSJ-trained POS tagger evaluated on the GENIA
corpus (Tsuruoka et al., 2005)

36

MedPost
(Smith et al., 2004)
• Hidden Markov Models (HMMs)
• Training data
– 5700 sentences randomly selected from various
thematic subsets.
• Accuracy
– 97.43% (native tagset), 96.9% (Penn tagset)
– Evaluated on 1,000 sentences
• Available from
– ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedPost/medpost.tar.gz

37

Training POS taggers with bio-corpora
(Tsuruoka and Tsujii, 2005)

training WSJ GENIA PennBioIE
WSJ 97.2 91.6 90.5
GENIA 85.3 98.6 92.2
PennBioIE 87.4 93.4 97.9
WSJ + GENIA 97.2 98.5 93.6
WSJ + PennBioIE 97.2 94.0 98.0
GENIA + PennBioIE 88.3 98.4 97.8
WSJ + GENIA + PennBioIE 97.2 98.4 97.9

38

Performance on new data
Relative performance evaluated on recent abstracts selected from
three journals:
- Nucleic Acid Research (NAR)
- Nature Medicine (NMED)
- Journal of Clinical Investigation (JCI)
training NAR NMED NMED Total (Acc.)
WSJ 109 47 102 258 (70.9%)
GENIA 121 74 132 327 (89.8%)
PennBioIE 129 65 122 316 (86.6%)
WSJ + GENIA 125 74 135 334 (91.8%)
WSJ + PennBioIE 133 71 133 337 (92.6%)
GENIA + PennBioIE 128 75 135 338 (92.9%)
WSJ + GENIA + PennBioIE 133 74 139 346 (95.1%)
39

Chunking (shallow parsing)

He reckons the current account deficit will narrow to
NP VP NP VP PP
only # 1.8 billion in September .
NP PP NP

• A chunker (shallow parser) segments a
sentence into non-recursive phrases.

40

Extracting noun phrases from MEDLINE
(Bennett, 1999)
• Rule-based noun phrase extraction
– Tokenization
– Part-Of-Speech tagging
– Pattern matching

Noun phrase extraction accuracies evaluated on 40 abstracts
FastNPE NPtool Chopper AZ Phraser
Recall 50% 95% 97% 92%
Precision 80% 96% 90% 86%

41

Chunking with Machine learning

• Chunking performance on Penn Treebank
Recall Precisio F-score
n
Winnow (with basic features) (Zhang, 93.60 93.54 93.57
2002)
Perceptron (Carreras, 2003) 93.29 94.19 93.74
SVM + voting (Kudoh, 2003) 93.92 93.89 93.91
SVM (Kudo, 2000) 93.51 93.45 93.48
Bidirectional MEMM (Tsuruoka, 2005) 93.70 93.70 93.70

42

Machine learning-based chunking
• Convert a treebank into sentences that are
annotated with chunk information.
– CoNLL-2000 data set
• http://www.cnts.ua.ac.be/conll2000/chunking/
• The conversion script is available
• Apply a sequence tagging algorithm such as
HMM, MEMM, CRF, or Semi-CRF.
• YamCha: an SVM-based chunker
– http://www.chasen.org/~taku/software/yamcha/

43

GENIA tagger
• Algorithm: Bidirectional MEMM
• POS tagging
– Trained on WSJ, GENIA and Penn BioIE
– Accuracy: 97-98%
• Shallow parsing
– Trained on WSJ and GENIA
– Accuracy: 90-94%
• Can output base forms
• Available from
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
44

Named-Entity Recognition
We have shown that interleukin-1 (IL-1) and IL-2 control
protein protein protein
IL-2 receptor alpha (IL-2R alpha) gene transcription in
DNA
CD4-CD8-murine T lymphocyte precursors.
cell_line

• Recognize named-entities in a sentence.
– Gene/protein names
– Protein, DNA, RNA, cell_line, cell_type

45

Performance of biomedical NE recognition

• Shared task data for Coling 2004 BioNLP workshop
- entity types: protein, DNA, RNA, cell_type, and cell_line
Recall Precision F-score
SVM+HMM (Zhou, 2004) 76.0 69.4 72.6
Semi-Markov CRFs (in prep.) 72.7 70.4 71.5
Two-Phase (Kim, 2005) 72.8 69.7 71.2
Sliding Window (in prep.) 71.5 70.2 70.8
CRF (Settles, 2005) 72.0 69.1 70.5
MEMM (Finkel, 2004) 71.6 68.6 70.1
: : : :
46

Features
Classification models, main features used in NLPBA (Kim, 2004)
CM lx af or sh g gz p n sy tr a ca d p pr ext.
n o p b o a
Zho SH x x x x x x x x x
Fin M x x x x x x x x x x B,
W
Set C x x x x (x) (x) x (W)
Son SC x x x x x V
Classification Model (CM):
Zha H x x M
S: SVM; H: HMM; M: MEMM; C: CRF
Features
lx: lexical features; af: affix information (chracter n-grams); or; orthographic Information;
sh: word shapes; gn: gene sequence; gz: gazetteers; po: part-of-speech tags; np: noun
phrase tags; sy: syntactic tags; tr: word triggers; ab: abbreviations; ca: cascaded entities;
do: global document information; pa: parentheses handling; pre: previously predicted entity
tags; B: British National Corpus; W: WWW; V: virtually generated corpus; M: MEDLINE

47

CFG parsing

S

VP

NP
NP QP
VBN NN VBD DT JJ CD CD NNS .

Estimated volume was a light 2.4 million ounces .

48

Phrase structure + head information

S

VP

NP
NP QP


49

Dependency relations



50

CFG parsing algorithms
• Performance on the Penn Treebank

LR LP F-score
Generative model (Collins, 1999) 88.1 88.3 88.2
Maxent-inspired (Charniak, 2000) 89.6 89.5 89.5
Simply Synchrony Networks (Henderson, 2004) 89.8 90.4 90.1
Data Oriented Parsing (Bod, 2003) 90.8 90.7 90.7
Re-ranking (Johnson, 2005) 91.0

51

CFG parsers
• Collins parser
– http://people.csail.mit.edu/mcollins/code.html

• Bikel‟s parser
– http://www.cis.upenn.edu/~dbikel/software.html#stat-parser
• Charniak parser
– http://www.cs.brown.edu/people/ec/
• Re-ranking parser
– http://www.cog.brown.edu:16080/~mj/Software.htm
• SSN parser
– http://homepages.inf.ed.ac.uk/jhender6/parser/ssn_parser.html

52

Parsing biomedical documents
• CFG parsing accuracies on the GENIA treebank
(Clegg, 2005)
LR LP F-score
Bikel 0.9.8 77.43 81.33 79.33
Charniak 76.05 77.12 76.58
Collins model 2 74.49 81.30 77.75
• In order to improve performance,
– Unsupervised parse combination (Clegg, 2005)
– Use lexical information (Lease, 2005)
• 14.2% reduction in error.

53

Parse tree
So

NP1 VP15

DT2 NP4 VP16 VP21

A AJ5 NP7 VP17 AV19 VP22 NP25

normal NP8 NP10 does not exclude NP24

serum NP11 NP13 AJ26 NP28

CRP mesurement deep NP29 NP31

vein 54
thrombosis

Semantic structure
So
Predicate
argument
NP1 VP15
relations
DT2 NP4 VP16 VP21
ARG1 ARG1 ARG1

A AJ5 NP7 ARG2 VP17 AV19 VP22 NP25
ARG2
ARG1
normal NP8 NP10 does not exclude NP24
MOD
ARG1
serum NP11 NP13 AJ26 NP28
MOD ARG1
CRP mesurement deep NP29 NP31
MOD
vein 55
thrombosis

Abstraction of surface expressions

56

HPSG parsing

HEAD: verb • HPSG
SUBJ: <>
COMPS: <> – A few schema
Subject-head schema
– Many lexical entries
HEAD: verb – Deep syntactic
SUBJ: <noun> analysis
Lexical entry COMPS: <>
Head-modifier schema
• Grammar
– Corpus-based
HEAD: noun HEAD: verb
SUBJ: <> SUBJ: <noun> HEAD:
adv grammar construction
COMPS: <> COMPS: <> MOD: verb (Miyao et al 2004)
• Parser
Mary walked slowly
– Beam search
(Tsuruoka et al.)

57

Experimental results
• Training set: Penn Treebank Section 02-21
(39,832 sentences)
• Test set: Penn Treebank Section 23 (< 40 words,
2,164 sentences)
• Accuracy of predicate argument relations (i.e.,
red arrows) is measured

Precision Recall F-score
87.9% 86.9% 87.4%

58

Parsing MEDLINE with HPSG

• Enju
– A wide-coverage HPSG parser
– http://www-tsujii.is.s.u-tokyo.ac.jp/enju/

59

Extraction of Protein-protein Interactions:
Predicate-argument relations + SVM (1)

• (Yakushiji, 2005)
CD4 protein interacts with non-polymorphic regions of MHCII .
ENTITY1 ENTITY2
Extraction patterns based on predicate-argument relations
argM arg1 arg1 arg2 arg1 arg2
CD4 protein interact with non-polymorphic region of MHCII
ENTITY1 ENTITY2
arg1

SVM learning with predicate-argument patterns
60

Text Mining for Biology

• MEDIE: An interactive intelligent IR
system retrieving events
– Performs a semantic search
• InfoPubMed: an interactive IE system and
an efficient PubMed search tool, helping
users to find information about biomedical
entities such as genes, proteins, and the
interactions between them.
61

Medie system overview
Off-line
On-line
Deep
parser RegionAlgebra
Input Semantically-
Textbase annotated Search engine
Entity Textbase
Recognizer

Search
Query
results

62

Service: extracting interactions

• Info-PubMed: interactive IE system and an
efficient PubMed search tool, helping users
to find information about biomedical entities
such as genes, proteins,and the
interactions between them.
• System components
– MEDIE
– Extraction of protein-protein interactions
– Multi-window interface on a browser
• UTokyo: NaCTeM self-funded partner 65

Info-PubMed
• helps biologists to search for their interests
– genes, proteins, their interactions, and
evidence sentences
– extracted from MEDLINE
(about 15 million abstracts of
biomedical papers)
• uses many NLP techniques explained
– in order to achieve high precision of retrieval

66

Flow Chart
Input Output
Gene or protein token:“TNF” Gene or protein
keywords entities
Gene:“TNF”
interactions
Gene or protein
around the
entitiy
given gene
Interaction:
“TNF” and “IL6” evidence sentences
interaction describing the given
interaction 67

Techniques(1/2)
• Biomedical entity recognition
in abstract sentences
– prepare a gene dictionary
– string match

68

Techniques(2/2)
• Extract sentences describing
protein-protein interaction
– deep parser based on HPSG syntax
• can detect semantic relations between
phrases
– domain dependent pattern recognition
• can learn and expand source patterns
• by using the result of the deep parser, it
can extract semantically true patterns
• not affected by syntactic variations
69

Linguistic Techniques for Text Mining

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Linguistic Techniques for Text Mining

Ähnlich wie Linguistic Techniques for Text Mining (18)

Mehr von butest

Mehr von butest (20)

Linguistic Techniques for Text Mining