1. Linguistic techniques for Text
Mining
NaCTeM team
www.nactem.ac.uk
Sophia Ananiadou
Chikashi Nobata
Yutaka Sasaki
Yoshimasa Tsuruoka
2. lexicon ontology
Natural Language Processing
deep annotated
raw part-of-speech named entity
syntactic
(unstructured) tagging recognition (structured)
parsing
text text
………………………………..………… S
……………………………….………....
... Secretion of TNF was abolished by VP
BHA in PMA-stimulated U937 cells.
…………………………………………… VP
NP
………………..
PP
NP PP PP NP
NN IN NN VBZ VBN IN NN IN JJ NN NNS .
Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells .
protein_molecule organic_compound cell_line
negative regulation 2
3. Basic Steps of Natural Language
Processing
• Sentence splitting
• Tokenization
• Part-of-speech tagging
• Shallow parsing
• Named entity recognition
• Syntactic parsing
• (Semantic Role Labeling)
3
4. Sentence splitting
Current immunosuppression protocols to prevent lung transplant rejection
reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1
T-cell pro-inflammatory cytokine production is important in host defense
against bacterial infection in the lungs. Excessive immunosuppression of Th1
T-cell pro-inflammatory cytokines leaves patients susceptible to infection.
Current immunosuppression protocols to prevent lung transplant rejection
reduce pro-inflammatory and T-helper type 1 (Th1) cytokines.
However, Th1 T-cell pro-inflammatory cytokine production is important in host
defense against bacterial infection in the lungs.
Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines
leaves patients susceptible to infection.
4
5. A heuristic rule for sentence splitting
sentence boundary
= period + space(s) + capital letter
Regular expression in Perl
s/. +([A-Z])/.n1/g;
5
6. Errors
IL-33 is known to induce the production of Th2-associated
cytokines (e.g. IL-5 and IL-13).
IL-33 is known to induce the production of Th2-associated
cytokines (e.g.
IL-5 and IL-13).
• Two solutions:
– Add more rules to handle exceptions
– Machine learning
6
7. Tools for sentence splitting
• JASMINE
– Rule-based
– http://uvdb3.hgc.jp/ALICE/program_download.html
• Scott Piao‟s splitter
– Rule-based
– http://text0.mib.man.ac.uk:8080/scottpiao/sent_det
ector
• OpenNLP
– Maximum-entropy learning
– https://sourceforge.net/projects/opennlp/
– Needs training data
7
8. Tokenization
The protein is activated by IL2.
The protein is activated by IL2 .
• Convert a sentence into a sequence of tokens
• Why do we tokenize?
• Because we do not want to treat a sentence as a
sequence of characters!
8
9. Tokenization
The protein is activated by IL2.
The protein is activated by IL2 .
• Tokenizing general English sentences is
relatively straightforward.
• Use spaces as the boundaries
• Use some heuristics to handle exceptions
9
10. Tokenisation issues
• separate possessive endings or abbreviated forms from
preceding words:
– Mary‟s Mary „s
Mary‟s Mary is
Mary‟s Mary has
• separate punctuation marks and quotes from words :
– Mary. Mary .
– “new” “ new “
10
11. Tokenization
• Tokenizer.sed: a simple script in sed
• http://www.cis.upenn.edu/~treebank/tokenization.h
tml
• Undesirable tokenization
– original: “1,25(OH)2D3”
– tokenized: “1 , 25 ( OH ) 2D3”
• Tokenization for biomedical text
– Not straight-forward
– Needs dictionary? Machine learning?
11
12. Tokenisation problems in Bio-text
• Commas
– 2,6-diaminohexanoic acid
– tricyclo(3.3.1.13,7)decanone
• Four kinds of hyphens
– “Syntactic:”
– Calcium-dependent
– Hsp-60
– Knocked-out gene: lush-- flies
– Negation: -fever
– Electric charge: Cl-
K. Cohen NAACL-2007 12
13. Tokenisation
• Tokenization: Divides the text into smallest
units (usually words), removing punctuation.
Challenge: What should be done with
punctuation that has linguistic meaning?
• Negative charge (Cl-)
• Absence of symptom (-fever)
• Knocked-out gene (Ski-/-)
• Gene name (IL-2 –mediated)
• Plus, “syntactic”uses (insulin-dependent)
K. Cohen NAACL-2007
13
14. Part-of-speech tagging
The peri-kappa B site mediates human immunodeficiency
DT NN NN NN VBZ JJ NN
virus type 2 enhancer activation in monocytes …
NN NN CD NN NN IN NNS
• Assign a part-of-speech tag to each token in a
sentence.
14
15. Part-of-speech tags
• The Penn Treebank tagset
– http://www.cis.upenn.edu/~treebank/
– 45 tags
NN Noun, singular or mass JJ Adjective
NNS Noun, plural JJR Adjective, comparative
NNP Proper noun, singular JJS Adjective, superlative
NNPS Proper noun, plural : :
: : DT Determiner
VB Verb, base form CD Cardinal number
VBD Verb, past tense CC Coordinating conjunction
VBG Verb, gerund or present participle IN Preposition or subordinating
VBN Verb, past participle conjunction
VBZ Verb, 3rd person singular present FW Foreign word
: : : :
15
16. Part-of-speech tagging is not easy
• Parts-of-speech are often ambiguous
I have to go to school.
verb
I had a go at skiing.
noun
• We need to look at the context
• But how?
16
17. Writing rules for part-of-speech tagging
I have to go to school. I had a go at skiing.
verb noun
• If the previous word is “to”, then it‟s a verb.
• If the previous word is “a”, then it‟s a noun.
• If the next word is …
:
Writing rules manually is impossible
17
18. Learning from examples
The involvement of ion channels in B and T lymphocyte activation is
DT NN IN NN NNS IN NN CC NN NN NN VBZ
supported by many reports of changes in ion fluxes and membrane
VBN IN JJ NNS IN NNS IN NN NNS CC NN
…………………………………………………………………………………….
…………………………………………………………………………………….
training
Unseen text
We demonstrate
We demonstrate PRP VBP
Machine Learning
that … that …
Algorithm
IN
18
19. Part-of-speech tagging with Hidden
Markov Models
P w1...wn | t1...tn P t1...tn
P t1...tn | w1...wn
tags words P w1...wn
P w1...wn | t1...tn P t1...tn
n
P wi | ti P ti | ti 1
i 1
output probability transition probability
19
20. First-order Hidden Markov Models
• Training
– Estimate P word j | tagx
P tag y | tagz
– Counting (+ smoothing)
• Using the tagger
n
arg max P wi | ti P ti | ti 1
i 1
20
21. Machine learning using diverse features
• We want to use diverse types of
information when predicting the tag.
He opened it
Verb
The word is “opened”
The suffix is “ed”
many clues The previous word is “He”
:
21
22. Machine learning with log-linear models
Feature function
Feature weight
1
p y|x exp f x, y
i i
Z x i
Z x exp f x, y
i i
y i
22
23. Machine learning with log-linear models
• Maximum likelihood estimation
– Find the parameters that maximize the
conditional log-likelihood of the training data
~ x ~ y| x
p p
LL( ) log p y|x
x, y
• Gradient
LL( )
E~ [ fi ] E p [ fi ]
p
i 23
24. Computing likelihood and model
expectation
• Example
– Two possible tags: “Noun” and “Verb”
– Two types of features: “word” and “suffix”
He opened it
Noun Verb Noun
tag verb, word opened tag verb, suffix ed
tag noun , word opened tag noun , suffix ed tag verb, word opened tag verb, suffix ed
24
tag = noun tag = verb
25. Conditional Random Fields (CRFs)
• A single log-linear model on the whole sentence
F n
1
P( y1...yn | x) exp f t , yt 1 , yt , x
i i
Z i 1 t 1
• The number of classes is HUGE, so it is
impossible to do the estimation in a naive way.
25
26. Conditional Random Fields (CRFs)
• Solution
– Let‟s restrict the types of features
– You can then use a dynamic programming
algorithm that drastically reduces the amount of
computation
• Features you can use (in first-order CRFs)
– Features defined on the tag
– Features defined on the adjacent pair of tags
26
27. Features
• Feature weights are associated with states
W0=He
and edges &
Tag = Noun
He has opened it
Noun Noun Noun Noun
Tagleft = Noun
Verb Verb Verb Verb
&
Tagright = Noun
27
32. Maximum entropy learning and
Conditional Random Fields
• Maximum entropy learning
– Log-linear modeling + MLE
– Parameter estimation
• Likelihood of each sample
• Model expectation of each feature
• Conditional Random Fields
– Log-linear modeling on the whole sentence
– Features are defined on states and edges
– Dynamic programming
32
33. POS tagging algorithms
• Performance on the Wall Street Journal corpus
Training Speed Accura
Cost cy
Dependency Net (2003) Low Low 97.2
Conditional Random Fields High High 97.1
Support vector machines (2003) 97.1
Bidirectional MEMM (2005) Low 97.1
Brill‟s tagger (1995) Low 96.6
HMM (2000) Very low High 96.7
33
35. Tagging errors made by
a WSJ-trained POS tagger
… and membrane potential after mitogen binding.
CC NN NN IN NN JJ
… two factors, which bind to the same kappa B enhancers…
CD NNS WDT NN TO DT JJ NN NN NNS
… by analysing the Ag amino acid sequence.
IN VBG DT VBG JJ NN NN
… to contain more T-cell determinants than …
TO VB RBR JJ NNS IN
Stimulation of interferon beta gene transcription in vitro by
NN IN JJ JJ NN NN IN NN IN
35
36. Taggers for general text do not work well
on biomedical text
Performance of the Brill tagger evaluated on randomly selected 1000
MEDLINE sentences: 86.8% (Smith et al., 2004)
Accuracy
Exact 84.4%
NNP = NN, NNPS = NNS 90.0%
LS = NN 91.3%
JJ = NN 94.9%
Accuracies of a WSJ-trained POS tagger evaluated on the GENIA
corpus (Tsuruoka et al., 2005)
36
37. MedPost
(Smith et al., 2004)
• Hidden Markov Models (HMMs)
• Training data
– 5700 sentences randomly selected from various
thematic subsets.
• Accuracy
– 97.43% (native tagset), 96.9% (Penn tagset)
– Evaluated on 1,000 sentences
• Available from
– ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedPost/medpost.tar.gz
37
39. Performance on new data
Relative performance evaluated on recent abstracts selected from
three journals:
- Nucleic Acid Research (NAR)
- Nature Medicine (NMED)
- Journal of Clinical Investigation (JCI)
training NAR NMED NMED Total (Acc.)
WSJ 109 47 102 258 (70.9%)
GENIA 121 74 132 327 (89.8%)
PennBioIE 129 65 122 316 (86.6%)
WSJ + GENIA 125 74 135 334 (91.8%)
WSJ + PennBioIE 133 71 133 337 (92.6%)
GENIA + PennBioIE 128 75 135 338 (92.9%)
WSJ + GENIA + PennBioIE 133 74 139 346 (95.1%)
39
40. Chunking (shallow parsing)
He reckons the current account deficit will narrow to
NP VP NP VP PP
only # 1.8 billion in September .
NP PP NP
• A chunker (shallow parser) segments a
sentence into non-recursive phrases.
40
43. Machine learning-based chunking
• Convert a treebank into sentences that are
annotated with chunk information.
– CoNLL-2000 data set
• http://www.cnts.ua.ac.be/conll2000/chunking/
• The conversion script is available
• Apply a sequence tagging algorithm such as
HMM, MEMM, CRF, or Semi-CRF.
• YamCha: an SVM-based chunker
– http://www.chasen.org/~taku/software/yamcha/
43
44. GENIA tagger
• Algorithm: Bidirectional MEMM
• POS tagging
– Trained on WSJ, GENIA and Penn BioIE
– Accuracy: 97-98%
• Shallow parsing
– Trained on WSJ and GENIA
– Accuracy: 90-94%
• Can output base forms
• Available from
http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
44
45. Named-Entity Recognition
We have shown that interleukin-1 (IL-1) and IL-2 control
protein protein protein
IL-2 receptor alpha (IL-2R alpha) gene transcription in
DNA
CD4-CD8-murine T lymphocyte precursors.
cell_line
• Recognize named-entities in a sentence.
– Gene/protein names
– Protein, DNA, RNA, cell_line, cell_type
45
46. Performance of biomedical NE recognition
• Shared task data for Coling 2004 BioNLP workshop
- entity types: protein, DNA, RNA, cell_type, and cell_line
Recall Precision F-score
SVM+HMM (Zhou, 2004) 76.0 69.4 72.6
Semi-Markov CRFs (in prep.) 72.7 70.4 71.5
Two-Phase (Kim, 2005) 72.8 69.7 71.2
Sliding Window (in prep.) 71.5 70.2 70.8
CRF (Settles, 2005) 72.0 69.1 70.5
MEMM (Finkel, 2004) 71.6 68.6 70.1
: : : :
46
47. Features
Classification models, main features used in NLPBA (Kim, 2004)
CM lx af or sh g gz p n sy tr a ca d p pr ext.
n o p b o a
Zho SH x x x x x x x x x
Fin M x x x x x x x x x x B,
W
Set C x x x x (x) (x) x (W)
Son SC x x x x x V
Classification Model (CM):
Zha H x x M
S: SVM; H: HMM; M: MEMM; C: CRF
Features
lx: lexical features; af: affix information (chracter n-grams); or; orthographic Information;
sh: word shapes; gn: gene sequence; gz: gazetteers; po: part-of-speech tags; np: noun
phrase tags; sy: syntactic tags; tr: word triggers; ab: abbreviations; ca: cascaded entities;
do: global document information; pa: parentheses handling; pre: previously predicted entity
tags; B: British National Corpus; W: WWW; V: virtually generated corpus; M: MEDLINE
47
48. CFG parsing
S
VP
NP
NP QP
VBN NN VBD DT JJ CD CD NNS .
Estimated volume was a light 2.4 million ounces .
48
49. Phrase structure + head information
S
VP
NP
NP QP
VBN NN VBD DT JJ CD CD NNS .
Estimated volume was a light 2.4 million ounces .
49
50. Dependency relations
VBN NN VBD DT JJ CD CD NNS .
Estimated volume was a light 2.4 million ounces .
50
51. CFG parsing algorithms
• Performance on the Penn Treebank
LR LP F-score
Generative model (Collins, 1999) 88.1 88.3 88.2
Maxent-inspired (Charniak, 2000) 89.6 89.5 89.5
Simply Synchrony Networks (Henderson, 2004) 89.8 90.4 90.1
Data Oriented Parsing (Bod, 2003) 90.8 90.7 90.7
Re-ranking (Johnson, 2005) 91.0
51
57. HPSG parsing
HEAD: verb • HPSG
SUBJ: <>
COMPS: <> – A few schema
Subject-head schema
– Many lexical entries
HEAD: verb – Deep syntactic
SUBJ: <noun> analysis
Lexical entry COMPS: <>
Head-modifier schema
• Grammar
– Corpus-based
HEAD: noun HEAD: verb
SUBJ: <> SUBJ: <noun> HEAD:
adv grammar construction
COMPS: <> COMPS: <> MOD: verb (Miyao et al 2004)
• Parser
Mary walked slowly
– Beam search
(Tsuruoka et al.)
57
58. Experimental results
• Training set: Penn Treebank Section 02-21
(39,832 sentences)
• Test set: Penn Treebank Section 23 (< 40 words,
2,164 sentences)
• Accuracy of predicate argument relations (i.e.,
red arrows) is measured
Precision Recall F-score
87.9% 86.9% 87.4%
58
59. Parsing MEDLINE with HPSG
• Enju
– A wide-coverage HPSG parser
– http://www-tsujii.is.s.u-tokyo.ac.jp/enju/
59
60. Extraction of Protein-protein Interactions:
Predicate-argument relations + SVM (1)
• (Yakushiji, 2005)
CD4 protein interacts with non-polymorphic regions of MHCII .
ENTITY1 ENTITY2
Extraction patterns based on predicate-argument relations
argM arg1 arg1 arg2 arg1 arg2
CD4 protein interact with non-polymorphic region of MHCII
ENTITY1 ENTITY2
arg1
SVM learning with predicate-argument patterns
60
61. Text Mining for Biology
• MEDIE: An interactive intelligent IR
system retrieving events
– Performs a semantic search
• InfoPubMed: an interactive IE system and
an efficient PubMed search tool, helping
users to find information about biomedical
entities such as genes, proteins, and the
interactions between them.
61
62. Medie system overview
Off-line
On-line
Deep
parser RegionAlgebra
Input Semantically-
Textbase annotated Search engine
Entity Textbase
Recognizer
Search
Query
results
62
65. Service: extracting interactions
• Info-PubMed: interactive IE system and an
efficient PubMed search tool, helping users
to find information about biomedical entities
such as genes, proteins,and the
interactions between them.
• System components
– MEDIE
– Extraction of protein-protein interactions
– Multi-window interface on a browser
• UTokyo: NaCTeM self-funded partner 65
66. Info-PubMed
• helps biologists to search for their interests
– genes, proteins, their interactions, and
evidence sentences
– extracted from MEDLINE
(about 15 million abstracts of
biomedical papers)
• uses many NLP techniques explained
– in order to achieve high precision of retrieval
66
67. Flow Chart
Input Output
Gene or protein token:“TNF” Gene or protein
keywords entities
Gene:“TNF”
interactions
Gene or protein
around the
entitiy
given gene
Interaction:
“TNF” and “IL6” evidence sentences
interaction describing the given
interaction 67
69. Techniques(2/2)
• Extract sentences describing
protein-protein interaction
– deep parser based on HPSG syntax
• can detect semantic relations between
phrases
– domain dependent pattern recognition
• can learn and expand source patterns
• by using the result of the deep parser, it
can extract semantically true patterns
• not affected by syntactic variations
69