SlideShare ist ein Scribd-Unternehmen logo
1 von 70
Linguistic techniques for Text
             Mining
            NaCTeM team
         www.nactem.ac.uk
          Sophia Ananiadou
           Chikashi Nobata
            Yutaka Sasaki
         Yoshimasa Tsuruoka
lexicon                          ontology




                                              Natural Language Processing


                                                                                                 deep                       annotated
             raw                 part-of-speech              named entity
                                                                                               syntactic
        (unstructured)              tagging                   recognition                                                  (structured)
                                                                                                parsing
             text                                                                                                               text




………………………………..…………                                                     S
……………………………….………....
... Secretion of TNF was abolished by                                               VP
BHA in PMA-stimulated U937 cells.
……………………………………………                                                                                VP
                                                             NP
………………..
                                                                                                                 PP

                                                        NP        PP                           PP                     NP


                                                        NN     IN NN VBZ          VBN     IN     NN IN      JJ             NN NNS .
                                                   Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells .
                                                             protein_molecule            organic_compound              cell_line



                                                                           negative regulation                                        2
Basic Steps of Natural Language
          Processing
•   Sentence splitting
•   Tokenization
•   Part-of-speech tagging
•   Shallow parsing
•   Named entity recognition
•   Syntactic parsing
•   (Semantic Role Labeling)
                                  3
Sentence splitting
Current immunosuppression protocols to prevent lung transplant rejection
reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1
T-cell pro-inflammatory cytokine production is important in host defense
against bacterial infection in the lungs. Excessive immunosuppression of Th1
T-cell pro-inflammatory cytokines leaves patients susceptible to infection.




Current immunosuppression protocols to prevent lung transplant rejection
reduce pro-inflammatory and T-helper type 1 (Th1) cytokines.

However, Th1 T-cell pro-inflammatory cytokine production is important in host
defense against bacterial infection in the lungs.

Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines
leaves patients susceptible to infection.
                                                                           4
A heuristic rule for sentence splitting

 sentence boundary
      = period + space(s) + capital letter

Regular expression in Perl

       s/. +([A-Z])/.n1/g;

                                             5
Errors
   IL-33 is known to induce the production of Th2-associated
   cytokines (e.g. IL-5 and IL-13).



   IL-33 is known to induce the production of Th2-associated
   cytokines (e.g.

   IL-5 and IL-13).


• Two solutions:
  – Add more rules to handle exceptions
  – Machine learning
                                                               6
Tools for sentence splitting
• JASMINE
   – Rule-based
   – http://uvdb3.hgc.jp/ALICE/program_download.html
• Scott Piao‟s splitter
   – Rule-based
   – http://text0.mib.man.ac.uk:8080/scottpiao/sent_det
     ector
• OpenNLP
   – Maximum-entropy learning
   – https://sourceforge.net/projects/opennlp/
   – Needs training data

                                                          7
Tokenization
              The protein is activated by IL2.



        The    protein   is   activated   by     IL2   .


• Convert a sentence into a sequence of tokens

• Why do we tokenize?
• Because we do not want to treat a sentence as a
  sequence of characters!
                                                           8
Tokenization
            The protein is activated by IL2.



      The    protein   is   activated   by     IL2   .


• Tokenizing general English sentences is
  relatively straightforward.
• Use spaces as the boundaries
• Use some heuristics to handle exceptions
                                                         9
Tokenisation issues
• separate possessive endings or abbreviated forms from
  preceding words:
   – Mary‟s    Mary „s
     Mary‟s    Mary is
     Mary‟s    Mary has
• separate punctuation marks and quotes from words :
   – Mary.   Mary .
   – “new”   “ new “




                                                      10
Tokenization
• Tokenizer.sed: a simple script in sed
     • http://www.cis.upenn.edu/~treebank/tokenization.h
       tml
• Undesirable tokenization
  – original: “1,25(OH)2D3”
  – tokenized: “1 , 25 ( OH ) 2D3”
• Tokenization for biomedical text
  – Not straight-forward
  – Needs dictionary? Machine learning?

                                                       11
Tokenisation problems in Bio-text
• Commas
  – 2,6-diaminohexanoic acid
  – tricyclo(3.3.1.13,7)decanone
• Four kinds of hyphens
  –   “Syntactic:”
  –   Calcium-dependent
  –   Hsp-60
  –   Knocked-out gene: lush-- flies
  –   Negation: -fever
  –   Electric charge: Cl-

  K. Cohen NAACL-2007                  12
Tokenisation

• Tokenization: Divides the text into smallest
  units (usually words), removing punctuation.
  Challenge: What should be done with
  punctuation that has linguistic meaning?
• Negative charge (Cl-)
• Absence of symptom (-fever)
• Knocked-out gene (Ski-/-)
• Gene name (IL-2 –mediated)
• Plus, “syntactic”uses (insulin-dependent)

K. Cohen NAACL-2007
                                                 13
Part-of-speech tagging

 The peri-kappa B site mediates human immunodeficiency
 DT NN         NN NN VBZ          JJ      NN
 virus type 2 enhancer activation in monocytes …
  NN NN CD        NN       NN      IN   NNS


• Assign a part-of-speech tag to each token in a
  sentence.




                                                     14
Part-of-speech tags
• The Penn Treebank tagset
  – http://www.cis.upenn.edu/~treebank/
  – 45 tags
 NN     Noun, singular or mass               JJ    Adjective
 NNS    Noun, plural                         JJR   Adjective, comparative
 NNP    Proper noun, singular                JJS   Adjective, superlative
 NNPS   Proper noun, plural                   :         :
  :         :                                DT    Determiner
 VB     Verb, base form                      CD    Cardinal number
 VBD    Verb, past tense                     CC    Coordinating conjunction
 VBG    Verb, gerund or present participle   IN    Preposition or subordinating
 VBN    Verb, past participle                                 conjunction
 VBZ    Verb, 3rd person singular present    FW    Foreign word
  :         :                                 :               :
                                                                            15
Part-of-speech tagging is not easy
• Parts-of-speech are often ambiguous
           I have to go to school.
                    verb

           I had a go at skiing.
                  noun

• We need to look at the context
• But how?

                                        16
Writing rules for part-of-speech tagging

    I have to go to school.     I had a go at skiing.
             verb                      noun

• If the previous word is “to”, then it‟s a verb.
• If the previous word is “a”, then it‟s a noun.
• If the next word is …
            :

       Writing rules manually is impossible
                                                        17
Learning from examples
      The involvement of ion channels in B and T lymphocyte activation is
       DT    NN      IN NN NNS IN NN CC NN              NN      NN    VBZ
      supported by many reports of changes in ion fluxes and membrane
        VBN      IN JJ     NNS IN NNS IN NN NNS CC NN
      …………………………………………………………………………………….
      …………………………………………………………………………………….



                                              training
Unseen text
                                                         We demonstrate
  We demonstrate                                         PRP    VBP
                           Machine Learning
  that …                                                 that …
                              Algorithm
                                                          IN

                                                                          18
Part-of-speech tagging with Hidden
          Markov Models
                         P w1...wn | t1...tn P t1...tn
P t1...tn | w1...wn
   tags     words               P w1...wn
                         P w1...wn | t1...tn P t1...tn
                          n
                               P wi | ti P ti | ti   1
                         i 1




          output probability             transition probability
                                                                  19
First-order Hidden Markov Models

• Training
  – Estimate         P word j | tagx
                 P tag y | tagz
  – Counting (+ smoothing)


• Using the tagger
                 n
       arg max         P wi | ti P ti | ti   1
                 i 1

                                                 20
Machine learning using diverse features


• We want to use diverse types of
  information when predicting the tag.

               He       opened        it

                        Verb

               The word is “opened”
               The suffix is “ed”
  many clues   The previous word is “He”
                :
                                            21
Machine learning with log-linear models

                                           Feature function
                     Feature weight

                  1
        p y|x        exp                   f x, y
                                          i i
                 Z x            i



           Z x        exp            f x, y
                                    i i
                 y          i


                                                              22
Machine learning with log-linear models

• Maximum likelihood estimation
  – Find the parameters that maximize the
    conditional log-likelihood of the training data
                                         ~ x ~ y| x
                                         p p
           LL( ) log             p y|x
                          x, y


• Gradient
             LL( )
                         E~ [ fi ] E p [ fi ]
                          p
                 i                                    23
Computing likelihood and model
               expectation
• Example
   – Two possible tags: “Noun” and “Verb”
   – Two types of features: “word” and “suffix”


                      He           opened             it

                    Noun            Verb           Noun

                          tag verb, word opened   tag verb, suffix ed

 tag noun , word opened   tag noun , suffix ed    tag verb, word opened   tag verb, suffix ed

                                                                                          24
           tag = noun                                         tag = verb
Conditional Random Fields (CRFs)

• A single log-linear model on the whole sentence

                             F   n
                     1
   P( y1...yn | x)     exp              f t , yt 1 , yt , x
                                       i i
                     Z       i 1 t 1


• The number of classes is HUGE, so it is
  impossible to do the estimation in a naive way.


                                                              25
Conditional Random Fields (CRFs)

• Solution
  – Let‟s restrict the types of features
  – You can then use a dynamic programming
    algorithm that drastically reduces the amount of
    computation


• Features you can use (in first-order CRFs)
  – Features defined on the tag
  – Features defined on the adjacent pair of tags

                                                       26
Features
• Feature weights are associated with states
              W0=He
  and edges     &
             Tag = Noun

      He        has           opened   it

     Noun       Noun           Noun    Noun


                  Tagleft = Noun
      Verb      Verb            Verb   Verb
                         &
                  Tagright = Noun
                                              27
A naive way of calculating Z(x)


Noun   Noun   Noun   Noun   = 7.2    Verb   Noun   Noun   Noun   = 4.1
Noun   Noun   Noun   Verb   = 1.3    Verb   Noun   Noun   Verb   = 0.8
Noun   Noun   Verb   Noun   = 4.5    Verb   Noun   Verb   Noun   = 9.7
Noun   Noun   Verb   Verb   = 0.9    Verb   Noun   Verb   Verb   = 5.5
Noun   Verb   Noun   Noun   = 2.3    Verb   Verb   Noun   Noun   = 5.7
Noun   Verb   Noun   Verb   = 11.2   Verb   Verb   Noun   Verb   = 4.3
Noun   Verb   Verb   Noun   = 3.4    Verb   Verb   Verb   Noun   = 2.2
Noun   Verb   Verb   Verb   = 2.5    Verb   Verb   Verb   Verb   = 1.9

                                                          Sum = 67.5
                                                                 28
Dynamic programming
• Results of intermediate computation can
  be reused.

      He       has              opened   it

     Noun      Noun             Noun     Noun


     Verb      Verb             Verb     Verb


                                                29
                      forward
Dynamic programming
• Results of intermediate computation can
  be reused.

      He       has           opened     it

     Noun      Noun              Noun   Noun


     Verb      Verb              Verb   Verb


                                               30
                      backward
Dynamic programming
• Computing marginal distribution



      He       has      opened      it

     Noun      Noun      Noun       Noun


     Verb      Verb      Verb       Verb


                                           31
Maximum entropy learning and
      Conditional Random Fields
• Maximum entropy learning
  – Log-linear modeling + MLE
  – Parameter estimation
     • Likelihood of each sample
     • Model expectation of each feature
• Conditional Random Fields
  – Log-linear modeling on the whole sentence
  – Features are defined on states and edges
  – Dynamic programming
                                                32
POS tagging algorithms
  • Performance on the Wall Street Journal corpus

                                 Training   Speed   Accura
                                   Cost               cy
Dependency Net (2003)              Low       Low     97.2
Conditional Random Fields          High      High    97.1
Support vector machines (2003)                       97.1
Bidirectional MEMM (2005)          Low               97.1
Brill‟s tagger (1995)              Low               96.6
HMM (2000)                       Very low    High    96.7

                                                            33
POS taggers
• Brill‟s tagger
   – http://www.cs.jhu.edu/~brill/
• TnT tagger
   – http://www.coli.uni-saarland.de/~thorsten/tnt/
• Stanford tagger
   – http://nlp.stanford.edu/software/tagger.shtml
• SVMTool
   – http://www.lsi.upc.es/~nlp/SVMTool/
• GENIA tagger
   – http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/

                                                          34
Tagging errors made by
      a WSJ-trained POS tagger
… and membrane potential after mitogen binding.
  CC       NN        NN      IN    NN      JJ
… two factors, which bind to the same kappa B enhancers…
  CD NNS WDT NN TO DT JJ NN NN NNS
… by analysing the Ag amino acid sequence.
  IN VBG DT VBG JJ NN                  NN
… to contain more T-cell determinants than …
 TO VB RBR JJ                NNS        IN
  Stimulation of interferon beta gene transcription in vitro by
      NN      IN    JJ      JJ NN          NN       IN NN IN

                                                              35
Taggers for general text do not work well
               on biomedical text

Performance of the Brill tagger evaluated on randomly selected 1000
MEDLINE sentences: 86.8% (Smith et al., 2004)



                                                Accuracy
        Exact                                    84.4%
        NNP = NN, NNPS = NNS                     90.0%
        LS = NN                                  91.3%
        JJ = NN                                  94.9%
Accuracies of a WSJ-trained POS tagger evaluated on the GENIA
corpus (Tsuruoka et al., 2005)

                                                                      36
MedPost
                (Smith et al., 2004)
• Hidden Markov Models (HMMs)
• Training data
  – 5700 sentences randomly selected from various
    thematic subsets.
• Accuracy
  – 97.43% (native tagset), 96.9% (Penn tagset)
  – Evaluated on 1,000 sentences
• Available from
  – ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedPost/medpost.tar.gz


                                                                   37
Training POS taggers with bio-corpora
           (Tsuruoka and Tsujii, 2005)

training                  WSJ    GENIA   PennBioIE
WSJ                       97.2   91.6      90.5
GENIA                     85.3   98.6      92.2
PennBioIE                 87.4   93.4      97.9
WSJ + GENIA               97.2   98.5      93.6
WSJ + PennBioIE           97.2   94.0      98.0
GENIA + PennBioIE         88.3   98.4      97.8
WSJ + GENIA + PennBioIE   97.2   98.4      97.9

                                                  38
Performance on new data
     Relative performance evaluated on recent abstracts selected from
     three journals:
       - Nucleic Acid Research (NAR)
       - Nature Medicine (NMED)
       - Journal of Clinical Investigation (JCI)
training                          NAR     NMED      NMED       Total (Acc.)
WSJ                               109       47       102       258 (70.9%)
GENIA                             121       74       132       327 (89.8%)
PennBioIE                         129       65       122       316 (86.6%)
WSJ + GENIA                       125       74       135       334 (91.8%)
WSJ + PennBioIE                   133       71       133       337 (92.6%)
GENIA + PennBioIE                 128       75       135       338 (92.9%)
WSJ + GENIA + PennBioIE           133       74       139       346 (95.1%)
                                                                         39
Chunking (shallow parsing)

He reckons the current account deficit will narrow to
NP VP                  NP                  VP      PP
only # 1.8 billion in September .
      NP           PP NP


• A chunker (shallow parser) segments a
  sentence into non-recursive phrases.




                                                        40
Extracting noun phrases from MEDLINE
               (Bennett, 1999)
• Rule-based noun phrase extraction
   – Tokenization
   – Part-Of-Speech tagging
   – Pattern matching


Noun phrase extraction accuracies evaluated on 40 abstracts
             FastNPE      NPtool       Chopper        AZ Phraser
 Recall      50%          95%          97%            92%
 Precision   80%          96%          90%            86%

                                                                   41
Chunking with Machine learning


• Chunking performance on Penn Treebank
                                       Recall Precisio F-score
                                                 n
Winnow (with basic features) (Zhang,   93.60   93.54    93.57
2002)
Perceptron (Carreras, 2003)            93.29   94.19    93.74
SVM + voting (Kudoh, 2003)             93.92   93.89    93.91
SVM (Kudo, 2000)                       93.51   93.45    93.48
Bidirectional MEMM (Tsuruoka, 2005)    93.70   93.70    93.70

                                                          42
Machine learning-based chunking
• Convert a treebank into sentences that are
  annotated with chunk information.
   – CoNLL-2000 data set
     • http://www.cnts.ua.ac.be/conll2000/chunking/
     • The conversion script is available
• Apply a sequence tagging algorithm such as
  HMM, MEMM, CRF, or Semi-CRF.
• YamCha: an SVM-based chunker
  – http://www.chasen.org/~taku/software/yamcha/


                                                      43
GENIA tagger
• Algorithm: Bidirectional MEMM
• POS tagging
  – Trained on WSJ, GENIA and Penn BioIE
  – Accuracy: 97-98%
• Shallow parsing
  – Trained on WSJ and GENIA
  – Accuracy: 90-94%
• Can output base forms
• Available from
  http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
                                                       44
Named-Entity Recognition
  We have shown that interleukin-1 (IL-1) and IL-2 control
                           protein protein protein
  IL-2 receptor alpha (IL-2R alpha) gene transcription in
               DNA
  CD4-CD8-murine T lymphocyte precursors.
                 cell_line

• Recognize named-entities in a sentence.
  – Gene/protein names
  – Protein, DNA, RNA, cell_line, cell_type



                                                         45
Performance of biomedical NE recognition

 • Shared task data for Coling 2004 BioNLP workshop
    - entity types: protein, DNA, RNA, cell_type, and cell_line
                                   Recall    Precision   F-score
SVM+HMM (Zhou, 2004)                76.0       69.4        72.6
Semi-Markov CRFs (in prep.)         72.7       70.4        71.5
Two-Phase (Kim, 2005)               72.8       69.7        71.2
Sliding Window (in prep.)           71.5       70.2        70.8
CRF (Settles, 2005)                 72.0       69.1        70.5
MEMM (Finkel, 2004)                 71.6       68.6        70.1
                 :                    :          :          :
                                                                46
Features
Classification models, main features used in NLPBA (Kim, 2004)
       CM     lx   af   or   sh g      gz p       n    sy tr      a   ca d      p    pr    ext.
                                n         o       p               b      o      a
Zho    SH          x    x         x    x     x              x     x   x              x
Fin    M      x    x         x         x     x         x          x        x    x    x     B,
                                                                                           W
Set    C      x    x    x    x         (x)                  (x)                      x     (W)
Son    SC          x    x                    x    x                                  x     V
Classification Model (CM):
Zha H           x                                                                       x M
  S: SVM; H: HMM; M: MEMM; C: CRF
Features
  lx: lexical features; af: affix information (chracter n-grams); or; orthographic Information;
sh: word shapes; gn: gene sequence; gz: gazetteers; po: part-of-speech tags; np: noun
phrase tags; sy: syntactic tags; tr: word triggers; ab: abbreviations; ca: cascaded entities;
do: global document information; pa: parentheses handling; pre: previously predicted entity
tags; B: British National Corpus; W: WWW; V: virtually generated corpus; M: MEDLINE

                                                                                          47
CFG parsing

                          S

                           VP

                                   NP
       NP                          QP
 VBN        NN VBD DT JJ      CD    CD    NNS .

Estimated volume was a light 2.4 million ounces .

                                                    48
Phrase structure + head information

                           S

                            VP

                                    NP
        NP                          QP
  VBN        NN VBD DT JJ      CD    CD    NNS .

 Estimated volume was a light 2.4 million ounces .

                                                     49
Dependency relations




 VBN       NN VBD DT JJ      CD    CD     NNS .

Estimated volume was a light 2.4 million ounces .

                                                    50
CFG parsing algorithms
• Performance on the Penn Treebank

                                              LR     LP     F-score
Generative model (Collins, 1999)              88.1   88.3    88.2
Maxent-inspired (Charniak, 2000)              89.6   89.5    89.5
Simply Synchrony Networks (Henderson, 2004)   89.8   90.4    90.1
Data Oriented Parsing (Bod, 2003)             90.8   90.7    90.7
Re-ranking (Johnson, 2005)                                   91.0




                                                                51
CFG parsers
• Collins parser
  – http://people.csail.mit.edu/mcollins/code.html


• Bikel‟s parser
  – http://www.cis.upenn.edu/~dbikel/software.html#stat-parser
• Charniak parser
  – http://www.cs.brown.edu/people/ec/
• Re-ranking parser
  – http://www.cog.brown.edu:16080/~mj/Software.htm
• SSN parser
  – http://homepages.inf.ed.ac.uk/jhender6/parser/ssn_parser.html

                                                                    52
Parsing biomedical documents
• CFG parsing accuracies on the GENIA treebank
  (Clegg, 2005)
                              LR      LP      F-score
      Bikel 0.9.8             77.43   81.33   79.33
      Charniak                76.05   77.12   76.58
      Collins model 2         74.49   81.30   77.75
• In order to improve performance,
  – Unsupervised parse combination (Clegg, 2005)
  – Use lexical information (Lease, 2005)
     • 14.2% reduction in error.

                                                        53
Parse tree
                                 So


      NP1                                                          VP15


DT2            NP4                                    VP16                    VP21


 A     AJ5             NP7                  VP17             AV19          VP22      NP25


      normal    NP8            NP10            does          not          exclude    NP24


               serum    NP11            NP13                                AJ26             NP28


                        CRP           mesurement                            deep     NP29           NP31


                                                                                      vein            54
                                                                                                thrombosis
Semantic structure
                                   So
                                                                                       Predicate
                                                                                       argument
      NP1                                                            VP15
                                                                                       relations
DT2            NP4                                      VP16                    VP21
 ARG1                        ARG1                                ARG1

 A     AJ5             NP7        ARG2        VP17             AV19          VP22   NP25
                                                                                 ARG2
          ARG1
      normal    NP8              NP10            does          not          exclude    NP24
                     MOD
                                                               ARG1
               serum    NP11              NP13                                AJ26             NP28
                             MOD                                                 ARG1
                           CRP          mesurement                            deep     NP29           NP31
                                                                                          MOD
                                                                                        vein            55
                                                                                                  thrombosis
Abstraction of surface expressions




                                     56
HPSG parsing

                 HEAD: verb                               • HPSG
                 SUBJ: <>
                 COMPS: <>                                  – A few schema
                Subject-head schema
                                                            – Many lexical entries
                               HEAD: verb                   – Deep syntactic
                               SUBJ: <noun>                   analysis
Lexical entry                  COMPS: <>
                           Head-modifier schema
                                                          • Grammar
                                                            – Corpus-based
HEAD: noun       HEAD: verb
SUBJ: <>         SUBJ: <noun>         HEAD:
                                              adv             grammar construction
COMPS: <>        COMPS: <>                    MOD: verb       (Miyao et al 2004)
                                                          • Parser
    Mary              walked                 slowly
                                                            – Beam search
                                                              (Tsuruoka et al.)

                                                                                  57
Experimental results
• Training set: Penn Treebank Section 02-21
  (39,832 sentences)
• Test set: Penn Treebank Section 23 (< 40 words,
  2,164 sentences)
• Accuracy of predicate argument relations (i.e.,
  red arrows) is measured

 Precision     Recall   F-score
  87.9%        86.9%    87.4%

                                               58
Parsing MEDLINE with HPSG

• Enju
  – A wide-coverage HPSG parser
  – http://www-tsujii.is.s.u-tokyo.ac.jp/enju/




                                                 59
Extraction of Protein-protein Interactions:
      Predicate-argument relations + SVM (1)

  • (Yakushiji, 2005)
CD4 protein interacts with non-polymorphic regions of MHCII .
 ENTITY1                                                ENTITY2
Extraction patterns based on predicate-argument relations
   argM arg1      arg1           arg2             arg1 arg2
CD4 protein interact with non-polymorphic region of MHCII
 ENTITY1                                              ENTITY2
                                    arg1

                         SVM learning with predicate-argument patterns
                                                                 60
Text Mining for Biology

• MEDIE: An interactive intelligent IR
  system retrieving events
  – Performs a semantic search
• InfoPubMed: an interactive IE system and
  an efficient PubMed search tool, helping
  users to find information about biomedical
  entities such as genes, proteins, and the
  interactions between them.
                                           61
Medie system overview
           Off-line
                                         On-line
             Deep
             parser                       RegionAlgebra
 Input                  Semantically-
Textbase                  annotated       Search engine
             Entity       Textbase
           Recognizer

                                                    Search
                                        Query
                                                    results

                                                          62
63
64
Service: extracting interactions

• Info-PubMed: interactive IE system and an
  efficient PubMed search tool, helping users
  to find information about biomedical entities
  such as genes, proteins,and the
  interactions between them.
• System components
  – MEDIE
  – Extraction of protein-protein interactions
  – Multi-window interface on a browser
• UTokyo: NaCTeM self-funded partner             65
Info-PubMed
• helps biologists to search for their interests
  – genes, proteins, their interactions, and
    evidence sentences
  – extracted from MEDLINE
    (about 15 million abstracts of
     biomedical papers)
• uses many NLP techniques explained
  – in order to achieve high precision of retrieval


                                                      66
Flow Chart
   Input                              Output
Gene or protein token:“TNF”        Gene or protein
  keywords                            entities
                    Gene:“TNF”
                                     interactions
Gene or protein
                                      around the
    entitiy
                                      given gene
                Interaction:
                “TNF” and “IL6” evidence sentences
  interaction                    describing the given
                                     interaction 67
Techniques(1/2)
• Biomedical entity recognition
  in abstract sentences
  – prepare a gene dictionary
  – string match




                                  68
Techniques(2/2)
• Extract sentences describing
  protein-protein interaction
  – deep parser based on HPSG syntax
    • can detect semantic relations between
      phrases
  – domain dependent pattern recognition
    • can learn and expand source patterns
    • by using the result of the deep parser, it
      can extract semantically true patterns
    • not affected by syntactic variations
                                                   69
Info-PubMed




              70

Weitere ähnliche Inhalte

Ähnlich wie Linguistic Techniques for Text Mining

Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012Jimmy Lai
 
Text mining tools for semantically enriching scientific literature
Text mining tools for semantically enriching scientific literatureText mining tools for semantically enriching scientific literature
Text mining tools for semantically enriching scientific literatureDuncan Hull
 
Practicum Pressentation PDF
Practicum Pressentation PDFPracticum Pressentation PDF
Practicum Pressentation PDFGui Chen
 
The Phenoscape Knowledgebase
The Phenoscape KnowledgebaseThe Phenoscape Knowledgebase
The Phenoscape Knowledgebasebalhoff
 
Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Monica Munoz-Torres
 
Introduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisIntroduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisMonica Munoz-Torres
 
Genome Curation using Apollo - Workshop at UTK
Genome Curation using Apollo - Workshop at UTKGenome Curation using Apollo - Workshop at UTK
Genome Curation using Apollo - Workshop at UTKMonica Munoz-Torres
 
transposons complete ppt
transposons complete ppttransposons complete ppt
transposons complete ppttauseefsko
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAGRF_Ltd
 
Cambridge Pre-U Biology - 1.6 Genes and Protein Synthesis PART 1 Sample
Cambridge Pre-U Biology - 1.6 Genes and Protein Synthesis PART 1 SampleCambridge Pre-U Biology - 1.6 Genes and Protein Synthesis PART 1 Sample
Cambridge Pre-U Biology - 1.6 Genes and Protein Synthesis PART 1 Samplemrexham
 
Dna mutation review (1)
Dna mutation review (1)Dna mutation review (1)
Dna mutation review (1)Kshitij Kapil
 

Ähnlich wie Linguistic Techniques for Text Mining (18)

OpenNLP demo
OpenNLP demoOpenNLP demo
OpenNLP demo
 
Nltk natural language toolkit overview and application @ PyCon.tw 2012
Nltk  natural language toolkit overview and application @ PyCon.tw 2012Nltk  natural language toolkit overview and application @ PyCon.tw 2012
Nltk natural language toolkit overview and application @ PyCon.tw 2012
 
Text mining tools for semantically enriching scientific literature
Text mining tools for semantically enriching scientific literatureText mining tools for semantically enriching scientific literature
Text mining tools for semantically enriching scientific literature
 
Apolo Taller en BIOS
Apolo Taller en BIOS Apolo Taller en BIOS
Apolo Taller en BIOS
 
Practicum Pressentation PDF
Practicum Pressentation PDFPracticum Pressentation PDF
Practicum Pressentation PDF
 
DNA Daily
DNA DailyDNA Daily
DNA Daily
 
The Phenoscape Knowledgebase
The Phenoscape KnowledgebaseThe Phenoscape Knowledgebase
The Phenoscape Knowledgebase
 
Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07Apollo Introduction for i5K Groups 2015-10-07
Apollo Introduction for i5K Groups 2015-10-07
 
RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
 
Introduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinisIntroduction to Apollo: i5K E affinis
Introduction to Apollo: i5K E affinis
 
Genome Curation using Apollo - Workshop at UTK
Genome Curation using Apollo - Workshop at UTKGenome Curation using Apollo - Workshop at UTK
Genome Curation using Apollo - Workshop at UTK
 
transposons complete ppt
transposons complete ppttransposons complete ppt
transposons complete ppt
 
Transposons is.pptx
Transposons is.pptxTransposons is.pptx
Transposons is.pptx
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
Khmer TTS
Khmer TTSKhmer TTS
Khmer TTS
 
Cambridge Pre-U Biology - 1.6 Genes and Protein Synthesis PART 1 Sample
Cambridge Pre-U Biology - 1.6 Genes and Protein Synthesis PART 1 SampleCambridge Pre-U Biology - 1.6 Genes and Protein Synthesis PART 1 Sample
Cambridge Pre-U Biology - 1.6 Genes and Protein Synthesis PART 1 Sample
 
Transposons is.pptx
Transposons is.pptxTransposons is.pptx
Transposons is.pptx
 
Dna mutation review (1)
Dna mutation review (1)Dna mutation review (1)
Dna mutation review (1)
 

Mehr von butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Mehr von butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Linguistic Techniques for Text Mining

  • 1. Linguistic techniques for Text Mining NaCTeM team www.nactem.ac.uk Sophia Ananiadou Chikashi Nobata Yutaka Sasaki Yoshimasa Tsuruoka
  • 2. lexicon ontology Natural Language Processing deep annotated raw part-of-speech named entity syntactic (unstructured) tagging recognition (structured) parsing text text ………………………………..………… S ……………………………….……….... ... Secretion of TNF was abolished by VP BHA in PMA-stimulated U937 cells. …………………………………………… VP NP ……………….. PP NP PP PP NP NN IN NN VBZ VBN IN NN IN JJ NN NNS . Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells . protein_molecule organic_compound cell_line negative regulation 2
  • 3. Basic Steps of Natural Language Processing • Sentence splitting • Tokenization • Part-of-speech tagging • Shallow parsing • Named entity recognition • Syntactic parsing • (Semantic Role Labeling) 3
  • 4. Sentence splitting Current immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs. Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection. Current immunosuppression protocols to prevent lung transplant rejection reduce pro-inflammatory and T-helper type 1 (Th1) cytokines. However, Th1 T-cell pro-inflammatory cytokine production is important in host defense against bacterial infection in the lungs. Excessive immunosuppression of Th1 T-cell pro-inflammatory cytokines leaves patients susceptible to infection. 4
  • 5. A heuristic rule for sentence splitting sentence boundary = period + space(s) + capital letter Regular expression in Perl s/. +([A-Z])/.n1/g; 5
  • 6. Errors IL-33 is known to induce the production of Th2-associated cytokines (e.g. IL-5 and IL-13). IL-33 is known to induce the production of Th2-associated cytokines (e.g. IL-5 and IL-13). • Two solutions: – Add more rules to handle exceptions – Machine learning 6
  • 7. Tools for sentence splitting • JASMINE – Rule-based – http://uvdb3.hgc.jp/ALICE/program_download.html • Scott Piao‟s splitter – Rule-based – http://text0.mib.man.ac.uk:8080/scottpiao/sent_det ector • OpenNLP – Maximum-entropy learning – https://sourceforge.net/projects/opennlp/ – Needs training data 7
  • 8. Tokenization The protein is activated by IL2. The protein is activated by IL2 . • Convert a sentence into a sequence of tokens • Why do we tokenize? • Because we do not want to treat a sentence as a sequence of characters! 8
  • 9. Tokenization The protein is activated by IL2. The protein is activated by IL2 . • Tokenizing general English sentences is relatively straightforward. • Use spaces as the boundaries • Use some heuristics to handle exceptions 9
  • 10. Tokenisation issues • separate possessive endings or abbreviated forms from preceding words: – Mary‟s Mary „s Mary‟s Mary is Mary‟s Mary has • separate punctuation marks and quotes from words : – Mary. Mary . – “new” “ new “ 10
  • 11. Tokenization • Tokenizer.sed: a simple script in sed • http://www.cis.upenn.edu/~treebank/tokenization.h tml • Undesirable tokenization – original: “1,25(OH)2D3” – tokenized: “1 , 25 ( OH ) 2D3” • Tokenization for biomedical text – Not straight-forward – Needs dictionary? Machine learning? 11
  • 12. Tokenisation problems in Bio-text • Commas – 2,6-diaminohexanoic acid – tricyclo(3.3.1.13,7)decanone • Four kinds of hyphens – “Syntactic:” – Calcium-dependent – Hsp-60 – Knocked-out gene: lush-- flies – Negation: -fever – Electric charge: Cl- K. Cohen NAACL-2007 12
  • 13. Tokenisation • Tokenization: Divides the text into smallest units (usually words), removing punctuation. Challenge: What should be done with punctuation that has linguistic meaning? • Negative charge (Cl-) • Absence of symptom (-fever) • Knocked-out gene (Ski-/-) • Gene name (IL-2 –mediated) • Plus, “syntactic”uses (insulin-dependent) K. Cohen NAACL-2007 13
  • 14. Part-of-speech tagging The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NN virus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS • Assign a part-of-speech tag to each token in a sentence. 14
  • 15. Part-of-speech tags • The Penn Treebank tagset – http://www.cis.upenn.edu/~treebank/ – 45 tags NN Noun, singular or mass JJ Adjective NNS Noun, plural JJR Adjective, comparative NNP Proper noun, singular JJS Adjective, superlative NNPS Proper noun, plural : : : : DT Determiner VB Verb, base form CD Cardinal number VBD Verb, past tense CC Coordinating conjunction VBG Verb, gerund or present participle IN Preposition or subordinating VBN Verb, past participle conjunction VBZ Verb, 3rd person singular present FW Foreign word : : : : 15
  • 16. Part-of-speech tagging is not easy • Parts-of-speech are often ambiguous I have to go to school. verb I had a go at skiing. noun • We need to look at the context • But how? 16
  • 17. Writing rules for part-of-speech tagging I have to go to school. I had a go at skiing. verb noun • If the previous word is “to”, then it‟s a verb. • If the previous word is “a”, then it‟s a noun. • If the next word is … : Writing rules manually is impossible 17
  • 18. Learning from examples The involvement of ion channels in B and T lymphocyte activation is DT NN IN NN NNS IN NN CC NN NN NN VBZ supported by many reports of changes in ion fluxes and membrane VBN IN JJ NNS IN NNS IN NN NNS CC NN ……………………………………………………………………………………. ……………………………………………………………………………………. training Unseen text We demonstrate We demonstrate PRP VBP Machine Learning that … that … Algorithm IN 18
  • 19. Part-of-speech tagging with Hidden Markov Models P w1...wn | t1...tn P t1...tn P t1...tn | w1...wn tags words P w1...wn P w1...wn | t1...tn P t1...tn n P wi | ti P ti | ti 1 i 1 output probability transition probability 19
  • 20. First-order Hidden Markov Models • Training – Estimate P word j | tagx P tag y | tagz – Counting (+ smoothing) • Using the tagger n arg max P wi | ti P ti | ti 1 i 1 20
  • 21. Machine learning using diverse features • We want to use diverse types of information when predicting the tag. He opened it Verb The word is “opened” The suffix is “ed” many clues The previous word is “He” : 21
  • 22. Machine learning with log-linear models Feature function Feature weight 1 p y|x exp f x, y i i Z x i Z x exp f x, y i i y i 22
  • 23. Machine learning with log-linear models • Maximum likelihood estimation – Find the parameters that maximize the conditional log-likelihood of the training data ~ x ~ y| x p p LL( ) log p y|x x, y • Gradient LL( ) E~ [ fi ] E p [ fi ] p i 23
  • 24. Computing likelihood and model expectation • Example – Two possible tags: “Noun” and “Verb” – Two types of features: “word” and “suffix” He opened it Noun Verb Noun tag verb, word opened tag verb, suffix ed tag noun , word opened tag noun , suffix ed tag verb, word opened tag verb, suffix ed 24 tag = noun tag = verb
  • 25. Conditional Random Fields (CRFs) • A single log-linear model on the whole sentence F n 1 P( y1...yn | x) exp f t , yt 1 , yt , x i i Z i 1 t 1 • The number of classes is HUGE, so it is impossible to do the estimation in a naive way. 25
  • 26. Conditional Random Fields (CRFs) • Solution – Let‟s restrict the types of features – You can then use a dynamic programming algorithm that drastically reduces the amount of computation • Features you can use (in first-order CRFs) – Features defined on the tag – Features defined on the adjacent pair of tags 26
  • 27. Features • Feature weights are associated with states W0=He and edges & Tag = Noun He has opened it Noun Noun Noun Noun Tagleft = Noun Verb Verb Verb Verb & Tagright = Noun 27
  • 28. A naive way of calculating Z(x) Noun Noun Noun Noun = 7.2 Verb Noun Noun Noun = 4.1 Noun Noun Noun Verb = 1.3 Verb Noun Noun Verb = 0.8 Noun Noun Verb Noun = 4.5 Verb Noun Verb Noun = 9.7 Noun Noun Verb Verb = 0.9 Verb Noun Verb Verb = 5.5 Noun Verb Noun Noun = 2.3 Verb Verb Noun Noun = 5.7 Noun Verb Noun Verb = 11.2 Verb Verb Noun Verb = 4.3 Noun Verb Verb Noun = 3.4 Verb Verb Verb Noun = 2.2 Noun Verb Verb Verb = 2.5 Verb Verb Verb Verb = 1.9 Sum = 67.5 28
  • 29. Dynamic programming • Results of intermediate computation can be reused. He has opened it Noun Noun Noun Noun Verb Verb Verb Verb 29 forward
  • 30. Dynamic programming • Results of intermediate computation can be reused. He has opened it Noun Noun Noun Noun Verb Verb Verb Verb 30 backward
  • 31. Dynamic programming • Computing marginal distribution He has opened it Noun Noun Noun Noun Verb Verb Verb Verb 31
  • 32. Maximum entropy learning and Conditional Random Fields • Maximum entropy learning – Log-linear modeling + MLE – Parameter estimation • Likelihood of each sample • Model expectation of each feature • Conditional Random Fields – Log-linear modeling on the whole sentence – Features are defined on states and edges – Dynamic programming 32
  • 33. POS tagging algorithms • Performance on the Wall Street Journal corpus Training Speed Accura Cost cy Dependency Net (2003) Low Low 97.2 Conditional Random Fields High High 97.1 Support vector machines (2003) 97.1 Bidirectional MEMM (2005) Low 97.1 Brill‟s tagger (1995) Low 96.6 HMM (2000) Very low High 96.7 33
  • 34. POS taggers • Brill‟s tagger – http://www.cs.jhu.edu/~brill/ • TnT tagger – http://www.coli.uni-saarland.de/~thorsten/tnt/ • Stanford tagger – http://nlp.stanford.edu/software/tagger.shtml • SVMTool – http://www.lsi.upc.es/~nlp/SVMTool/ • GENIA tagger – http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/ 34
  • 35. Tagging errors made by a WSJ-trained POS tagger … and membrane potential after mitogen binding. CC NN NN IN NN JJ … two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN … to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN 35
  • 36. Taggers for general text do not work well on biomedical text Performance of the Brill tagger evaluated on randomly selected 1000 MEDLINE sentences: 86.8% (Smith et al., 2004) Accuracy Exact 84.4% NNP = NN, NNPS = NNS 90.0% LS = NN 91.3% JJ = NN 94.9% Accuracies of a WSJ-trained POS tagger evaluated on the GENIA corpus (Tsuruoka et al., 2005) 36
  • 37. MedPost (Smith et al., 2004) • Hidden Markov Models (HMMs) • Training data – 5700 sentences randomly selected from various thematic subsets. • Accuracy – 97.43% (native tagset), 96.9% (Penn tagset) – Evaluated on 1,000 sentences • Available from – ftp://ftp.ncbi.nlm.nih.gov/pub/lsmith/MedPost/medpost.tar.gz 37
  • 38. Training POS taggers with bio-corpora (Tsuruoka and Tsujii, 2005) training WSJ GENIA PennBioIE WSJ 97.2 91.6 90.5 GENIA 85.3 98.6 92.2 PennBioIE 87.4 93.4 97.9 WSJ + GENIA 97.2 98.5 93.6 WSJ + PennBioIE 97.2 94.0 98.0 GENIA + PennBioIE 88.3 98.4 97.8 WSJ + GENIA + PennBioIE 97.2 98.4 97.9 38
  • 39. Performance on new data Relative performance evaluated on recent abstracts selected from three journals: - Nucleic Acid Research (NAR) - Nature Medicine (NMED) - Journal of Clinical Investigation (JCI) training NAR NMED NMED Total (Acc.) WSJ 109 47 102 258 (70.9%) GENIA 121 74 132 327 (89.8%) PennBioIE 129 65 122 316 (86.6%) WSJ + GENIA 125 74 135 334 (91.8%) WSJ + PennBioIE 133 71 133 337 (92.6%) GENIA + PennBioIE 128 75 135 338 (92.9%) WSJ + GENIA + PennBioIE 133 74 139 346 (95.1%) 39
  • 40. Chunking (shallow parsing) He reckons the current account deficit will narrow to NP VP NP VP PP only # 1.8 billion in September . NP PP NP • A chunker (shallow parser) segments a sentence into non-recursive phrases. 40
  • 41. Extracting noun phrases from MEDLINE (Bennett, 1999) • Rule-based noun phrase extraction – Tokenization – Part-Of-Speech tagging – Pattern matching Noun phrase extraction accuracies evaluated on 40 abstracts FastNPE NPtool Chopper AZ Phraser Recall 50% 95% 97% 92% Precision 80% 96% 90% 86% 41
  • 42. Chunking with Machine learning • Chunking performance on Penn Treebank Recall Precisio F-score n Winnow (with basic features) (Zhang, 93.60 93.54 93.57 2002) Perceptron (Carreras, 2003) 93.29 94.19 93.74 SVM + voting (Kudoh, 2003) 93.92 93.89 93.91 SVM (Kudo, 2000) 93.51 93.45 93.48 Bidirectional MEMM (Tsuruoka, 2005) 93.70 93.70 93.70 42
  • 43. Machine learning-based chunking • Convert a treebank into sentences that are annotated with chunk information. – CoNLL-2000 data set • http://www.cnts.ua.ac.be/conll2000/chunking/ • The conversion script is available • Apply a sequence tagging algorithm such as HMM, MEMM, CRF, or Semi-CRF. • YamCha: an SVM-based chunker – http://www.chasen.org/~taku/software/yamcha/ 43
  • 44. GENIA tagger • Algorithm: Bidirectional MEMM • POS tagging – Trained on WSJ, GENIA and Penn BioIE – Accuracy: 97-98% • Shallow parsing – Trained on WSJ and GENIA – Accuracy: 90-94% • Can output base forms • Available from http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/ 44
  • 45. Named-Entity Recognition We have shown that interleukin-1 (IL-1) and IL-2 control protein protein protein IL-2 receptor alpha (IL-2R alpha) gene transcription in DNA CD4-CD8-murine T lymphocyte precursors. cell_line • Recognize named-entities in a sentence. – Gene/protein names – Protein, DNA, RNA, cell_line, cell_type 45
  • 46. Performance of biomedical NE recognition • Shared task data for Coling 2004 BioNLP workshop - entity types: protein, DNA, RNA, cell_type, and cell_line Recall Precision F-score SVM+HMM (Zhou, 2004) 76.0 69.4 72.6 Semi-Markov CRFs (in prep.) 72.7 70.4 71.5 Two-Phase (Kim, 2005) 72.8 69.7 71.2 Sliding Window (in prep.) 71.5 70.2 70.8 CRF (Settles, 2005) 72.0 69.1 70.5 MEMM (Finkel, 2004) 71.6 68.6 70.1 : : : : 46
  • 47. Features Classification models, main features used in NLPBA (Kim, 2004) CM lx af or sh g gz p n sy tr a ca d p pr ext. n o p b o a Zho SH x x x x x x x x x Fin M x x x x x x x x x x B, W Set C x x x x (x) (x) x (W) Son SC x x x x x V Classification Model (CM): Zha H x x M S: SVM; H: HMM; M: MEMM; C: CRF Features lx: lexical features; af: affix information (chracter n-grams); or; orthographic Information; sh: word shapes; gn: gene sequence; gz: gazetteers; po: part-of-speech tags; np: noun phrase tags; sy: syntactic tags; tr: word triggers; ab: abbreviations; ca: cascaded entities; do: global document information; pa: parentheses handling; pre: previously predicted entity tags; B: British National Corpus; W: WWW; V: virtually generated corpus; M: MEDLINE 47
  • 48. CFG parsing S VP NP NP QP VBN NN VBD DT JJ CD CD NNS . Estimated volume was a light 2.4 million ounces . 48
  • 49. Phrase structure + head information S VP NP NP QP VBN NN VBD DT JJ CD CD NNS . Estimated volume was a light 2.4 million ounces . 49
  • 50. Dependency relations VBN NN VBD DT JJ CD CD NNS . Estimated volume was a light 2.4 million ounces . 50
  • 51. CFG parsing algorithms • Performance on the Penn Treebank LR LP F-score Generative model (Collins, 1999) 88.1 88.3 88.2 Maxent-inspired (Charniak, 2000) 89.6 89.5 89.5 Simply Synchrony Networks (Henderson, 2004) 89.8 90.4 90.1 Data Oriented Parsing (Bod, 2003) 90.8 90.7 90.7 Re-ranking (Johnson, 2005) 91.0 51
  • 52. CFG parsers • Collins parser – http://people.csail.mit.edu/mcollins/code.html • Bikel‟s parser – http://www.cis.upenn.edu/~dbikel/software.html#stat-parser • Charniak parser – http://www.cs.brown.edu/people/ec/ • Re-ranking parser – http://www.cog.brown.edu:16080/~mj/Software.htm • SSN parser – http://homepages.inf.ed.ac.uk/jhender6/parser/ssn_parser.html 52
  • 53. Parsing biomedical documents • CFG parsing accuracies on the GENIA treebank (Clegg, 2005) LR LP F-score Bikel 0.9.8 77.43 81.33 79.33 Charniak 76.05 77.12 76.58 Collins model 2 74.49 81.30 77.75 • In order to improve performance, – Unsupervised parse combination (Clegg, 2005) – Use lexical information (Lease, 2005) • 14.2% reduction in error. 53
  • 54. Parse tree So NP1 VP15 DT2 NP4 VP16 VP21 A AJ5 NP7 VP17 AV19 VP22 NP25 normal NP8 NP10 does not exclude NP24 serum NP11 NP13 AJ26 NP28 CRP mesurement deep NP29 NP31 vein 54 thrombosis
  • 55. Semantic structure So Predicate argument NP1 VP15 relations DT2 NP4 VP16 VP21 ARG1 ARG1 ARG1 A AJ5 NP7 ARG2 VP17 AV19 VP22 NP25 ARG2 ARG1 normal NP8 NP10 does not exclude NP24 MOD ARG1 serum NP11 NP13 AJ26 NP28 MOD ARG1 CRP mesurement deep NP29 NP31 MOD vein 55 thrombosis
  • 56. Abstraction of surface expressions 56
  • 57. HPSG parsing HEAD: verb • HPSG SUBJ: <> COMPS: <> – A few schema Subject-head schema – Many lexical entries HEAD: verb – Deep syntactic SUBJ: <noun> analysis Lexical entry COMPS: <> Head-modifier schema • Grammar – Corpus-based HEAD: noun HEAD: verb SUBJ: <> SUBJ: <noun> HEAD: adv grammar construction COMPS: <> COMPS: <> MOD: verb (Miyao et al 2004) • Parser Mary walked slowly – Beam search (Tsuruoka et al.) 57
  • 58. Experimental results • Training set: Penn Treebank Section 02-21 (39,832 sentences) • Test set: Penn Treebank Section 23 (< 40 words, 2,164 sentences) • Accuracy of predicate argument relations (i.e., red arrows) is measured Precision Recall F-score 87.9% 86.9% 87.4% 58
  • 59. Parsing MEDLINE with HPSG • Enju – A wide-coverage HPSG parser – http://www-tsujii.is.s.u-tokyo.ac.jp/enju/ 59
  • 60. Extraction of Protein-protein Interactions: Predicate-argument relations + SVM (1) • (Yakushiji, 2005) CD4 protein interacts with non-polymorphic regions of MHCII . ENTITY1 ENTITY2 Extraction patterns based on predicate-argument relations argM arg1 arg1 arg2 arg1 arg2 CD4 protein interact with non-polymorphic region of MHCII ENTITY1 ENTITY2 arg1 SVM learning with predicate-argument patterns 60
  • 61. Text Mining for Biology • MEDIE: An interactive intelligent IR system retrieving events – Performs a semantic search • InfoPubMed: an interactive IE system and an efficient PubMed search tool, helping users to find information about biomedical entities such as genes, proteins, and the interactions between them. 61
  • 62. Medie system overview Off-line On-line Deep parser RegionAlgebra Input Semantically- Textbase annotated Search engine Entity Textbase Recognizer Search Query results 62
  • 63. 63
  • 64. 64
  • 65. Service: extracting interactions • Info-PubMed: interactive IE system and an efficient PubMed search tool, helping users to find information about biomedical entities such as genes, proteins,and the interactions between them. • System components – MEDIE – Extraction of protein-protein interactions – Multi-window interface on a browser • UTokyo: NaCTeM self-funded partner 65
  • 66. Info-PubMed • helps biologists to search for their interests – genes, proteins, their interactions, and evidence sentences – extracted from MEDLINE (about 15 million abstracts of biomedical papers) • uses many NLP techniques explained – in order to achieve high precision of retrieval 66
  • 67. Flow Chart Input Output Gene or protein token:“TNF” Gene or protein keywords entities Gene:“TNF” interactions Gene or protein around the entitiy given gene Interaction: “TNF” and “IL6” evidence sentences interaction describing the given interaction 67
  • 68. Techniques(1/2) • Biomedical entity recognition in abstract sentences – prepare a gene dictionary – string match 68
  • 69. Techniques(2/2) • Extract sentences describing protein-protein interaction – deep parser based on HPSG syntax • can detect semantic relations between phrases – domain dependent pattern recognition • can learn and expand source patterns • by using the result of the deep parser, it can extract semantically true patterns • not affected by syntactic variations 69