SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Downloaden Sie, um offline zu lesen
Six Month Progress Report




       Farzaneh Sarafraz
        14 August 2008
In this report
    What I have learnt
●


    What are the gaps in my understanding
●


    Outputs so far
●


    Reflection on supervision mode
●


    Plan outline until December 2008
●
1. What I have learnt – general
    General
●

         Settled down in a new environment
     –
         Learnt some of the regulations and how things 
     –
         work in
              The country
          ●


              The city
          ●


              The university
          ●


              The faculty
          ●


              The school
          ●
What I have learnt – less 
                 general
    Less general
●

        Thesis and paper writing theory and practice
    –
             Specifically through the CS7100 seminar
         ●


        LaTeX
    –
        Coding infrastruction
    –
             Warmed up!
         ●


        Database handling
    –
        Administration / web applications
    –
    Specific
●

        Biological text mining theory
    –
Biological text mining
    Biological text mining theory
●

        Main problems
    –
        Main challenges
    –
        Main approaches
    –
        Communities
    –
        Events, papers, journals, competitions, etc.
    –
             40+ papers in my CiteULike account
         ●


    Biological text mining hands on
●

        Tools, techniques, and resources
    –
        i2b2
    –
        HIV
    –
Biological text mining theory
    Main problems
●

        Information retrieval
    –
        Information extraction
    –
             Relation extraction
         ●


        Shallow parsing / chunking
    –
        POS tagging
    –
        Word sense disambiguation
    –
        Term variation
    –
Biological text mining theory 
                (cont.)
    Main problems (cont.)
●

        Named entity recognition
    –
             Dictionary based
         ●


             Rule based
         ●


             Machine learning (HMM: Zhou et al.)
         ●


             Hybrid
         ●


        Evaluation
    –
             Precision, recall, F­Score
         ●


             Sensitivity and specificity
         ●


             Not always possible due to the lack of
         ●

                  Test Corpora
              –
                  Common domains, techniques, goals
              –
Biological text mining theory 
                (cont.)
    Main challenges
●

        Deal with sublanguage of biology
    –
        Build scalable and robust systems
    –
        Present the results in meaningful and informative 
    –
        ways to the biologist
        Deal with interdisciplinary aspects
    –
             Biology – chemistry – medicine
         ●

                  Different views / information needs
              –

             Specific field (biomedicine) – linguistics – computation 
         ●


             and data mining
Main Challenges (cont.)
    Specific field (biomedicine) – linguistics – 
●


    computation and data mining
        The text is not necessarily written to be 
    –
        comprehensible by automatic techniques
        The language is dramatically different from that 
    –
        of e.g. newswire.
        Terminology, new and coined terms, usage 
    –
        ambiguity
        Non­algorithmic, irrational patterns in NL
    –
Resources
    I am aware of / I am using existing resources
●

        Literature repositories/search engines
    –
             Pubmed, MEDLINE, BioMed
         ●


             Google
         ●


        Parsers
    –
             Stanford Parser
         ●


             GeniaTagger
         ●


        Terminological resorces
    –
             Gene Ontology
         ●


             EMBL­EBI
         ●


             MeSh thesaurus
         ●


             UMLS
         ●


             Gene Synonym Finder, SBO, ...
         ●
Resources (cont.)
    Existing resources (cont.)
●

        Lexical resources
    –
        Webservices
    –
             Entrez
         ●


             Taverna
         ●


             SBO
         ●
Resources (cont.)
    I am partially developing tools for
●

        Named entity recognition
    –
        Relation extraction
    –
    I am fully tackling
●

             PPI mining
         ●


             Word sense disambiguation
         ●


             Nominalization
         ●


        I may have to tackle in future
    –
             Contradiction, negation, contrasts
         ●


             Temporal text mining
         ●
2. What I still need to learn ­ 
               Specific
    There may be gaps I am unaware of
●


    Less of wheel reinvention
●

        Use other software
    –
             Lingpipe, NLTK, Weka, RASP, ABNER, PIE, 
         ●


             BIOINFER, MALLET, Julielab, SPECIALIST,  EMBL­
             EBI, GNN (Arizona Uni), 
        Use other methods/approaches
    –
             Machine Learning
         ●


             Dynamic programming
         ●


        CL / Bio text mining theory algorithms
    –
             Viterbi, HMM, NN, SVM, GA, CRF,
         ●


             ...
         ●
2. What I still need to learn ­ 
               Specific
    Make a resources list on our web page?
●

        Similar to the Stanford – outdated
    –
        repository
    –
What I still need to learn – Less 
              general
    News of the field
●


    Areas/opportunities for research
●

        Michael Phelps analogy
    –
    Developing skills for a CV
●

        Ways to proove I have the skills I already have
    –
    Presenting results
●

        Reasons, occasions, methods
    –
        Writing
    –
    Other workshops by the faculty
●
What I still need to learn ­ 
                 General
    Writing, writing, writing
●

        Binge writing vs. Snacking
    –
        Write as you go
    –
             Closer to the final output
         ●


             Paper­based dissertation? Something to consider.
         ●


        Review, get feedback, rewrite
    –
        A pedantic editor
    –
What I still need to learn – 
            General (Cont.)
    Stronger coding infrastructure
●

        More reusable libraries
    –
        Config files
    –
        One­click approach
    –
    Optimisation
●

        Code
    –
        Database
    –
             Query optimization
         ●


             Database optimization
         ●


        Server
    –
             Load balancing
         ●

                  Multi threading
              –
                  Multi processor
              –
3. Outputs so far
    Written
●

        Background work survey
    –
             Mid April 2008
         ●


             5 pages (approx. 1000 words)
         ●


             Feedback from supervisor
         ●


             Never was written up
         ●


        Writing sample for CS7100 seminar
    –
             June 2008
         ●


             Same document as above, revised and rewritten
         ●


             12 pages, 2215 words
         ●


             Feedback from Jim Miles and peer students
         ●
HIV
    Understanding of the problem and the goals
●


    Presenting the given/wanted as tables/code/
●


    query
    Building code infrastructure
●

        Database tables
    –
        Utility libraries
    –
        Version control system
    –
        1500+ lines of documented, reusable code
    –
HIV summary
    Goal: to reproduce a human­produced table
●


    Each row has the following main columns
●

        HIV GPN (protein name, acc, and gene ID)
    –
        Human GPN (protein name, acc, and gene ID)
    –
        A relation (interactoin) between the two
    –
        A description of the interaction
    –
        The PMIDs that the interaction has been 
    –
        reported in
    The raw input: the full abstracts
●
HIV results
    HIV and human GPN names
●

        Most where mapped to their entities
    –
        1237 out of 50416 currently unmapped (2%)
    –
    Interaction verbs
●

        Interesting verbs and stems identified
    –
        The stems where found in the text
    –
             Working on stems, so including nominals, etc.
         ●


    Terms extracted from the interaction 
●


    descriptions in the original data 
Example
    SELECT DISTINCT mention FROM 
●

    index_description_term i where 
    termID=28;

           18 variations
       ●




            CD4+ T         T4 (CD)    CD4+T
            CD4­, T        T4(CD)     T (CD4)
            T CD4          CD4 (T)    CD4+ (T)
            CD4(+) T       CD4(+)T    CD4(T)
            CD4 T          CD4+­T     CD(4+) T
            T4+ (CD)       CD4(+)­T   CD4­ T
Example
    SELECT DISTINCT mention FROM 
●

    index_description_term i where 
    termID=28 or termID = 17;

           28 variations
       ●




      CD4+ T          T4(CD)     CD4+ (T)       CD4(+) T cell
      CD4­, T         CD4 (T)    CD4(T)         CD4 T­cell
      T CD4           CD4(+)T    CD(4+) T       CD4(+) T­cell
      CD4(+) T        CD4+­T     CD4­ T         CD4(+)T cell
      CD4 T           CD4(+)­T   CD4+ T cell    CD4+­T­cell
      T4+ (CD)        CD4+T      CD4­, T cell   CD4(+)­T­cell
      T4 (CD)         T (CD4)    CD4+ T­cell    CD4 T cell
HIV results
    POS tagging with GeniaTagger
●


    Parsing with Stanford parser
●

        Haven't used this data yet
    –

    Working with sentences as units
●


    Normalising terms
●


    Tables of synonyms
●


    Tables of verb stems and terms
●


    Indexes with mention/offset pairs
●
HIV results

    Looking for sentences that share all these 
●


    properties with any of the goal table rows
        A human­HIV pair of GPN
    –
        A verb phrase containing a word with the same 
    –
        stem of the interaction verb
        Any description term(s)
    –
    Very high recall (few false negatives)
●


    Not­so­high precision (numerous false 
●


    positives)
    Optimisation for more complicated queries 
●
HIV next steps
    Compare with other PPI mining and GPN 
●


    recognition tools
    Find optimum parameters
●


    Presentable results
●


    Integrate with the interaction ontology
●


    Evaluate, compare, present, get feedback
●


    Apply to new papers
●


    Apply to new organisms
●


    Evaluate, compare, present, get feedback...
●
Supervision
    Good points
●

        Moving away from theory to tackling real 
    –
        problems very quickly
        Micromanagement while I am free to manage my 
    –
        own time and other preferences
        Planning ahead, causing commitment
    –
        Providing common sense, insight, and savvy
    –
Supervision – good points 
             (cont.)
    Providing good starting points while not ruling 
–
    out my own ideas
    Good meeting frequency
–
         Group meetings?
     ●


    General support
–
    Addressing my needs
–
         Financial
     ●


         Research interests and preferences
     ●
Supervision
    Could be improved
●

        Minutes were not always thorough
    –
        Same for tasklists
    –
        We could have agenda for the meetings
    –
             I write a list of the things that I want to discuss each 
         ●


             session
             Like the one I had for this report–could have been 
         ●


             there when I presented my 3­week plan
        Same for TEAM meetings and HIV meetings
    –
    I hope we keep tackling real problems in 
●


    future
Plan
    End of August
●

        Presenting HIV output to the group
    –
        Writing HIV results
    –
    Sep
●

        Moving to new accommodation (11­20 Sep.)
    –
        Moving on HIV
    –
             Applying the ontology
         ●


             Mining new corpora
         ●


             Generalising?
         ●
Plan
    Oct
●

        Writing up HIV
    –
        Possible publicatoin
    –
        Ideas for PhD research
    –
    Nov
●

        Finalise MPhil vs. PhD
    –
        Finalise PhD research area
    –
        Work on end of year report
    –
    Dec
●

        Write up EOY report
    –
        EOY Viva
    –
References
    Ananiadou, Sophia, and John McNaught. 2006. Text Mining for Biology 
●


    and Biomedicine. Norwood: Artech House, Inc.
    Spasić, Irena. Some Web Services relevant for biomedical applications. 
●


    (Presentation slides.)
    Zhou, GuoDong, Jie Zhang, Jian Su, Dan Shen, and ChewLim Tan, 
●


    2004. Recognizing names in biomedical texts: a machine learning 
    approach. Bioinformatics. Vol. 20 no. 7. Pp. 1178­1190

Weitere ähnliche Inhalte

Andere mochten auch (12)

Health care special interest-i2b2
Health care  special interest-i2b2Health care  special interest-i2b2
Health care special interest-i2b2
 
Nacsa úJ 4.1 Jav.
Nacsa úJ 4.1 Jav.Nacsa úJ 4.1 Jav.
Nacsa úJ 4.1 Jav.
 
Edu
EduEdu
Edu
 
BioNLP09 Winners
BioNLP09 WinnersBioNLP09 Winners
BioNLP09 Winners
 
Crf
CrfCrf
Crf
 
Tinsleys 7 Accomplishments
Tinsleys 7 AccomplishmentsTinsleys 7 Accomplishments
Tinsleys 7 Accomplishments
 
Bionlp09
Bionlp09Bionlp09
Bionlp09
 
Susan Gray
Susan GraySusan Gray
Susan Gray
 
the_life_cycle_of_a_wireframe
the_life_cycle_of_a_wireframethe_life_cycle_of_a_wireframe
the_life_cycle_of_a_wireframe
 
Defense
DefenseDefense
Defense
 
Olivia Contradictions
Olivia ContradictionsOlivia Contradictions
Olivia Contradictions
 
Ambiguity
AmbiguityAmbiguity
Ambiguity
 

Ähnlich wie Six Month

!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx
!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx
!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx
katherncarlyle
 
Just the basics_strata_2013
Just the basics_strata_2013Just the basics_strata_2013
Just the basics_strata_2013
Ken Mwai
 
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
eswcsummerschool
 

Ähnlich wie Six Month (20)

Question Classifier
Question ClassifierQuestion Classifier
Question Classifier
 
Trust in Recommender Systems a historical overview and recent developments
Trust in Recommender Systems
a historical overview and recent developmentsTrust in Recommender Systems
a historical overview and recent developments
Trust in Recommender Systems a historical overview and recent developments
 
!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx
!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx
!#$&()&#+,$)!#$$&())• +,-.$0$12,#-34-$#3.docx
 
Exploring Data Visualization
Exploring Data VisualizationExploring Data Visualization
Exploring Data Visualization
 
Paul Henning Krogh A New Dawn For E Collaboration In Science
Paul Henning Krogh   A New Dawn For E Collaboration In SciencePaul Henning Krogh   A New Dawn For E Collaboration In Science
Paul Henning Krogh A New Dawn For E Collaboration In Science
 
Semantic Web research anno 2006:main streams, popular falacies, current statu...
Semantic Web research anno 2006:main streams, popular falacies, current statu...Semantic Web research anno 2006:main streams, popular falacies, current statu...
Semantic Web research anno 2006:main streams, popular falacies, current statu...
 
Common Qualitative Research Designs and What They’re Good For
Common Qualitative Research Designs and What They’re Good ForCommon Qualitative Research Designs and What They’re Good For
Common Qualitative Research Designs and What They’re Good For
 
Just the basics_strata_2013
Just the basics_strata_2013Just the basics_strata_2013
Just the basics_strata_2013
 
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
ESWC SS 2013 - Wednesday Tutorial Marko Grobelnik: Introduction to Big Data A...
 
Research data management for medical data with pyradigm
Research data management for medical data with pyradigmResearch data management for medical data with pyradigm
Research data management for medical data with pyradigm
 
ALOE - Combining User Generated Content and Traditional Metadata
ALOE - Combining User Generated Content and Traditional MetadataALOE - Combining User Generated Content and Traditional Metadata
ALOE - Combining User Generated Content and Traditional Metadata
 
Getting to Know Your Data with R
Getting to Know Your Data with RGetting to Know Your Data with R
Getting to Know Your Data with R
 
Caspar Preservation Methodology Steve Renkin
Caspar Preservation Methodology Steve RenkinCaspar Preservation Methodology Steve Renkin
Caspar Preservation Methodology Steve Renkin
 
Technical Report Writing - Chocolate Cake K Christian
Technical Report Writing - Chocolate Cake K ChristianTechnical Report Writing - Chocolate Cake K Christian
Technical Report Writing - Chocolate Cake K Christian
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Data Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake FansData Science Provenance: From Drug Discovery to Fake Fans
Data Science Provenance: From Drug Discovery to Fake Fans
 
Knowledge base system appl. p 1,2-ver1
Knowledge base system appl.  p 1,2-ver1Knowledge base system appl.  p 1,2-ver1
Knowledge base system appl. p 1,2-ver1
 
Oxford DTP - Sansone curation tools - Dec 2014
Oxford DTP - Sansone curation tools - Dec 2014Oxford DTP - Sansone curation tools - Dec 2014
Oxford DTP - Sansone curation tools - Dec 2014
 
Semi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific TablesSemi-automated Exploration and Extraction of Data in Scientific Tables
Semi-automated Exploration and Extraction of Data in Scientific Tables
 
Text-mining and Automation
Text-mining and AutomationText-mining and Automation
Text-mining and Automation
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 

Six Month

  • 1. Six Month Progress Report Farzaneh Sarafraz 14 August 2008
  • 2. In this report What I have learnt ● What are the gaps in my understanding ● Outputs so far ● Reflection on supervision mode ● Plan outline until December 2008 ●
  • 3. 1. What I have learnt – general General ● Settled down in a new environment – Learnt some of the regulations and how things  – work in The country ● The city ● The university ● The faculty ● The school ●
  • 4. What I have learnt – less  general Less general ● Thesis and paper writing theory and practice – Specifically through the CS7100 seminar ● LaTeX – Coding infrastruction – Warmed up! ● Database handling – Administration / web applications – Specific ● Biological text mining theory –
  • 5. Biological text mining Biological text mining theory ● Main problems – Main challenges – Main approaches – Communities – Events, papers, journals, competitions, etc. – 40+ papers in my CiteULike account ● Biological text mining hands on ● Tools, techniques, and resources – i2b2 – HIV –
  • 6. Biological text mining theory Main problems ● Information retrieval – Information extraction – Relation extraction ● Shallow parsing / chunking – POS tagging – Word sense disambiguation – Term variation –
  • 7. Biological text mining theory  (cont.) Main problems (cont.) ● Named entity recognition – Dictionary based ● Rule based ● Machine learning (HMM: Zhou et al.) ● Hybrid ● Evaluation – Precision, recall, F­Score ● Sensitivity and specificity ● Not always possible due to the lack of ● Test Corpora – Common domains, techniques, goals –
  • 8. Biological text mining theory  (cont.) Main challenges ● Deal with sublanguage of biology – Build scalable and robust systems – Present the results in meaningful and informative  – ways to the biologist Deal with interdisciplinary aspects – Biology – chemistry – medicine ● Different views / information needs – Specific field (biomedicine) – linguistics – computation  ● and data mining
  • 9. Main Challenges (cont.) Specific field (biomedicine) – linguistics –  ● computation and data mining The text is not necessarily written to be  – comprehensible by automatic techniques The language is dramatically different from that  – of e.g. newswire. Terminology, new and coined terms, usage  – ambiguity Non­algorithmic, irrational patterns in NL –
  • 10. Resources I am aware of / I am using existing resources ● Literature repositories/search engines – Pubmed, MEDLINE, BioMed ● Google ● Parsers – Stanford Parser ● GeniaTagger ● Terminological resorces – Gene Ontology ● EMBL­EBI ● MeSh thesaurus ● UMLS ● Gene Synonym Finder, SBO, ... ●
  • 11. Resources (cont.) Existing resources (cont.) ● Lexical resources – Webservices – Entrez ● Taverna ● SBO ●
  • 12. Resources (cont.) I am partially developing tools for ● Named entity recognition – Relation extraction – I am fully tackling ● PPI mining ● Word sense disambiguation ● Nominalization ● I may have to tackle in future – Contradiction, negation, contrasts ● Temporal text mining ●
  • 13. 2. What I still need to learn ­  Specific There may be gaps I am unaware of ● Less of wheel reinvention ● Use other software – Lingpipe, NLTK, Weka, RASP, ABNER, PIE,  ● BIOINFER, MALLET, Julielab, SPECIALIST,  EMBL­ EBI, GNN (Arizona Uni),  Use other methods/approaches – Machine Learning ● Dynamic programming ● CL / Bio text mining theory algorithms – Viterbi, HMM, NN, SVM, GA, CRF, ● ... ●
  • 14. 2. What I still need to learn ­  Specific Make a resources list on our web page? ● Similar to the Stanford – outdated – repository –
  • 15. What I still need to learn – Less  general News of the field ● Areas/opportunities for research ● Michael Phelps analogy – Developing skills for a CV ● Ways to proove I have the skills I already have – Presenting results ● Reasons, occasions, methods – Writing – Other workshops by the faculty ●
  • 16. What I still need to learn ­  General Writing, writing, writing ● Binge writing vs. Snacking – Write as you go – Closer to the final output ● Paper­based dissertation? Something to consider. ● Review, get feedback, rewrite – A pedantic editor –
  • 17. What I still need to learn –  General (Cont.) Stronger coding infrastructure ● More reusable libraries – Config files – One­click approach – Optimisation ● Code – Database – Query optimization ● Database optimization ● Server – Load balancing ● Multi threading – Multi processor –
  • 18. 3. Outputs so far Written ● Background work survey – Mid April 2008 ● 5 pages (approx. 1000 words) ● Feedback from supervisor ● Never was written up ● Writing sample for CS7100 seminar – June 2008 ● Same document as above, revised and rewritten ● 12 pages, 2215 words ● Feedback from Jim Miles and peer students ●
  • 19. HIV Understanding of the problem and the goals ● Presenting the given/wanted as tables/code/ ● query Building code infrastructure ● Database tables – Utility libraries – Version control system – 1500+ lines of documented, reusable code –
  • 20. HIV summary Goal: to reproduce a human­produced table ● Each row has the following main columns ● HIV GPN (protein name, acc, and gene ID) – Human GPN (protein name, acc, and gene ID) – A relation (interactoin) between the two – A description of the interaction – The PMIDs that the interaction has been  – reported in The raw input: the full abstracts ●
  • 21. HIV results HIV and human GPN names ● Most where mapped to their entities – 1237 out of 50416 currently unmapped (2%) – Interaction verbs ● Interesting verbs and stems identified – The stems where found in the text – Working on stems, so including nominals, etc. ● Terms extracted from the interaction  ● descriptions in the original data 
  • 22. Example SELECT DISTINCT mention FROM  ● index_description_term i where  termID=28; 18 variations ● CD4+ T T4 (CD) CD4+T CD4­, T T4(CD) T (CD4) T CD4 CD4 (T) CD4+ (T) CD4(+) T CD4(+)T CD4(T) CD4 T CD4+­T CD(4+) T T4+ (CD) CD4(+)­T CD4­ T
  • 23. Example SELECT DISTINCT mention FROM  ● index_description_term i where  termID=28 or termID = 17; 28 variations ● CD4+ T T4(CD) CD4+ (T) CD4(+) T cell CD4­, T CD4 (T) CD4(T) CD4 T­cell T CD4 CD4(+)T CD(4+) T CD4(+) T­cell CD4(+) T CD4+­T CD4­ T CD4(+)T cell CD4 T CD4(+)­T CD4+ T cell CD4+­T­cell T4+ (CD) CD4+T CD4­, T cell CD4(+)­T­cell T4 (CD) T (CD4) CD4+ T­cell CD4 T cell
  • 24. HIV results POS tagging with GeniaTagger ● Parsing with Stanford parser ● Haven't used this data yet – Working with sentences as units ● Normalising terms ● Tables of synonyms ● Tables of verb stems and terms ● Indexes with mention/offset pairs ●
  • 25. HIV results Looking for sentences that share all these  ● properties with any of the goal table rows A human­HIV pair of GPN – A verb phrase containing a word with the same  – stem of the interaction verb Any description term(s) – Very high recall (few false negatives) ● Not­so­high precision (numerous false  ● positives) Optimisation for more complicated queries  ●
  • 26. HIV next steps Compare with other PPI mining and GPN  ● recognition tools Find optimum parameters ● Presentable results ● Integrate with the interaction ontology ● Evaluate, compare, present, get feedback ● Apply to new papers ● Apply to new organisms ● Evaluate, compare, present, get feedback... ●
  • 27. Supervision Good points ● Moving away from theory to tackling real  – problems very quickly Micromanagement while I am free to manage my  – own time and other preferences Planning ahead, causing commitment – Providing common sense, insight, and savvy –
  • 28. Supervision – good points  (cont.) Providing good starting points while not ruling  – out my own ideas Good meeting frequency – Group meetings? ● General support – Addressing my needs – Financial ● Research interests and preferences ●
  • 29. Supervision Could be improved ● Minutes were not always thorough – Same for tasklists – We could have agenda for the meetings – I write a list of the things that I want to discuss each  ● session Like the one I had for this report–could have been  ● there when I presented my 3­week plan Same for TEAM meetings and HIV meetings – I hope we keep tackling real problems in  ● future
  • 30. Plan End of August ● Presenting HIV output to the group – Writing HIV results – Sep ● Moving to new accommodation (11­20 Sep.) – Moving on HIV – Applying the ontology ● Mining new corpora ● Generalising? ●
  • 31. Plan Oct ● Writing up HIV – Possible publicatoin – Ideas for PhD research – Nov ● Finalise MPhil vs. PhD – Finalise PhD research area – Work on end of year report – Dec ● Write up EOY report – EOY Viva –
  • 32. References Ananiadou, Sophia, and John McNaught. 2006. Text Mining for Biology  ● and Biomedicine. Norwood: Artech House, Inc. Spasić, Irena. Some Web Services relevant for biomedical applications.  ● (Presentation slides.) Zhou, GuoDong, Jie Zhang, Jian Su, Dan Shen, and ChewLim Tan,  ● 2004. Recognizing names in biomedical texts: a machine learning  approach. Bioinformatics. Vol. 20 no. 7. Pp. 1178­1190