2. In this report
What I have learnt
●
What are the gaps in my understanding
●
Outputs so far
●
Reflection on supervision mode
●
Plan outline until December 2008
●
3. 1. What I have learnt – general
General
●
Settled down in a new environment
–
Learnt some of the regulations and how things
–
work in
The country
●
The city
●
The university
●
The faculty
●
The school
●
4. What I have learnt – less
general
Less general
●
Thesis and paper writing theory and practice
–
Specifically through the CS7100 seminar
●
LaTeX
–
Coding infrastruction
–
Warmed up!
●
Database handling
–
Administration / web applications
–
Specific
●
Biological text mining theory
–
5. Biological text mining
Biological text mining theory
●
Main problems
–
Main challenges
–
Main approaches
–
Communities
–
Events, papers, journals, competitions, etc.
–
40+ papers in my CiteULike account
●
Biological text mining hands on
●
Tools, techniques, and resources
–
i2b2
–
HIV
–
6. Biological text mining theory
Main problems
●
Information retrieval
–
Information extraction
–
Relation extraction
●
Shallow parsing / chunking
–
POS tagging
–
Word sense disambiguation
–
Term variation
–
7. Biological text mining theory
(cont.)
Main problems (cont.)
●
Named entity recognition
–
Dictionary based
●
Rule based
●
Machine learning (HMM: Zhou et al.)
●
Hybrid
●
Evaluation
–
Precision, recall, FScore
●
Sensitivity and specificity
●
Not always possible due to the lack of
●
Test Corpora
–
Common domains, techniques, goals
–
8. Biological text mining theory
(cont.)
Main challenges
●
Deal with sublanguage of biology
–
Build scalable and robust systems
–
Present the results in meaningful and informative
–
ways to the biologist
Deal with interdisciplinary aspects
–
Biology – chemistry – medicine
●
Different views / information needs
–
Specific field (biomedicine) – linguistics – computation
●
and data mining
9. Main Challenges (cont.)
Specific field (biomedicine) – linguistics –
●
computation and data mining
The text is not necessarily written to be
–
comprehensible by automatic techniques
The language is dramatically different from that
–
of e.g. newswire.
Terminology, new and coined terms, usage
–
ambiguity
Nonalgorithmic, irrational patterns in NL
–
10. Resources
I am aware of / I am using existing resources
●
Literature repositories/search engines
–
Pubmed, MEDLINE, BioMed
●
Google
●
Parsers
–
Stanford Parser
●
GeniaTagger
●
Terminological resorces
–
Gene Ontology
●
EMBLEBI
●
MeSh thesaurus
●
UMLS
●
Gene Synonym Finder, SBO, ...
●
12. Resources (cont.)
I am partially developing tools for
●
Named entity recognition
–
Relation extraction
–
I am fully tackling
●
PPI mining
●
Word sense disambiguation
●
Nominalization
●
I may have to tackle in future
–
Contradiction, negation, contrasts
●
Temporal text mining
●
13. 2. What I still need to learn
Specific
There may be gaps I am unaware of
●
Less of wheel reinvention
●
Use other software
–
Lingpipe, NLTK, Weka, RASP, ABNER, PIE,
●
BIOINFER, MALLET, Julielab, SPECIALIST, EMBL
EBI, GNN (Arizona Uni),
Use other methods/approaches
–
Machine Learning
●
Dynamic programming
●
CL / Bio text mining theory algorithms
–
Viterbi, HMM, NN, SVM, GA, CRF,
●
...
●
14. 2. What I still need to learn
Specific
Make a resources list on our web page?
●
Similar to the Stanford – outdated
–
repository
–
15. What I still need to learn – Less
general
News of the field
●
Areas/opportunities for research
●
Michael Phelps analogy
–
Developing skills for a CV
●
Ways to proove I have the skills I already have
–
Presenting results
●
Reasons, occasions, methods
–
Writing
–
Other workshops by the faculty
●
16. What I still need to learn
General
Writing, writing, writing
●
Binge writing vs. Snacking
–
Write as you go
–
Closer to the final output
●
Paperbased dissertation? Something to consider.
●
Review, get feedback, rewrite
–
A pedantic editor
–
17. What I still need to learn –
General (Cont.)
Stronger coding infrastructure
●
More reusable libraries
–
Config files
–
Oneclick approach
–
Optimisation
●
Code
–
Database
–
Query optimization
●
Database optimization
●
Server
–
Load balancing
●
Multi threading
–
Multi processor
–
18. 3. Outputs so far
Written
●
Background work survey
–
Mid April 2008
●
5 pages (approx. 1000 words)
●
Feedback from supervisor
●
Never was written up
●
Writing sample for CS7100 seminar
–
June 2008
●
Same document as above, revised and rewritten
●
12 pages, 2215 words
●
Feedback from Jim Miles and peer students
●
19. HIV
Understanding of the problem and the goals
●
Presenting the given/wanted as tables/code/
●
query
Building code infrastructure
●
Database tables
–
Utility libraries
–
Version control system
–
1500+ lines of documented, reusable code
–
20. HIV summary
Goal: to reproduce a humanproduced table
●
Each row has the following main columns
●
HIV GPN (protein name, acc, and gene ID)
–
Human GPN (protein name, acc, and gene ID)
–
A relation (interactoin) between the two
–
A description of the interaction
–
The PMIDs that the interaction has been
–
reported in
The raw input: the full abstracts
●
21. HIV results
HIV and human GPN names
●
Most where mapped to their entities
–
1237 out of 50416 currently unmapped (2%)
–
Interaction verbs
●
Interesting verbs and stems identified
–
The stems where found in the text
–
Working on stems, so including nominals, etc.
●
Terms extracted from the interaction
●
descriptions in the original data
22. Example
SELECT DISTINCT mention FROM
●
index_description_term i where
termID=28;
18 variations
●
CD4+ T T4 (CD) CD4+T
CD4, T T4(CD) T (CD4)
T CD4 CD4 (T) CD4+ (T)
CD4(+) T CD4(+)T CD4(T)
CD4 T CD4+T CD(4+) T
T4+ (CD) CD4(+)T CD4 T
23. Example
SELECT DISTINCT mention FROM
●
index_description_term i where
termID=28 or termID = 17;
28 variations
●
CD4+ T T4(CD) CD4+ (T) CD4(+) T cell
CD4, T CD4 (T) CD4(T) CD4 Tcell
T CD4 CD4(+)T CD(4+) T CD4(+) Tcell
CD4(+) T CD4+T CD4 T CD4(+)T cell
CD4 T CD4(+)T CD4+ T cell CD4+Tcell
T4+ (CD) CD4+T CD4, T cell CD4(+)Tcell
T4 (CD) T (CD4) CD4+ Tcell CD4 T cell
24. HIV results
POS tagging with GeniaTagger
●
Parsing with Stanford parser
●
Haven't used this data yet
–
Working with sentences as units
●
Normalising terms
●
Tables of synonyms
●
Tables of verb stems and terms
●
Indexes with mention/offset pairs
●
25. HIV results
Looking for sentences that share all these
●
properties with any of the goal table rows
A humanHIV pair of GPN
–
A verb phrase containing a word with the same
–
stem of the interaction verb
Any description term(s)
–
Very high recall (few false negatives)
●
Notsohigh precision (numerous false
●
positives)
Optimisation for more complicated queries
●
26. HIV next steps
Compare with other PPI mining and GPN
●
recognition tools
Find optimum parameters
●
Presentable results
●
Integrate with the interaction ontology
●
Evaluate, compare, present, get feedback
●
Apply to new papers
●
Apply to new organisms
●
Evaluate, compare, present, get feedback...
●
27. Supervision
Good points
●
Moving away from theory to tackling real
–
problems very quickly
Micromanagement while I am free to manage my
–
own time and other preferences
Planning ahead, causing commitment
–
Providing common sense, insight, and savvy
–
28. Supervision – good points
(cont.)
Providing good starting points while not ruling
–
out my own ideas
Good meeting frequency
–
Group meetings?
●
General support
–
Addressing my needs
–
Financial
●
Research interests and preferences
●
29. Supervision
Could be improved
●
Minutes were not always thorough
–
Same for tasklists
–
We could have agenda for the meetings
–
I write a list of the things that I want to discuss each
●
session
Like the one I had for this report–could have been
●
there when I presented my 3week plan
Same for TEAM meetings and HIV meetings
–
I hope we keep tackling real problems in
●
future
30. Plan
End of August
●
Presenting HIV output to the group
–
Writing HIV results
–
Sep
●
Moving to new accommodation (1120 Sep.)
–
Moving on HIV
–
Applying the ontology
●
Mining new corpora
●
Generalising?
●
31. Plan
Oct
●
Writing up HIV
–
Possible publicatoin
–
Ideas for PhD research
–
Nov
●
Finalise MPhil vs. PhD
–
Finalise PhD research area
–
Work on end of year report
–
Dec
●
Write up EOY report
–
EOY Viva
–
32. References
Ananiadou, Sophia, and John McNaught. 2006. Text Mining for Biology
●
and Biomedicine. Norwood: Artech House, Inc.
Spasić, Irena. Some Web Services relevant for biomedical applications.
●
(Presentation slides.)
Zhou, GuoDong, Jie Zhang, Jian Su, Dan Shen, and ChewLim Tan,
●
2004. Recognizing names in biomedical texts: a machine learning
approach. Bioinformatics. Vol. 20 no. 7. Pp. 11781190