Six Month

Six Month Progress Report

Farzaneh Sarafraz
14 August 2008

In this report
What I have learnt
●

What are the gaps in my understanding
●

Outputs so far
●

Reflection on supervision mode
●

Plan outline until December 2008
●

1. What I have learnt – general
General
●

Settled down in a new environment
–
Learnt some of the regulations and how things
–
work in
The country
●

The city
●

The university
●

The faculty
●

The school
●

What I have learnt – less
general
Less general
●

Thesis and paper writing theory and practice
–
Specifically through the CS7100 seminar
●

LaTeX
–
Coding infrastruction
–
Warmed up!
●

Database handling
–
Administration / web applications
–
Specific
●

Biological text mining theory
–

Biological text mining
●

Main problems
–
Main challenges
–
Main approaches
–
Communities
–
Events, papers, journals, competitions, etc.
–
40+ papers in my CiteULike account
●

Biological text mining hands on
●

Tools, techniques, and resources
–
i2b2
–
HIV
–

Main problems
●

Information retrieval
–
Information extraction
–
Relation extraction
●

Shallow parsing / chunking
–
POS tagging
–
Word sense disambiguation
–
Term variation
–

(cont.)
Main problems (cont.)
●

Named entity recognition
–
Dictionary based
●

Rule based
●

Machine learning (HMM: Zhou et al.)
●

Hybrid
●

Evaluation
–
Precision, recall, FScore
●

Sensitivity and specificity
●

Not always possible due to the lack of
●

Test Corpora
–
Common domains, techniques, goals
–

(cont.)
Main challenges
●

Deal with sublanguage of biology
–
Build scalable and robust systems
–
Present the results in meaningful and informative
–
ways to the biologist
Deal with interdisciplinary aspects
–
Biology – chemistry – medicine
●

Different views / information needs
–

Specific field (biomedicine) – linguistics – computation
●

and data mining

Main Challenges (cont.)
Specific field (biomedicine) – linguistics –
●

computation and data mining
The text is not necessarily written to be
–
comprehensible by automatic techniques
The language is dramatically different from that
–
of e.g. newswire.
Terminology, new and coined terms, usage
–
ambiguity
Nonalgorithmic, irrational patterns in NL
–

Resources
I am aware of / I am using existing resources
●

Literature repositories/search engines
–
Pubmed, MEDLINE, BioMed
●

Google
●

Parsers
–
Stanford Parser
●

GeniaTagger
●

Terminological resorces
–
Gene Ontology
●

EMBLEBI
●

MeSh thesaurus
●

UMLS
●

Gene Synonym Finder, SBO, ...
●

Resources (cont.)
Existing resources (cont.)
●

Lexical resources
–
Webservices
–
Entrez
●

Taverna
●

SBO
●

Resources (cont.)
I am partially developing tools for
●

Named entity recognition
–
Relation extraction
–
I am fully tackling
●

PPI mining
●

Word sense disambiguation
●

Nominalization
●

I may have to tackle in future
–
Contradiction, negation, contrasts
●

Temporal text mining
●

2. What I still need to learn
Specific
There may be gaps I am unaware of
●

Less of wheel reinvention
●

Use other software
–
Lingpipe, NLTK, Weka, RASP, ABNER, PIE,
●

BIOINFER, MALLET, Julielab, SPECIALIST, EMBL
EBI, GNN (Arizona Uni),
Use other methods/approaches
–
Machine Learning
●

Dynamic programming
●

CL / Bio text mining theory algorithms
–
Viterbi, HMM, NN, SVM, GA, CRF,
●

...
●

2. What I still need to learn
Specific
Make a resources list on our web page?
●

Similar to the Stanford – outdated
–
repository
–

What I still need to learn – Less
general
News of the field
●

Areas/opportunities for research
●

Michael Phelps analogy
–
Developing skills for a CV
●

Ways to proove I have the skills I already have
–
Presenting results
●

Reasons, occasions, methods
–
Writing
–
Other workshops by the faculty
●

What I still need to learn
General
Writing, writing, writing
●

Binge writing vs. Snacking
–
Write as you go
–
Closer to the final output
●

Paperbased dissertation? Something to consider.
●

Review, get feedback, rewrite
–
A pedantic editor
–

What I still need to learn –
General (Cont.)
Stronger coding infrastructure
●

More reusable libraries
–
Config files
–
Oneclick approach
–
Optimisation
●

Code
–
Database
–
Query optimization
●

Database optimization
●

Server
–
Load balancing
●

Multi threading
–
Multi processor
–

3. Outputs so far
Written
●

Background work survey
–
Mid April 2008
●

5 pages (approx. 1000 words)
●

Feedback from supervisor
●

Never was written up
●

Writing sample for CS7100 seminar
–
June 2008
●

Same document as above, revised and rewritten
●

12 pages, 2215 words
●

Feedback from Jim Miles and peer students
●

HIV
Understanding of the problem and the goals
●

Presenting the given/wanted as tables/code/
●

query
Building code infrastructure
●

Database tables
–
Utility libraries
–
Version control system
–
1500+ lines of documented, reusable code
–

HIV summary
Goal: to reproduce a humanproduced table
●

Each row has the following main columns
●

HIV GPN (protein name, acc, and gene ID)
–
Human GPN (protein name, acc, and gene ID)
–
A relation (interactoin) between the two
–
A description of the interaction
–
The PMIDs that the interaction has been
–
reported in
The raw input: the full abstracts
●

HIV results
HIV and human GPN names
●

Most where mapped to their entities
–
1237 out of 50416 currently unmapped (2%)
–
Interaction verbs
●

Interesting verbs and stems identified
–
The stems where found in the text
–
Working on stems, so including nominals, etc.
●

Terms extracted from the interaction
●

descriptions in the original data

Example
SELECT DISTINCT mention FROM
●

index_description_term i where
termID=28;

18 variations
●

CD4+ T T4 (CD) CD4+T
CD4, T T4(CD) T (CD4)
T CD4 CD4 (T) CD4+ (T)
CD4(+) T CD4(+)T CD4(T)
CD4 T CD4+T CD(4+) T
T4+ (CD) CD4(+)T CD4 T

Example
SELECT DISTINCT mention FROM
●

index_description_term i where
termID=28 or termID = 17;

28 variations
●

CD4+ T T4(CD) CD4+ (T) CD4(+) T cell
CD4, T CD4 (T) CD4(T) CD4 Tcell
T CD4 CD4(+)T CD(4+) T CD4(+) Tcell
CD4(+) T CD4+T CD4 T CD4(+)T cell
CD4 T CD4(+)T CD4+ T cell CD4+Tcell
T4+ (CD) CD4+T CD4, T cell CD4(+)Tcell
T4 (CD) T (CD4) CD4+ Tcell CD4 T cell

HIV results
POS tagging with GeniaTagger
●

Parsing with Stanford parser
●

Haven't used this data yet
–

Working with sentences as units
●

Normalising terms
●

Tables of synonyms
●

Tables of verb stems and terms
●

Indexes with mention/offset pairs
●

HIV results

Looking for sentences that share all these
●

properties with any of the goal table rows
A humanHIV pair of GPN
–
A verb phrase containing a word with the same
–
stem of the interaction verb
Any description term(s)
–
Very high recall (few false negatives)
●

Notsohigh precision (numerous false
●

positives)
Optimisation for more complicated queries
●

HIV next steps
Compare with other PPI mining and GPN
●

recognition tools
Find optimum parameters
●

Presentable results
●

Integrate with the interaction ontology
●

Evaluate, compare, present, get feedback
●

Apply to new papers
●

Apply to new organisms
●

Evaluate, compare, present, get feedback...
●

Supervision
Good points
●

Moving away from theory to tackling real
–
problems very quickly
Micromanagement while I am free to manage my
–
own time and other preferences
Planning ahead, causing commitment
–
Providing common sense, insight, and savvy
–

Supervision – good points
(cont.)
Providing good starting points while not ruling
–
out my own ideas
Good meeting frequency
–
Group meetings?
●

General support
–
Addressing my needs
–
Financial
●

Research interests and preferences
●

Supervision
Could be improved
●

Minutes were not always thorough
–
Same for tasklists
–
We could have agenda for the meetings
–
I write a list of the things that I want to discuss each
●

session
Like the one I had for this report–could have been
●

there when I presented my 3week plan
Same for TEAM meetings and HIV meetings
–
I hope we keep tackling real problems in
●

future

Plan
End of August
●

Presenting HIV output to the group
–
Writing HIV results
–
Sep
●

Moving to new accommodation (1120 Sep.)
–
Moving on HIV
–
Applying the ontology
●

Mining new corpora
●

Generalising?
●

Plan
Oct
●

Writing up HIV
–
Possible publicatoin
–
Ideas for PhD research
–
Nov
●

Finalise MPhil vs. PhD
–
Finalise PhD research area
–
Work on end of year report
–
Dec
●

Write up EOY report
–
EOY Viva
–

References
Ananiadou, Sophia, and John McNaught. 2006. Text Mining for Biology
●

and Biomedicine. Norwood: Artech House, Inc.
Spasić, Irena. Some Web Services relevant for biomedical applications.
●

(Presentation slides.)
Zhou, GuoDong, Jie Zhang, Jian Su, Dan Shen, and ChewLim Tan,
●

2004. Recognizing names in biomedical texts: a machine learning
approach. Bioinformatics. Vol. 20 no. 7. Pp. 11781190

Six Month

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (12)

Ähnlich wie Six Month

Ähnlich wie Six Month (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Six Month