SlideShare a Scribd company logo
1 of 3
Research Proposal
Bioinformatics approach to evaluation of Transcription factor genes and diseases
(Cancer)
Problem Statement:
The purpose of the proposed research is the
development of a computational approach
to quantitatively evaluate associations
between transcription factor encoding
genes and human diseases, based on
available literature evidence. The approach
will analyze a set of candidate genes and
determine which genes are linked to human
diseases, which properties are involved in
these gene-disease linkages, and which
clusters of similar genes are involved in
particular diseases. During the course of the
research, I shall explore methods for
recapitulating existing associations and
predicting novel associations based on
diverse forms of data pertaining to genes
and diseases. These methods will evaluate
the resulting associations in a quantitative
manner, and the resulting analyses will be
validated to determine the efficacy of the
methods.
Background:
Identification of functional causes and
contributing mechanisms of disease is a
principal aim of biomedical research. In
many cases, the term “disease” broadly
applies to a heterogeneous set of observable
properties, which may arise from multiple
molecular processes. Disease is often
characterized by symptoms and a pattern of
progression over time. The area of Cancer
diseases is particularly broad,
encompassing a wide range of complex,
abnormal phenotypes. Compared to
diseases associated with other organs, many
types of cancer like brain cancer tend to be
poorly understood: many are difficult to
characterize and have complex genetic
components involving multiple genes.
Transcription factors are key regulators of
gene expression, involved via processes
such as the recruitment of transcription
initiation factors and conformational
change of DNA, working alone or as part of
protein complexes.
GeneSeeker [1] can find genes within a
chromosomal location that are localized in
particular tissues, by looking at human and
mouse expression data. Another method of
associating disease gene to anatomical
locations [2] performed text mining of
PubMed abstracts to associate eVOC
anatomical ontology terms to gene names.
Machine learning approaches can be used
when a representative set of disease genes
are available to use as training data. In DGP
[3], a decision tree classification approach
is used to find features common to disease
genes based on a training set composed of
sample disease and control proteins.
Features were protein length, BLASTP
ratios (conservation score) between a
protein and its highest scoring homologue
within taxonomic groups (representing
phylogenetic conservation and extent) and
the conservation score with the closest
paralogue. The study indicates that, on
average, hereditary disease genes (genes
taken from OMIM) in comparison to
randomly selected genes are longer, more
conserved, phylogenetically extended and
without close paralogues.
PROSPECTR [4] uses a wider variety of
features, including the length of the gene,
the length of its coding sequence, the length
of its cDNA, length of the protein, GC
content and percentage protein identity
with its nearest homologue in various
species (mouse, worm, fly). The
investigators used an alternating decision
tree, taking genes from OMIM and
comparing against genes not found in
OMIM. They also generated two
independent test sets – one using genes
from the Human Gene Mutation Database
with randomly selected control genes, and
another set of 54 genes not in OMIM, again
with a set of randomly selected control
genes.
POCUS [5] takes another machine learning
approach, using a selected training set of
genes linked to the target disease. POCUS
identifies common features between all the
training genes – InterPro domains, GO
annotations, similar expression profile –
and assesses the chance that such common
features would be shared by chance. This
method depends on a carefully selected
training set of genes, and focuses the
likelihood of these genes all sharing
common, disease-related properties, in
contrast to methods that focus on
overrepresentation of properties among the
training genes.
Proposed Method:
Most of the existing methods for the
computational prediction of linkages
between genes and disease take as input a
preliminary list of candidate genes (e.g.
genes in a genomic region linked in a
genetic study to a disease), and return as
output either a reduced or a ranked list. The
underlying approaches differ substantively
between methods. Examples of
characteristics used in the methods include
numerical features derived from the raw
sequence of genes and/or encoded proteins,
existing annotations of proteins and genes,
and abstracts or articles directly referring to
the gene. The current methods focus on
using properties from a representative set of
genes to identify similar genes from the
candidate set.
We propose a method of extracting gene-
disease associations that will emphasize
verifiable supporting evidence for the
predicted associations, and a quantitative
evaluation of the strength of the association.
We shall investigate both associations
between genes and disease, as well as
properties of the gene-disease association.
We shall consider three base entities –
Genes, Diseases, Evidence – and the
relationships between these entities.
Goal of Research:
Our goal will be to predict Gene-Disease
relationships based on the existence of
relationships between other entity pairings.
After initial study of mammalian gene
disease relationships, we will broaden the
approach to incorporate entity relationships
involving orthologous genes in model
organisms or related diseases. These paths
of supporting evidence will be
quantitatively evaluated, making it possible
to both extracts strongly supported gene-
disease linkages and to rank these linkages.
Although the thesis itself will investigate
properties of transcription factor genes in
Cancer diseases, the methods and analysis
will be designed for general application.
For the initial analysis of the main gene-
disease associations.
Reference:
1. Van Driel MA, Cuelenaere K,
Kemmeren PP, Leunissen JA,
Brunner HG, et al. (2005)
GeneSeeker: extraction and
integration of human disease-
related information from web-based
genetic databases. Nucleic Acids
Research 33: 758.
2. Tiffin N, Kelso J, Powell A, Pan H,
Bajic V, et al. (2005) Integration of
text- and datamining using
ontologies successfully selects
disease gene candidates. Nucleic
Acids Research 33: 1544-1552.
3. López-Bigas N, Ouzounis C (2004)
Genome-wide identification of
genes likely to be involved in
human genetic disease. Nucleic
Acids Research 32: 3108.
4. Adie E, Adams R, Evans K,
Porteous D, Pickard B (2005)
Speeding disease gene discovery by
sequence-based candidate
prioritization. BMC Bioinformatics
6: 55.
5. Turner F, Clutterbuck D, Semple C
(2003) POCUS: mining genomic sequence
annotation to predict disease genes.
Genome Biology 4: 75.

More Related Content

What's hot

Curriculum_Vitae_Mark_Ebbert-modern
Curriculum_Vitae_Mark_Ebbert-modernCurriculum_Vitae_Mark_Ebbert-modern
Curriculum_Vitae_Mark_Ebbert-modern
Mark Ebbert
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
CSCJournals
 
A comparative study of covariance selection models for the inference of gene ...
A comparative study of covariance selection models for the inference of gene ...A comparative study of covariance selection models for the inference of gene ...
A comparative study of covariance selection models for the inference of gene ...
Roberto Anglani
 
A Multi State Markov Model for Analyzing Patterns of Use of Opiod Treatments ...
A Multi State Markov Model for Analyzing Patterns of Use of Opiod Treatments ...A Multi State Markov Model for Analyzing Patterns of Use of Opiod Treatments ...
A Multi State Markov Model for Analyzing Patterns of Use of Opiod Treatments ...
HMO Research Network
 
Curriculum Vitae2-wusj
Curriculum Vitae2-wusjCurriculum Vitae2-wusj
Curriculum Vitae2-wusj
Sijin Wu
 

What's hot (20)

NTU-2019
NTU-2019NTU-2019
NTU-2019
 
INBIOMEDvision Workshop at MIE 2011. Victoria López
INBIOMEDvision Workshop at MIE 2011. Victoria LópezINBIOMEDvision Workshop at MIE 2011. Victoria López
INBIOMEDvision Workshop at MIE 2011. Victoria López
 
Thomas S. Price, Ph.D. Career resume, Jan 2017
Thomas S. Price, Ph.D. Career resume, Jan 2017Thomas S. Price, Ph.D. Career resume, Jan 2017
Thomas S. Price, Ph.D. Career resume, Jan 2017
 
Anil bl gather..dna structure
Anil bl gather..dna structureAnil bl gather..dna structure
Anil bl gather..dna structure
 
Curriculum_Vitae_Mark_Ebbert-modern
Curriculum_Vitae_Mark_Ebbert-modernCurriculum_Vitae_Mark_Ebbert-modern
Curriculum_Vitae_Mark_Ebbert-modern
 
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
International Journal of Biometrics and Bioinformatics(IJBB) Volume (3) Issue...
 
A comparative study of covariance selection models for the inference of gene ...
A comparative study of covariance selection models for the inference of gene ...A comparative study of covariance selection models for the inference of gene ...
A comparative study of covariance selection models for the inference of gene ...
 
Melanoma
MelanomaMelanoma
Melanoma
 
Application of Microarray Technology and softcomputing in cancer Biology
Application of Microarray Technology and softcomputing in cancer BiologyApplication of Microarray Technology and softcomputing in cancer Biology
Application of Microarray Technology and softcomputing in cancer Biology
 
Unravelling the molecular linkage of co morbid diseases
Unravelling the molecular linkage of co morbid diseasesUnravelling the molecular linkage of co morbid diseases
Unravelling the molecular linkage of co morbid diseases
 
Unravelling the molecular linkage of co morbid
Unravelling the molecular linkage of co morbidUnravelling the molecular linkage of co morbid
Unravelling the molecular linkage of co morbid
 
A Multi State Markov Model for Analyzing Patterns of Use of Opiod Treatments ...
A Multi State Markov Model for Analyzing Patterns of Use of Opiod Treatments ...A Multi State Markov Model for Analyzing Patterns of Use of Opiod Treatments ...
A Multi State Markov Model for Analyzing Patterns of Use of Opiod Treatments ...
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
 
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
EFFICACY OF NON-NEGATIVE MATRIX FACTORIZATION FOR FEATURE SELECTION IN CANCER...
 
Interrogating differences in expression of targeted gene sets to predict brea...
Interrogating differences in expression of targeted gene sets to predict brea...Interrogating differences in expression of targeted gene sets to predict brea...
Interrogating differences in expression of targeted gene sets to predict brea...
 
akulanth604fp_pres
akulanth604fp_presakulanth604fp_pres
akulanth604fp_pres
 
Curriculum Vitae2-wusj
Curriculum Vitae2-wusjCurriculum Vitae2-wusj
Curriculum Vitae2-wusj
 
Biostatistics
BiostatisticsBiostatistics
Biostatistics
 
ciclo autonomico-short paper - Witfor 2016 paper_42
ciclo autonomico-short paper - Witfor 2016 paper_42ciclo autonomico-short paper - Witfor 2016 paper_42
ciclo autonomico-short paper - Witfor 2016 paper_42
 
poster_Xingzhi2
poster_Xingzhi2poster_Xingzhi2
poster_Xingzhi2
 

Similar to Research proposal sjtu

Contribution of genome-wide association studies to scientific research: a pra...
Contribution of genome-wide association studies to scientific research: a pra...Contribution of genome-wide association studies to scientific research: a pra...
Contribution of genome-wide association studies to scientific research: a pra...
Mutiple Sclerosis
 
Algorithmically Optimized Gene Selection for Targeted Clinical Sequencing Panels
Algorithmically Optimized Gene Selection for Targeted Clinical Sequencing PanelsAlgorithmically Optimized Gene Selection for Targeted Clinical Sequencing Panels
Algorithmically Optimized Gene Selection for Targeted Clinical Sequencing Panels
Thermo Fisher Scientific
 

Similar to Research proposal sjtu (20)

Role of genomics proteomics, and bioinformatics.
Role of genomics proteomics, and bioinformatics.Role of genomics proteomics, and bioinformatics.
Role of genomics proteomics, and bioinformatics.
 
Gene hunting strategies
Gene hunting strategiesGene hunting strategies
Gene hunting strategies
 
Contribution of genome-wide association studies to scientific research: a pra...
Contribution of genome-wide association studies to scientific research: a pra...Contribution of genome-wide association studies to scientific research: a pra...
Contribution of genome-wide association studies to scientific research: a pra...
 
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSIONCOMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
COMPUTATIONAL METHODS FOR FUNCTIONAL ANALYSIS OF GENE EXPRESSION
 
Target discovery and validation
Target discovery and validation Target discovery and validation
Target discovery and validation
 
Biomedicine & Pharmacotherapy
Biomedicine & PharmacotherapyBiomedicine & Pharmacotherapy
Biomedicine & Pharmacotherapy
 
Molecular target and development models
Molecular target and development modelsMolecular target and development models
Molecular target and development models
 
Genomics and proteomics in drug discovery and development
Genomics and proteomics in drug discovery and developmentGenomics and proteomics in drug discovery and development
Genomics and proteomics in drug discovery and development
 
Genomics
GenomicsGenomics
Genomics
 
proteomics and system biology
proteomics and system biologyproteomics and system biology
proteomics and system biology
 
Genomics and proteomics
Genomics and proteomicsGenomics and proteomics
Genomics and proteomics
 
patho.ppt
patho.pptpatho.ppt
patho.ppt
 
Isaac Kohane, "A Data Perspective on Autonomy, Human Rights, and the End of N...
Isaac Kohane, "A Data Perspective on Autonomy, Human Rights, and the End of N...Isaac Kohane, "A Data Perspective on Autonomy, Human Rights, and the End of N...
Isaac Kohane, "A Data Perspective on Autonomy, Human Rights, and the End of N...
 
IMMUNOINFORMATICS , MICROARRAY and Machine Learning - All about Immunology, I...
IMMUNOINFORMATICS , MICROARRAY and Machine Learning - All about Immunology, I...IMMUNOINFORMATICS , MICROARRAY and Machine Learning - All about Immunology, I...
IMMUNOINFORMATICS , MICROARRAY and Machine Learning - All about Immunology, I...
 
Analisis de la expresion de genes en la depresion
Analisis de la expresion de genes en la depresionAnalisis de la expresion de genes en la depresion
Analisis de la expresion de genes en la depresion
 
Use cases
Use casesUse cases
Use cases
 
Biotech2012spring 1-overview 0
Biotech2012spring 1-overview 0Biotech2012spring 1-overview 0
Biotech2012spring 1-overview 0
 
Seminar mol biol_1_spring_2013
Seminar mol biol_1_spring_2013Seminar mol biol_1_spring_2013
Seminar mol biol_1_spring_2013
 
Application of Biomedical Informatics in Clinical Problem Solving
Application of Biomedical Informatics in Clinical Problem SolvingApplication of Biomedical Informatics in Clinical Problem Solving
Application of Biomedical Informatics in Clinical Problem Solving
 
Algorithmically Optimized Gene Selection for Targeted Clinical Sequencing Panels
Algorithmically Optimized Gene Selection for Targeted Clinical Sequencing PanelsAlgorithmically Optimized Gene Selection for Targeted Clinical Sequencing Panels
Algorithmically Optimized Gene Selection for Targeted Clinical Sequencing Panels
 

Recently uploaded

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
rknatarajan
 

Recently uploaded (20)

CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur EscortsRussian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
Russian Call Girls in Nagpur Grishma Call 7001035870 Meet With Nagpur Escorts
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANJALI) Dange Chowk Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and workingUNIT-V FMM.HYDRAULIC TURBINE - Construction and working
UNIT-V FMM.HYDRAULIC TURBINE - Construction and working
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 

Research proposal sjtu

  • 1. Research Proposal Bioinformatics approach to evaluation of Transcription factor genes and diseases (Cancer) Problem Statement: The purpose of the proposed research is the development of a computational approach to quantitatively evaluate associations between transcription factor encoding genes and human diseases, based on available literature evidence. The approach will analyze a set of candidate genes and determine which genes are linked to human diseases, which properties are involved in these gene-disease linkages, and which clusters of similar genes are involved in particular diseases. During the course of the research, I shall explore methods for recapitulating existing associations and predicting novel associations based on diverse forms of data pertaining to genes and diseases. These methods will evaluate the resulting associations in a quantitative manner, and the resulting analyses will be validated to determine the efficacy of the methods. Background: Identification of functional causes and contributing mechanisms of disease is a principal aim of biomedical research. In many cases, the term “disease” broadly applies to a heterogeneous set of observable properties, which may arise from multiple molecular processes. Disease is often characterized by symptoms and a pattern of progression over time. The area of Cancer diseases is particularly broad, encompassing a wide range of complex, abnormal phenotypes. Compared to diseases associated with other organs, many types of cancer like brain cancer tend to be poorly understood: many are difficult to characterize and have complex genetic components involving multiple genes. Transcription factors are key regulators of gene expression, involved via processes such as the recruitment of transcription initiation factors and conformational change of DNA, working alone or as part of protein complexes. GeneSeeker [1] can find genes within a chromosomal location that are localized in particular tissues, by looking at human and mouse expression data. Another method of associating disease gene to anatomical locations [2] performed text mining of PubMed abstracts to associate eVOC anatomical ontology terms to gene names. Machine learning approaches can be used when a representative set of disease genes are available to use as training data. In DGP [3], a decision tree classification approach is used to find features common to disease genes based on a training set composed of sample disease and control proteins. Features were protein length, BLASTP ratios (conservation score) between a protein and its highest scoring homologue within taxonomic groups (representing phylogenetic conservation and extent) and the conservation score with the closest paralogue. The study indicates that, on average, hereditary disease genes (genes taken from OMIM) in comparison to
  • 2. randomly selected genes are longer, more conserved, phylogenetically extended and without close paralogues. PROSPECTR [4] uses a wider variety of features, including the length of the gene, the length of its coding sequence, the length of its cDNA, length of the protein, GC content and percentage protein identity with its nearest homologue in various species (mouse, worm, fly). The investigators used an alternating decision tree, taking genes from OMIM and comparing against genes not found in OMIM. They also generated two independent test sets – one using genes from the Human Gene Mutation Database with randomly selected control genes, and another set of 54 genes not in OMIM, again with a set of randomly selected control genes. POCUS [5] takes another machine learning approach, using a selected training set of genes linked to the target disease. POCUS identifies common features between all the training genes – InterPro domains, GO annotations, similar expression profile – and assesses the chance that such common features would be shared by chance. This method depends on a carefully selected training set of genes, and focuses the likelihood of these genes all sharing common, disease-related properties, in contrast to methods that focus on overrepresentation of properties among the training genes. Proposed Method: Most of the existing methods for the computational prediction of linkages between genes and disease take as input a preliminary list of candidate genes (e.g. genes in a genomic region linked in a genetic study to a disease), and return as output either a reduced or a ranked list. The underlying approaches differ substantively between methods. Examples of characteristics used in the methods include numerical features derived from the raw sequence of genes and/or encoded proteins, existing annotations of proteins and genes, and abstracts or articles directly referring to the gene. The current methods focus on using properties from a representative set of genes to identify similar genes from the candidate set. We propose a method of extracting gene- disease associations that will emphasize verifiable supporting evidence for the predicted associations, and a quantitative evaluation of the strength of the association. We shall investigate both associations between genes and disease, as well as properties of the gene-disease association. We shall consider three base entities – Genes, Diseases, Evidence – and the relationships between these entities. Goal of Research: Our goal will be to predict Gene-Disease relationships based on the existence of relationships between other entity pairings. After initial study of mammalian gene disease relationships, we will broaden the approach to incorporate entity relationships involving orthologous genes in model organisms or related diseases. These paths of supporting evidence will be quantitatively evaluated, making it possible to both extracts strongly supported gene- disease linkages and to rank these linkages. Although the thesis itself will investigate properties of transcription factor genes in
  • 3. Cancer diseases, the methods and analysis will be designed for general application. For the initial analysis of the main gene- disease associations. Reference: 1. Van Driel MA, Cuelenaere K, Kemmeren PP, Leunissen JA, Brunner HG, et al. (2005) GeneSeeker: extraction and integration of human disease- related information from web-based genetic databases. Nucleic Acids Research 33: 758. 2. Tiffin N, Kelso J, Powell A, Pan H, Bajic V, et al. (2005) Integration of text- and datamining using ontologies successfully selects disease gene candidates. Nucleic Acids Research 33: 1544-1552. 3. López-Bigas N, Ouzounis C (2004) Genome-wide identification of genes likely to be involved in human genetic disease. Nucleic Acids Research 32: 3108. 4. Adie E, Adams R, Evans K, Porteous D, Pickard B (2005) Speeding disease gene discovery by sequence-based candidate prioritization. BMC Bioinformatics 6: 55. 5. Turner F, Clutterbuck D, Semple C (2003) POCUS: mining genomic sequence annotation to predict disease genes. Genome Biology 4: 75.