Research proposal sjtu

Research Proposal
Bioinformatics approach to evaluation of Transcription factor genes and diseases
(Cancer)
Problem Statement:
The purpose of the proposed research is the
development of a computational approach
to quantitatively evaluate associations
between transcription factor encoding
genes and human diseases, based on
available literature evidence. The approach
will analyze a set of candidate genes and
determine which genes are linked to human
diseases, which properties are involved in
these gene-disease linkages, and which
clusters of similar genes are involved in
particular diseases. During the course of the
research, I shall explore methods for
recapitulating existing associations and
predicting novel associations based on
diverse forms of data pertaining to genes
and diseases. These methods will evaluate
the resulting associations in a quantitative
manner, and the resulting analyses will be
validated to determine the efficacy of the
methods.
Background:
Identification of functional causes and
contributing mechanisms of disease is a
principal aim of biomedical research. In
many cases, the term “disease” broadly
applies to a heterogeneous set of observable
properties, which may arise from multiple
molecular processes. Disease is often
characterized by symptoms and a pattern of
progression over time. The area of Cancer
diseases is particularly broad,
encompassing a wide range of complex,
abnormal phenotypes. Compared to
diseases associated with other organs, many
types of cancer like brain cancer tend to be
poorly understood: many are difficult to
characterize and have complex genetic
components involving multiple genes.
Transcription factors are key regulators of
gene expression, involved via processes
such as the recruitment of transcription
initiation factors and conformational
change of DNA, working alone or as part of
protein complexes.
GeneSeeker [1] can find genes within a
chromosomal location that are localized in
particular tissues, by looking at human and
mouse expression data. Another method of
associating disease gene to anatomical
locations [2] performed text mining of
PubMed abstracts to associate eVOC
anatomical ontology terms to gene names.
Machine learning approaches can be used
when a representative set of disease genes
are available to use as training data. In DGP
[3], a decision tree classification approach
is used to find features common to disease
genes based on a training set composed of
sample disease and control proteins.
Features were protein length, BLASTP
ratios (conservation score) between a
protein and its highest scoring homologue
within taxonomic groups (representing
phylogenetic conservation and extent) and
the conservation score with the closest
paralogue. The study indicates that, on
average, hereditary disease genes (genes
taken from OMIM) in comparison to

randomly selected genes are longer, more
conserved, phylogenetically extended and
without close paralogues.
PROSPECTR [4] uses a wider variety of
features, including the length of the gene,
the length of its coding sequence, the length
of its cDNA, length of the protein, GC
content and percentage protein identity
with its nearest homologue in various
species (mouse, worm, fly). The
investigators used an alternating decision
tree, taking genes from OMIM and
comparing against genes not found in
OMIM. They also generated two
independent test sets – one using genes
from the Human Gene Mutation Database
with randomly selected control genes, and
another set of 54 genes not in OMIM, again
with a set of randomly selected control
genes.
POCUS [5] takes another machine learning
approach, using a selected training set of
genes linked to the target disease. POCUS
identifies common features between all the
training genes – InterPro domains, GO
annotations, similar expression profile –
and assesses the chance that such common
features would be shared by chance. This
method depends on a carefully selected
training set of genes, and focuses the
likelihood of these genes all sharing
common, disease-related properties, in
contrast to methods that focus on
overrepresentation of properties among the
training genes.
Proposed Method:
Most of the existing methods for the
computational prediction of linkages
between genes and disease take as input a
preliminary list of candidate genes (e.g.
genes in a genomic region linked in a
genetic study to a disease), and return as
output either a reduced or a ranked list. The
underlying approaches differ substantively
between methods. Examples of
characteristics used in the methods include
numerical features derived from the raw
sequence of genes and/or encoded proteins,
existing annotations of proteins and genes,
and abstracts or articles directly referring to
the gene. The current methods focus on
using properties from a representative set of
genes to identify similar genes from the
candidate set.
We propose a method of extracting gene-
disease associations that will emphasize
verifiable supporting evidence for the
predicted associations, and a quantitative
evaluation of the strength of the association.
We shall investigate both associations
between genes and disease, as well as
properties of the gene-disease association.
We shall consider three base entities –
Genes, Diseases, Evidence – and the
relationships between these entities.
Goal of Research:
Our goal will be to predict Gene-Disease
relationships based on the existence of
relationships between other entity pairings.
After initial study of mammalian gene
disease relationships, we will broaden the
approach to incorporate entity relationships
involving orthologous genes in model
organisms or related diseases. These paths
of supporting evidence will be
quantitatively evaluated, making it possible
to both extracts strongly supported gene-
disease linkages and to rank these linkages.
Although the thesis itself will investigate
properties of transcription factor genes in

Cancer diseases, the methods and analysis
will be designed for general application.
For the initial analysis of the main gene-
disease associations.
Reference:
1. Van Driel MA, Cuelenaere K,
Kemmeren PP, Leunissen JA,
Brunner HG, et al. (2005)
GeneSeeker: extraction and
integration of human disease-
related information from web-based
genetic databases. Nucleic Acids
Research 33: 758.
2. Tiffin N, Kelso J, Powell A, Pan H,
Bajic V, et al. (2005) Integration of
text- and datamining using
ontologies successfully selects
disease gene candidates. Nucleic
Acids Research 33: 1544-1552.
3. López-Bigas N, Ouzounis C (2004)
Genome-wide identification of
genes likely to be involved in
human genetic disease. Nucleic
Acids Research 32: 3108.
4. Adie E, Adams R, Evans K,
Porteous D, Pickard B (2005)
Speeding disease gene discovery by
sequence-based candidate
prioritization. BMC Bioinformatics
6: 55.
5. Turner F, Clutterbuck D, Semple C
(2003) POCUS: mining genomic sequence
annotation to predict disease genes.
Genome Biology 4: 75.

Research proposal sjtu

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Research proposal sjtu

Similar to Research proposal sjtu (20)

Recently uploaded

Recently uploaded (20)

Research proposal sjtu