Pathways and genomes databases in bioinformatics

Presented by
Sarwat Bashir
(Bioinformatics 8th semester )
Shaheed Binazeer Bhutto Women University of Peshawar
Shaheed Benazir Bhutto University Peshawar

SECONDARY DATABASES IN BIOINFORMATICS
Those data that are derived from the analysis or
treatment of primary data such as secondary
structures, hydrophobicity plots, and domain are
stored in secondary databases.
http://www.imb-jena.de/~rake/Bioinformatics_WEB/databases_classification.html

THE BIOINFORMATICS SECONDARY
DATABASES
 Secondary databases are further divided into four
categories according to the information they contain :
 Sequence-related Information
 Genome-related Information
 Structure-related Information
 Pathway Information

Metabolic Pathway and Protein
Function Databases
 A pathway database (DB) is a DB that describes
biochemical pathways, reactions, and enzymes. For
the modeling and simulation of a biopathway.

GENOME DATABASES
 These databases collect organism genome sequences,
annotate (add description ) and analyze them, and
provide public access.
 Add some of experimental literature to improve
computed annotations.
 These databases may hold many species genomes, or a
single model organism genome.

PAGED: a pathway and gene-set enrichment
database to enable molecular phenotype
discoveries
 Abstract:
 Background: Pathway and gene-set enrichment analysis has evolved into
the study of high-throughput functional genomics form past decade.
 Researchers have begun to combine pathway and gene-set enrichment
analysis as well as network module-based approaches to identify crucial
relationships between different molecular mechanisms.
 Methods: To meet the new challenge of molecular phenotype discovery, in
this work, they developed an integrated following methods :
 Online database, the Pathway And Gene Enrichment Database (PAGED),
to enable comprehensive searches for disease-specific pathways, gene
signatures, microRNA targets, and network modules by integrating gene-set-
based prior knowledge as molecular patterns from multiple levels: the
genome, transcriptome, posttranscriptome, and proteome.
Huang et al. PAGED: a pathway and gene-set enrichment database to enable molecular
phenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2
http://www.biomedcentral.com/1471-2105/13/S15/S2`

Cont.…
 Results: The online database we developed, PAGED
http://bio.informatics.iupui.edu/PAGED is by far the
most comprehensive public compilation of gene sets.
 In its current release, PAGED contains a total of 25,242
gene sets, 61,413 genes, 20 organisms, and 1,275,560
records from five major categories.
 Beyond its size, the advantage of PAGED lies in the
explorations of relationships between gene sets as
gene-set association networks (GSANs).
phenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2 http://www.biomedcentral.com/1471-
2105/13/S15/S2`

Introduction to PAGED
 Biological pathways have provided natural sources of molecular
mechanisms to develop diagnosis, treatment, and prevention strategies
for complex diseases.
 Gene-set enrichment methods analyzed the activity of thousands of
genes effectively instead of individual gene analysis .
 The analysis reveal accusations between the genotypes and
phenotypes, which are simply called molecular profiling or molecular
phenotypes.
 The other biological pathway databases are heterogeneous and lack of
annotations.
 Unlike candidate pathway analysis, genome-wide pathway analysis
does not require prior biological knowledge.
 PAGED can reveal the interaction a cross the different databases.
 Gene signature data from the transcriptome level offers a
complementary source of information to complete pathway knowledge.

The division of pathway analysis
 Pathway analysis are divided into three generation of
approaches:
 First generation: over representation analysis (ORA) approach
 Second generation: functional class sorting (FCS) approach.
 Third generation: pathway topology (PT) approach.
 Multi-level, multi-scale, knowledge-guided enrichment analysis
can enable molecular phenotype discovery for specific human
diseases.
 The acquisition of prior knowledge and systems modeling poses
a challenge for developing tools that go beyond third-generation
pathway analysis for disease-specific molecular profiling.
 To meet the new challenges of molecular phenotype discovery,
the Pathway And Gene Enrichment Database (PAGED) are
developed.

The benefits of integrated
database (PAGED)
 This new database can provide the following benefits to biological researchers.
 First, this database consists of disease-gene association data, curated and
integrated from Online Mendelian Inheritance in Man (OMIM) database and
the Genetic Association Database (GAD) therefore, it has the potential to assist
human disease studies.
 Second, as contains all current compiled gene signatures in Molecular
Signature Database (MSigDB) and Gene Signatures Database (GeneSigDB.
 Third, it further integrates with microRNA-targets from miRecords database,
signaling pathways, protein interaction networks, and transcription
factor/gene regulatory networks, partially based on data integrated from the
Human Pathway Database (HPD) and the Human Annotated and Predicted
Protein Interaction (HAPPI) database..
 It provide integrated the following version of the database OMIM (Feb. 2012),
GAD (Aug. 2011), GeneSigDB (v. 4.0, Sept. 2011), MSigDB (v. 3.0. Sept. 2010),
HPD (2009), HAPPI (v. 1.4)and miRecords (Nov. 2010), which are the latest
versions available.
2105/13/S15/S2`

The advantages of this Research
 The advantage of this work is the relationship between
pathways, gene signatures, microRNA targets, and/or
network modules.
 These gene-set-based relationships can be visualized
as a gene-set association network (GSAN), which
provides a “roadmap” for molecular phenotype
discovery for specific human diseases.
 It demonstrate how to query PAGED to discover
crucial pathways, gene signatures, and gene network
modules specific to disease genome .

Methods
 Data sources: The overview of the data integration process in Figure 1.
 Gene-set data were collected, extracted, and integrated from five major
categories.
 The pathway data sources were from HPD , which has integrated 999
human biological pathway data from five curated sources: KEGG, PID,
BioCarta, Reactome, and Protein Lounge.
 The genome-level disease gene relationships were from OMIM and
GAD.
 The transcriptome- level gene signatures were from MSigDB and
GeneSigDB.
 The post-transcriptome-level microRNA data were from miRecords.
 The proteome level data was from an integrated protein interaction
database.
 HAPPI, which has integrated HPRD, BIND, MINT, STRING, and
OPHID databases.

Gene-set data integration:
 Treat gene sets as all groups of genes, including disease associated
genes, pathway genes, gene signatures, microRNA-targeted genes, and
PPI sub-network modules.
 The raw files are curated from those data sources have various formats
including plaintext, XML, and table.
 It have to written Perl/Java parsers to convert them into a common tab
delimited textual format to ensure syntactic level data compatibility.
 To integrate across different databases, they mapped the gene/protein
IDs in all databases to official gene symbols. The gene-set gene data is
stored in our backend ORACLE11g relational database.
 All recodes of gene set members are represented by the official gene
symbols.
 All PAGED gene sets were assigned unique PAGED-specific identifiers

Online software designing
 The PAGED platform follows a multi-tiered design
architecture.
 The backend was implemented as PL/SQL packages
on an Oracle 11g database server. The PAGED
application middleware was implemented on the
Oracle Application Express (APEX) server, which
bridged between the Apache webserver and the Oracle
database server.

Gene-set similarity measurement
 The similarity score Si, j of two different gene sets is defined by the following
formula:
 Here, Pi and Pj denote two different gene sets, while |Pi| and |Pj| are the
number of genes in each of these two gene sets.
 Their intersection Pi∩Pj denotes a common set of genes, while their union
Pi∪Pj is calculated as |Pi| + |Pj| - |Pi∩Pj|.
 Here, α is a weight coefficient among [0, 1], which is used to count varying
degree of contributions from calculations based both on the overlap (left item
SL) and the cover (right item SR).
 SL is well-known as the Jaccard coefficient which is often used to evaluate the
similarity between two sets .
 When a larger gene set covers a smaller one, it is expected that their similarity
score to be high enough to identify them.

Microarray data
 For gene expression data analysis they show how to
discover crucial pathways, gene signatures, and gene
network modules specific to disease functional
genomics.
 To downloaded a microarray dataset from Gene
Expression Omnibus, GEO
http://www.ncbi.nlm.nih.gov/geo/.
 This microarray dataset compared the transcriptome
data of expected information collected adenomas with
those of the normal from the same individuals.
2105/13/S15/S2`

Differential gene-set expressions
 Use ABS_FC to denote the absolute value of fold
change for each gene. Then define differential gene set
expressions.
 NORM_ABS_FC: The p*-norm of ABS_FC of all the
available differential gene expressions in a gene set.
Huang et al. PAGED: a pathway and gene-set enrichment database to enable
molecular phenotypeDiscoveries BMC Bioinformatics 2012, 13(Suppl 15):S2

Gene-set association network
(GSAN) construction
 To visualize the relationships between gene sets, we define
a gene-set association network (GSAN) as a network of
associations between different gene sets, in which the
network element representation is as follows:
• Node: Gene set
• Edge: Association between two gene sets
• Node size: Gene-set scale (Counting genes in each gene set)
• Node color: Differential gene-set expression
(NORM_ABS_FC)
• Node line color: Gene-set data source
• Edge width: Similarity score

Results
 Database content statistics:
 Table 1 lists the detailed statistics for each data source
and the overlap between each pair. For example,
2105/13/S15/S2`

Gene-set scale distributions
 Gene-set scale distributions for PAGED molecule data.
 A gene-set scale refers to the number of molecules
(i.e., genes) involved in a given gene set.
 The distributions are plotted under log scale for both
the x-axis and y-axis.
 The linear trend line in red represents linear
regression of PAGED distribution.
2105/13/S15/S2`

Cont ….
 An overview for the core functionality of the online PAGED
website.
 (A) The PAGED home page providing search by either
disease name or gene list;
 (B) a webpage containing the list of gene sets retrieved as a
result of a disease query;
 (C) a webpage containing the list of gene sets retrieved as a
result of a gene list query;
 (D) an advanced search page in which the user can either
search disease name or upload a gene-list to search;
 (E) a browse page listing the gene sets, their data source
and number of genes.

Discussion
 In the near future, the improved gene-set similarity algorithms
will be introduced by using a global PPI network to calculate
their distance.
 This would provide a more robust measurement for web
interface development, and the plan is to add a disease browsing
function based on disease ontology and a network visualization
function to show the gene-set association dynamically.
 The final goal is to perform multi-scale network modeling for
molecular phenotype discoveries by integrating differential
expressions with pathway and network topologies.
 The current release of PAGED provides a solid foundation for us
to develop third-generation pathway analysis tools.
Huang et al. PAGED: a pathway and gene-set enrichment database to enable molecular phenotypeDiscoveries
BMC Bioinformatics 2012, 13(Suppl 15):S2 http://www.biomedcentral.com/1471-2105/13/S15/S2`

Conclusions
 The development of PAGED, an online database that provides
the most comprehensive public compilation of gene sets.
 In the current release, PAGED contains a total of 25,242 gene
sets, 61,413 genes, 20 organisms, and 1,275,560 records from five
major categories:
 The pathway data from HPD, genome-level disease data from
OMIM and
 GAD, transcriptome-level gene signatures from MSigDB and
GeneSigDB, the post-transcriptomemicroRNA data from
miRecords, and proteome-level data from HAPPI.
 The number of overlapping genes between each data source,
gene-set scale distribution, and case study in colorectal cancer.
 The current PAGED software can help users address a wide range
of gene-set-related questions in human disease biology studies.

MGD: the Mouse Genome
Database
 ABSTRACT
 The Mouse Genome Database (MGD) (http://www.informatics.jax.org) is one
component of a community database resource for the laboratory mouse, a key
model organism for interpreting the human genome and for under standing
human biology.
 MGD strives to provide an extensively integrated information resource with
experimental details annotated from both literature and on-line genomic data
sources.
 MGD presents the consensus representation of genotype (sequence) to
phenotype information including highly detailed information about genes and
gene products.
 Primary foci of integration are through representations of relationships
between genes, sequences and phenotypes.
 MGD collaborates with other bioinformatics groups to curate a definitive set of
information about the laboratory mouse.
 Recent developments include a general implementation of database structures
for controlled vocabularies and the integration of a phenotype classification
system.
Judith A. Blake et.al MGD: the Mouse Genome Database Nucleic Acids Research, 2003, Vol. 31,
No. 1 193–195DOI: 10.1093/nar/gkg047

INTRODUCTION
 The Mouse Genome Database (MGD) provides an
integrated information on mouse genes, genetic markers
and genomic features as well as information on molecular
segments ( probes, primers, cDNA clones, BACs and YACs)
mutant phenotypes, comparative mapping data, graphical
displays of linkage, cytogenetic and physical maps,
experimental mapping data, as well as strain distribution
patterns for recombinant inbred strains (RIs) and cross
haplotypes.
 MGD is updated daily . It providing several new data
manipulation and display tools.
 MGD is one component of the Mouse Genome Informatics
(MGI) database resource (http://www.informatics.jax.org)
located at The Jackson Laboratory (http://www.jax.org).
Judith A. Blake et.al MGD: the Mouse Genome Database Nucleic Acids Research, 2003, Vol. 31, No.
1 193–195DOI: 10.1093/nar/gkg047

IMPROVEMENTS DURING 2002
 Implementation of phenotype classifications
 A broad, high-level set of phenotype terms have been developed and employed
to classify phenotype data in MGD.
 This defined vocabulary of 105 terms can be used to search, group, compare
and analyze phenotypes.
 These phenotype classification terms appear on the Alleles and Phenotypes
Query Form (Fig. 1), and on the Genes and Marker Query Form.
 The complete list of terms and their accession IDs is also available by FTP.
 On each form, there is a link to the phenotype classification terms, complete
with definitions and examples.
 Users of the MGI database can select one or more terms from the list to search
for records associated with a particular phenotype, in combination with many
other parameters on the forms.
 In addition, text based searches for more specific phenotypic terms remain
available.
No. 1 193–195DOI: 10.1093/nar/gkg047

Judith A. Blake et.al MGD: the Mouse Genome Database Nucleic Acids Research, 2003,
Vol. 31, No. 1 193–195DOI: 10.1093/nar/gkg047

Improvements to the MGI:GO
browser
 The MGI GO Browser
(http://www.informatics.jax.org/searches/GO_form.sht
ml) allows database users to access genes in MGI using
functional annotation terms from the GO.
 This Browser was developed in conjunction with the
GXD. (Gene Expression Database )
 The GO Browser can be accessed from gene detail or
query pages as well as directly from the MGI menus.
Judith A. Blake et.al MGD: the Mouse Genome Database Nucleic Acids Research, 2003,
Vol. 31, No. 1 193–195DOI: 10.1093/nar/gkg047

Availability of MGI:GO files in
various formats
 MGI gene-to-GO annotations are updated daily.
 Various files for the MGI gene/markers with the GO associations
are publicly available.
 These files are updated each time MGI submits a new gene
association file to the GO web site (http://
www.geneontology.org) and can be accessed on the MGI FTP
server (ftp://www.informatics.jax.org/pub/informatics/reports/
gene association.mgi).
 A file of all the GO terms used by MGI in the annotation of genes
and gene products is also available. MGI also provides a file to
the GO database of MGI Gene : SWISS-PROT associations.
 This information is incorporated into the GO database and thus
enables users to recover mouse sequence data as a result of a
semantic search against the GO database
(http://www.godatabase.org/cgi-bin/go.cgi
No. 1 193–195DOI: 10.1093/nar/gkg047

IMPLEMENTATION
 MGD is implemented in the Sybase relational database
system, version 12.5.
 A large set of CGI scripts and Java Servlets mediate the
user’s interaction with the database.
 For computational users, direct SQL access can be
requested through User Support.
 User-requested database reports and a number of
widely used data files (generated daily) are available
on the FTP site (ftp://ftp.informatics.jax.org).
Judith A. Blake et.al MGD: the Mouse Genome Database Nucleic Acids Research, 2003, Vol. 31, No. 1
193–195DOI: 10.1093/nar/gkg047

CITING MGD
 The following citation format is suggested when
referring to datasets specific to the MGD component
of MGI :
 Mouse Genome Database (MGD), Mouse Genome
Informatics, The Jackson Laboratory, Bar Harbor,
Maine (URL: http://www.informatics.jax.org).
No. 1 193–195DOI: 10.1093/nar/gkg047

Pathways and genomes databases in bioinformatics

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Pathways and genomes databases in bioinformatics

Ähnlich wie Pathways and genomes databases in bioinformatics (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Pathways and genomes databases in bioinformatics

Hinweis der Redaktion