Integrative Networks Centric Bioinformatics

Integrative & Network Analysis of
Transcriptomics and Proteomics Data
Prof. N. Krasnogor
Biology, Neuroscience and Computing (BNC) Research Group
School of Computing Science
Centre for Synthetic Biology and Bioexploitation
Newcastle University

E-mail:
Twitter:
Skype:

Natalio.Krasnogor@Newcastle.ac.uk
@Nkrasnogor
Natalio.Krasnogor

Outline
•

Introduction: goals and data sets

•

ArrayMining.net: tool set for microarray
analysis.
Part I

•

TopoGSA: topological analysis of genes /
proteins networks.

•

PathExpand: inference of missing pathways
links.

•

Enrichnet: network-based enrichment.
Part II

•

Visualisation & Exploration: networks
surfing.
Part III

Gibson G (2003) Microarray Analysis.
PLoS Biol 1(1): e15.
doi:10.1371/journal.pbio.0000015

Introduction

•

Typical problem in biosciences: How to make effective use of
multiple, large-scale data sources?

•

Typical problem in computer science: How to exploit the
strengths of different algorithms?

 GOAL: Develop new (& existing) methods combining diverse
data sources and algorithms

Reference data set
Armstrong et al. Leukemia data set

72 samples and 12,626 genes

•

3 leukemia sub-types: ALL (24), AML (28),
MLL (20)

•

Heat map: 30 most differentially
expressed genes vs. samples

Normalisation: Variance Stabilizing
Normalisation (Huber et al., 2002)

•

samples

Platform: Affymetrix UV95A oligonucleotide
array

•

genes

•

Thresholding/Filtering steps: see
Armstrong et al. (2001, Nat. Genet.)

•

Public access to data set:
http://www.broadinstitute.org/cgi-bin/cancer/da

Main data set

QMC breast cancer microarray data set

Pre-normalized data (log-scale, min: 4.9,
max: 13.3)

•


•

3 tumour grades: 1 (33), 2 (52), 3 (43)

•

grade1

Platform: Illumina Sentrix Human-6
BeadChips

•

genes

•

Probe level data analysis: Bioconductor
beadarray package

•

http://www.ebi.ac.uk/microarray-as/ae
accession number: E-TABM-576

grade 3

(grade 1 and grade 3)

Breast cancer data - difficulties

Breast cancer outcome is hard to predict:

Van‘t Veer et al.

Alon et al.

Golub et al.

Large degree of class-overlap in Breast cancer microarray data, whereas
Leukemia decision boundaries are easy to find (Blazadonakis, 2009).

Data Fusion
Other biological data sources used:
Breast cancer microarray data:
• additional public
data sets: Huang
et al., Veer et al.
• pre-processing:
GC-RMA
Cellular pathway data:
• obtained from GO,
BioCarta, Reactome,
KEGG and InterPro
• total: approx. 3000
pathways (size > 10)

Protein interaction data:
• unweighted binary
interactions (MIPS,
DIP, BIND, HPRD,
IntAct - only human)
• 9392 nodes,
38857 edges
Cancer gene sets:
• mutated genes in
different human
cancer types
(Breast, Liver,...)
• 30 gene sets of
size > 10 genes

Methods overview
Methods overview: ArrayMining & TopoGSA

Part I
Tools for Automated Microarray
Data Analysis

9

Web-tool: ArrayMining.net
What is ArrayMining.net?
ArrayMinining.net is an online microarray
analysis tool set integrating multiple data
sources and algorithms.

www.arraymining.org

Goal: A “swiss knife“ for
microarray analysis
tasks

6 analysis modules:
1. Gene selection
2. Sample clustering
classical
3. Sample classification
4. Gene Set Analysis
new
5. Gene Network Analysis
6. Cross-Study Normalization

Methods Overview

Methods overview: ArrayMining & TopoGSA

ArrayMining.net: Gene selection
Gene selection module
• Applies supervised feature selection algorithms (CFS, eBayes, SAM, etc.)
• Compares multiple algorithms or combines them into an ensemble
• Example: ENSEMBLE feature selection for Armstrong et al. (2001) dataset:
Affymetrix ID

Gene symbol

Gene descriptions – source:

F-statistic

32847_at



MYLK

myosin, light polypeptide kinase

159.59

1389_at



MME

membrane metallo-endopeptidase (neutral endopeptidase, enkephalinase)

137.53

35164_at



WFS1

wolfram syndrome 1 (wolframin)

36239_at



POU2AF1

pou domain, class 2, associating factor 1

116.75

1325_at



SMAD1

smad, mothers against dpp homolog 1 (drosophila)

110.37

963_at



LIG4

ligase iv, dna, atp-dependent

89.77

34168_at



DNTT

deoxynucleotidyltransferase, terminal

89.31

40570_at



FOXO1

forkhead box o1a (rhabdomyosarcoma)

86.89

33412_at



LGALS1

lectin, galactoside-binding, soluble, 1 (galectin 1)

81.31

 previously identified by Armstrong et al.

128

 newly identified

ArrayMining.net: Gene selection
Gene selection module (2): Armstrong et al. dataset




• Automatic generation of box
plots with gene and sample
class annotations
• The first row shows the box
plots for the two best-ranked
newly identified genes in the
Armstrong et al. dataset ()
• The second row shows two
top-ranked previously identified genes ()
• The user can easily compare
and combine the results
from different selection
methods





ArrayMining.net: Examples

genes

Further examples: Gene selection and Clustering module

samples

Automatic generation of heatmaps and PCA Cluster plots (Armstrong et al. dataset)

Further examples: 3D-ICA and Co-Expression analysis
Sample space:

Gene space:

ALL
AML
MLL

3D Independent Component Analysis plot (left) and the largest connected components
from a gene co-expression network (right) for the Armstrong et al. dataset

ArrayMining.net: In-house data
Apply the tools on new data: QMC Breast cancer data
Expression levels across 3 tumour grades:

STK6

KIF2C

Heat map: 50 most significant genes

MYBL2

AURKb

Box plot: 4 most significant genes

ArrayMining.net: QMC dataset
QMC Breast cancer data set – selected genes
• all top-ranked genes are known or likely to be involved in breast cancer
• the selection is robust with regard to cross-validation cycles and algorithms

PC (gene vs. outcome):

Fold
Change

ESTROGEN RECEPTOR 1

-0.75

0.16

1.6e-20 (1.)

RAS-LIKE, ESTROGEN-REGULATED,
GROWTH INHIBITOR

-0.66

0.46

5.3e-14 (2.)

WD REPEAT DOMAIN 19

-0.66

0.73

1.2e-13 (3.)

CARBONIC ANHYDRASE XII

-0.65

0.28

2.7e-13 (4.)

ARP3 ACTIN-RELATED PROTEIN 3
HOMOLOG (YEAST)

0.64

1.37

9.6e-13 (5.)

TETRATRICOPEPTIDE REPEAT DOMAIN
8

-0.63

0.82

2.2e-12 (6.)

BREAST CANCER MEMBRANE PROTEIN
11

-0.62

0.24

7.1e-12 (7.)

Gene name

Q-value
(Rank)

ArrayMining.net: Example
ArrayMining - Class Discovery Analysis module:
•

Motiviation:
Exploiting the synergies between partition-based and hierarchical clustering
algorithms

•

Approach:
Consensus clustering based on the agreement of clustering results for pairs
of objects (details on next slide).
- equivalent to median partition problem (NP-complete)
- Simulated Annealing (SA) has been shown to provide good solutions

•

Our solution:
- Compare SA (Aarts et al. cooling scheme) with thermodynamic SA (TSA)
and fast SA (FSA)  FSA provides fastest convergence
- Initialization: Input clustering with highest agreement to other inputs

ArrayMining.net: Consensus Clustering
ArrayMining‘s consensus clustering approach:
Clustering Agreement:= No. of times pairs of samples are assigned to the
same cluster across all input clusterings
Idea: Reward objects in the same cluster, if they have a high agreement.

Agreement matrix:
all clusterings for
samples i and j

Fitness function:

α:= (max(A)+min(A))/2

Sample 1

Aij := #agreements across

Sample 2

Clustering Methods and Validity Indices
•

Sample clustering methods: 8 different methods considered:
- partition-based: k-Means, PAM, CLARA, SOM, SOTA
- hierarchical: AL-HCL, DIANA, AGNES

•

Scoring & number of clusters selection: 5 validity indices / splitting rules used:
- Silhouette width, Calinski-Harabasz, Dunn, C-index, knn-Connectivity.
Example: Silhouette width
a(i) = avg. distance of obj(i) to all others
in the same cluster
b(i) = avg. distance of obj(i) to all
others in closest distinct cluster

- good validity indices should have: no multivariate normality assumptions (Gower,
1981), small or no bias (Milligan & Cooper, 1985)
•

Standardization: classical (mean 0, stddev. 1) or median absolute deviation

Many Clustering Methods/Validity Indexes

22

Consensus Clustering: example
Example application: QMC breast cancer dataset
•

Separate sub-classes in 84 luminal samples with consensus clustering

•

Input algorithms: k-Means, SOM, SOTA, PAM, HCL, DIANA, HYBRID-HCL

low confidence
(silhouette widths)

best separation
for two clusters

External Validation
Clustering results – external validation (tumour grades)
Measure similarity of
clusterings with the rand
index R:

a, b, c and d are the #pairs
of objects assigned to:
- the same cluster in
both clusterings (a)

Reference
clustering: 3 tumour
grades (low,
medium, high)
Random model
Single clustering
Consensus

- different clusters in
both clusterings (b)
- the same cluster in
clustering 1/2 and
different clusters in
clustering 2/1 (c/d)
- Corrected for chance:
 adjusted rand index

10000 random
clusterings

ArrayMining.net: Genes Set Analysis
Extension: Gene set analysis
•
•
•

Expression levels for a single gene are often unreliable
Similar genes might contain complementary information
We want to integrate functional annotation data
samples



pathways

Gene Set Analysis (GSA):
1) Identify sets of functionally
similar genes (GO, KEGG, etc.)
2) Summarize gene sets to “Meta”genes (PCA, MDS, etc.)
3) Apply statistical analysis
(example: Van Andel institute cancer gene sets)

Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide
expression profiles. Subramanian et al. PNAS October 25, 2005 vol. 102 no. 43 15545–15550

Gene Set Analysis module – example analysis
we apply the Gene Set Analysis module to the Armstrong et al. dataset
• with known cancer gene sets the class separation is better than for single genes
•

Heat map for the Armstrong et al. dataset based on pathway meta-genes

Consensus Clustering: example (2)
Combine consensus clustering with gene set analysis
•

Map genes onto Gene Ontology (GO), reduce dimensionality (MDS)

•

Apply same consensus clustering as before on GO-based „meta-genes“

~3 times higher
confidence

better separation

External Validation

Single clustering
Consensus clustering
Consensus
(PAM+SOTA)
10000 random
clusterings

Interim Summary
ArrayMining Integrative Clustering - Summary

•

Consensus clustering (CC) results tend to be similar to or slightly better than the best single clusterings
in terms of adj. rand index and validity indices (but longer runtime)

•

The input clusterings should include diverse methods and exclude similar methods

•

Using gene sets (GS) representing cellular pathways instead of single genes results in better cluster
separation, adj. rand indices and validity indices (annotation data required)

 GS & CC provide improved results, but:
longer runtimes + annotation data required

Part II
Tools for Networks Analysis

31

Networks Biology
Describe network strategies to identify and prioritize gene sets or
Describe network strategies to identify and prioritize gene sets or
functional associations (arising in high throughput experiments)
functional associations (arising in high throughput experiments)
between aagenes/proteins set of interest (target set) and annotated
between genes/proteins set of interest (target set) and annotated
genes/proteins sets (reference set)
genes/proteins sets (reference set)
I will describe some of the functionalities and methods in:

Freely accessible at:
http://www.ncl.ac.uk/csbb/research/resources/

InIn collaboration with:
collaboration with:
Dr. Pawel Widera (Newcastle
Dr. Pawel Widera (Newcastle
University)
University)
Dr. Enrico Glaab (University ofof
Dr. Enrico Glaab (University
Luxembourg ) )
Luxembourg
Dr. Anais Baudot (Unversite d’AixDr. Anais Baudot (Unversite d’AixMarseille)
Marseille)
Prof. Reinhard Schneider (University
Prof. Reinhard Schneider (University
ofof Luxembourg )
Luxembourg )
Prof. Alfonso Valencia (CNIO ) )
Prof. Alfonso Valencia (CNIO

TopoGSA: Network topological
analysis of gene sets
What is TopoGSA?
TopoGSA is a web-application mapping
gene sets onto a comprehensive human
protein interaction network and analysing
their network topological properties.

Two types of analysis:
1. Compare genes within a gene set:
e.g. up- vs. down-regulated genes
2. Compare a gene set against a
database of known gene sets
(e.g. KEGG, BioCarta, GO)

TopoGSA - Methods
TopoGSA computes topological properties for an uploaded gene set
TopoGSA computes topological properties for an uploaded gene set
and matched-size random gene sets
and matched-size random gene sets
•

the degree of each node in the gene set

•

the local clustering coefficient Ci for each node vi in the gene set:

where ki is the degree of vi and ejk is the edge between vj and vk
•

the shortest path length between pairs of nodes vi and vj in the gene set

•

the node betweenness B(v) for each node v in the gene set:

here σst(v) is the number of shortest paths from s to t passing through v
•

the eigenvector centrality for each node in the gene set

Human PPI network  MIPS, DIP, BIND, HPRD and InAct
Human PPI network  MIPS, DIP, BIND, HPRD and InAct
9393 proteins and 38857 interactions
9393 proteins and 38857 interactions
Also for yeast, fly, worm and arabidopsis
Also for yeast, fly, worm and arabidopsis

LEGEND:

Mean node
betweenness

• Cellular processes
• Environmental information processing
• Genetic information processing
• Human diseases
• Metabolism
• Cancer genes

General results:

tering
Mean clus
coefficient

Me
a
pa n sho
th
len rtest
gth

• Metabolic pathways
have high shortest path
lenghts and low betweenness
• Disease pathways and
cancer gene sets tend to
have high betweenness
and small shortest path
lenghts

Main data set
QMC breast cancer microarray data set

Pre-normalized data (log-scale, min: 4.9,
max: 13.3)

•


•

3 tumour grades: 1 (33), 2 (52), 3 (43)

•

grade1

Platform: Illumina Sentrix Human-6
BeadChips

•

genes

•

Probe level data analysis: Bioconductor
beadarray package

•

http://www.ebi.ac.uk/microarray-as/ae
accession number: E-TABM-576

grade 3

(grade 1 and grade 3)

ArrayMining.net: In-house data

Apply the tools on new data: QMC Breast cancer data
www.arraymining.net

www.topogsa.net
Expression levels across 3 tumour grades:

STK6

KIF2C

Heat map: 50 most significant genes

MYBL2

AURKb

Box plot: 4 most significant genes

ArrayMining  TopoGSA
Topological analysis of Selected Genes
• Results of within-gene-set comparison:
Estrogen receptor 1 gene and apoptosis regulator Bcl2, both up-regulated
in luminal samples, have outstanding network topological properties (higher
betweenness, higher degree, higher centrality) in comparison to other genes.
• Results of comparison against reference databases:
- Metabolic KEGG pathways are most similar to the uploaded gene set in
terms of network topological properties.
- Most similar BioCarta pathways: Cytokine, Differentiation and inflammatory
pathways.

Real-world application of tools sets
ArrayMining identifies RERG as a tumour marker
• RERG (Ras-related and oestrogen-regulated growth-inhibitor) was identified
as a new candidate marker of ER-positive luminal-like breast cancer subtype
• Validation using immunohistochemistry on Tissue Microarrays containing
1,140 invasive breast cancers confirmed RERG‘s utility as a marker gene

TMAs of invasive breast cancer
show strong RERG expression

RERG Protein Expression VS BCSS & DMFI
H.O. Habashy, D.G. Powe, E. Glaab, N. Krasnogor, J.M. Garibaldi, E.A. Rakha, G. Ball, A.R.
Green, and I.O. Ellis. Rerg (ras-related and oestrogen-regulated growth-inhibitor) expression
in breast cancer: A marker of er-positive luminal-like subtype. Breast Cancer Research and
Treatment, (on line first):1-12, 2010.
1.0

0.8

Negative RERG expression

Cumulative Survival

Without adjuvant treatment

Cumulative Survival

Positive RERG expression

0.6

1.0

0.8

0.6

p=0.002

0.4
0

50

100

150

200

250

0

50

BCSS in months

1.0

0.8

200

250

0.6


p= 0.007

0.4
0

50

100

150

200

DMFI in months

Kaplan Meier plot of RERG
protein expression with respect
to BCSS in ER+ only

250

Cumulative Survival

Without Tamoxifen treatment

Cumulative Survival

150

1.0


Kaplan Meier plot of RERG
protein expression with respect
to BCSS in ER+ U ER- cohort

100

BCSS in months

0.8


0.6

p=0.027
0

50

100

150

BCSS in months

200

250

PathExpand: Expanding
pathways and cellular processes

• Utilize same PPI network as in TopoGSA; derived from
– MIPS, DIP, MINT, HPRD and IntAct
– only experimental evidence of binary PPI
– final protein interaction network contained 9392 proteins
(nodes) and 38857 interactions (edges)
• Process Mapping:
– KEGG, Biocarta and Reactome were mapped
– Only 60% of pathways members existed in PPI Network

PPI-based pathway-enlargement
Extension Procedure

Idea: Enlarge pathways by adding genes that are “strongly connected“ to the pathway-nodes or
increase the pathway-“compactness“

••
••

Our procedure extended 159 pathways from
Our procedure extended 159 pathways from
BioCarta, 90 from KEGG and 52 from
BioCarta, 90 from KEGG and 52 from
Reactome.
Reactome.
The pathway sizes increased on average from
The pathway sizes increased on average from
113% to 126% of the original size.
113% to 126% of the original size.

Validation shows:
1)The proteins added are well connected and central in the protein interaction
network
2)The added proteins display gene ontology annotations matching better to the
original cellular pathway/process annotations than random proteins
3)Are enriched in processes known to be related to cellular signalling
4)Our method is able to recover known cellular pathway/process proteins in a
cross-validation experiment

Added cancer gene

Example case: BioCarta BTG family proteins and cell cycle regulation
Black: Original pathway nodes – Green: Nodes added based on connectivity

Pathway enlargment – Example 1
Example: Alzheimer disease pathway
•More than 20 proteins
annotated in our
PPIN
•5 added proteins by the
extension process
•3 known disease
associated
•2 candidates: METTL2B,
TMED10

Pathway enlargment – Example 2
Example: Interleukin signaling pathways
•Complex signaling
system, sharing
intracellular
Cascades
•New regulators
•New crosstalk
proteins

Network-based Functional
Association Ranking
Describe network strategies to identify and prioritize
Describe network strategies to identify and prioritize
functional associations (arising in high throughput
functional associations (arising in high throughput
experiments) between a genes/proteins set of interest
experiments) between a genes/proteins set of interest
(target set) and annotated genes/proteins sets (reference
(target set) and annotated genes/proteins sets (reference
set)
set)

2) Feed

1) Produce

Reference Set
Reference Set
11

Target
Target
Set
Set

3) Filter/Compare/
Overlap/Etc

4) Transfer

Reference Set
Reference Set
nn

5) New Hypothesis/Experiments/Insights

Multiple Approaches for “Enrichment Analysis”:
•Over-representation analysis (ORA)
•Gene set enrichment analysis (GSEA)
•Integrative and modular enrichment analysis (MEA)

Huang, D. et al., Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of
large gene lists. Nucleic Acids Res., 37 (1), 2009

Key limitations of previous
“Enrichment Analysis”:
•ORA techniques often have low discriminative power with scores varying
widely with small changes in the overlap size.
•Functional information captured in the graph structure of a molecular
interaction network connecting the gene/protein sets of interest is
disregarded.
•Genes and proteins in the network neighborhood, in particular those with
missing annotations, are not taken into account.
•The recognition of tissue-specific gene/protein set associations is often
statistically infeasible.

Perturbations in molecular networks disrupt biological pathways and result in human
diseases.

Wang X et al. Briefings in Functional Genomics 2011;10:280-293.
© The Author 2011. Published by Oxford University Press. All rights reserved

Key idea behind EnrichNet
Gene1
Gene1
..
..
..
GeneN
GeneN

Target Set
Target Set

Gene1
Gene1
Target Set . .
Target Set
..
..
GeneN
GeneN

Gene1
Gene1
..
..
..
GeneN
GeneN

∩

∪

Protein1
Protein1
..
..
..
ProteinM
ProteinM

Gene1
Gene1
..
..
Reference Set 1 1 Reference Set 1 1
Reference Set
Reference Set
..
GeneN
GeneN

∪
Protein1
Protein1
..
..
Reference Set nn Reference Set n n
Reference Set
Reference Set
..
ProteinM
ProteinM

What does it do?
Input:
– 10 or more human gene or protein identifiers
– Selection of reference database, e.g., KEGG, BioCarta, WikiPathways, Reactome,
PID, Interpro, GO.
Processing:
–

Maps target and reference sets to genome scale molecular interaction networks
•

Two default ones available

•

User can provide her/his own

– RWR to calculate distance between target and reference sets mapped into the large
network
– Comparison of these scores against a background model

Output:
– Ranking table of reference dataset
•

cellular pathways, processes and complexes

– Network and Tissue (60) specific scores
– For each pathway  interactive visualization of its embedding network
– Zoom, search, highlight, retrieve annotation/topological data

How does it work?
Connected human interactome graph
Connected human interactome graph
derived from STRING 9.0 database
derived from STRING 9.0 database
Edges weighted by the STRING
Edges weighted by the STRING
combined confidence score
combined confidence score
normalized to range [0,1])
normalized to range [0,1])

Distances (~F(Pt)) from target nodes
Distances (~F(Pt)) from target nodes
to reference ones is calculated by aa
to reference ones is calculated by
Random Walk with Restart (RWR)
Random Walk with Restart (RWR)
procedure.
procedure.
These distances are then linked to aa
These distances are then linked to
background model…
background model…

How does it work?
Distances are related to aa
Distances are related to
background model via:
background model via:

• Distance dependent weighting factor also accounts for the supposition that an over-representation of
small distance scores is more likely to reflect strong associations than an over-representation of large
distance scores.
• Classical statistical tests for comparing differences in the centre or shape of two distributions, e.g. the
Mann–Whitney U-test or the Kolmogorov–Smirnov test, are not applicable as they lack a distancedependent weighting.
• Randomly matched-size gene sets do not provide an adequate background model, since their members
can only have similar connectivity properties as pathway-representing gene sets if they are allowed to
significantly overlap with real pathways in the network.
• Tissue specific scores are easily computable by only considering distance scores to nodes
labeled with the tissue in question

Results: EnrichNet scores compared to over-representation
analysis scores
•

Assessed EnrichNet on two gene
sets representing different tumour
types:

•

genes mutated in bladder
and gastric cancer

•

genes associated with
Parkinson’s disease

•

Compared the results with a
conventional ORA on all pathway
databases.

absolute Pearson correlations between
0.50 and 0.95 for the different
Gene set pairs with equal ORA scores can be
datasets compared
differentiated using their Xd-distances

Datasets that share none of their genes/proteins with a pathway of interest (hence cannot be scored with ORA) and those with large
overlap- sizes (hence very similar ORA scores) mostly receive different Xd-scores

Thus enables a more sensitive and comprehensive ranking of gene set pairs

Results: Xd-score ranking for top 20 functional associations between genes
mutated in gastric cancer and pathways in the BioCarta database

Datasets that share none of their genes/proteins with a pathway of interest (hence cannot be scored with ORA) and those with large overlap
sizes (hence very similar ORA scores) mostly receive different Xd-scores

Thus enables a more sensitive and comprehensive ranking of gene set pairs

Results: Comparative validation on benchmark gene
expression data

EnrichNet and ORA scores
(Fisher’s exact test) across
five (reference) microarray
gene expression datasets X
two (target) gene set
collections

Results: Identification of novel functional associations
The network-based Xd-score ranking identifies several new associations
missed by the classical approach (ORA)

We use dataset examples pairs such that:

•target gene sets are all mutated in different diseases without additionally
available expression level data  they cannot be analyzed with “expressionaware” gene set enrichment analysis techniques

•zero or insignificant overlap size  receive low ORA scores (Q-values> 0.05)
but Xd-scores above the significance threshold obtained from the linear
regression fit

•these results point to functional associations that reflect dense networks of

•

Largest connected component for the network structure obtained when
comparing the gastric cancer mutated gene set against the pathway
Role of Erk5 (Extracellular signal-related kinase 5) in Neuronal Survival
(h_erk5Pathway) from the BioCarta database, describing a signalling
cascade which induces transcriptional events promoting neuronal survival.

•

These datasets have an intersection of only three genes (HRAS, NRAS &
KRAS) and would therefore not have been considered as significantly
associated ORA.

•

The network-based Xd-score (0.26, which is a significant threshold VS
only 0.08 Q-value in ORA with Fisher test), highlights functional
associations.

•

These

reflect

abundance

of

molecular

interactions

between

the

corresponding proteins for these gene sets and their shared network
neighborhoods.

•

This dense network of interactions corroborates previous findings linking
extracellular signal-related kinases (ERKs) to gastric cancer via an
induction of the putative tumor suppressor gene DDMBT1 (deleted in

•

Largest connected component for the network structure obtained
when comparing two datasets, bladder cancer mutated genes
and the genes for the Gene Ontology (GO) term ‘tyrosine
phosphorylation of Stat3

•

These share only a single gene (NF2)  no association can be
inferred from ORA

•

The high Xd-score for this gene set pair (0.80) points to a
functional

association

via

multiple

connecting

molecular

interactions, which is confirmed by the visualization.

•

In agreement with the previous observation that the downregulation of STAT3 phosphorylation by means of silencing the
Rho GTPase CDC42 is linked to the suppression of tumour
growth in bladder cancer.

•

Rho GTPases like CDC42 are known to frequently participate in
carcinogenic processes

•

Their involvement in bladder cancer is also reflected by a high
Xd-score of 0.71 for the GO biological process ‘regulation of Rho
GTPase activity’ (GO:0032319), which also shares only one

c)

•

Largest connected component for the network structure
obtained

when

comparing

two

datasets,

genes

implicated in Parkinson’s disease (PD) and the
‘regulation of interleukin-6 biosynthetic process’ from
the Gene Ontology database

•

A significant XD-score (0.77, significance threshold: 0.73)
even when sharing only one gene (IL1B)

•

The corresponding sub-network reveals a dense cluster
of interactions that interlink the PD gene set with the
interleukin-6 pathway.

•

This corroborates previously identified links between PD
and inflammation and reports of elevated levels of
interleukin-6 in the cerebrospinal fluid of PD patients

Results: Evaluation of the tissue specificity of gene set
associations
The network-based Xd-score is capable of computing tissue-specific
association scores

•

Tissue-specific analyses can in principle be realized with enrichment analysis techniques other than
EnrichNet

•

In practice, however, is often infeasible for conventional ORA methods  the subset of genes with
available tissue-specific annotations is too small to obtain reliable over-representation statistics.

•

EnrichNet alleviates this limitation of ORA approaches by taking tissue specificity annotations into account
from all non-overlapping gene/protein pairs that are connected through paths of interactions in a
molecular network.

•

Example: we took a set of genes with known implications in PD and measured the tissue-specific
associations with the high-scoring KEGG ‘Neurodegenerative Diseases’ (hsa01510) pathway  high Xdscores were over-represented in the group of brain tissues, whereas the centre of the Xd- score

Part III
Tools for Networks Visualisation &
Exploration

64

FruitNet
Main question: what is
the regulation
mechanism?
-measure gene expression
over time
-correlate expression
profiles
-connect genes if ρ >
threshold
The Tomato Genome Consortium, "The tomato
genome sequence provides insights into fleshy fruit
evolution", Nature 485, 635-641 (31 May 2012),
doi:10.1038/nature11119

65

•

FruitNet [in prep]: work in progress, interactions in
tomato fruit development

•

SeedNet [1]:
interactions in dormant and
germinating Arabidopsis seeds

•

SCoPNet [2]:
predicted network of
interactions in germinating and dormant Arabidopsis
seeds generated with BioHELL

•

EndoNet [3]:
interactions in the micropylar
endosperm of germinating Arabidopsis seed

•

RadNet [3]:
interactions in the lower axis of
germinating Arabidopsis seeds

[1] G.W. Bassel, H. Lan, E. Glaab, D.J. Gibbs, T. Gerjets, N. Krasnogor, A.J. Bonner, M.J. Holdsworth, N.J. Provart,
"Genome-wide network model capturing seed germination reveals coordinated regulation of plant cellular phase transitions",
PNAS, 108(23):9709-9714, June 2011 doi:10.1073/pnas.1100958108
[2] G.W. Bassel, E. Glaab, J. Marquez, M.J. Holdsworth, J. Bacardit, "Functional Network Construction in Arabidopsis Using
Rule-Based Machine Learning on Large-Scale Data Sets", The Plant Cell, 23(9):3101-3116, September 2011
doi:10.1105/tpc.111.088153
[3] B.J.W. Dekkers, S. Pearce, R.P. van Bolderen-Veldkamp, A. Marshall, P. Widera, J. Gilbert, H. Drost, G.W. Bassel, K.
Muller, J.R. King, A.T.A. Wood, I. Grosse, M. Quint, N. Krasnogor, G. Leubner-Metzger, M.J. Holdsworth, L. Bentsink,
"Transcriptional Dynamics of Two Seed Compartments with Opposing Roles in Arabidopsis Seed Germination”, PLANT
PHYSIOLOGY, 163(1):205-215, September 2013 doi:10.1104/pp.113.223511
68

Conclusions
•

Combining algorithms in a sequential and/or parallel fashion can
provide performance improvements and new biological insights

•

Microarray and gene set analysis tasks can be interlinked flexibly
in an (almost) completely automated process
[www.ArrayMining.Net]

•

New analysis types like network-based topology analysis and coexpression analysis complement existing tools [www.
{TopoGSA,PathExpan,EnrichNet}.net]

•

In the case of BC it allowed us to identify candidate genes to
characterise ER+ luminal-like BC.
– RERG gene is a key marker of the luminal BC class
– It can be used to separate distinct prognostic subgroups

Conclusions(I): Feature comparison with similar tools
ArrayMining &
TopoGSA

GEPAS
(Tarraga et al.)

Expression Profiler
(Kapushesky et al.)

Pre-processing:
Image analysis, single- and
dimensionality reduction,
gene name normalization,
cross-study normalization,
covariance-based filtering

Pre-processing:
Image analysis, missing
value imputation, multiple
single study normalization
methods, dimensionality
reduction, ID converter

Pre-processing:
Image analysis, single
study normalization,
missing value imputation,
dimensionality reduction,
advanced data selection

Analysis:
Classification, Clustering,
Gene selection, GSEA,
PCA, ICA, Co-expression
analysis, PPI-topology
analysis, Ensembles/Cons.

Analysis:
Classification, Clustering,
Gene selection, GSEA,
PCA, CGH arrays, Tissue
mining,Text mining, TFbinding site prediction

Analysis:
Clustering, Gene selection,
PCA, Co-expression
analysis (different from
ArrayMining), COA,
Similarity search

Usability/features:
PDF-reports, sortable
ranking tables, data annotation, 2D/3D plots, e-mail
notification, video tutorials

Usability/features:
special tree visualization
(Caat, SotaTree, Newick
Trees), 2D plots, data
annotation (Babelomics),

Usability/features:
Excel export, XML queries,
2D plots, data annotation
(GO, chromosome
location)

Conclusions
•

Example applications of (simple) network-based scoring methodology

•

Illustrate the utility of the approach for identifying novel functional associations between
gene/protein sets

•

Reflect known direct and indirect molecular interactions between their members rather than only
the size of their overlap.

•

Intelligent Visualisation of Networks Communities Possible

B.J.W. Dekkers, S. Pearce, R.P. van Bolderen-Veldkamp, A. Marshall, P. Widera, J. Gilbert, H.G. Drost, G.W. Bassel, K. Muller, J.R. King,
A.T. Wood, I. Grosse, M. Quint, N. Krasnogor, G. Leubner-Metzger, M.J. Holdsworth, and L. Bentsink. Transcriptional dynamics of two seed
compartments with opposing roles in arabidopsis seed germination. Plant Physiology, (to appear, first published online):113.223511, 2013
E. Glaab, J. Bacardit, J.M. Garibaldi, and N. Krasnogor. Using rule-based machine learning for candidate disease gene prioritization and sample
classification of cancer gene expression data. PLoS One, 7(7):e39932, 2012

E. Glaab, A. Baudot, N. Krasnogor, R.Schneider, and A. Valencia. Enrichnet: network-based gene set enrichment analysis. Bioinformatics,
2012. This paper was also accepted as a full oral presentation at the 2012 European Conference on Computational Biology.
G.W. Bassel, H. Lanc, E. Glaa, D.J. Gibbs, T. Gerjets, N. Krasnogor, A.J. Bonner, M.J. Holdsworth, and N.J. Provart. Genome-wide
network model capturing seed germination reveals coordinated rregulation of plant cellular phase transitions. Proceedings of the National
Academy of Sciences of the United States of America (PNAS), 2011.
E. Glaab, A. Baudot, N. Krasnogor, and A. Valencia. Topogsa: network topological gene set analysis. Bioinformatics, 26(9):1271, March
2010.
E. Glaab, A. Baudot, N. Krasnogor, and A. Valencia. Extending pathways and processes using molecular interaction networks to analyse
cancer genome data. BMC Bioinformatics, 11(597), 2010.
H.O. Habashy, D.G. Powe, E. Glaab, N. Krasnogor, J.M. Garibaldi, E.A. Rakha, G. Ball, A.R. Green, and I.O. Ellis. Rerg (ras-related
and oestrogen-regulated growth-inhibitor) expression in breast cancer: A marker of er-positive luminal-like subtype. Breast Cancer
Research and Treatment, (on line first):1-12, 2010.
E. Glaab, J. Garibaldi, and N. Krasnogor. Arraymining: a modular web-application for microarray analysis combining ensemble and
consensus methods with cross-study normalization. BMC Bioinformatics, 10(1)(1):358, 2009.

Availability

Freely accessible at: http://www.ncl.ac.uk/csbb/research/resources/
This work was supported by the UK’s Biotechnology and Biological Sciences Research Council [BB/F01855X/1], the Engineering and
Physical Sciences Research Council (EP/J004111/1).

Acknowledgements
Dr. Pawel Widera (Newcastle University, UK)
Dr. Jaume Bacardit (Newcastle University, UK)
Dr. Enrico Glaab (University of Luxembourg, Luxemburg )
Dr. Anais Baudot (Unversite d’Aix-Marseille, France)
Prof. Reinhard Schneider (University of Luxembourg, Luxemburg )
Prof. Alfonso Valencia (CNIO, Spain )
Prof. Mike Holdsworth (University of Nottingham, UK)
Prof. Graham Seymour (University of Nottingham, UK)
Prof. Doron Lancet (Weizmann Institute of Science, Israel)
Dr. Marylin Safran (Weizmann Institute of Science, Israel)

Integrative Networks Centric Bioinformatics

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Integrative Networks Centric Bioinformatics

Ähnlich wie Integrative Networks Centric Bioinformatics (20)

Mehr von Natalio Krasnogor

Mehr von Natalio Krasnogor (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Integrative Networks Centric Bioinformatics

Hinweis der Redaktion