Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Integrative Networks Centric Bioinformatics
1. Integrative & Network Analysis of
Transcriptomics and Proteomics Data
Prof. N. Krasnogor
Biology, Neuroscience and Computing (BNC) Research Group
School of Computing Science
Centre for Synthetic Biology and Bioexploitation
Newcastle University
E-mail:
Twitter:
Skype:
Natalio.Krasnogor@Newcastle.ac.uk
@Nkrasnogor
Natalio.Krasnogor
2. Outline
•
Introduction: goals and data sets
•
ArrayMining.net: tool set for microarray
analysis.
Part I
•
TopoGSA: topological analysis of genes /
proteins networks.
•
PathExpand: inference of missing pathways
links.
•
Enrichnet: network-based enrichment.
Part II
•
Visualisation & Exploration: networks
surfing.
Part III
Gibson G (2003) Microarray Analysis.
PLoS Biol 1(1): e15.
doi:10.1371/journal.pbio.0000015
3. Introduction
•
Typical problem in biosciences: How to make effective use of
multiple, large-scale data sources?
•
Typical problem in computer science: How to exploit the
strengths of different algorithms?
GOAL: Develop new (& existing) methods combining diverse
data sources and algorithms
4. Reference data set
Armstrong et al. Leukemia data set
72 samples and 12,626 genes
•
3 leukemia sub-types: ALL (24), AML (28),
MLL (20)
•
Heat map: 30 most differentially
expressed genes vs. samples
Normalisation: Variance Stabilizing
Normalisation (Huber et al., 2002)
•
samples
Platform: Affymetrix UV95A oligonucleotide
array
•
genes
•
Thresholding/Filtering steps: see
Armstrong et al. (2001, Nat. Genet.)
•
Public access to data set:
http://www.broadinstitute.org/cgi-bin/cancer/da
5. Main data set
QMC breast cancer microarray data set
Pre-normalized data (log-scale, min: 4.9,
max: 13.3)
•
128 samples and 47,293 genes
•
3 tumour grades: 1 (33), 2 (52), 3 (43)
•
grade1
Platform: Illumina Sentrix Human-6
BeadChips
•
genes
•
Probe level data analysis: Bioconductor
beadarray package
•
Public access to data set:
http://www.ebi.ac.uk/microarray-as/ae
accession number: E-TABM-576
grade 3
Heat map: 30 most differentially
expressed genes vs. samples
(grade 1 and grade 3)
6. Breast cancer data - difficulties
Breast cancer outcome is hard to predict:
Van‘t Veer et al.
Alon et al.
Golub et al.
Large degree of class-overlap in Breast cancer microarray data, whereas
Leukemia decision boundaries are easy to find (Blazadonakis, 2009).
7. Data Fusion
Other biological data sources used:
Breast cancer microarray data:
• additional public
data sets: Huang
et al., Veer et al.
• pre-processing:
GC-RMA
Cellular pathway data:
• obtained from GO,
BioCarta, Reactome,
KEGG and InterPro
• total: approx. 3000
pathways (size > 10)
Protein interaction data:
• unweighted binary
interactions (MIPS,
DIP, BIND, HPRD,
IntAct - only human)
• 9392 nodes,
38857 edges
Cancer gene sets:
• mutated genes in
different human
cancer types
(Breast, Liver,...)
• 30 gene sets of
size > 10 genes
10. Web-tool: ArrayMining.net
What is ArrayMining.net?
ArrayMinining.net is an online microarray
analysis tool set integrating multiple data
sources and algorithms.
www.arraymining.org
Goal: A “swiss knife“ for
microarray analysis
tasks
6 analysis modules:
1. Gene selection
2. Sample clustering
classical
3. Sample classification
4. Gene Set Analysis
new
5. Gene Network Analysis
6. Cross-Study Normalization
12. ArrayMining.net: Gene selection
Gene selection module
• Applies supervised feature selection algorithms (CFS, eBayes, SAM, etc.)
• Compares multiple algorithms or combines them into an ensemble
• Example: ENSEMBLE feature selection for Armstrong et al. (2001) dataset:
Affymetrix ID
Gene symbol
Gene descriptions – source:
F-statistic
32847_at
MYLK
myosin, light polypeptide kinase
159.59
1389_at
MME
membrane metallo-endopeptidase (neutral endopeptidase, enkephalinase)
137.53
35164_at
WFS1
wolfram syndrome 1 (wolframin)
36239_at
POU2AF1
pou domain, class 2, associating factor 1
116.75
1325_at
SMAD1
smad, mothers against dpp homolog 1 (drosophila)
110.37
963_at
LIG4
ligase iv, dna, atp-dependent
89.77
34168_at
DNTT
deoxynucleotidyltransferase, terminal
89.31
40570_at
FOXO1
forkhead box o1a (rhabdomyosarcoma)
86.89
33412_at
LGALS1
lectin, galactoside-binding, soluble, 1 (galectin 1)
81.31
previously identified by Armstrong et al.
128
newly identified
13. ArrayMining.net: Gene selection
Gene selection module (2): Armstrong et al. dataset
• Automatic generation of box
plots with gene and sample
class annotations
• The first row shows the box
plots for the two best-ranked
newly identified genes in the
Armstrong et al. dataset ()
• The second row shows two
top-ranked previously identified genes ()
• The user can easily compare
and combine the results
from different selection
methods
15. ArrayMining.net: Examples
Further examples: 3D-ICA and Co-Expression analysis
Sample space:
Gene space:
ALL
AML
MLL
3D Independent Component Analysis plot (left) and the largest connected components
from a gene co-expression network (right) for the Armstrong et al. dataset
16. ArrayMining.net: In-house data
Apply the tools on new data: QMC Breast cancer data
Expression levels across 3 tumour grades:
STK6
KIF2C
Heat map: 50 most significant genes
MYBL2
AURKb
Box plot: 4 most significant genes
17. ArrayMining.net: QMC dataset
QMC Breast cancer data set – selected genes
• all top-ranked genes are known or likely to be involved in breast cancer
• the selection is robust with regard to cross-validation cycles and algorithms
PC (gene vs. outcome):
Fold
Change
ESTROGEN RECEPTOR 1
-0.75
0.16
1.6e-20 (1.)
RAS-LIKE, ESTROGEN-REGULATED,
GROWTH INHIBITOR
-0.66
0.46
5.3e-14 (2.)
WD REPEAT DOMAIN 19
-0.66
0.73
1.2e-13 (3.)
CARBONIC ANHYDRASE XII
-0.65
0.28
2.7e-13 (4.)
ARP3 ACTIN-RELATED PROTEIN 3
HOMOLOG (YEAST)
0.64
1.37
9.6e-13 (5.)
TETRATRICOPEPTIDE REPEAT DOMAIN
8
-0.63
0.82
2.2e-12 (6.)
BREAST CANCER MEMBRANE PROTEIN
11
-0.62
0.24
7.1e-12 (7.)
Gene name
Q-value
(Rank)
19. ArrayMining.net: Example
ArrayMining - Class Discovery Analysis module:
•
Motiviation:
Exploiting the synergies between partition-based and hierarchical clustering
algorithms
•
Approach:
Consensus clustering based on the agreement of clustering results for pairs
of objects (details on next slide).
- equivalent to median partition problem (NP-complete)
- Simulated Annealing (SA) has been shown to provide good solutions
•
Our solution:
- Compare SA (Aarts et al. cooling scheme) with thermodynamic SA (TSA)
and fast SA (FSA) FSA provides fastest convergence
- Initialization: Input clustering with highest agreement to other inputs
20. ArrayMining.net: Consensus Clustering
ArrayMining‘s consensus clustering approach:
Clustering Agreement:= No. of times pairs of samples are assigned to the
same cluster across all input clusterings
Idea: Reward objects in the same cluster, if they have a high agreement.
Agreement matrix:
all clusterings for
samples i and j
Fitness function:
α:= (max(A)+min(A))/2
Sample 1
Aij := #agreements across
Sample 2
21. Clustering Methods and Validity Indices
•
Sample clustering methods: 8 different methods considered:
- partition-based: k-Means, PAM, CLARA, SOM, SOTA
- hierarchical: AL-HCL, DIANA, AGNES
•
Scoring & number of clusters selection: 5 validity indices / splitting rules used:
- Silhouette width, Calinski-Harabasz, Dunn, C-index, knn-Connectivity.
Example: Silhouette width
a(i) = avg. distance of obj(i) to all others
in the same cluster
b(i) = avg. distance of obj(i) to all
others in closest distinct cluster
- good validity indices should have: no multivariate normality assumptions (Gower,
1981), small or no bias (Milligan & Cooper, 1985)
•
Standardization: classical (mean 0, stddev. 1) or median absolute deviation
23. Consensus Clustering: example
Example application: QMC breast cancer dataset
•
Separate sub-classes in 84 luminal samples with consensus clustering
•
Input algorithms: k-Means, SOM, SOTA, PAM, HCL, DIANA, HYBRID-HCL
low confidence
(silhouette widths)
best separation
for two clusters
24. External Validation
Clustering results – external validation (tumour grades)
Measure similarity of
clusterings with the rand
index R:
a, b, c and d are the #pairs
of objects assigned to:
- the same cluster in
both clusterings (a)
Reference
clustering: 3 tumour
grades (low,
medium, high)
Random model
Single clustering
Consensus
- different clusters in
both clusterings (b)
- the same cluster in
clustering 1/2 and
different clusters in
clustering 2/1 (c/d)
- Corrected for chance:
adjusted rand index
10000 random
clusterings
26. ArrayMining.net: Genes Set Analysis
Extension: Gene set analysis
•
•
•
Expression levels for a single gene are often unreliable
Similar genes might contain complementary information
We want to integrate functional annotation data
samples
pathways
Gene Set Analysis (GSA):
1) Identify sets of functionally
similar genes (GO, KEGG, etc.)
2) Summarize gene sets to “Meta”genes (PCA, MDS, etc.)
3) Apply statistical analysis
(example: Van Andel institute cancer gene sets)
Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide
expression profiles. Subramanian et al. PNAS October 25, 2005 vol. 102 no. 43 15545–15550
27. ArrayMining.net: Examples
Gene Set Analysis module – example analysis
we apply the Gene Set Analysis module to the Armstrong et al. dataset
• with known cancer gene sets the class separation is better than for single genes
•
Heat map for the Armstrong et al. dataset based on pathway meta-genes
28. Consensus Clustering: example (2)
Combine consensus clustering with gene set analysis
•
Map genes onto Gene Ontology (GO), reduce dimensionality (MDS)
•
Apply same consensus clustering as before on GO-based „meta-genes“
~3 times higher
confidence
better separation
30. Interim Summary
ArrayMining Integrative Clustering - Summary
•
Consensus clustering (CC) results tend to be similar to or slightly better than the best single clusterings
in terms of adj. rand index and validity indices (but longer runtime)
•
The input clusterings should include diverse methods and exclude similar methods
•
Using gene sets (GS) representing cellular pathways instead of single genes results in better cluster
separation, adj. rand indices and validity indices (annotation data required)
GS & CC provide improved results, but:
longer runtimes + annotation data required
32. Networks Biology
Describe network strategies to identify and prioritize gene sets or
Describe network strategies to identify and prioritize gene sets or
functional associations (arising in high throughput experiments)
functional associations (arising in high throughput experiments)
between aagenes/proteins set of interest (target set) and annotated
between genes/proteins set of interest (target set) and annotated
genes/proteins sets (reference set)
genes/proteins sets (reference set)
I will describe some of the functionalities and methods in:
Freely accessible at:
http://www.ncl.ac.uk/csbb/research/resources/
InIn collaboration with:
collaboration with:
Dr. Pawel Widera (Newcastle
Dr. Pawel Widera (Newcastle
University)
University)
Dr. Enrico Glaab (University ofof
Dr. Enrico Glaab (University
Luxembourg ) )
Luxembourg
Dr. Anais Baudot (Unversite d’AixDr. Anais Baudot (Unversite d’AixMarseille)
Marseille)
Prof. Reinhard Schneider (University
Prof. Reinhard Schneider (University
ofof Luxembourg )
Luxembourg )
Prof. Alfonso Valencia (CNIO ) )
Prof. Alfonso Valencia (CNIO
34. TopoGSA: Network topological
analysis of gene sets
What is TopoGSA?
TopoGSA is a web-application mapping
gene sets onto a comprehensive human
protein interaction network and analysing
their network topological properties.
Two types of analysis:
1. Compare genes within a gene set:
e.g. up- vs. down-regulated genes
2. Compare a gene set against a
database of known gene sets
(e.g. KEGG, BioCarta, GO)
35. TopoGSA - Methods
TopoGSA computes topological properties for an uploaded gene set
TopoGSA computes topological properties for an uploaded gene set
and matched-size random gene sets
and matched-size random gene sets
•
the degree of each node in the gene set
•
the local clustering coefficient Ci for each node vi in the gene set:
where ki is the degree of vi and ejk is the edge between vj and vk
•
the shortest path length between pairs of nodes vi and vj in the gene set
•
the node betweenness B(v) for each node v in the gene set:
here σst(v) is the number of shortest paths from s to t passing through v
•
the eigenvector centrality for each node in the gene set
36. Human PPI network MIPS, DIP, BIND, HPRD and InAct
Human PPI network MIPS, DIP, BIND, HPRD and InAct
9393 proteins and 38857 interactions
9393 proteins and 38857 interactions
Also for yeast, fly, worm and arabidopsis
Also for yeast, fly, worm and arabidopsis
LEGEND:
Mean node
betweenness
• Cellular processes
• Environmental information processing
• Genetic information processing
• Human diseases
• Metabolism
• Cancer genes
General results:
tering
Mean clus
coefficient
Me
a
pa n sho
th
len rtest
gth
• Metabolic pathways
have high shortest path
lenghts and low betweenness
• Disease pathways and
cancer gene sets tend to
have high betweenness
and small shortest path
lenghts
37. Main data set
QMC breast cancer microarray data set
Pre-normalized data (log-scale, min: 4.9,
max: 13.3)
•
128 samples and 47,293 genes
•
3 tumour grades: 1 (33), 2 (52), 3 (43)
•
grade1
Platform: Illumina Sentrix Human-6
BeadChips
•
genes
•
Probe level data analysis: Bioconductor
beadarray package
•
Public access to data set:
http://www.ebi.ac.uk/microarray-as/ae
accession number: E-TABM-576
grade 3
Heat map: 30 most differentially
expressed genes vs. samples
(grade 1 and grade 3)
38. ArrayMining.net: In-house data
Apply the tools on new data: QMC Breast cancer data
www.arraymining.net
www.topogsa.net
Expression levels across 3 tumour grades:
STK6
KIF2C
Heat map: 50 most significant genes
MYBL2
AURKb
Box plot: 4 most significant genes
39. ArrayMining.net: QMC dataset
QMC Breast cancer data set – selected genes
• all top-ranked genes are known or likely to be involved in breast cancer
• the selection is robust with regard to cross-validation cycles and algorithms
PC (gene vs. outcome):
Fold
Change
ESTROGEN RECEPTOR 1
-0.75
0.16
1.6e-20 (1.)
RAS-LIKE, ESTROGEN-REGULATED,
GROWTH INHIBITOR
-0.66
0.46
5.3e-14 (2.)
WD REPEAT DOMAIN 19
-0.66
0.73
1.2e-13 (3.)
CARBONIC ANHYDRASE XII
-0.65
0.28
2.7e-13 (4.)
ARP3 ACTIN-RELATED PROTEIN 3
HOMOLOG (YEAST)
0.64
1.37
9.6e-13 (5.)
TETRATRICOPEPTIDE REPEAT DOMAIN
8
-0.63
0.82
2.2e-12 (6.)
BREAST CANCER MEMBRANE PROTEIN
11
-0.62
0.24
7.1e-12 (7.)
Gene name
Q-value
(Rank)
40. ArrayMining TopoGSA
Topological analysis of Selected Genes
• Results of within-gene-set comparison:
Estrogen receptor 1 gene and apoptosis regulator Bcl2, both up-regulated
in luminal samples, have outstanding network topological properties (higher
betweenness, higher degree, higher centrality) in comparison to other genes.
• Results of comparison against reference databases:
- Metabolic KEGG pathways are most similar to the uploaded gene set in
terms of network topological properties.
- Most similar BioCarta pathways: Cytokine, Differentiation and inflammatory
pathways.
41. Real-world application of tools sets
ArrayMining identifies RERG as a tumour marker
• RERG (Ras-related and oestrogen-regulated growth-inhibitor) was identified
as a new candidate marker of ER-positive luminal-like breast cancer subtype
• Validation using immunohistochemistry on Tissue Microarrays containing
1,140 invasive breast cancers confirmed RERG‘s utility as a marker gene
TMAs of invasive breast cancer
show strong RERG expression
42. RERG Protein Expression VS BCSS & DMFI
H.O. Habashy, D.G. Powe, E. Glaab, N. Krasnogor, J.M. Garibaldi, E.A. Rakha, G. Ball, A.R.
Green, and I.O. Ellis. Rerg (ras-related and oestrogen-regulated growth-inhibitor) expression
in breast cancer: A marker of er-positive luminal-like subtype. Breast Cancer Research and
Treatment, (on line first):1-12, 2010.
1.0
0.8
Negative RERG expression
Cumulative Survival
Without adjuvant treatment
Cumulative Survival
Positive RERG expression
0.6
1.0
0.8
0.6
p=0.002
0.4
0
50
100
150
200
250
0
50
BCSS in months
1.0
0.8
200
250
0.6
Negative RERG expression
p= 0.007
0.4
0
50
100
150
200
DMFI in months
Kaplan Meier plot of RERG
protein expression with respect
to BCSS in ER+ only
250
Positive RERG expression
Cumulative Survival
Without Tamoxifen treatment
Cumulative Survival
150
1.0
Positive RERG expression
Kaplan Meier plot of RERG
protein expression with respect
to BCSS in ER+ U ER- cohort
100
BCSS in months
0.8
Negative RERG expression
0.6
p=0.027
0
50
100
150
BCSS in months
200
250
43. PathExpand: Expanding
pathways and cellular processes
• Utilize same PPI network as in TopoGSA; derived from
– MIPS, DIP, MINT, HPRD and IntAct
– only experimental evidence of binary PPI
– final protein interaction network contained 9392 proteins
(nodes) and 38857 interactions (edges)
• Process Mapping:
– KEGG, Biocarta and Reactome were mapped
– Only 60% of pathways members existed in PPI Network
45. ••
••
Our procedure extended 159 pathways from
Our procedure extended 159 pathways from
BioCarta, 90 from KEGG and 52 from
BioCarta, 90 from KEGG and 52 from
Reactome.
Reactome.
The pathway sizes increased on average from
The pathway sizes increased on average from
113% to 126% of the original size.
113% to 126% of the original size.
Validation shows:
1)The proteins added are well connected and central in the protein interaction
network
2)The added proteins display gene ontology annotations matching better to the
original cellular pathway/process annotations than random proteins
3)Are enriched in processes known to be related to cellular signalling
4)Our method is able to recover known cellular pathway/process proteins in a
cross-validation experiment
Added cancer gene
Example case: BioCarta BTG family proteins and cell cycle regulation
Black: Original pathway nodes – Green: Nodes added based on connectivity
46. Pathway enlargment – Example 1
Example: Alzheimer disease pathway
•More than 20 proteins
annotated in our
PPIN
•5 added proteins by the
extension process
•3 known disease
associated
•2 candidates: METTL2B,
TMED10
48. Network-based Functional
Association Ranking
Describe network strategies to identify and prioritize
Describe network strategies to identify and prioritize
functional associations (arising in high throughput
functional associations (arising in high throughput
experiments) between a genes/proteins set of interest
experiments) between a genes/proteins set of interest
(target set) and annotated genes/proteins sets (reference
(target set) and annotated genes/proteins sets (reference
set)
set)
49. 2) Feed
1) Produce
Reference Set
Reference Set
11
Target
Target
Set
Set
3) Filter/Compare/
Overlap/Etc
4) Transfer
Reference Set
Reference Set
nn
5) New Hypothesis/Experiments/Insights
Multiple Approaches for “Enrichment Analysis”:
•Over-representation analysis (ORA)
•Gene set enrichment analysis (GSEA)
•Integrative and modular enrichment analysis (MEA)
Huang, D. et al., Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of
large gene lists. Nucleic Acids Res., 37 (1), 2009
50. Key limitations of previous
“Enrichment Analysis”:
•ORA techniques often have low discriminative power with scores varying
widely with small changes in the overlap size.
•Functional information captured in the graph structure of a molecular
interaction network connecting the gene/protein sets of interest is
disregarded.
•Genes and proteins in the network neighborhood, in particular those with
missing annotations, are not taken into account.
•The recognition of tissue-specific gene/protein set associations is often
statistically infeasible.
52. Key idea behind EnrichNet
Gene1
Gene1
..
..
..
GeneN
GeneN
Target Set
Target Set
Gene1
Gene1
Target Set . .
Target Set
..
..
GeneN
GeneN
Gene1
Gene1
..
..
..
GeneN
GeneN
∩
∪
Protein1
Protein1
..
..
..
ProteinM
ProteinM
Gene1
Gene1
..
..
Reference Set 1 1 Reference Set 1 1
Reference Set
Reference Set
..
GeneN
GeneN
∪
Protein1
Protein1
..
..
Reference Set nn Reference Set n n
Reference Set
Reference Set
..
ProteinM
ProteinM
53. What does it do?
Input:
– 10 or more human gene or protein identifiers
– Selection of reference database, e.g., KEGG, BioCarta, WikiPathways, Reactome,
PID, Interpro, GO.
Processing:
–
Maps target and reference sets to genome scale molecular interaction networks
•
Two default ones available
•
User can provide her/his own
– RWR to calculate distance between target and reference sets mapped into the large
network
– Comparison of these scores against a background model
Output:
– Ranking table of reference dataset
•
cellular pathways, processes and complexes
– Network and Tissue (60) specific scores
– For each pathway interactive visualization of its embedding network
– Zoom, search, highlight, retrieve annotation/topological data
54. How does it work?
Connected human interactome graph
Connected human interactome graph
derived from STRING 9.0 database
derived from STRING 9.0 database
Edges weighted by the STRING
Edges weighted by the STRING
combined confidence score
combined confidence score
normalized to range [0,1])
normalized to range [0,1])
Distances (~F(Pt)) from target nodes
Distances (~F(Pt)) from target nodes
to reference ones is calculated by aa
to reference ones is calculated by
Random Walk with Restart (RWR)
Random Walk with Restart (RWR)
procedure.
procedure.
These distances are then linked to aa
These distances are then linked to
background model…
background model…
55. How does it work?
Distances are related to aa
Distances are related to
background model via:
background model via:
• Distance dependent weighting factor also accounts for the supposition that an over-representation of
small distance scores is more likely to reflect strong associations than an over-representation of large
distance scores.
• Classical statistical tests for comparing differences in the centre or shape of two distributions, e.g. the
Mann–Whitney U-test or the Kolmogorov–Smirnov test, are not applicable as they lack a distancedependent weighting.
• Randomly matched-size gene sets do not provide an adequate background model, since their members
can only have similar connectivity properties as pathway-representing gene sets if they are allowed to
significantly overlap with real pathways in the network.
• Tissue specific scores are easily computable by only considering distance scores to nodes
labeled with the tissue in question
56. Results: EnrichNet scores compared to over-representation
analysis scores
•
Assessed EnrichNet on two gene
sets representing different tumour
types:
•
genes mutated in bladder
and gastric cancer
•
genes associated with
Parkinson’s disease
•
Compared the results with a
conventional ORA on all pathway
databases.
absolute Pearson correlations between
0.50 and 0.95 for the different
Gene set pairs with equal ORA scores can be
datasets compared
differentiated using their Xd-distances
Datasets that share none of their genes/proteins with a pathway of interest (hence cannot be scored with ORA) and those with large
overlap- sizes (hence very similar ORA scores) mostly receive different Xd-scores
Thus enables a more sensitive and comprehensive ranking of gene set pairs
57. Results: Xd-score ranking for top 20 functional associations between genes
mutated in gastric cancer and pathways in the BioCarta database
Datasets that share none of their genes/proteins with a pathway of interest (hence cannot be scored with ORA) and those with large overlap
sizes (hence very similar ORA scores) mostly receive different Xd-scores
Thus enables a more sensitive and comprehensive ranking of gene set pairs
58. Results: Comparative validation on benchmark gene
expression data
EnrichNet and ORA scores
(Fisher’s exact test) across
five (reference) microarray
gene expression datasets X
two (target) gene set
collections
59. Results: Identification of novel functional associations
The network-based Xd-score ranking identifies several new associations
missed by the classical approach (ORA)
We use dataset examples pairs such that:
•target gene sets are all mutated in different diseases without additionally
available expression level data they cannot be analyzed with “expressionaware” gene set enrichment analysis techniques
•zero or insignificant overlap size receive low ORA scores (Q-values> 0.05)
but Xd-scores above the significance threshold obtained from the linear
regression fit
•these results point to functional associations that reflect dense networks of
60. Results: Identification of novel functional associations
•
Largest connected component for the network structure obtained when
comparing the gastric cancer mutated gene set against the pathway
Role of Erk5 (Extracellular signal-related kinase 5) in Neuronal Survival
(h_erk5Pathway) from the BioCarta database, describing a signalling
cascade which induces transcriptional events promoting neuronal survival.
•
These datasets have an intersection of only three genes (HRAS, NRAS &
KRAS) and would therefore not have been considered as significantly
associated ORA.
•
The network-based Xd-score (0.26, which is a significant threshold VS
only 0.08 Q-value in ORA with Fisher test), highlights functional
associations.
•
These
reflect
abundance
of
molecular
interactions
between
the
corresponding proteins for these gene sets and their shared network
neighborhoods.
•
This dense network of interactions corroborates previous findings linking
extracellular signal-related kinases (ERKs) to gastric cancer via an
induction of the putative tumor suppressor gene DDMBT1 (deleted in
61. Results: Identification of novel functional associations
•
Largest connected component for the network structure obtained
when comparing two datasets, bladder cancer mutated genes
and the genes for the Gene Ontology (GO) term ‘tyrosine
phosphorylation of Stat3
•
These share only a single gene (NF2) no association can be
inferred from ORA
•
The high Xd-score for this gene set pair (0.80) points to a
functional
association
via
multiple
connecting
molecular
interactions, which is confirmed by the visualization.
•
In agreement with the previous observation that the downregulation of STAT3 phosphorylation by means of silencing the
Rho GTPase CDC42 is linked to the suppression of tumour
growth in bladder cancer.
•
Rho GTPases like CDC42 are known to frequently participate in
carcinogenic processes
•
Their involvement in bladder cancer is also reflected by a high
Xd-score of 0.71 for the GO biological process ‘regulation of Rho
GTPase activity’ (GO:0032319), which also shares only one
62. Results: Identification of novel functional associations
c)
•
Largest connected component for the network structure
obtained
when
comparing
two
datasets,
genes
implicated in Parkinson’s disease (PD) and the
‘regulation of interleukin-6 biosynthetic process’ from
the Gene Ontology database
•
A significant XD-score (0.77, significance threshold: 0.73)
even when sharing only one gene (IL1B)
•
The corresponding sub-network reveals a dense cluster
of interactions that interlink the PD gene set with the
interleukin-6 pathway.
•
This corroborates previously identified links between PD
and inflammation and reports of elevated levels of
interleukin-6 in the cerebrospinal fluid of PD patients
63. Results: Evaluation of the tissue specificity of gene set
associations
The network-based Xd-score is capable of computing tissue-specific
association scores
•
Tissue-specific analyses can in principle be realized with enrichment analysis techniques other than
EnrichNet
•
In practice, however, is often infeasible for conventional ORA methods the subset of genes with
available tissue-specific annotations is too small to obtain reliable over-representation statistics.
•
EnrichNet alleviates this limitation of ORA approaches by taking tissue specificity annotations into account
from all non-overlapping gene/protein pairs that are connected through paths of interactions in a
molecular network.
•
Example: we took a set of genes with known implications in PD and measured the tissue-specific
associations with the high-scoring KEGG ‘Neurodegenerative Diseases’ (hsa01510) pathway high Xdscores were over-represented in the group of brain tissues, whereas the centre of the Xd- score
65. FruitNet
Main question: what is
the regulation
mechanism?
-measure gene expression
over time
-correlate expression
profiles
-connect genes if ρ >
threshold
The Tomato Genome Consortium, "The tomato
genome sequence provides insights into fleshy fruit
evolution", Nature 485, 635-641 (31 May 2012),
doi:10.1038/nature11119
65
68. •
FruitNet [in prep]: work in progress, interactions in
tomato fruit development
•
SeedNet [1]:
interactions in dormant and
germinating Arabidopsis seeds
•
SCoPNet [2]:
predicted network of
interactions in germinating and dormant Arabidopsis
seeds generated with BioHELL
•
EndoNet [3]:
interactions in the micropylar
endosperm of germinating Arabidopsis seed
•
RadNet [3]:
interactions in the lower axis of
germinating Arabidopsis seeds
[1] G.W. Bassel, H. Lan, E. Glaab, D.J. Gibbs, T. Gerjets, N. Krasnogor, A.J. Bonner, M.J. Holdsworth, N.J. Provart,
"Genome-wide network model capturing seed germination reveals coordinated regulation of plant cellular phase transitions",
PNAS, 108(23):9709-9714, June 2011 doi:10.1073/pnas.1100958108
[2] G.W. Bassel, E. Glaab, J. Marquez, M.J. Holdsworth, J. Bacardit, "Functional Network Construction in Arabidopsis Using
Rule-Based Machine Learning on Large-Scale Data Sets", The Plant Cell, 23(9):3101-3116, September 2011
doi:10.1105/tpc.111.088153
[3] B.J.W. Dekkers, S. Pearce, R.P. van Bolderen-Veldkamp, A. Marshall, P. Widera, J. Gilbert, H. Drost, G.W. Bassel, K.
Muller, J.R. King, A.T.A. Wood, I. Grosse, M. Quint, N. Krasnogor, G. Leubner-Metzger, M.J. Holdsworth, L. Bentsink,
"Transcriptional Dynamics of Two Seed Compartments with Opposing Roles in Arabidopsis Seed Germination”, PLANT
PHYSIOLOGY, 163(1):205-215, September 2013 doi:10.1104/pp.113.223511
68
69. Conclusions
•
Combining algorithms in a sequential and/or parallel fashion can
provide performance improvements and new biological insights
•
Microarray and gene set analysis tasks can be interlinked flexibly
in an (almost) completely automated process
[www.ArrayMining.Net]
•
New analysis types like network-based topology analysis and coexpression analysis complement existing tools [www.
{TopoGSA,PathExpan,EnrichNet}.net]
•
In the case of BC it allowed us to identify candidate genes to
characterise ER+ luminal-like BC.
– RERG gene is a key marker of the luminal BC class
– It can be used to separate distinct prognostic subgroups
70. Conclusions(I): Feature comparison with similar tools
ArrayMining &
TopoGSA
GEPAS
(Tarraga et al.)
Expression Profiler
(Kapushesky et al.)
Pre-processing:
Image analysis, single- and
dimensionality reduction,
gene name normalization,
cross-study normalization,
covariance-based filtering
Pre-processing:
Image analysis, missing
value imputation, multiple
single study normalization
methods, dimensionality
reduction, ID converter
Pre-processing:
Image analysis, single
study normalization,
missing value imputation,
dimensionality reduction,
advanced data selection
Analysis:
Classification, Clustering,
Gene selection, GSEA,
PCA, ICA, Co-expression
analysis, PPI-topology
analysis, Ensembles/Cons.
Analysis:
Classification, Clustering,
Gene selection, GSEA,
PCA, CGH arrays, Tissue
mining,Text mining, TFbinding site prediction
Analysis:
Clustering, Gene selection,
PCA, Co-expression
analysis (different from
ArrayMining), COA,
Similarity search
Usability/features:
PDF-reports, sortable
ranking tables, data annotation, 2D/3D plots, e-mail
notification, video tutorials
Usability/features:
special tree visualization
(Caat, SotaTree, Newick
Trees), 2D plots, data
annotation (Babelomics),
Usability/features:
Excel export, XML queries,
2D plots, data annotation
(GO, chromosome
location)
71. Conclusions
•
Example applications of (simple) network-based scoring methodology
•
Illustrate the utility of the approach for identifying novel functional associations between
gene/protein sets
•
Reflect known direct and indirect molecular interactions between their members rather than only
the size of their overlap.
•
Intelligent Visualisation of Networks Communities Possible
B.J.W. Dekkers, S. Pearce, R.P. van Bolderen-Veldkamp, A. Marshall, P. Widera, J. Gilbert, H.G. Drost, G.W. Bassel, K. Muller, J.R. King,
A.T. Wood, I. Grosse, M. Quint, N. Krasnogor, G. Leubner-Metzger, M.J. Holdsworth, and L. Bentsink. Transcriptional dynamics of two seed
compartments with opposing roles in arabidopsis seed germination. Plant Physiology, (to appear, first published online):113.223511, 2013
E. Glaab, J. Bacardit, J.M. Garibaldi, and N. Krasnogor. Using rule-based machine learning for candidate disease gene prioritization and sample
classification of cancer gene expression data. PLoS One, 7(7):e39932, 2012
E. Glaab, A. Baudot, N. Krasnogor, R.Schneider, and A. Valencia. Enrichnet: network-based gene set enrichment analysis. Bioinformatics,
2012. This paper was also accepted as a full oral presentation at the 2012 European Conference on Computational Biology.
G.W. Bassel, H. Lanc, E. Glaa, D.J. Gibbs, T. Gerjets, N. Krasnogor, A.J. Bonner, M.J. Holdsworth, and N.J. Provart. Genome-wide
network model capturing seed germination reveals coordinated rregulation of plant cellular phase transitions. Proceedings of the National
Academy of Sciences of the United States of America (PNAS), 2011.
E. Glaab, A. Baudot, N. Krasnogor, and A. Valencia. Topogsa: network topological gene set analysis. Bioinformatics, 26(9):1271, March
2010.
E. Glaab, A. Baudot, N. Krasnogor, and A. Valencia. Extending pathways and processes using molecular interaction networks to analyse
cancer genome data. BMC Bioinformatics, 11(597), 2010.
H.O. Habashy, D.G. Powe, E. Glaab, N. Krasnogor, J.M. Garibaldi, E.A. Rakha, G. Ball, A.R. Green, and I.O. Ellis. Rerg (ras-related
and oestrogen-regulated growth-inhibitor) expression in breast cancer: A marker of er-positive luminal-like subtype. Breast Cancer
Research and Treatment, (on line first):1-12, 2010.
E. Glaab, J. Garibaldi, and N. Krasnogor. Arraymining: a modular web-application for microarray analysis combining ensemble and
consensus methods with cross-study normalization. BMC Bioinformatics, 10(1)(1):358, 2009.
72. Availability
Freely accessible at: http://www.ncl.ac.uk/csbb/research/resources/
This work was supported by the UK’s Biotechnology and Biological Sciences Research Council [BB/F01855X/1], the Engineering and
Physical Sciences Research Council (EP/J004111/1).
73. Acknowledgements
Dr. Pawel Widera (Newcastle University, UK)
Dr. Jaume Bacardit (Newcastle University, UK)
Dr. Enrico Glaab (University of Luxembourg, Luxemburg )
Dr. Anais Baudot (Unversite d’Aix-Marseille, France)
Prof. Reinhard Schneider (University of Luxembourg, Luxemburg )
Prof. Alfonso Valencia (CNIO, Spain )
Prof. Mike Holdsworth (University of Nottingham, UK)
Prof. Graham Seymour (University of Nottingham, UK)
Prof. Doron Lancet (Weizmann Institute of Science, Israel)
Dr. Marylin Safran (Weizmann Institute of Science, Israel)
Hinweis der Redaktion
.
Red labels represent features that are unique to only one of the tools (however, also the other features are implemented differently). More importantly, ArrayMining combines algorithms using ensembles and consensus clustering.
Abbreviations:
GSEA = gene set enrichment analysis
ICA = Independent component analysis
CGH = Comparative genomic hybridization
COA = Correspondence analysis
TF = Transcription factor
CAAT = a hierarchical cluster analyser (abbreviation not explained)
Similarity search = tool for identifying co-expressed genes, given a set of genes as a query