Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Visualization hang zhong
1. 1
Visualization
of
Ciona
Intestinalis
Co-‐expression
Network
by
Hang
Zhong
A
dissertation
submitted
in
partial
fulfillment
of
the
requirements
for
the
degree
of
Master
of
Science
Department
of
Biology
New
York
University
May,
2012
2. 2
ACKNOWLEDGEMENTS
I
would
like
to
thank
my
advisor,
Richard
Bonneau,
for
providing
me
the
opportunity
to
participate
in
this
project,
ongoing
guidance
and
support.
I
am
also
indebted
to
professor
Lionel
Christiaen
for
inspiring
the
project.
This
thesis
could
not
have
come
to
fruition
without
the
help
of
Florian
Razy,
who
offered
insightful
and
thought-‐provoking
input.
I
am
also
everlastingly
grateful
to
Duncan
Penfold-‐Brown
for
teaching
me
the
programming.
I
would
also
like
to
thank
Kieran
Mace,
Aviv
Madar,
Kevin
Drew,
Maximilian
Haeussler
and
Claudia
Racioppi
who
so
patiently
offer
their
time
and
support.
Many
thanks
to
Todd
Heiniger
and
Joel
Rodriguez
for
revising
the
thesis.
Finally,
I
would
like
to
thank
my
family
for
the
invaluable
support
they
have
given
me
in
the
course
of
my
life
and
studies.
3. 3
ABSTRACT
The
abnormalities
of
the
heart
development
causes
most
frequent
congenital
diseases
in
humans.
The
conservation
of
the
Gene
Regulatory
Network
(GRN)
involved
in
heart
development,
cellular
simplicity,
low
genetic
redundancy
and
relevant
evolutionary
position
lead
researchers
to
study
the
ascidian
Ciona
intestinalis.
To
extract
useful
information
from
the
Microarray
data
for
researchers
to
infer
the
heart
network
in
Ciona,
this
thesis
not
only
applies
the
standard-‐
based
approaches
to
find
the
differential
expression
genes,
but
also
explores
the
network-‐based
approaches
to
find
functional
group.
By
visualizing
the
co-‐expression
network
in
Gaggle,
the
list
of
ASM
and
heart
candidate
genes
are
fine-‐tuned.
In
addition,
the
modules
containing
candiate
and
known
marker
genes
may
deserve
further
study.
4. 4
TABLE
OF
CONTENTS
ABSTRACT
..................................................................................................................................
3
1.
INTRODUCTION
...............................................................................................................
7
1.1
GENE
REGULATORY
NETWORK
OF
CARDIOGENIC
PRECURSORS
IN
CIONA
...............................
7
1.2
MICROARRAY
DATA
ANALYSIS
...............................................................................................
8
1.3
NETWORK
VISUALIZATION
THROUGH
GAGGLE
.......................................................................
9
2.
METHODS
........................................................................................................................
10
2.1
MICROARRAY
EXPERIMENTAL
DESIGN
................................................................................
10
2.2
GENE
EXPRESSION
DATA
....................................................................................................
10
2.2.1
QUALITY
CONTROL
........................................................................................................................
10
2.2.2
PREPROCESSING
............................................................................................................................
11
2.3
STATISTICAL
TEST
..............................................................................................................
11
2.4
CLUSTER
ANALYSIS
............................................................................................................
11
2.5
FUNCTIONAL
ENRICHMENT
ANALYSIS
................................................................................
12
2.6
GENERATION
OF
NETWORKS
..............................................................................................
12
2.6.1
STRING
PROTEIN
NETWORK
........................................................................................................
12
2.6.2
UNWEIGHTED
CO-‐EXPRESSION
NETWORK
................................................................................
13
2.6.3
WEIGHTED
CO-‐EXPRESSION
NETWORK
.....................................................................................
13
2.7
NETWORK
VISUALIZATION
.................................................................................................
14
2.7.1
FILE
FORMAT
.................................................................................................................................
14
2.7.2
ANALYZING
NETWORK
BY
PLUGIN
IN
CYTOSCAPE
....................................................................
14
3.
RESULTS
..........................................................................................................................
15
3.1
DIFFERENTIAL
EXPRESSION
...............................................................................................
15
3.1.1
EXPECTATION
OF
THE
MICROARRAY
DATA
................................................................................
15
3.1.2
ASM
AND
HEART
CANDIDATE
GENES
..........................................................................................
15
3.2
NETWORK
VISUALIZATION
IN
GAGGLE
...............................................................................
17
5. 5
3.2.1
NETWORKS
.....................................................................................................................................
17
3.2.2
FINDINGS
FROM
THE
NETWORK
VISUALIZATION
IN
GAGGLE
..................................................
20
3.2.2.1
GAGGLE
AS
INFORMATION
INTEGRATION
CENTER
...............................................................
20
3.2.2.2
MODULE
FROM
ALLEGROMCODE
.............................................................................................
21
3.2.2.3
MODULE
FROM
WEIGHTED
NETWORK
....................................................................................
22
3.2.2.4
FINE-‐TUNED
LIST
......................................................................................................................
23
4.
DISCUSSION
....................................................................................................................
25
4.1
ASM
CANDIDATE
GENES
......................................................................................................
25
4.2
ANNOTATION
IN
CIONA
INTESTINALIS
................................................................................
25
4.3
FUNCTIONAL
RIBOSOME
GROUP
AND
COE
...........................................................................
26
4.4
TIME-‐SERIES
......................................................................................................................
27
4.5
LIMITATIONS
OF
THE
CO-‐EXPRESSION
NETWORK
...............................................................
28
FIGURES
AND
TABLES
.........................................................................................................
29
FIGURE
1
PIPELINE.
...................................................................................................................
29
FIGURE
2
NORMALIZED
UNSCALED
STANDARD
ERROR
(NUSE).
.................................................
30
FIGURE
3
HEAT-‐MAP
OF
ASM
AND
HEART
CANDIDATE
GENES.
...................................................
30
FIGURE
4
OUTPUT
OF
THE
SHORT
TIME-‐SERIES
EXPRESSION
MINER.
........................................
31
FIGURE
5
SELECTING
SOFT
POWER.
...........................................................................................
31
FIGURE
6
CIONA
INTESTINALIS
WEIGHTED
CO-‐EXPRESSION
NETWORK.
....................................
32
FIGURE
7
MODULE
SIGNIFICANCE.
.............................................................................................
33
FIGURE
8
INTRAMODULAR
CONNECTIVITY
AND
MODULE
SIGNIFICANCE.
...................................
34
FIGURE
9
STRING
PROTEIN
NETWORK.
.....................................................................................
35
FIGURE
10
LABELING
IN
WEIGHTED
NETWORK.
........................................................................
35
FIGURE
11
THE
1ST
MODULE
INFERRED
BY
ALLEGROMCODE
FOR
UNWEIGHTED
CO-‐EXPRESSION
NETWORK.
36
FIGURE
12
THE
1ST
MODULE
OF
UNWEIGHTED
CO-‐EXPRESSION
NETWORK
ENRICHMENT.
.........
37
FIGURE
13
THE
1ST
MODULE
INFERRED
BY
ALLEGROMCODE
FOR
WEIGHTED
CO-‐EXPRESSION
NETWORK.
37
6. 6
FIGURE
14
THE
1ST
MODULE
OF
WEIGHTED
NETWORK
ENRICHMENT.
.......................................
37
FIGURE
15
RIBOSOME
GROUP
IN
THE
STRING.
...........................................................................
38
FIGURE
16
RIBOSOME
GROUP
IN
STRING
NETWORK
ENRICHMENT.
............................................
38
FIGURE
17
RIBOSOME
GROUP
AND
COE.
....................................................................................
39
FIGURE
18
GREY
COLOR
GENES.
................................................................................................
39
FIGURE
19
TAN
MODULE
...........................................................................................................
40
FIGURE
20
BROWN
MODULE
.....................................................................................................
40
FIGURE
21
TURQUOISE
MODULE
ENRICHMENT.
.........................................................................
41
FIGURE
22
GENES
IN
TURQUOISE
PLUS
STEM
CONDITION.
........................................................
41
FIGURE
23
GENES
OF
TURQUOISE
PLUS
STEM
CONDITION
ENRICHMENT.
...................................
42
FIGURE
24
SUB-‐GROUP
OF
CANDIDATE
GENES
IN
UNWEIGHTED
NETWORK.
..............................
42
FIGURE
25
SUB-‐GROUP
OF
CANDIDATE
GENES
IN
UNWEIGHTED
NETWORK
ENRICHMENT.
........
43
FIGURE
26
ASM
CANDIDATE
GENES
IN
WEIGHTED
NETWORK
ENRICHMENT.
.............................
43
FIGURE
27
ASM
AND
HEART
CANDIDATE
GENES
........................................................................
44
REFERENCES
...........................................................................................................................
45
7. 7
1. INTRODUCTION
1.1
Gene
regulatory
network
of
cardiogenic
precursors
in
Ciona
The
abnormalities
of
the
heart
development
causes
most
frequent
congenital
diseases
in
humans.
The
conservation
of
the
Gene
Regulatory
Network
(GRN)
involved
in
heart
development,
cellular
simplicity,
low
genetic
redundancy
and
relevant
evolutionary
position
lead
researchers
to
study
the
ascidian
Ciona
intestinalis(Davidson
2007).
In
Ciona,
a
single
pair
of
blastomeres
called
B7.5
gives
birth
to
the
anterior
tail
muscle
(ATM)
and
to
the
trunk
ventral
cells
(TVC)
(Figure
27).
Following
migration
from
the
tail,
the
TVC
undergo
asymmetric
cell
divisions
at
the
ventral
midline
of
the
trunk.
The
medial
TVC
give
rise
to
the
heart
while
the
lateral
TVCs
migrate
toward
the
atrial
placode
where
they
will
form
the
atrial
siphon
muscles
(ASM).
Thus,
the
TVC
are
similar
to
the
multipotent
cardio-‐
pharyngeal
progenitors
found
in
vertebrates,
while
ASM
are
likely
equivalent
to
the
jaw
muscle
in
vertebrates.
A
few
years
ago,
the
first
cardiogenic
the
Gene
Regulatory
Network
(GRN)
in
Ciona
was
proposed
(Christiaen,
Davidson
et
al.
2008),
decoupling
genes
necessary
for
heart
specification
from
genes
necessary
for
cell
migration.
Later
study
has
been
shown
that
ASM
precursors
express
the
transcription
factor
COE
(Stolfi,
Gainous
et
al.
8. 8
2010),
which
is
necessary
and
sufficient
to
specify
ASM
fate.
Misexpression
of
COE
in
the
whole
TVC
lineage
blocks
heart
development
and
imposes
an
ASM
fate
to
all
cells.
Conversely,
misexpression
of
a
constitutive
repressor
form
of
COE
provokes
the
opposite
phenotype,
blocking
ASM
formation
and
causing
all
cells
to
form
heart
tissue.
Using
the
genome-‐wide
Microarray
analysis
to
study
this
crucial
COE
gene
and
find
the
downstream
effectors
of
COE,
it
is
expected
to
gain
insights
to
the
gene
regulatory
network
of
the
heart.
1.2 Microarray
data
analysis
Most
of
the
existing
studies
have
focused
on
the
differential
expression
to
identify
genes
that
distinguish
different
sets
of
samples.
It’s
quite
common
to
apply
different
testing
method,
such
as
t-‐test,
F-‐
test,
or
nonparametric
versions
of
the
Wilcoxon
test
to
rank
thousands
of
genes,
and
the
most
significant
genes
are
select
(Gentleman
2005).
Other
specific
statistical
methods
are
also
commonly
used
in
the
Microarray
data
analysis,
such
as
Significance
Analysis
of
Microarray
(SAM)
(Tusher,
Tibshirani
et
al.
2001)
and
LIMMA
(Wettenhall,
Smyth
2004)
using
a
Bayesian
mixture
model.
Another
way
of
using
microarray
data
is
to
understand
an
individual
gene
or
protein’s
network
properties
by
studying
the
co-‐
expression,
where
genes
that
have
similar
expression
patterns
across
a
set
of
samples
are
hypothesized
to
have
a
functional
relationship.
9. 9
This
co-‐expression
network-‐based
approach
is
consistent
with
the
important
concept
that
has
emerged
over
the
past
decade—genes
and
their
protein
products
carry
out
cellular
processes
in
the
context
of
functional
modules
and
are
related
(Barabasi,
Bonabeau
2003,
Barabasi,
Oltvai
2004).
1.3 Network
visualization
through
Gaggle
It
has
been
well
recognized
that
visualization
plays
a
key
role
in
helping
to
understand
biological
systems,
particularly
in
the
era
of
high-‐throughput
studies
with
a
wealth
of
‘omics’-‐scale
data
(Gehlenborg,
O'Donoghue
et
al.
2010).
This
thesis
applies
the
simple,
open-‐source
Java
software
system
Gaggle
(Shannon,
Reiss
et
al.
2006)
for
co-‐expression
network
visualization.
Gaggle
is
a
cross-‐platform
system
integrated
with
diverse
databases
(KEGG,
BioCyc,
and
String)
and
software
(Cytoscape,
DataMatrixViewer,
R
statistical
environment,
and
TIGR
Microarray
Expression
Viewer).
With
four
simple
data
types
(names,
matrices,
networks,
and
associative
arrays),
researchers
can
explore
many
different
sources
and
variety
of
software
tools
by
entering
these
information
into
the
Gaggle
Boss
and
transferred
to
other
tools.
10. 10
2. METHODS
The
pipeline
of
this
thesis
is
in
Figure
1.
2.1
Microarray
experimental
design
The
microarray
data
used
in
this
study
are
kindly
provided
by
Dr.
Lionel
Christiaen.
It
consists
of
30,969
probe
sets
from
Affymetrix
GeneChips.
The
perturbation
group
includes
LacZ
control,
the
over-‐
expression
and
loss
of
function
of
transcription
factor
Collier/EBF/OIf
(COE)
in
the
sorted
TVC
cells
at
21
hours
post
fertilization
(hpf)—
after
the
asymmetric
divisions
of
the
TVCs
but
before
completion
of
the
ASM
migration.
Time-‐series
group
is
comprised
of
11
time
points,
every
2
hours
varying
from
8
to
28
hours
in
TVC
cells.
2.2
Gene
expression
data
2.2.1 Quality
control
This
thesis
applies
the
arrayQualityMetrics
(Kauffmann,
Gentleman
et
al.
2009),
a
Bioconductor
package
for
quality
control.
It
provides
an
HTML
report
with
several
diagnostics
plots.
In
general,
the
array
will
be
discarded
if
it
is
identified
as
an
outlier
in
both
before
and
after
normalization
in
the
report.
The
Microarray
data
firstly
is
imported
in
statistical
programming
language
R,
and
then
carried
on
the
quality
control
by
arrayQualityMetrics.
The
sample
LacZ.3
is
removed
since
it
was
11. 11
reported
an
outlier
in
both
before
and
after
normalization
(Figure
2).
2.2.2 Preprocessing
The
cell
files
of
the
Microarray
are
normalized
by
the
RMA
method
(Gentleman
2005).
The
expression
matrix
contains
30,969
probes
and
48
arrays.
After
the
non-‐specific
filtering
by
variance
(IQR=0.5),
the
matrix
contains
15,484
probes,
48
arrays.
Using
the
collapseRows
function
in
WGCNA,
the
probes
with
maximum
variance
are
selected
to
represent
genes.
After
merging
the
probes,
the
merged
matrix
contains
10,079
probes
and
48
arrays.
2.3
Statistical
test
The
merged
matrix
is
ranked
by
moderated
F
test
and
genes
are
selected
with
significant
p-‐value
(<0.05,
using
Limma
package)
(Smyth
2004)
after
adjusted
by
Benjamini-‐Hochnerg
method.
After
ranking,
the
top-‐rank
matrix
contains
4,307
probes
and
48
arrays.
The
top-‐rank
matrix
is
imported
to
one
of
the
Gaggle
Geese
MultiExperiment
Viewer
(MeV)
and
under
Significant
Analysis
for
Microarrays
(SAM)
test
(COE
versus
COEW
group,
p-‐value
<
0.05,
1000
permutation,
FDR
=
0.9).
2.4
Cluster
analysis
12. 12
Hierarchical
clustering
is
performed
for
ASM
and
Heart
candidate
genes
using
MeV,
using
Pearson
correlation
metric
and
average
linkage
clustering.
The
time-‐series
group
data,
totaling
36
arrays,
are
averaged
for
each
time
point
and
imported
to
Short
Time-‐series
Expression
Miner
(STEM),
using
STEM
Clustering
Method.
2.5
Functional
enrichment
analysis
Blast2GO
(B2G)
(Conesa,
Gtz
et
al.
2005)
is
a
comprehensive
bioinformatics
tool
for
annotation,
visualization
and
analysis
in
functional
genomics
research.
It
offers
a
suitable
platform
for
functional
research
in
non-‐model
species,
such
as
Ciona
intestinalis.
DNA
sequences
in
fasta
format
were
loaded
to
Blast2GO.
15,629
genes
remained
in
the
Blast2GO,
followed
by
blasting,
go-‐
mapping
and
yielded
Go-‐terms
for
3,964
genes.
The
test
group
from
different
lists
is
tested
against
the
reference
group
(3,964
genes)
using
the
Fisher’s
Exact
Test
(p-‐value
<
0.05,
FDR
correction).
2.6 Generation
of
networks
2.6.1 String
protein
network
Using
the
Ensembl
gene
name
in
this
filt.gene
matrix
as
input,
the
genes
of
interest
in
the
Search
Tool
for
the
Retrieval
of
Interacting
Genes
(STRING)
database
(Szklarczyk,
Franceschini
et
al.
2011)
are
extracted
from
the
STRING
website
in
Text
Summary
format
and
13. 13
parsed
to
Cystoscape
simple
interaction
format
(SIF)
(Shannon,
Markiel
et
al.
2003)
by
python
programming
language.
2.6.2 Unweighted
co-‐expression
network
The
Pearson
Correlation
Coefficient
for
all
pair-‐wise
comparisons
of
genes
is
calculated
from
filt.gene
matrix
in
R.
High
correlated
genes
are
selected
with
cutoff
0.9
and
parsed
to
simple
interaction
format
(SIF)
(Shannon,
Markiel
et
al.
2003)
by
python.
2.6.3 Weighted
co-‐expression
network
2.6.3.1 Network
construction
The
procedure
can
be
found
in
the
WGCNA
website
(Horvath
2011).
2.6.3.2 Module
detection
Pearson
correlation
coefficients
are
calculated
for
all
pair-‐wise
comparisons
of
genes
across
all
samples.
The
resulting
Pearson
correlation
matrix
is
transformed
into
the
weighted
adjacency
matrix
with
the
above
power
beta
6.
The
average
linkage
hierarchical
clustering
is
used
to
group
genes
on
the
basis
of
the
topological
overlap
dissimilarity
measure
of
their
network
connection
strengths
(Zhang,
Horvath
2005).
Using
a
dynamic
tree-‐cutting
algorithm
(Langfelder,
Zhang
et
al.
2008),
13
modules
are
found
with
the
minimum
cluster
size
of
70
(Figure
6).
Genes
that
are
not
assigned
to
modules
are
assigned
the
color
grey.
14. 14
2.6.3.3 Module
significance
The
p
value
of
moderated
t
test
is
the
output
from
topTable
of
AffylmGUI
package
in
R
(Smyth
2004).
2.7 Network
visualization
2.7.1 File
format
The
output
files
from
WGCNA
are
parsed
to
simple
interaction
format
(SIF)
(Shannon,
Markiel
et
al.
2003)
by
python.
2.7.2 Analyzing
network
by
plugin
in
Cytoscape
AllegroMCODE
and
Network
Analysis
plugin
in
Cytoscape
are
used
to
analyze
the
network.
Finding
the
cluster
automatically
is
achieved
by
AllegroMCODE.
15. 15
3. RESULTS
3.1 Differential
expression
3.1.1 Expectation
of
the
Microarray
data
Genes
that
are
up-‐regulated
in
the
overexpression
of
COE
or
down-‐regulated
in
loss
of
function
of
COE
are
considered
ASM
candidate
genes
downstream
of
COE,
while
genes
that
are
down-‐
regulated
in
overexpression
of
COE
or
up-‐regulated
in
loss
of
function
of
COE
are
considered
Heart
candidate
genes
repressed
by
COE
(Stolfi,
Gainous
et
al.
2010).
Using
the
COE
and
COEW
group
as
two
classes
in
the
Significant
Analysis
for
Microarrays
(SAM),
the
contrast
would
yield
ASM
and
Heart
candidate
genes.
3.1.2 ASM
and
Heart
candidate
genes
3.1.2.1
Lists
from
SAM
336
significant
genes
are
derived
from
SAM
and
separated
into
206
ASM
candidate
genes
(negative
in
SAM,
expression
of
COE
group
lower
than
that
of
COEW
group)
and
130
Heart
candidate
genes
(positive
in
SAM,
expression
of
COE
group
higher
than
that
of
COEW
group).
These
two
groups
can
be
distinguished
by
the
first
three
columns
in
the
heat-‐map
(Figure
3,
Figure
27).
16. 16
Based
on
the
Hierarchical
Clustering
and
observation,
the
ASM
candidate
genes
can
be
roughly
divided
into
three
large
groups:
A1.
The
first
group
(up-‐down-‐up-‐ASM,
61
genes),
shows
a
“U”
shape
curve
through
the
time-‐series
experiments,
with
the
earliest
up-‐regulation
right
at
the
experimental
time
point
of
8
hours.
This
group
contains
Snail
(‘SNAIL’
in
the
thesis),
SET
and
MYND
Domain
1
(SMYD1)
and
Myodblast
determination
protein
(Myod,
‘MYOD’
in
the
thesis).
A2.
The
second
group
(early-‐ASM,
45
genes),
including
COE
and
Myocyte
Regulatory
Light
Chain
(MRLC5,
‘MYL5’
in
the
thesis)
gene,
shows
early
up-‐regulation
around
14
hours.
A3.
The
third
group
(late-‐ASM,
100
genes)
has
relatively
late
up-‐regulation
after
18
hours,
with
myosin
heavy
chain
genes
(MHC3),
tropomyosin
1(TPM1,
‘CTM1’
in
the
thesis)
and
muscle
like
actin
2
(MA2)
in
the
group.
The
Heart
candidate
genes
can
be
divided
into
two
large
groups:
H1.
The
first
group
(early-‐Heart,
99
genes)
shows
early
up-‐
regulation
(before
20
hours),
containing
heart
markers
BMP2/4,
NK4,
NOTRLC/HAND-‐LIKE,
and
ETS/POINTED2.
17. 17
H2.
The
second
group
(late-‐Heart,
31
genes)
displays
relative
late
up-‐regulation
(after
20
hours),
with
mesenchyme
specific
gene
3
(MECH3)
in
the
group.
As
expected,
two
lists
of
genes
have
some
important
markers
in
them
and
noticeable
temporal
expression.
But
these
ASM
and
Heart
candidate
genes
didn’t
show
Go-‐term
enrichment
from
the
Blast2GO,
which
might
indicate
the
need
to
fine-‐tune
the
list,
even
though
the
Blast2GO
with
few
go
terms
is
another
concern.
Further
improvement
of
the
ASM
and
Heart
candidate
gene
list
would
be
necessary
to
know
the
effect
of
the
non-‐specific
filtering,
selecting
the
probe
for
a
gene
by
maximum
variance
and
SAM
ranking.
3.1.2.2
Clusters
from
STEM
Total
7
significant
model
profiles
showed
in
the
STEM
output.
23
out
of
the
206
ASM
candidate
genes
are
in
the
significant
profiles.
Most
of
them
are
in
the
profile
20,
similar
to
the
late-‐ASM,
including
the
MHC3,
MA2
and
MYL5
genes.
For
the
Heart
candidate
genes,
13
out
of
130
are
in
the
significant
profiles.
3.2 Network
Visualization
in
Gaggle
3.2.1 Networks
3.2.1.1 STRING
protein
network
The
STRING
(Szklarczyk,
Franceschini
et
al.
2011)
protein
network
is
created
to
make
good
use
of
the
existing
data
resources.
It
18. 18
provides
both
experimental
and
predicted
interaction
information
from
computational
techniques,
presented
as
different
colors
in
the
edge
(Figure
9).
3.2.1.2 Co-‐expression
network
The
network-‐based
approaches,
also
termed
graph-‐based
approaches,
aim
to
extract
recurrent
expression
patterns
or
conserved
module
from
the
rapid
accumulation
of
Microarray
datasets.
The
Microarray
dataset
is
modeled
as
a
relation
graph
where
each
node
represents
one
gene
and
two
genes
are
connected
through
the
edge
based
on
certain
expression
correlation
parameter
(Zhang,
Horvath
2005)
to
measure
the
similarity
between
expression
profiles
(Pearson
Correlation
Coefficient
is
used
in
this
thesis).
The
graph,
namely
network,
can
be
represented
by
an
adjacency
matrix
that
encodes
whether
a
pair
of
nodes
is
connected.
For
unweighted
networks,
entries
are
1
or
0.
For
weighted
networks,
the
adjacency
matrix
reports
the
connection
strength
for
the
gene
pairs,
between
1
and
0
(Zhang,
Horvath
2005).
The
concept
of
connectivity
in
graph
theory,
also
termed
degree,
can
be
depicted
as
the
row
sum
of
the
adjacency
matrix,
measuring
the
direct
neighbors
of
the
node
in
the
unweighted
networks
and
connection
strengths
in
the
weighted
network.
Two
co-‐expression
networks
are
generated
in
this
thesis.
19. 19
The
unweighted
co-‐expression
network
is
formed
by
the
genes
with
the
Pearson
Correlation
Coefficient
higher
than
0.9.
A
total
766
nodes
are
in
this
unweighted
network
with
clustering
coefficient
0.311
(output
result
from
the
Network
Analysis
plugin
in
Cytoscape,
measuring
the
cohesiveness
of
the
neighborhood
of
a
node).
The
genes
with
the
top
5000
strong
weight
are
outputted
to
build
the
weighted
co-‐expression
network
(cutoff
for
the
weight
is
0.23),
a
total
of
814
nodes,
with
clustering
coefficient
0.728.
The
unweighted
network
has
more
isolated
clusters
with
only
2
nodes
linked
by
1
edge.
The
weighted
network
has
greater
density
with
some
hubs
(high
connectivity),
and
also
contains
colors
in
the
node
for
the
different
modules
detected
in
the
WGCNA.
Though
these
two
networks
are
different
in
the
adjacency
matrix,
they
are
both
based
on
Pearson
Correlation
Coefficient
to
present
the
genes
of
high
similarity
in
the
graph
in
terms
of
their
closeness.
In
other
words,
genes
of
same
expression
profiles
across
all
of
the
experiments
would
be
close
to
each
other
in
the
network.
These
network-‐based
approaches
allow
for
the
exploration
of
the
position
of
a
biological
entity
in
the
context
of
its
local
neighborhood
in
the
graph
and
network
as
a
whole,
and
less
troubled
by
inherent
noise
that
confound
conventional
pairwise
approaches
(Freeman,
Goldovsky
et
al.
2007).
20. 20
3.2.2 Findings
from
the
network
visualization
in
Gaggle
3.2.2.1 Gaggle
as
information
integration
center
In
this
post-‐genomic
era,
biologists
often
face
the
challenge
to
freely
explore
the
experimental
and
computational
data
from
many
different
sources
and
diverse
software
tools,
such
as
storing
different
data
for
genes,
retrieving
data
from
a
list
of
genes,
and
mapping
one
list
of
genes
with
another.
Once
the
network
has
been
loaded
in
the
Cytoscape,
Gaggle,
as
an
information
integration
center,
can
help
to
solve
these
problems
with
respect
to
Microarray
data.
Storing
different
data
for
genes
can
be
achieved
by
labeling.
As
shown
in
the
Figure
9
and
10,
two
networks
present
data
from
6
different
sources,
such
node
color
for
module,
node
label
for
ASM
or
Heart
candidate
genes,
node
shape
for
significance
in
moderated
F
test,
node
size
for
connectivity,
edge
color
for
different
interaction,
and
distance
between
nodes
for
closeness.
Therefore
the
network
in
Cytoscape
functions
as
a
visual
database.
Retrieving
data
from
a
list
of
genes,
such
as
expression
matrix,
is
also
feasible
through
the
basic
function
“broadcast”
in
Gaggle.
For
example,
a
list
of
genes
of
interest
in
the
Cytoscape
can
be
sent
to
the
Gaggle
Boss,
and
then
broadcast
to
Data
Matrix
Viewer
(DMV),
which
can
output
the
expression
matrix.
21. 21
Mapping
one
list
of
genes
with
another
can
be
done
conveniently
in
Gaggle
thourhg
the
many
functions
that
it
offers.
In
the
MultiExperiment
Viewer
(MeV),
a
sub-‐list
of
genes
can
be
launched
in
a
new
viewer.
In
Cytoscape,
the
function
“Create
new
network
from
selected
nodes”
can
be
used
in
this
task.
Between
different
tools,
the
function
“broadcast”
would
serve
as
a
bridge
to
transfer
the
list
and
map
it
in
the
existing
tools.
3.2.2.2 Module
from
AllegroMCODE
The
main
goal
of
the
co-‐expression
network
visualization
is
to
find
the
highly
correlated
genes
(module)
related
to
the
ASM
or
Heart
network,
specifically
aiming
to
infer
targets
of
the
transcription
factor
COE.
In
the
unweighted
network
without
predefined
modules,
the
modules
can
be
automatically
detected
by
AllegroMCODE,
a
plugin
in
Cytoscape
to
find
highly
interconnected
groups
of
nodes
in
a
huge
complex
network.
The
1st
module
detected
by
AllegroMCODE
for
the
unweighted
network
is
shown
in
the
Figure
11.
This
module
is
significantly
enriched
in
biological
process
(Figure
12),
such
as
biosynthetic
process
and
cellular
biosynthetic
process.
For
the
weighted
network,
the
1st
module
(Figure
13)
detected
by
AllegroMCODE
contains
largely
turquoise
module
genes
(only
1
22. 22
grey
color
gene.
This
module
is
significantly
enriched
in
intracellular
process
(Figure
14).
Comparing
these
1st
modules
of
unweighed
and
weighted
network,
they
both
contain
ribosome
related
genes
(gene
name
starts
with
“RP”).
Because
these
two
networks
are
both
generated
from
the
same
Microarray
data,
an
external
reference
would
be
necessary
to
determine
whether
this
ribosome
group
is
found
by
chance.
The
common
list
of
23
genes
is
from
the
comparison
between
the
1st
module
in
weighted
network
and
all
turquoise
module
genes
in
STRING
network,
which
has
16
ribosome
related
genes.
3.2.2.3 Module
from
weighted
network
Weighted
correlation
analysis
(WGCNA)
has
advantages
in
identifying
candidate
targets
with
its
unique
mathematical
features
(Langfelder,
Horvath
2008).
While
the
highly
correlated
genes
can
be
grouped
into
different
modules,
those
genes
that
are
far
from
the
modules
are
depicted
in
grey.
Figure
18
shows
that
these
grey
color
genes
in
the
weighted
network
are
often
with
fewer
edges
and
targeted
at
miRNA,
which
are
reasonably
different
from
other
functional
modules.
In
Figure
7
and
Figure
8,
the
tan
and
brown
modules
have
strong
module
significance
(the
significance
is
defined
as
–log10
(p-‐
value
in
moderated
t
test)).
By
visualizing
these
two
modules
from
23. 23
their
top
50
intramodular
connectivity
genes
respectively,
these
modules
can
be
found
enriched
in
the
ASM
and
Heart
candidate
genes.
Interestingly,
NK4
gene
is
in
the
tan
module
with
other
genes
(Figure
19).
Islet
(ISL)
gene,
which
is
not
in
the
candidate
list
yet
reported
to
be
ASM
gene,
is
in
the
brown
module
with
some
known
markers,
such
as
MA2,
MHC3,
NOTRLC/HAND-‐LIKE,
and
ETS/POINTED2
(Figure
20).
These
results
would
be
helpful
to
be
a
starting
point
for
making
hypothesis
of
the
Heart
network
in
Ciona.
As
the
largest
module
in
the
weighted
network,
enriched
in
cellular
process
and
others
(Figure
21),
it
is
natural
to
consider
limiting
the
list
of
the
turquoise
module
genes
with
other
conditions.
The
list
of
genes
resulted
from
turquoise
module
and
STEM
condition
shows
a
clear
temporal
expression
and
enrichment
in
muscle
and
heart
related
go-‐terms
(Figure
22,
Figure
23),
while
containing
only
four
genes
found
in
the
list.
3.2.2.4 Fine-‐tuned
list
The
network
in
Gaggle
can
serve
as
a
visualization
center
as
well
as
a
fine-‐tuning
filter
for
a
list
of
genes,
because
the
network
is
built
upon
the
high
correlated
pair
of
genes
with
reduced
noise.
It
is
by
no
means
the
genes
that
are
not
in
the
network
that
should
be
discarded,
but
it
is
good
to
have
expected
go-‐term
enrichment
to
confirm
the
list.
Because
the
go-‐term
enrichment
is
related
to
the
24. 24
proportion
of
genes
with
the
same
go-‐terms,
the
number
of
noisy
genes
in
the
whole
list
would
have
a
great
impact
on
the
enrichment.
Importing
the
candidate
list
to
the
co-‐expression
network
would
reduce
the
noise
and
yield
better
enrichment
result.
By
“broadcasting”
function
in
the
MeV,
the
Cytoscape
can
receive
and
label
the
336
significant
genes
in
the
unweighted
network
with
yellow
color,
and
then
create
a
sub-‐network
for
the
candidate
genes.
A
subgroup
of
the
candidate
genes
(Figure
24)
is
significantly
enriched
in
muscle
and
heart
related
go-‐terms
(Figure
25),
which
previously
could
not
be
reported
from
the
Blast2GO.
The
ASM
candidate
genes
in
the
network
are
also
enriched
in
muscle
and
heart
go-‐terms
(Figure
26),
while
the
Heart
candidate
genes
in
the
network
are
still
not
reported
enrichment
from
the
Blast2GO.
25. 25
4. DISCUSSION
4.1 ASM
candidate
genes
COE
is
necessary
and
sufficient
to
specify
ASM
fate
(Stolfi,
Gainous
et
al.
2010).
It
is
understandable
that
COE
expresses
earlier
than
the
late-‐ASM
genes
(A3
group),
such
as
MHC3,
TPM1,
MA2.
While
for
the
up-‐down-‐up-‐ASM
(A1
group),
it
has
the
earliest
up-‐regulation,
with
MYOD
in
the
group.
In
Xenopus,
the
cross-‐regulatory
interactions
of
COE
orthologs
with
genes
of
the
Myogenic
Regulatory
Factor
(MRF)
family,
such
as
MYOD
and
MYF5,
are
crucial
for
muscle
commitment
and
differentiation
(Green,
Vetter
2011).
However,
how
COE
may
repress
the
cardiac
fate
and
promote
cell
migration
in
Xenopus
has
never
been
studied.
A
possible
hypothesis
is
that
in
Ciona,
the
early
functions
controlled
by
COE
in
ASM
precursors
are
independent
on
MRF
activation
since
the
MRF
in
the
A1
group
has
earlier
up-‐
regulation
than
COE
in
the
A2
group.
And
the
A1
group
genes
are
more
likely
to
be
TVC
genes,
which
also
can
explain
the
fact
that
there
are
heart
related
go-‐terms
in
the
enrichment
of
the
ASM
genes
in
the
weighted
network
(Figure
26).
4.2 Annotation
in
Ciona
intestinalis
The
draft
of
genome
sequence
of
the
ascidian
Ciona
intestinalis
(Dehal,
Satou
et
al.
2002)
has
been
a
valuable
research
resource.
26. 26
However,
there
are
numerous
inconsistencies
with
the
gene
models
because
of
the
intrinsic
limitations
in
gene
prediction
programs
and
the
fragmented
nature
of
the
assembly
(Satou,
Mineta
et
al.
2008).
Therefore
the
annotation
job
for
the
probe
in
this
study
focuses
on
combining
available
resources
from
various
databases,
such
as
Aniseed
(Tassy,
Dauga
et
al.),
Ensembl
Genome
Browser
(Kersey,
Lawson
et
al.
2010),
CIPRO
(Endo,
Ueno
et
al.),
STRING
(Szklarczyk,
Franceschini
et
al.
2011),
UCSC
Genome
Browser
(Karolchik,
Hinrichs
et
al.
2011),
and
also
internal
files
from
Dr.
Lionel
Christiaen’s
lab.
There
are
16,250
non-‐redundant
genes
in
the
30,969
probes,
which
will
be
the
criteria
to
map
a
probe
to
a
gene.
It
is
unavoidable
that
there
are
differences
between
the
gene
annotation
in
this
thesis
and
other
sources.
4.3 Functional
ribosome
group
and
COE
The
highly
linked
ribosome
genes
in
the
STRING
network
(Figure
19),
enriched
in
ribosome
process
(Figure
20),
naturally
lead
to
a
question—what
is
the
relationship
between
this
functional
ribosome
group
and
COE.
By
broadcasting
this
list
of
ribosomes
and
COE
genes
to
MeV,
the
heat-‐map
and
expression
plot
show
the
similarity
in
the
time-‐series
experiments
of
ribosome
group
and
COE.
And
this
group
of
ribosome
genes
has
quite
a
stable
expression
profile.
It
is
likely
to
find
more
housekeeping
genes
in
the
same
module
as
the
ribosome
group,
which
is
not
the
focus
of
this
thesis.
27. 27
4.4 Time-‐series
Though
the
clustering
algorithms,
such
as
Hierarchical
clustering
(Eisen,
Spellman
et
al.
1998),
K-‐means,
and
Self-‐organizing
Maps
(SOM)
(Tamayo,
Slonim
et
al.
1999),
can
be
used
to
analyze
the
Microarray
data
and
yield
many
biological
insights,
they
are
not
designed
for
time-‐series
data
since
they
assume
that
data
at
each
time
point
is
collected
independent
of
each
other,
and
ignore
the
sequential
nature
of
time-‐series
data
(Ernst,
Nau
et
al.
2005).
This
thesis
applies
the
Short
Time-‐series
Expression
Miner
(STEM)
method
to
learn
about
the
time-‐series
experiments
with
the
hope
of
finding
clues
about
the
true
biological
pattern,
which
is
designed
for
the
analysis
of
short
time
series
Microarray
gene
expression
data
(Ernst,
Bar
Joseph
2006).
The
algorithm
(Ernst,
Nau
et
al.
2005)
of
STEM
starts
by
selecting
a
set
of
potential
expression
profiles,
covering
the
entire
space
of
all
possible
expression
profiles
that
can
be
generated
by
the
genes
in
the
experiment,
and
each
represents
a
unique
temporal
expression
pattern.
Next,
each
gene
will
be
assigned
to
one
of
the
profiles
and
after
the
permutation
resulting
in
different
large
clusters
with
significant
model
profiles
by
greedy
algorithm
(Ernst,
Nau
et
al.
2005),
which
are
colored
in
the
top
list
in
the
user
interface.
It
is
worth
to
mention
that
the
STEM
is
designed
for
short
time-‐
series
(defined
3
–
8
time
points
in
their
website);
while
the
time
points
in
this
Microarray
dataset
is
11.
28. 28
4.5 Limitations
of
the
co-‐expression
network
The
co-‐expression
network
approaches
have
several
limitations
including
the
following.
First,
the
network
similarity
is
based
on
the
Pearson
Correlation
Coefficient,
which
is
sensitive
to
outliers.
Therefore
the
quality
of
the
input
matrix
would
be
important
to
the
final
result.
It
would
be
helpful
to
try
the
data
transformation
or
use
Spearman’s
rank
correlation
coefficient.
A
second
limitation
is
that
the
Pearson
Correlation
Coefficient
based
co-‐expression
network
is
more
suitable
for
finding
global
co-‐
expression
genes(Qian,
Dolled
Filhart
et
al.
2001),
and
it
cannot
accurately
detect
the
time-‐delayed
or
transient
response
of
the
down-‐
stream
effectors
for
the
time-‐series
experiments.
It
would
be
better
to
use
local
clustering
(Qian,
Dolled
Filhart
et
al.
2001)
to
find
the
time-‐
delay
or
local
co-‐expression
genes,
or
other
tools
specialized
in
long
time-‐series
experiments
like
The
Graphical
Query
Language
(GQL)
(Costa,
Schnhuth
et
al.
2005).
A
third
limitation
is
that
it
is
difficult
to
pick
thresholds
for
a
biological
network.
The
hard-‐threshold
for
the
unweighted
network
would
arbitrarily
cut
off
some
biological
meaningful
edges.
The
weak
weight
modules
would
also
be
cut
off
in
the
weighted
network
while
it
is
possible
that
this
kind
of
weak
linkage
would
be
biologically
meaningful.
30. 30
Figure
2
Normalized
unscaled
standard
error
(NUSE).
One
of
the
tests
in
the
arrayQualityMetrics,
NUSE,
detected
sample
LacZ3
as
an
outlier.
Figure
3
Heat-‐map
of
ASM
and
Heart
candidate
genes.
ASM
candidate
genes
are
red
in
the
first
and
third
column.
A1:
up-‐
down-‐up-‐ASM.
A2:
early-‐ASM.
A3:
late-‐ASM.
Heart
candidate
genes
are
red
in
the
second
column.
H1:
early-‐Heart.
H2:
late-‐Heart.
31. 31
Figure
4
Output
of
the
Short
Time-‐series
Expression
Miner.
Significant
clusters
are
colored
at
the
top
row.
5 10 15 20
0.30.40.50.60.70.80.9
Scale independence
Soft Threshold (power)
ScaleFreeTopologyModelFit,
signedR^2
1
2
3 4
5 6
7 8 9 10 11 12 13 14 15 16 17
18
19 20
5 10 15 20
050010001500 Mean connectivity
Soft Threshold (power)
MeanConnectivity
1
2
3
4
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Figure
5
Selecting
soft
power.
The
soft
threshold
power
beta
of
6
is
chosen
for
calculating
the
adjacency
matrix
since
it
reached
a
high
topology
model
fit
(R^2)
and
high
mean
connectivity.
32. 32
Figure
6
Ciona
intestinalis
weighted
co-‐expression
network.
The
dendrogram
results
from
average
linkage
hierarchical
clustering.
The
color-‐band
below
the
dendrogram
denotes
the
modules,
which
are
defined
as
branches
in
the
dendrogram.
Of
the
10,
079
genes,
6162
were
clustered
into
13
modules,
and
the
remaining
genes
are
colored
in
grey.
33. 33
black blue brown green greenyellow grey magenta pink purple red tan turquoise yellow
Dynamic−cutree Module Significance(COE−COEW modt) p= 3.1e−86
Dynamic Module
coesig
0.00.20.40.60.8
black blue brown green greenyellow grey magenta pink purple red tan turquoise yellow
Counts
01000200030004000
Figure
7
Module
significance.
Module
significance
is
determined
as
the
average
absolute
gene
significance
(defined
by
minus
log
of
a
p-‐value)
measure
for
all
genes
in
a
given
module.