Visualization hang zhong

1

Visualization
of
Ciona
Intestinalis

Co-‐expression
Network

by

Hang
Zhong

A
dissertation
submitted
in
partial
fulfillment

of
the
requirements
for
the
degree
of

Master
of
Science

Department
of
Biology

New
York
University

May,
2012

2

ACKNOWLEDGEMENTS

I
would
like
to
thank
my
advisor,
Richard
Bonneau,
for

providing
me
the
opportunity
to
participate
in
this
project,
ongoing

guidance
and
support.
I
am
also
indebted
to
professor
Lionel

Christiaen
for
inspiring
the
project.
This
thesis
could
not
have
come
to

fruition
without
the
help
of
Florian
Razy,
who
offered
insightful
and

thought-‐provoking
input.

I
am
also
everlastingly
grateful
to
Duncan
Penfold-‐Brown
for

teaching
me
the
programming.
I
would
also
like
to
thank
Kieran
Mace,

Aviv
Madar,
Kevin
Drew,
Maximilian
Haeussler
and
Claudia
Racioppi

who
so
patiently
offer
their
time
and
support.
Many
thanks
to
Todd

Heiniger
and
Joel
Rodriguez
for
revising
the
thesis.

Finally,
I
would
like
to
thank
my
family
for
the
invaluable

support
they
have
given
me
in
the
course
of
my
life
and
studies.

3

ABSTRACT

The
abnormalities
of
the
heart
development
causes
most

frequent
congenital
diseases
in
humans.
The
conservation
of
the
Gene

Regulatory
Network
(GRN)
involved
in
heart
development,
cellular

simplicity,
low
genetic
redundancy
and
relevant
evolutionary
position

lead
researchers
to
study
the
ascidian
Ciona
intestinalis.
To
extract

useful
information
from
the
Microarray
data
for
researchers
to
infer

the
heart
network
in
Ciona,
this
thesis
not
only
applies
the
standard-‐
based
approaches
to
find
the
differential
expression
genes,
but
also

explores
the
network-‐based
approaches
to
find
functional
group.
By

visualizing
the
co-‐expression
network

in
Gaggle,
the
list
of
ASM
and

heart
candidate
genes
are
fine-‐tuned.
In
addition,
the
modules

containing
candiate
and
known
marker
genes
may
deserve
further

study.

4

TABLE
OF
CONTENTS

ABSTRACT
..................................................................................................................................
3

1.
INTRODUCTION
...............................................................................................................
7

1.1
GENE
REGULATORY
NETWORK
OF
CARDIOGENIC
PRECURSORS
IN
CIONA
...............................
7

1.2
MICROARRAY
DATA
ANALYSIS
...............................................................................................
8

1.3
NETWORK
VISUALIZATION
THROUGH
GAGGLE
.......................................................................
9

2.
METHODS
........................................................................................................................
10

2.1
MICROARRAY
EXPERIMENTAL
DESIGN
................................................................................
10

2.2
GENE
EXPRESSION
DATA
....................................................................................................
10

2.2.1
QUALITY
CONTROL
........................................................................................................................
10

2.2.2
PREPROCESSING
............................................................................................................................
11

2.3
STATISTICAL
TEST
..............................................................................................................
11

2.4
CLUSTER
ANALYSIS
............................................................................................................
11

2.5
FUNCTIONAL
ENRICHMENT
ANALYSIS
................................................................................
12

2.6
GENERATION
OF
NETWORKS
..............................................................................................
12

2.6.1
STRING
PROTEIN
NETWORK
........................................................................................................
12

2.6.2
UNWEIGHTED
CO-‐EXPRESSION
NETWORK
................................................................................
13

2.6.3
WEIGHTED
CO-‐EXPRESSION
NETWORK
.....................................................................................
13

2.7
NETWORK
VISUALIZATION
.................................................................................................
14

2.7.1
FILE
FORMAT
.................................................................................................................................
14

2.7.2
ANALYZING
NETWORK
BY
PLUGIN
IN
CYTOSCAPE
....................................................................
14

3.
RESULTS
..........................................................................................................................
15

3.1
DIFFERENTIAL
EXPRESSION
...............................................................................................
15

3.1.1
EXPECTATION
OF
THE
MICROARRAY
DATA
................................................................................
15

3.1.2
ASM
AND
HEART
CANDIDATE
GENES
..........................................................................................
15

3.2
NETWORK
VISUALIZATION
IN
GAGGLE
...............................................................................
17

5

3.2.1
NETWORKS
.....................................................................................................................................
17

3.2.2
FINDINGS
FROM
THE
NETWORK
VISUALIZATION
IN
GAGGLE
..................................................
20

3.2.2.1
GAGGLE
AS
INFORMATION
INTEGRATION
CENTER
...............................................................
20

3.2.2.2
MODULE
FROM
ALLEGROMCODE
.............................................................................................
21

3.2.2.3
MODULE
FROM
WEIGHTED
NETWORK
....................................................................................
22

3.2.2.4
FINE-‐TUNED
LIST
......................................................................................................................
23

4.
DISCUSSION
....................................................................................................................
25

4.1

ASM
CANDIDATE
GENES
......................................................................................................
25

4.2
ANNOTATION
IN
CIONA
INTESTINALIS
................................................................................
25

4.3
FUNCTIONAL
RIBOSOME
GROUP
AND
COE
...........................................................................
26

4.4
TIME-‐SERIES
......................................................................................................................
27

4.5
LIMITATIONS
OF
THE
CO-‐EXPRESSION
NETWORK
...............................................................
28

FIGURES
AND
TABLES
.........................................................................................................
29

FIGURE
1
PIPELINE.
...................................................................................................................
29

FIGURE
2
NORMALIZED
UNSCALED
STANDARD
ERROR
(NUSE).
.................................................
30

FIGURE
3
HEAT-‐MAP
OF
ASM
AND
HEART
CANDIDATE
GENES.
...................................................
30

FIGURE
4
OUTPUT
OF
THE
SHORT
TIME-‐SERIES
EXPRESSION
MINER.
........................................
31

FIGURE
5
SELECTING
SOFT
POWER.
...........................................................................................
31

FIGURE
6
CIONA
INTESTINALIS
WEIGHTED
CO-‐EXPRESSION
NETWORK.
....................................
32

FIGURE
7
MODULE
SIGNIFICANCE.
.............................................................................................
33

FIGURE
8
INTRAMODULAR
CONNECTIVITY
AND
MODULE
SIGNIFICANCE.
...................................
34

FIGURE
9
STRING

PROTEIN
NETWORK.
.....................................................................................
35

FIGURE
10
LABELING
IN
WEIGHTED
NETWORK.
........................................................................
35

FIGURE
11
THE
1ST
MODULE
INFERRED
BY
ALLEGROMCODE
FOR
UNWEIGHTED
CO-‐EXPRESSION

NETWORK.
36

FIGURE
12
THE
1ST

MODULE
OF
UNWEIGHTED
CO-‐EXPRESSION
NETWORK
ENRICHMENT.
.........
37

FIGURE
13
THE
1ST

MODULE
INFERRED
BY
ALLEGROMCODE
FOR
WEIGHTED
CO-‐EXPRESSION

NETWORK.
37

6

FIGURE
14
THE
1ST
MODULE
OF
WEIGHTED
NETWORK
ENRICHMENT.
.......................................
37

FIGURE
15
RIBOSOME
GROUP
IN
THE
STRING.
...........................................................................
38

FIGURE
16
RIBOSOME
GROUP
IN
STRING
NETWORK
ENRICHMENT.
............................................
38

FIGURE
17
RIBOSOME
GROUP
AND
COE.
....................................................................................
39

FIGURE
18
GREY
COLOR
GENES.
................................................................................................
39

FIGURE
19
TAN
MODULE
...........................................................................................................
40

FIGURE
20
BROWN
MODULE
.....................................................................................................
40

FIGURE
21
TURQUOISE
MODULE
ENRICHMENT.
.........................................................................
41

FIGURE
22
GENES
IN
TURQUOISE
PLUS

STEM
CONDITION.
........................................................
41

FIGURE
23
GENES
OF
TURQUOISE
PLUS
STEM
CONDITION
ENRICHMENT.
...................................
42

FIGURE
24
SUB-‐GROUP
OF
CANDIDATE
GENES
IN
UNWEIGHTED
NETWORK.
..............................
42

FIGURE
25
SUB-‐GROUP
OF
CANDIDATE
GENES
IN
UNWEIGHTED
NETWORK
ENRICHMENT.
........
43

FIGURE
26
ASM
CANDIDATE
GENES
IN
WEIGHTED
NETWORK
ENRICHMENT.
.............................
43

FIGURE
27
ASM
AND
HEART
CANDIDATE
GENES
........................................................................
44

REFERENCES
...........................................................................................................................
45

7

1. INTRODUCTION

1.1
Gene
regulatory
network
of
cardiogenic
precursors
in
Ciona

The
abnormalities
of
the
heart
development
causes
most

frequent
congenital
diseases
in
humans.
The
conservation
of
the
Gene

Regulatory
Network
(GRN)
involved
in
heart
development,
cellular

simplicity,
low
genetic
redundancy
and
relevant
evolutionary
position

lead
researchers
to
study
the
ascidian
Ciona
intestinalis(Davidson

2007).
In
Ciona,
a
single
pair
of
blastomeres
called
B7.5
gives
birth
to

the
anterior
tail
muscle
(ATM)
and
to
the
trunk
ventral
cells
(TVC)

(Figure
27).
Following
migration
from
the
tail,
the
TVC
undergo

asymmetric
cell
divisions
at
the
ventral
midline
of
the
trunk.
The

medial
TVC
give
rise
to
the
heart
while
the
lateral
TVCs
migrate

toward
the
atrial
placode
where
they
will
form
the
atrial
siphon

muscles
(ASM).
Thus,
the
TVC
are
similar
to
the
multipotent
cardio-‐
pharyngeal
progenitors
found
in
vertebrates,
while
ASM
are
likely

equivalent
to
the
jaw
muscle
in
vertebrates.

A
few
years
ago,
the
first
cardiogenic
the
Gene
Regulatory

Network
(GRN)
in
Ciona
was
proposed
(Christiaen,
Davidson
et
al.

2008),
decoupling
genes
necessary
for
heart
specification
from
genes

necessary
for
cell
migration.
Later
study
has
been
shown
that
ASM

precursors
express
the
transcription
factor
COE
(Stolfi,
Gainous
et
al.

8

2010),
which
is
necessary
and
sufficient
to
specify
ASM
fate.

Misexpression
of
COE
in
the
whole
TVC
lineage
blocks
heart

development
and
imposes
an
ASM
fate
to
all
cells.
Conversely,

misexpression
of
a
constitutive
repressor
form
of
COE
provokes
the

opposite
phenotype,
blocking
ASM
formation
and
causing
all
cells
to

form
heart
tissue.
Using
the
genome-‐wide
Microarray
analysis
to

study
this
crucial
COE
gene
and
find
the
downstream
effectors
of
COE,

it
is
expected
to
gain
insights
to
the
gene
regulatory
network
of
the

heart.

1.2 Microarray
data
analysis

Most
of
the
existing
studies
have
focused
on
the
differential

expression
to
identify
genes
that
distinguish
different
sets
of
samples.

It’s
quite
common
to
apply
different
testing
method,
such
as
t-‐test,
F-‐
test,
or
nonparametric
versions
of
the
Wilcoxon
test
to
rank

thousands
of
genes,
and
the
most
significant
genes
are
select

(Gentleman
2005).
Other
specific
statistical
methods
are
also

commonly
used
in
the
Microarray
data
analysis,
such
as
Significance

Analysis
of
Microarray
(SAM)

(Tusher,
Tibshirani
et
al.
2001)
and

LIMMA
(Wettenhall,
Smyth
2004)
using
a
Bayesian
mixture
model.

Another
way
of
using
microarray
data
is
to
understand
an

individual
gene
or
protein’s
network
properties
by
studying
the
co-‐
expression,
where
genes
that
have
similar
expression
patterns
across

a
set
of
samples
are
hypothesized
to
have
a
functional
relationship.

9

This
co-‐expression
network-‐based
approach
is
consistent
with
the

important
concept
that
has
emerged
over
the
past
decade—genes
and

their
protein
products
carry
out
cellular
processes
in
the
context
of

functional
modules
and
are
related
(Barabasi,
Bonabeau
2003,

Barabasi,
Oltvai
2004).

1.3 Network
visualization
through
Gaggle

It
has
been
well
recognized
that
visualization
plays
a
key
role
in

helping
to
understand
biological
systems,
particularly
in
the
era
of

high-‐throughput
studies
with
a
wealth
of
‘omics’-‐scale
data

(Gehlenborg,
O'Donoghue
et
al.
2010).
This
thesis
applies
the
simple,

open-‐source
Java
software
system
Gaggle
(Shannon,
Reiss
et
al.
2006)

for
co-‐expression
network
visualization.
Gaggle
is
a
cross-‐platform

system
integrated
with
diverse
databases
(KEGG,
BioCyc,
and
String)

and
software
(Cytoscape,
DataMatrixViewer,
R
statistical

environment,
and
TIGR
Microarray
Expression
Viewer).
With
four

simple
data
types
(names,
matrices,
networks,
and
associative
arrays),

researchers
can
explore
many
different
sources
and
variety
of

software
tools
by
entering
these
information
into
the
Gaggle
Boss
and

transferred
to
other
tools.

10

2. METHODS

The
pipeline
of
this
thesis
is
in
Figure
1.

2.1
Microarray
experimental
design

The
microarray
data
used
in
this
study
are
kindly
provided
by

Dr.
Lionel
Christiaen.
It
consists
of
30,969
probe
sets
from
Affymetrix

GeneChips.
The
perturbation
group
includes
LacZ
control,
the
over-‐
expression
and
loss
of
function
of
transcription
factor
Collier/EBF/OIf

(COE)
in
the
sorted
TVC
cells
at
21
hours
post
fertilization
(hpf)—
after
the
asymmetric
divisions
of
the
TVCs
but
before
completion
of

the
ASM
migration.
Time-‐series
group
is
comprised
of
11
time
points,

every
2
hours
varying
from
8
to
28
hours
in
TVC
cells.

2.2
Gene
expression
data

2.2.1 Quality
control

This
thesis
applies
the
arrayQualityMetrics
(Kauffmann,

Gentleman
et
al.
2009),
a
Bioconductor
package
for
quality
control.
It

provides
an
HTML
report
with
several
diagnostics
plots.
In
general,

the
array
will
be
discarded
if
it
is
identified
as
an
outlier
in
both

before
and
after
normalization
in
the
report.

The
Microarray
data
firstly
is
imported
in
statistical

programming
language
R,
and
then
carried
on
the
quality
control
by

arrayQualityMetrics.
The
sample
LacZ.3
is
removed
since
it
was

11

reported
an
outlier
in
both
before
and
after
normalization
(Figure
2).

2.2.2 Preprocessing

The
cell
files
of
the
Microarray
are
normalized
by
the
RMA

method
(Gentleman
2005).
The
expression
matrix
contains
30,969

probes
and
48
arrays.
After
the
non-‐specific
filtering
by
variance

(IQR=0.5),
the
matrix
contains
15,484
probes,
48
arrays.

Using
the
collapseRows
function
in
WGCNA,
the
probes
with

maximum
variance
are
selected
to
represent
genes.
After
merging
the

probes,
the
merged
matrix
contains
10,079
probes
and
48
arrays.

2.3
Statistical
test

The
merged
matrix
is
ranked
by
moderated
F
test
and
genes

are
selected
with
significant
p-‐value
(<0.05,
using
Limma
package)

(Smyth
2004)
after
adjusted
by
Benjamini-‐Hochnerg
method.

After

ranking,
the
top-‐rank
matrix
contains
4,307
probes
and
48
arrays.

The
top-‐rank
matrix
is
imported
to
one
of
the
Gaggle
Geese

MultiExperiment
Viewer
(MeV)
and
under
Significant
Analysis
for

Microarrays
(SAM)
test
(COE
versus
COEW
group,
p-‐value
<
0.05,

1000
permutation,
FDR
=
0.9).

2.4
Cluster
analysis

12

Hierarchical
clustering
is
performed
for
ASM
and
Heart

candidate
genes
using
MeV,
using
Pearson
correlation
metric
and

average
linkage
clustering.

The
time-‐series
group
data,
totaling
36
arrays,
are
averaged
for

each
time
point
and
imported
to
Short
Time-‐series
Expression
Miner

(STEM),
using
STEM
Clustering
Method.

2.5
Functional
enrichment
analysis

Blast2GO
(B2G)

(Conesa,
Gtz
et
al.
2005)
is
a
comprehensive

bioinformatics
tool
for
annotation,
visualization
and
analysis
in

functional
genomics
research.
It
offers
a
suitable
platform
for

functional
research
in
non-‐model
species,
such
as
Ciona
intestinalis.

DNA
sequences
in
fasta
format
were
loaded
to
Blast2GO.

15,629
genes
remained
in
the
Blast2GO,
followed
by
blasting,
go-‐
mapping
and
yielded
Go-‐terms
for
3,964
genes.
The
test
group
from

different
lists
is
tested
against
the
reference
group
(3,964
genes)

using
the
Fisher’s
Exact
Test
(p-‐value
<
0.05,
FDR
correction).

2.6 Generation
of
networks

2.6.1 String
protein
network

Using
the
Ensembl
gene
name
in
this
filt.gene
matrix
as
input,

the
genes
of
interest
in
the
Search
Tool
for
the
Retrieval
of
Interacting

Genes
(STRING)
database
(Szklarczyk,
Franceschini
et
al.
2011)
are

extracted
from
the
STRING
website
in
Text
Summary
format
and

13

parsed
to
Cystoscape
simple
interaction
format
(SIF)

(Shannon,

Markiel
et
al.
2003)
by
python
programming
language.

2.6.2 Unweighted
co-‐expression
network

The
Pearson
Correlation
Coefficient
for
all
pair-‐wise

comparisons
of
genes
is
calculated
from
filt.gene
matrix
in
R.
High

correlated
genes
are
selected
with
cutoff
0.9
and
parsed
to
simple

interaction
format
(SIF)

(Shannon,
Markiel
et
al.
2003)
by
python.

2.6.3 Weighted
co-‐expression
network

2.6.3.1 Network
construction

The
procedure
can
be
found
in
the
WGCNA
website
(Horvath

2011).

2.6.3.2 Module
detection

Pearson
correlation
coefficients
are
calculated
for
all
pair-‐wise

comparisons
of
genes
across
all
samples.

The
resulting
Pearson

correlation
matrix
is
transformed
into
the
weighted
adjacency
matrix

with
the
above
power
beta
6.
The
average
linkage
hierarchical

clustering
is
used
to
group
genes
on
the
basis
of
the
topological

overlap
dissimilarity
measure
of
their
network
connection
strengths

(Zhang,
Horvath
2005).
Using
a
dynamic
tree-‐cutting
algorithm

(Langfelder,
Zhang
et
al.
2008),
13
modules
are
found
with
the
minimum

cluster
size
of
70
(Figure
6).
Genes
that
are
not
assigned
to
modules

are
assigned
the
color
grey.

14

2.6.3.3 Module
significance

The
p
value
of
moderated
t
test
is
the
output
from
topTable
of

AffylmGUI
package
in
R
(Smyth
2004).

2.7 Network
visualization

2.7.1 File
format

The
output
files
from
WGCNA
are
parsed
to
simple
interaction

format
(SIF)

(Shannon,
Markiel
et
al.
2003)
by
python.

2.7.2 Analyzing
network
by
plugin
in
Cytoscape

AllegroMCODE
and
Network
Analysis
plugin
in
Cytoscape
are

used
to
analyze
the
network.
Finding
the
cluster
automatically
is

achieved
by
AllegroMCODE.

15

3. RESULTS

3.1 Differential
expression

3.1.1 Expectation
of
the
Microarray
data

Genes
that
are
up-‐regulated
in
the
overexpression
of
COE
or

down-‐regulated
in
loss
of
function
of
COE
are
considered
ASM

candidate
genes
downstream
of
COE,
while
genes
that
are
down-‐
regulated
in
overexpression
of
COE
or
up-‐regulated
in
loss
of
function

of
COE
are
considered
Heart
candidate
genes
repressed
by
COE
(Stolfi,

Gainous
et
al.
2010).

Using
the
COE
and
COEW
group
as
two
classes
in
the

Significant
Analysis
for
Microarrays
(SAM),
the
contrast
would
yield

ASM
and
Heart
candidate
genes.

3.1.2 ASM
and
Heart
candidate
genes

3.1.2.1
Lists
from
SAM

336
significant
genes
are
derived
from
SAM
and
separated
into

206
ASM
candidate
genes
(negative
in
SAM,
expression
of
COE
group

lower
than
that
of
COEW
group)
and
130
Heart
candidate
genes

(positive
in
SAM,
expression
of
COE
group
higher
than
that
of
COEW

group).

These
two
groups
can
be
distinguished
by
the
first
three

columns
in
the
heat-‐map
(Figure
3,
Figure
27).

16

Based
on
the
Hierarchical
Clustering
and
observation,
the
ASM

candidate
genes
can
be
roughly
divided
into
three
large
groups:

A1.
The
first
group
(up-‐down-‐up-‐ASM,
61
genes),
shows
a
“U”

shape
curve
through
the
time-‐series
experiments,
with
the
earliest

up-‐regulation
right
at
the
experimental
time
point
of
8
hours.
This

group
contains
Snail
(‘SNAIL’
in
the
thesis),
SET
and
MYND
Domain
1

(SMYD1)
and
Myodblast
determination
protein
(Myod,
‘MYOD’
in
the

thesis).

A2.
The
second
group
(early-‐ASM,
45
genes),
including
COE

and
Myocyte
Regulatory
Light
Chain
(MRLC5,
‘MYL5’
in
the
thesis)

gene,
shows
early
up-‐regulation
around
14
hours.

A3.
The
third
group
(late-‐ASM,
100
genes)
has
relatively
late

up-‐regulation
after
18
hours,
with
myosin
heavy
chain
genes
(MHC3),

tropomyosin
1(TPM1,
‘CTM1’
in
the
thesis)
and
muscle
like
actin
2

(MA2)
in
the
group.

The
Heart
candidate
genes
can
be
divided
into
two
large

groups:

H1.
The
first
group
(early-‐Heart,
99
genes)
shows
early
up-‐
regulation
(before
20
hours),
containing
heart
markers
BMP2/4,
NK4,

NOTRLC/HAND-‐LIKE,
and
ETS/POINTED2.

17

H2.
The
second
group
(late-‐Heart,
31
genes)
displays
relative

late
up-‐regulation
(after
20
hours),
with
mesenchyme
specific
gene
3

(MECH3)
in
the
group.

As
expected,
two
lists
of
genes
have
some
important
markers

in
them
and
noticeable
temporal
expression.
But
these
ASM
and
Heart

candidate
genes
didn’t
show
Go-‐term
enrichment
from
the
Blast2GO,

which
might
indicate
the
need
to
fine-‐tune
the
list,
even
though
the

Blast2GO
with
few
go
terms
is
another
concern.
Further
improvement

of
the
ASM
and
Heart
candidate
gene
list
would
be
necessary
to
know

the
effect
of
the
non-‐specific
filtering,
selecting
the
probe
for
a
gene
by

maximum
variance
and
SAM
ranking.

3.1.2.2
Clusters
from
STEM

Total
7
significant
model
profiles
showed
in
the
STEM
output.

23
out
of
the
206
ASM
candidate
genes
are
in
the
significant
profiles.

Most
of
them
are
in
the
profile
20,
similar
to
the
late-‐ASM,
including

the
MHC3,
MA2
and
MYL5
genes.
For
the
Heart
candidate
genes,
13

out
of
130
are
in
the
significant
profiles.

3.2 Network
Visualization
in
Gaggle

3.2.1 Networks

3.2.1.1 STRING
protein
network

The
STRING
(Szklarczyk,
Franceschini
et
al.
2011)
protein

network
is
created
to
make
good
use
of
the
existing
data
resources.

It

18

provides
both
experimental
and
predicted
interaction
information

from
computational
techniques,
presented
as
different
colors
in
the

edge
(Figure
9).

3.2.1.2 Co-‐expression
network

The
network-‐based
approaches,
also
termed
graph-‐based

approaches,
aim
to
extract
recurrent
expression
patterns
or

conserved
module
from
the
rapid
accumulation
of
Microarray

datasets.
The
Microarray
dataset
is
modeled
as
a
relation
graph
where

each
node
represents
one
gene
and
two
genes
are
connected
through

the
edge
based
on
certain
expression
correlation
parameter
(Zhang,

Horvath
2005)
to
measure
the
similarity
between
expression
profiles

(Pearson
Correlation
Coefficient
is
used
in
this
thesis).
The
graph,

namely
network,
can
be
represented
by
an
adjacency
matrix
that

encodes
whether
a
pair
of
nodes
is
connected.
For
unweighted

networks,
entries
are
1
or
0.
For
weighted
networks,
the
adjacency

matrix
reports
the
connection
strength
for
the
gene
pairs,
between
1

and
0
(Zhang,
Horvath
2005).
The
concept
of
connectivity
in
graph

theory,
also
termed
degree,
can
be
depicted
as
the
row
sum
of
the

adjacency
matrix,
measuring
the
direct
neighbors
of
the
node
in
the

unweighted
networks
and
connection
strengths
in
the
weighted

network.

Two
co-‐expression
networks
are
generated
in
this
thesis.

19

The
unweighted
co-‐expression
network
is
formed
by
the
genes

with
the
Pearson
Correlation
Coefficient
higher
than
0.9.
A
total
766

nodes
are
in
this
unweighted
network
with
clustering
coefficient

0.311
(output
result
from
the
Network
Analysis
plugin
in
Cytoscape,

measuring
the
cohesiveness
of
the
neighborhood
of
a
node).

The
genes
with
the
top
5000
strong
weight
are
outputted
to

build
the
weighted
co-‐expression
network
(cutoff
for
the
weight
is

0.23),
a
total
of
814
nodes,
with
clustering
coefficient
0.728.

The
unweighted
network
has
more
isolated
clusters
with
only

2
nodes
linked
by
1
edge.
The
weighted
network
has
greater
density

with
some
hubs
(high
connectivity),
and
also
contains
colors
in
the

node
for
the
different
modules
detected
in
the
WGCNA.

Though
these
two
networks
are
different
in
the
adjacency

matrix,
they
are
both
based
on
Pearson
Correlation
Coefficient
to

present
the
genes
of
high
similarity
in
the
graph
in
terms
of
their

closeness.
In
other
words,
genes
of
same
expression
profiles
across
all

of
the
experiments
would
be
close
to
each
other
in
the
network.
These

network-‐based
approaches
allow
for
the
exploration
of
the
position
of

a
biological
entity
in
the
context
of
its
local
neighborhood
in
the
graph

and
network
as
a
whole,
and
less
troubled
by
inherent
noise
that

confound
conventional
pairwise
approaches
(Freeman,
Goldovsky
et
al.

2007).

20

3.2.2 Findings
from
the
network
visualization
in
Gaggle

3.2.2.1 Gaggle
as
information
integration
center

In
this
post-‐genomic
era,
biologists
often
face
the
challenge
to

freely
explore
the
experimental
and
computational
data
from
many

different
sources
and
diverse
software
tools,
such
as
storing
different

data
for
genes,
retrieving
data
from
a
list
of
genes,
and
mapping
one

list
of
genes
with
another.
Once
the
network
has
been
loaded
in
the

Cytoscape,
Gaggle,
as
an
information
integration
center,
can
help
to

solve
these
problems
with
respect
to
Microarray
data.

Storing
different
data
for
genes
can
be
achieved
by
labeling.
As

shown
in
the
Figure
9
and
10,
two
networks
present
data
from
6

different
sources,
such
node
color
for
module,
node
label
for
ASM
or

Heart
candidate
genes,
node
shape
for
significance
in
moderated
F

test,
node
size
for
connectivity,
edge
color
for
different
interaction,

and
distance
between
nodes
for
closeness.
Therefore
the
network
in

Cytoscape
functions
as
a
visual
database.

Retrieving
data
from
a
list
of
genes,
such
as
expression
matrix,

is
also
feasible
through
the
basic
function
“broadcast”
in
Gaggle.
For

example,
a
list
of
genes
of
interest
in
the
Cytoscape
can
be
sent
to
the

Gaggle
Boss,
and
then
broadcast
to
Data
Matrix
Viewer
(DMV),
which

can
output
the
expression
matrix.

21

Mapping
one
list
of
genes
with
another
can
be
done

conveniently
in
Gaggle
thourhg
the
many
functions
that
it
offers.
In

the
MultiExperiment
Viewer
(MeV),
a
sub-‐list
of
genes
can
be

launched
in
a
new
viewer.
In
Cytoscape,
the
function
“Create
new

network
from
selected
nodes”
can
be
used
in
this
task.
Between

different
tools,
the
function
“broadcast”
would
serve
as
a
bridge
to

transfer
the
list
and
map
it
in
the
existing
tools.

3.2.2.2 Module
from
AllegroMCODE

The
main
goal
of
the
co-‐expression
network
visualization
is
to

find
the
highly
correlated
genes
(module)
related
to
the
ASM
or
Heart

network,
specifically
aiming
to
infer
targets
of
the
transcription
factor

COE.

In
the
unweighted
network
without
predefined
modules,
the

modules
can
be
automatically
detected
by
AllegroMCODE,
a
plugin
in

Cytoscape
to
find
highly
interconnected
groups
of
nodes
in
a
huge

complex
network.
The
1st
module
detected
by
AllegroMCODE
for
the

unweighted
network
is
shown
in
the
Figure
11.
This
module
is

significantly
enriched
in
biological
process
(Figure
12),
such
as

biosynthetic
process
and
cellular
biosynthetic
process.

For
the
weighted
network,
the
1st
module
(Figure
13)
detected

by
AllegroMCODE
contains
largely
turquoise
module
genes
(only
1

22

grey
color
gene.
This
module
is
significantly
enriched
in
intracellular

process
(Figure
14).

Comparing
these
1st
modules
of
unweighed
and
weighted

network,
they
both
contain
ribosome
related
genes
(gene
name
starts

with
“RP”).

Because
these
two
networks
are
both
generated
from
the

same
Microarray
data,
an
external
reference
would
be
necessary
to

determine
whether
this
ribosome
group
is
found
by
chance.
The

common
list
of
23
genes
is
from
the
comparison
between
the
1st

module
in
weighted
network
and
all
turquoise
module
genes
in

STRING
network,
which
has
16
ribosome
related
genes.

3.2.2.3 Module
from
weighted
network

Weighted
correlation
analysis
(WGCNA)
has
advantages
in

identifying
candidate
targets
with
its
unique
mathematical
features

(Langfelder,
Horvath
2008).
While
the
highly
correlated
genes
can
be

grouped
into
different
modules,
those
genes
that
are
far
from
the

modules
are
depicted
in
grey.
Figure
18
shows
that
these
grey
color

genes
in
the
weighted
network
are
often
with
fewer
edges
and

targeted
at
miRNA,
which
are
reasonably
different
from
other

functional
modules.

In
Figure
7
and
Figure
8,
the
tan
and
brown
modules
have

strong
module
significance
(the
significance
is
defined
as
–log10
(p-‐
value
in
moderated
t
test)).
By
visualizing
these
two
modules
from

23

their
top
50
intramodular
connectivity
genes
respectively,
these

modules
can
be
found
enriched
in
the
ASM
and
Heart
candidate
genes.

Interestingly,
NK4
gene
is
in
the
tan
module
with
other
genes
(Figure

19).
Islet
(ISL)
gene,
which
is
not
in
the
candidate
list
yet
reported
to

be
ASM
gene,
is
in
the
brown
module
with
some
known
markers,
such

as
MA2,
MHC3,
NOTRLC/HAND-‐LIKE,
and
ETS/POINTED2
(Figure

20).

These
results
would
be
helpful
to
be
a
starting
point
for
making

hypothesis
of
the
Heart
network
in
Ciona.

As
the
largest
module
in
the
weighted
network,
enriched
in

cellular
process
and
others
(Figure
21),
it
is
natural
to
consider

limiting
the
list
of
the
turquoise
module
genes
with
other
conditions.

The
list
of
genes
resulted
from
turquoise
module
and
STEM
condition

shows
a
clear
temporal
expression
and
enrichment
in
muscle
and

heart
related
go-‐terms
(Figure
22,
Figure
23),
while
containing
only

four
genes
found
in
the
list.

3.2.2.4 Fine-‐tuned
list

The
network
in
Gaggle
can
serve
as
a
visualization
center
as

well
as
a
fine-‐tuning
filter
for
a
list
of
genes,
because
the
network
is

built
upon
the
high
correlated
pair
of
genes
with
reduced
noise.
It
is

by
no
means
the
genes
that
are
not
in
the
network
that
should
be

discarded,
but
it
is
good
to
have
expected
go-‐term
enrichment
to

confirm
the
list.
Because
the
go-‐term
enrichment
is
related
to
the

24

proportion
of
genes
with
the
same
go-‐terms,
the
number
of
noisy

genes
in
the
whole
list
would
have
a
great
impact
on
the
enrichment.

Importing
the
candidate
list
to
the
co-‐expression
network
would

reduce
the
noise
and
yield
better
enrichment
result.

By
“broadcasting”
function
in
the
MeV,
the
Cytoscape
can

receive
and
label
the
336
significant
genes
in
the
unweighted
network

with
yellow
color,
and
then
create
a
sub-‐network
for
the
candidate

genes.
A
subgroup
of
the
candidate
genes
(Figure
24)
is
significantly

enriched
in
muscle
and
heart
related
go-‐terms
(Figure
25),
which

previously
could
not
be
reported
from
the
Blast2GO.
The
ASM

candidate
genes
in
the
network
are
also
enriched
in
muscle
and
heart

go-‐terms
(Figure
26),
while
the
Heart
candidate
genes
in
the
network

are
still
not
reported
enrichment
from
the
Blast2GO.

25

4. DISCUSSION

4.1 ASM
candidate
genes

COE
is
necessary
and
sufficient
to
specify
ASM
fate
(Stolfi,

Gainous
et
al.
2010).

It
is
understandable
that
COE
expresses
earlier

than
the
late-‐ASM
genes
(A3
group),
such
as
MHC3,
TPM1,
MA2.
While

for
the
up-‐down-‐up-‐ASM
(A1
group),
it
has
the
earliest
up-‐regulation,

with
MYOD
in
the
group.
In
Xenopus,
the
cross-‐regulatory
interactions

of
COE
orthologs
with
genes
of
the
Myogenic
Regulatory
Factor
(MRF)

family,
such
as
MYOD
and
MYF5,
are
crucial
for
muscle
commitment

and
differentiation
(Green,
Vetter
2011).
However,
how
COE
may

repress
the
cardiac
fate
and
promote
cell
migration
in
Xenopus
has

never
been
studied.
A
possible
hypothesis
is
that
in
Ciona,
the
early

functions
controlled
by
COE
in
ASM
precursors
are
independent
on

MRF
activation
since
the
MRF
in
the
A1
group
has
earlier
up-‐
regulation
than
COE
in
the
A2
group.

And
the
A1
group
genes
are
more
likely
to
be
TVC
genes,
which

also
can
explain
the
fact
that
there
are
heart
related
go-‐terms
in
the

enrichment
of
the
ASM
genes
in
the
weighted
network
(Figure
26).

4.2 Annotation
in
Ciona
intestinalis

The
draft
of
genome
sequence
of
the
ascidian
Ciona
intestinalis

(Dehal,
Satou
et
al.
2002)
has
been
a
valuable
research
resource.

26

However,
there
are
numerous
inconsistencies
with
the
gene
models

because
of
the
intrinsic
limitations
in
gene
prediction
programs
and

the
fragmented
nature
of
the
assembly
(Satou,
Mineta
et
al.
2008).

Therefore
the
annotation
job
for
the
probe
in
this
study
focuses
on

combining
available
resources
from
various
databases,
such
as

Aniseed
(Tassy,
Dauga
et
al.),
Ensembl
Genome
Browser
(Kersey,

Lawson
et
al.
2010),
CIPRO
(Endo,
Ueno
et
al.),
STRING
(Szklarczyk,

Franceschini
et
al.
2011),
UCSC
Genome
Browser
(Karolchik,
Hinrichs

et
al.
2011),
and
also
internal
files
from
Dr.
Lionel
Christiaen’s
lab.

There
are
16,250
non-‐redundant
genes
in
the
30,969
probes,
which

will
be
the
criteria
to
map
a
probe
to
a
gene.
It
is
unavoidable
that

there
are
differences
between
the
gene
annotation
in
this
thesis
and

other
sources.

4.3 Functional
ribosome
group
and
COE

The
highly
linked
ribosome
genes
in
the
STRING
network

(Figure
19),
enriched
in
ribosome
process
(Figure
20),
naturally
lead

to
a
question—what
is
the
relationship
between
this
functional

ribosome
group
and
COE.
By
broadcasting
this
list
of
ribosomes
and

COE
genes
to
MeV,
the
heat-‐map
and
expression
plot
show
the

similarity
in
the
time-‐series
experiments
of
ribosome
group
and
COE.

And
this
group
of
ribosome
genes
has
quite
a
stable
expression

profile.
It
is
likely
to
find
more
housekeeping
genes
in
the
same

module
as
the
ribosome
group,
which
is
not
the
focus
of
this
thesis.

27

4.4 Time-‐series

Though
the
clustering
algorithms,
such
as
Hierarchical

clustering
(Eisen,
Spellman
et
al.
1998),
K-‐means,
and
Self-‐organizing

Maps
(SOM)
(Tamayo,
Slonim
et
al.
1999),
can
be
used
to
analyze
the

Microarray
data
and
yield
many
biological
insights,
they
are
not

designed
for
time-‐series
data
since
they
assume
that
data
at
each
time

point
is
collected
independent
of
each
other,
and
ignore
the
sequential

nature
of
time-‐series
data
(Ernst,
Nau
et
al.
2005).
This
thesis
applies

the
Short
Time-‐series
Expression
Miner
(STEM)
method
to
learn

about
the
time-‐series
experiments
with
the
hope
of
finding
clues

about
the
true
biological
pattern,
which
is
designed
for
the
analysis
of

short
time
series
Microarray
gene
expression
data
(Ernst,
Bar
Joseph

2006).
The
algorithm
(Ernst,
Nau
et
al.
2005)
of
STEM
starts
by
selecting

a
set
of
potential
expression
profiles,
covering
the
entire
space
of
all

possible
expression
profiles
that
can
be
generated
by
the
genes
in
the

experiment,
and
each
represents
a
unique
temporal
expression

pattern.
Next,
each
gene
will
be
assigned
to
one
of
the
profiles
and

after
the
permutation
resulting
in
different
large
clusters
with

significant
model
profiles
by
greedy
algorithm
(Ernst,
Nau
et
al.
2005),

which
are
colored
in
the
top
list
in
the
user
interface.

It
is
worth
to
mention
that
the
STEM
is
designed
for
short
time-‐
series
(defined
3
–
8
time
points
in
their
website);
while
the
time

points
in
this
Microarray
dataset
is
11.

28

4.5 Limitations
of
the
co-‐expression
network

The
co-‐expression
network
approaches
have
several
limitations

including
the
following.
First,
the
network
similarity
is
based
on
the

Pearson
Correlation
Coefficient,
which
is
sensitive
to
outliers.

Therefore
the
quality
of
the
input
matrix
would
be
important
to
the

final
result.
It
would
be
helpful
to
try
the
data
transformation
or
use

Spearman’s
rank
correlation
coefficient.

A
second
limitation
is
that
the
Pearson
Correlation
Coefficient

based
co-‐expression
network
is
more
suitable
for
finding
global
co-‐
expression
genes(Qian,
Dolled
Filhart
et
al.
2001),
and
it
cannot

accurately
detect
the
time-‐delayed
or
transient
response
of
the
down-‐
stream
effectors
for
the
time-‐series
experiments.
It
would
be
better
to

use
local
clustering
(Qian,
Dolled
Filhart
et
al.
2001)
to
find
the
time-‐
delay
or
local
co-‐expression
genes,
or
other
tools
specialized
in
long

time-‐series
experiments
like
The
Graphical
Query
Language
(GQL)

(Costa,
Schnhuth
et
al.
2005).

A
third
limitation
is
that
it
is
difficult
to
pick
thresholds
for
a

biological
network.
The
hard-‐threshold
for
the
unweighted
network

would
arbitrarily
cut
off
some
biological
meaningful
edges.
The
weak

weight
modules
would
also
be
cut
off
in
the
weighted
network
while
it

is
possible
that
this
kind
of
weak
linkage
would
be
biologically

meaningful.

29

Figures
and
tables

Figure
1
Pipeline.

30

Figure
2
Normalized
unscaled
standard
error
(NUSE).

One
of
the
tests
in
the
arrayQualityMetrics,
NUSE,
detected
sample

LacZ3
as
an
outlier.

Figure
3
Heat-‐map
of
ASM
and
Heart
candidate
genes.

ASM
candidate
genes
are
red
in
the
first
and
third
column.
A1:
up-‐
down-‐up-‐ASM.
A2:
early-‐ASM.
A3:
late-‐ASM.
Heart
candidate
genes

are
red
in
the
second
column.
H1:
early-‐Heart.
H2:
late-‐Heart.

31

Figure
4
Output
of
the
Short
Time-‐series
Expression
Miner.

Significant
clusters
are
colored
at
the
top
row.

5 10 15 20
0.30.40.50.60.70.80.9
Scale independence
Soft Threshold (power)
ScaleFreeTopologyModelFit,
signedR^2
1
2
3 4
5 6
7 8 9 10 11 12 13 14 15 16 17
18
19 20
5 10 15 20
050010001500 Mean connectivity
Soft Threshold (power)
MeanConnectivity
1
2
3
4
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Figure
5
Selecting
soft
power.

The
soft
threshold
power
beta
of
6
is
chosen
for
calculating
the

adjacency
matrix
since
it
reached
a
high
topology
model
fit
(R^2)
and

high
mean
connectivity.

32

Figure
6
Ciona
intestinalis
weighted
co-‐expression
network.

The
dendrogram
results
from
average
linkage
hierarchical
clustering.

The
color-‐band
below
the
dendrogram
denotes
the
modules,
which

are
defined
as
branches
in
the
dendrogram.
Of
the
10,
079
genes,

6162
were
clustered
into
13
modules,
and
the
remaining
genes
are

colored
in
grey.

33

black blue brown green greenyellow grey magenta pink purple red tan turquoise yellow
Dynamic−cutree Module Significance(COE−COEW modt) p= 3.1e−86
Dynamic Module
coesig
0.00.20.40.60.8
black blue brown green greenyellow grey magenta pink purple red tan turquoise yellow
Counts
01000200030004000

Figure
7
Module
significance.

Module
significance
is
determined
as
the
average
absolute
gene

significance
(defined
by
minus
log
of
a
p-‐value)
measure
for
all
genes

in
a
given
module.

Visualization hang zhong

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (9)

Ähnlich wie Visualization hang zhong

Ähnlich wie Visualization hang zhong (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Visualization hang zhong