SlideShare ist ein Scribd-Unternehmen logo
1 von 47
Downloaden Sie, um offline zu lesen
1	
  
	
  
	
  Visualization	
  of	
  Ciona	
  Intestinalis	
  
Co-­‐expression	
  Network	
  
by	
  
Hang	
  Zhong	
  
	
  
A	
  dissertation	
  submitted	
  in	
  partial	
  fulfillment	
  
of	
  the	
  requirements	
  for	
  the	
  degree	
  of	
  
Master	
  of	
  Science	
  
Department	
  of	
  Biology	
  
New	
  York	
  University	
  
May,	
  2012	
  
	
  
	
  
	
  
	
  
	
  
	
  
2	
  
	
  
ACKNOWLEDGEMENTS	
  
	
   I	
   would	
   like	
   to	
   thank	
   my	
   advisor,	
   Richard	
   Bonneau,	
   for	
  
providing	
  me	
  the	
  opportunity	
  to	
  participate	
  in	
  this	
  project,	
   ongoing	
  
guidance	
   and	
   support.	
   I	
   am	
   also	
   indebted	
   to	
   professor	
   Lionel	
  
Christiaen	
  for	
  inspiring	
  the	
  project.	
  This	
  thesis	
  could	
  not	
  have	
  come	
  to	
  
fruition	
  without	
  the	
  help	
  of	
  Florian	
  Razy,	
  who	
  offered	
  insightful	
  and	
  
thought-­‐provoking	
  input.	
  	
  
I	
  am	
  also	
  everlastingly	
  grateful	
  to	
  Duncan	
  Penfold-­‐Brown	
  for	
  
teaching	
  me	
  the	
  programming.	
  I	
  would	
  also	
  like	
  to	
  thank	
  Kieran	
  Mace,	
  
Aviv	
  Madar,	
  Kevin	
  Drew,	
  Maximilian	
  Haeussler	
  and	
  Claudia	
  Racioppi	
  
who	
  so	
  patiently	
  offer	
  their	
  time	
  and	
  support.	
  Many	
  thanks	
  to	
  Todd	
  
Heiniger	
  and	
  Joel	
  Rodriguez	
  for	
  revising	
  the	
  thesis.	
  	
  
Finally,	
   I	
   would	
   like	
   to	
   thank	
   my	
   family	
   for	
   the	
   invaluable	
  
support	
  they	
  have	
  given	
  me	
  in	
  the	
  course	
  of	
  my	
  life	
  and	
  studies.	
  	
  
	
  
	
  
	
  
	
  
	
  
	
  
3	
  
	
  
ABSTRACT	
  
The	
   abnormalities	
   of	
   the	
   heart	
   development	
   causes	
   most	
  
frequent	
  congenital	
  diseases	
  in	
  humans.	
  The	
  conservation	
  of	
  the	
  Gene	
  
Regulatory	
   Network	
   (GRN)	
   involved	
   in	
   heart	
   development,	
   cellular	
  
simplicity,	
  low	
  genetic	
  redundancy	
  and	
  relevant	
  evolutionary	
  position	
  
lead	
   researchers	
   to	
   study	
   the	
   ascidian	
   Ciona	
   intestinalis.	
   To	
   extract	
  
useful	
  information	
  from	
  the	
  Microarray	
  data	
  for	
  researchers	
  to	
  infer	
  
the	
  heart	
  network	
  in	
  Ciona,	
  this	
  thesis	
  not	
  only	
  applies	
  the	
  standard-­‐
based	
   approaches	
   to	
   find	
   the	
   differential	
   expression	
   genes,	
   but	
   also	
  
explores	
  the	
  network-­‐based	
  approaches	
  to	
  find	
  functional	
  group.	
  By	
  
visualizing	
  the	
  co-­‐expression	
  network	
  	
  in	
  Gaggle,	
  the	
  list	
  of	
  ASM	
  and	
  
heart	
   candidate	
   genes	
   are	
   fine-­‐tuned.	
   In	
   addition,	
   the	
   modules	
  
containing	
   candiate	
   and	
   known	
   marker	
   genes	
   may	
   deserve	
   further	
  
study.	
  
4	
  
	
  
	
  
TABLE	
  OF	
  CONTENTS	
  
ABSTRACT	
  ..................................................................................................................................	
  3	
  
1.	
   INTRODUCTION	
  ...............................................................................................................	
  7	
  
1.1	
   GENE	
  REGULATORY	
  NETWORK	
  OF	
  CARDIOGENIC	
  PRECURSORS	
  IN	
  CIONA	
  ...............................	
  7	
  
1.2	
   MICROARRAY	
  DATA	
  ANALYSIS	
  ...............................................................................................	
  8	
  
1.3	
   NETWORK	
  VISUALIZATION	
  THROUGH	
  GAGGLE	
  .......................................................................	
  9	
  
2.	
   METHODS	
  ........................................................................................................................	
  10	
  
2.1	
   MICROARRAY	
  EXPERIMENTAL	
  DESIGN	
  ................................................................................	
  10	
  
2.2	
   GENE	
  EXPRESSION	
  DATA	
  ....................................................................................................	
  10	
  
2.2.1	
   QUALITY	
  CONTROL	
  ........................................................................................................................	
  10	
  
2.2.2	
   PREPROCESSING	
  ............................................................................................................................	
  11	
  
2.3	
   STATISTICAL	
  TEST	
  ..............................................................................................................	
  11	
  
2.4	
   CLUSTER	
  ANALYSIS	
  ............................................................................................................	
  11	
  
2.5	
   FUNCTIONAL	
  ENRICHMENT	
  ANALYSIS	
  ................................................................................	
  12	
  
2.6	
   GENERATION	
  OF	
  NETWORKS	
  ..............................................................................................	
  12	
  
2.6.1	
   STRING	
  PROTEIN	
  NETWORK	
  ........................................................................................................	
  12	
  
2.6.2	
   UNWEIGHTED	
  CO-­‐EXPRESSION	
  NETWORK	
  ................................................................................	
  13	
  
2.6.3	
   WEIGHTED	
  CO-­‐EXPRESSION	
  NETWORK	
  .....................................................................................	
  13	
  
2.7	
   NETWORK	
  VISUALIZATION	
  .................................................................................................	
  14	
  
2.7.1	
   FILE	
  FORMAT	
  .................................................................................................................................	
  14	
  
2.7.2	
   ANALYZING	
  NETWORK	
  BY	
  PLUGIN	
  IN	
  CYTOSCAPE	
  ....................................................................	
  14	
  
3.	
   RESULTS	
  ..........................................................................................................................	
  15	
  
3.1	
   DIFFERENTIAL	
  EXPRESSION	
  ...............................................................................................	
  15	
  
3.1.1	
   EXPECTATION	
  OF	
  THE	
  MICROARRAY	
  DATA	
  ................................................................................	
  15	
  
3.1.2	
   ASM	
  AND	
  HEART	
  CANDIDATE	
  GENES	
  ..........................................................................................	
  15	
  
3.2	
   NETWORK	
  VISUALIZATION	
  IN	
  GAGGLE	
  ...............................................................................	
  17	
  
5	
  
	
  
3.2.1	
   NETWORKS	
  .....................................................................................................................................	
  17	
  
3.2.2	
   FINDINGS	
  FROM	
  THE	
  NETWORK	
  VISUALIZATION	
  IN	
  GAGGLE	
  ..................................................	
  20	
  
3.2.2.1	
   GAGGLE	
  AS	
  INFORMATION	
  INTEGRATION	
  CENTER	
  ...............................................................	
  20	
  
3.2.2.2	
   MODULE	
  FROM	
  ALLEGROMCODE	
  .............................................................................................	
  21	
  
3.2.2.3	
   MODULE	
  FROM	
  WEIGHTED	
  NETWORK	
  ....................................................................................	
  22	
  
3.2.2.4	
   FINE-­‐TUNED	
  LIST	
  ......................................................................................................................	
  23	
  
4.	
   DISCUSSION	
  ....................................................................................................................	
  25	
  
4.1	
  	
  	
  	
  ASM	
  CANDIDATE	
  GENES	
  ......................................................................................................	
  25	
  
4.2	
   ANNOTATION	
  IN	
  CIONA	
  INTESTINALIS	
  ................................................................................	
  25	
  
4.3	
   FUNCTIONAL	
  RIBOSOME	
  GROUP	
  AND	
  COE	
  ...........................................................................	
  26	
  
4.4	
   TIME-­‐SERIES	
  ......................................................................................................................	
  27	
  
4.5	
   LIMITATIONS	
  OF	
  THE	
  CO-­‐EXPRESSION	
  NETWORK	
  ...............................................................	
  28	
  
FIGURES	
  AND	
  TABLES	
  .........................................................................................................	
  29	
  
FIGURE	
  1	
   PIPELINE.	
  ...................................................................................................................	
  29	
  
FIGURE	
  2	
   NORMALIZED	
  UNSCALED	
  STANDARD	
  ERROR	
  (NUSE).	
  .................................................	
  30	
  
FIGURE	
  3	
   HEAT-­‐MAP	
  OF	
  ASM	
  AND	
  HEART	
  CANDIDATE	
  GENES.	
  ...................................................	
  30	
  
FIGURE	
  4	
   OUTPUT	
  OF	
  THE	
  SHORT	
  TIME-­‐SERIES	
  EXPRESSION	
  MINER.	
  ........................................	
  31	
  
FIGURE	
  5	
   SELECTING	
  SOFT	
  POWER.	
  ...........................................................................................	
  31	
  
FIGURE	
  6	
   CIONA	
  INTESTINALIS	
  WEIGHTED	
  CO-­‐EXPRESSION	
  NETWORK.	
  ....................................	
  32	
  
FIGURE	
  7	
   MODULE	
  SIGNIFICANCE.	
  .............................................................................................	
  33	
  
FIGURE	
  8	
   INTRAMODULAR	
  CONNECTIVITY	
  AND	
  MODULE	
  SIGNIFICANCE.	
  ...................................	
  34	
  
FIGURE	
  9	
   STRING	
  	
  PROTEIN	
  NETWORK.	
  .....................................................................................	
  35	
  
FIGURE	
  10	
   LABELING	
  IN	
  WEIGHTED	
  NETWORK.	
  ........................................................................	
  35	
  
FIGURE	
  11	
   THE	
  1ST	
  MODULE	
  INFERRED	
  BY	
  ALLEGROMCODE	
  FOR	
  UNWEIGHTED	
  CO-­‐EXPRESSION	
  
NETWORK.	
   36	
  
FIGURE	
  12	
   THE	
  1ST
	
  MODULE	
  OF	
  UNWEIGHTED	
  CO-­‐EXPRESSION	
  NETWORK	
  ENRICHMENT.	
  .........	
  37	
  
FIGURE	
  13	
   THE	
  1ST
	
  MODULE	
  INFERRED	
  BY	
  ALLEGROMCODE	
  FOR	
  WEIGHTED	
  CO-­‐EXPRESSION	
  
NETWORK.	
   37	
  
6	
  
	
  
FIGURE	
  14	
   THE	
  1ST	
  MODULE	
  OF	
  WEIGHTED	
  NETWORK	
  ENRICHMENT.	
  .......................................	
  37	
  
FIGURE	
  15	
   RIBOSOME	
  GROUP	
  IN	
  THE	
  STRING.	
  ...........................................................................	
  38	
  
FIGURE	
  16	
   RIBOSOME	
  GROUP	
  IN	
  STRING	
  NETWORK	
  ENRICHMENT.	
  ............................................	
  38	
  
FIGURE	
  17	
   RIBOSOME	
  GROUP	
  AND	
  COE.	
  ....................................................................................	
  39	
  
FIGURE	
  18	
   GREY	
  COLOR	
  GENES.	
  ................................................................................................	
  39	
  
FIGURE	
  19	
   TAN	
  MODULE	
  ...........................................................................................................	
  40	
  
FIGURE	
  20	
   BROWN	
  MODULE	
  .....................................................................................................	
  40	
  
FIGURE	
  21	
   TURQUOISE	
  MODULE	
  ENRICHMENT.	
  .........................................................................	
  41	
  
FIGURE	
  22	
   GENES	
  IN	
  TURQUOISE	
  PLUS	
  	
  STEM	
  CONDITION.	
  ........................................................	
  41	
  
FIGURE	
  23	
   GENES	
  OF	
  TURQUOISE	
  PLUS	
  STEM	
  CONDITION	
  ENRICHMENT.	
  ...................................	
  42	
  
FIGURE	
  24	
   SUB-­‐GROUP	
  OF	
  CANDIDATE	
  GENES	
  IN	
  UNWEIGHTED	
  NETWORK.	
  ..............................	
  42	
  
FIGURE	
  25	
   SUB-­‐GROUP	
  OF	
  CANDIDATE	
  GENES	
  IN	
  UNWEIGHTED	
  NETWORK	
  ENRICHMENT.	
  ........	
  43	
  
FIGURE	
  26	
   ASM	
  CANDIDATE	
  GENES	
  IN	
  WEIGHTED	
  NETWORK	
  ENRICHMENT.	
  .............................	
  43	
  
FIGURE	
  27	
   ASM	
  AND	
  HEART	
  CANDIDATE	
  GENES	
  ........................................................................	
  44	
  
REFERENCES	
  ...........................................................................................................................	
  45	
  
	
  
	
  
	
  
	
  
7	
  
	
  
	
  
1. INTRODUCTION	
  
1.1 	
  Gene	
  regulatory	
  network	
  of	
  cardiogenic	
  precursors	
  in	
  Ciona	
  
	
  	
  	
   The	
   abnormalities	
   of	
   the	
   heart	
   development	
   causes	
   most	
  
frequent	
  congenital	
  diseases	
  in	
  humans.	
  The	
  conservation	
  of	
  the	
  Gene	
  
Regulatory	
   Network	
   (GRN)	
   involved	
   in	
   heart	
   development,	
   cellular	
  
simplicity,	
  low	
  genetic	
  redundancy	
  and	
  relevant	
  evolutionary	
  position	
  
lead	
   researchers	
   to	
   study	
   the	
   ascidian	
   Ciona	
   intestinalis(Davidson	
  
2007).	
  In	
  Ciona,	
  a	
  single	
  pair	
  of	
  blastomeres	
  called	
  B7.5	
  gives	
  birth	
  to	
  
the	
   anterior	
   tail	
   muscle	
   (ATM)	
   and	
   to	
   the	
   trunk	
   ventral	
   cells	
   (TVC)	
  
(Figure	
   27).	
   Following	
   migration	
   from	
   the	
   tail,	
   the	
   TVC	
   undergo	
  
asymmetric	
   cell	
   divisions	
   at	
   the	
   ventral	
   midline	
   of	
   the	
   trunk.	
   The	
  
medial	
   TVC	
   give	
   rise	
   to	
   the	
   heart	
   while	
   the	
   lateral	
   TVCs	
   migrate	
  
toward	
   the	
   atrial	
   placode	
   where	
   they	
   will	
   form	
   the	
   atrial	
   siphon	
  
muscles	
  (ASM).	
  Thus,	
  the	
  TVC	
  are	
  similar	
  to	
  the	
  multipotent	
  cardio-­‐
pharyngeal	
   progenitors	
   found	
   in	
   vertebrates,	
   while	
   ASM	
   are	
   likely	
  
equivalent	
  to	
  the	
  jaw	
  muscle	
  in	
  vertebrates.	
  	
  
	
  	
   A	
   few	
   years	
   ago,	
   the	
   first	
   cardiogenic	
   the	
   Gene	
   Regulatory	
  
Network	
   (GRN)	
   in	
   Ciona	
   was	
   proposed	
   (Christiaen,	
   Davidson	
   et	
   al.	
  
2008),	
  decoupling	
  genes	
  necessary	
  for	
  heart	
  specification	
  from	
  genes	
  
necessary	
   for	
   cell	
   migration.	
   Later	
   study	
   has	
   been	
   shown	
   that	
   ASM	
  
precursors	
   express	
   the	
   transcription	
   factor	
   COE	
   (Stolfi,	
   Gainous	
   et	
   al.	
  
8	
  
	
  
2010),	
   which	
   is	
   necessary	
   and	
   sufficient	
   to	
   specify	
   ASM	
   fate.	
  	
  
Misexpression	
   of	
   COE	
   in	
   the	
   whole	
   TVC	
   lineage	
   blocks	
   heart	
  
development	
   and	
   imposes	
   an	
   ASM	
   fate	
   to	
   all	
   cells.	
   Conversely,	
  
misexpression	
  of	
  a	
  constitutive	
  repressor	
  form	
  of	
  COE	
  provokes	
  the	
  
opposite	
  phenotype,	
  blocking	
  ASM	
  formation	
  and	
  causing	
  all	
  cells	
  to	
  
form	
   heart	
   tissue.	
   Using	
   the	
   genome-­‐wide	
   Microarray	
   analysis	
   to	
  
study	
  this	
  crucial	
  COE	
  gene	
  and	
  find	
  the	
  downstream	
  effectors	
  of	
  COE,	
  
it	
  is	
  expected	
  to	
  gain	
  insights	
  to	
  the	
  gene	
  regulatory	
  network	
  of	
  the	
  
heart.	
  	
  
1.2 Microarray	
  data	
  analysis	
  
	
   Most	
   of	
   the	
   existing	
   studies	
   have	
   focused	
   on	
   the	
   differential	
  
expression	
  to	
  identify	
  genes	
  that	
  distinguish	
  different	
  sets	
  of	
  samples.	
  
It’s	
  quite	
  common	
  to	
  apply	
  different	
  testing	
  method,	
  such	
  as	
  t-­‐test,	
  F-­‐
test,	
   or	
   nonparametric	
   versions	
   of	
   the	
   Wilcoxon	
   test	
   to	
   rank	
  
thousands	
   of	
   genes,	
   and	
   the	
   most	
   significant	
   genes	
   are	
   select	
  
(Gentleman	
   2005).	
   Other	
   specific	
   statistical	
   methods	
   are	
   also	
  
commonly	
  used	
  in	
  the	
  Microarray	
  data	
  analysis,	
  such	
  as	
  Significance	
  
Analysis	
   of	
   Microarray	
   (SAM)	
   	
   (Tusher,	
   Tibshirani	
   et	
   al.	
   2001)	
   and	
  
LIMMA	
   (Wettenhall,	
   Smyth	
   2004)	
   using	
   a	
   Bayesian	
   mixture	
   model.	
  
	
   Another	
   way	
   of	
   using	
   microarray	
   data	
   is	
   to	
   understand	
   an	
  
individual	
   gene	
   or	
   protein’s	
   network	
   properties	
   by	
   studying	
   the	
   co-­‐
expression,	
  where	
  genes	
  that	
  have	
  similar	
  expression	
  patterns	
  across	
  
a	
   set	
   of	
   samples	
   are	
   hypothesized	
   to	
   have	
   a	
   functional	
   relationship.	
  
9	
  
	
  
This	
   co-­‐expression	
   network-­‐based	
   approach	
   is	
   consistent	
   with	
   the	
  
important	
  concept	
  that	
  has	
  emerged	
  over	
  the	
  past	
  decade—genes	
  and	
  
their	
  protein	
  products	
  carry	
  out	
  cellular	
  processes	
  in	
  the	
   context	
  of	
  
functional	
   modules	
   and	
   are	
   related	
   (Barabasi,	
   Bonabeau	
   2003,	
  
Barabasi,	
  Oltvai	
  2004).	
  
1.3 Network	
  visualization	
  through	
  Gaggle	
  
	
  	
   It	
  has	
  been	
  well	
  recognized	
  that	
  visualization	
  plays	
  a	
  key	
  role	
  in	
  
helping	
   to	
   understand	
   biological	
   systems,	
   particularly	
   in	
   the	
   era	
   of	
  
high-­‐throughput	
   studies	
   with	
   a	
   wealth	
   of	
   ‘omics’-­‐scale	
   data	
  
(Gehlenborg,	
  O'Donoghue	
  et	
  al.	
  2010).	
  This	
  thesis	
  applies	
  the	
  simple,	
  
open-­‐source	
   Java	
   software	
   system	
   Gaggle	
   (Shannon,	
   Reiss	
   et	
   al.	
   2006)	
  
for	
   co-­‐expression	
   network	
   visualization.	
   Gaggle	
   is	
   a	
   cross-­‐platform	
  
system	
  integrated	
  with	
  diverse	
  databases	
  (KEGG,	
  BioCyc,	
  and	
  String)	
  
and	
   software	
   (Cytoscape,	
   DataMatrixViewer,	
   R	
   statistical	
  
environment,	
   and	
   TIGR	
   Microarray	
   Expression	
   Viewer).	
   With	
   four	
  
simple	
  data	
  types	
  (names,	
  matrices,	
  networks,	
  and	
  associative	
  arrays),	
  
researchers	
   can	
   explore	
   many	
   different	
   sources	
   and	
   variety	
   of	
  
software	
  tools	
  by	
  entering	
  these	
  information	
  into	
  the	
  Gaggle	
  Boss	
  and	
  
transferred	
  to	
  other	
  tools.	
  	
  
	
  
	
  
	
  
10	
  
	
  
	
  
2. METHODS	
  
	
   	
  The	
  pipeline	
  of	
  this	
  thesis	
  is	
  in	
  Figure	
  1.	
  	
  
2.1 	
  Microarray	
  experimental	
  design	
  
	
   The	
  microarray	
  data	
  used	
  in	
  this	
  study	
  are	
  kindly	
  provided	
  by	
  
Dr.	
  Lionel	
  Christiaen.	
  It	
  consists	
  of	
  30,969	
  probe	
  sets	
  from	
  Affymetrix	
  
GeneChips.	
   The	
   perturbation	
   group	
   includes	
   LacZ	
   control,	
   the	
   over-­‐
expression	
  and	
  loss	
  of	
  function	
  of	
  transcription	
  factor	
  Collier/EBF/OIf	
  
(COE)	
   in	
   the	
   sorted	
   TVC	
   cells	
   at	
   21	
   hours	
   post	
   fertilization	
   (hpf)—
after	
  the	
  asymmetric	
  divisions	
  of	
  the	
  TVCs	
  but	
  before	
  completion	
  of	
  
the	
  ASM	
  migration.	
  Time-­‐series	
  group	
  is	
  comprised	
  of	
  11	
  time	
  points,	
  
every	
  2	
  hours	
  varying	
  from	
  8	
  to	
  28	
  hours	
  in	
  TVC	
  cells.	
  	
  
2.2 	
  Gene	
  expression	
  data	
  
2.2.1 Quality	
  control	
  	
  
	
  	
   This	
   thesis	
   applies	
   the	
   arrayQualityMetrics	
   (Kauffmann,	
  
Gentleman	
  et	
  al.	
  2009),	
   a	
   Bioconductor	
   package	
   for	
   quality	
   control.	
   It	
  
provides	
   an	
   HTML	
   report	
   with	
   several	
   diagnostics	
   plots.	
   In	
   general,	
  
the	
   array	
   will	
   be	
   discarded	
   if	
   it	
   is	
   identified	
   as	
   an	
   outlier	
   in	
   both	
  
before	
  and	
  after	
  normalization	
  in	
  the	
  report.	
  	
  
	
  	
   The	
   Microarray	
   data	
   firstly	
   is	
   imported	
   in	
   statistical	
  
programming	
  language	
  R,	
  and	
  then	
  carried	
  on	
  the	
  quality	
  control	
  by	
  
arrayQualityMetrics.	
   The	
   sample	
   LacZ.3	
   is	
   removed	
   since	
   it	
   was	
  
11	
  
	
  
reported	
  an	
  outlier	
  in	
  both	
  before	
  and	
  after	
  normalization	
  (Figure	
  2).	
  
2.2.2 Preprocessing	
  
	
   The	
   cell	
   files	
   of	
   the	
   Microarray	
   are	
   normalized	
   by	
   the	
   RMA	
  
method	
   (Gentleman	
   2005).	
   The	
   expression	
   matrix	
   contains	
   30,969	
  
probes	
   and	
   48	
   arrays.	
   After	
   the	
   non-­‐specific	
   filtering	
   by	
   variance	
  
(IQR=0.5),	
  the	
  matrix	
  contains	
  15,484	
  probes,	
  48	
  arrays.	
  	
  
	
   Using	
   the	
   collapseRows	
   function	
   in	
   WGCNA,	
   the	
   probes	
   with	
  
maximum	
  variance	
  are	
  selected	
  to	
  represent	
  genes.	
  After	
  merging	
  the	
  
probes,	
  the	
  merged	
  matrix	
  contains	
  10,079	
  probes	
  and	
  48	
  arrays.	
  	
  
2.3 	
  Statistical	
  test	
  
	
   The	
  merged	
  matrix	
  is	
  ranked	
  by	
  moderated	
  F	
  test	
  and	
  genes	
  
are	
   selected	
   with	
   significant	
   p-­‐value	
   (<0.05,	
   using	
   Limma	
   package)	
  
(Smyth	
   2004)	
   after	
   adjusted	
   by	
   Benjamini-­‐Hochnerg	
   method.	
   	
   After	
  
ranking,	
  the	
  top-­‐rank	
  matrix	
  contains	
  4,307	
  probes	
  and	
  48	
  arrays.	
  	
  
	
   The	
   top-­‐rank	
   matrix	
   is	
   imported	
   to	
   one	
   of	
   the	
   Gaggle	
   Geese	
  
MultiExperiment	
   Viewer	
   (MeV)	
   and	
   under	
   Significant	
   Analysis	
   for	
  
Microarrays	
   (SAM)	
   test	
   (COE	
   versus	
   COEW	
   group,	
   p-­‐value	
   <	
   0.05,	
  
1000	
  permutation,	
  FDR	
  =	
  0.9).	
  	
  
2.4 	
  Cluster	
  analysis	
  
12	
  
	
  
	
   Hierarchical	
   clustering	
   is	
   performed	
   for	
   ASM	
   and	
   Heart	
  
candidate	
   genes	
   using	
   MeV,	
   using	
   Pearson	
   correlation	
   metric	
   and	
  
average	
  linkage	
  clustering.	
  	
  
	
   The	
  time-­‐series	
  group	
  data,	
  totaling	
  36	
  arrays,	
  are	
  averaged	
  for	
  
each	
  time	
  point	
  and	
  imported	
  to	
  Short	
  Time-­‐series	
  Expression	
  Miner	
  
(STEM),	
  using	
  STEM	
  Clustering	
  Method.	
  
2.5 	
  Functional	
  enrichment	
  analysis	
  
	
   Blast2GO	
   (B2G)	
   	
   (Conesa,	
   Gtz	
   et	
   al.	
   2005)	
   is	
   a	
   comprehensive	
  
bioinformatics	
   tool	
   for	
   annotation,	
   visualization	
   and	
   analysis	
   in	
  
functional	
   genomics	
   research.	
   It	
   offers	
   a	
   suitable	
   platform	
   for	
  
functional	
  research	
  in	
  non-­‐model	
  species,	
  such	
  as	
  Ciona	
  intestinalis.	
  	
  	
  
	
   DNA	
   sequences	
   in	
   fasta	
   format	
   were	
   loaded	
   to	
   Blast2GO.	
  
15,629	
   genes	
   remained	
   in	
   the	
   Blast2GO,	
   followed	
   by	
   blasting,	
   go-­‐
mapping	
  and	
  yielded	
  Go-­‐terms	
  for	
  3,964	
  genes.	
  The	
  test	
  group	
  from	
  
different	
   lists	
   is	
   tested	
   against	
   the	
   reference	
   group	
   (3,964	
   genes)	
  
using	
  the	
  Fisher’s	
  Exact	
  Test	
  (p-­‐value	
  <	
  0.05,	
  FDR	
  correction).	
  	
  
2.6 Generation	
  of	
  networks	
  
2.6.1 String	
  protein	
  network	
  
	
   Using	
  the	
  Ensembl	
  gene	
  name	
  in	
  this	
  filt.gene	
  matrix	
  as	
  input,	
  
the	
  genes	
  of	
  interest	
  in	
  the	
  Search	
  Tool	
  for	
  the	
  Retrieval	
  of	
  Interacting	
  
Genes	
   (STRING)	
   database	
   (Szklarczyk,	
   Franceschini	
   et	
   al.	
   2011)	
   are	
  
extracted	
   from	
   the	
   STRING	
   website	
   in	
   Text	
   Summary	
   format	
   and	
  
13	
  
	
  
parsed	
   to	
   Cystoscape	
   simple	
   interaction	
   format	
   (SIF)	
   	
   (Shannon,	
  
Markiel	
  et	
  al.	
  2003)	
  by	
  python	
  programming	
  language.	
  	
  
2.6.2 Unweighted	
  co-­‐expression	
  network	
  
	
   The	
   Pearson	
   Correlation	
   Coefficient	
   for	
   all	
   pair-­‐wise	
  
comparisons	
   of	
   genes	
   is	
   calculated	
   from	
   filt.gene	
   matrix	
   in	
   R.	
   High	
  
correlated	
   genes	
   are	
   selected	
   with	
   cutoff	
   0.9	
   and	
   parsed	
   to	
   simple	
  
interaction	
  format	
  (SIF)	
  	
  (Shannon,	
  Markiel	
  et	
  al.	
  2003)	
  by	
  python.	
  	
  
2.6.3 Weighted	
  co-­‐expression	
  network	
  
2.6.3.1 Network	
  construction	
  
	
   The	
  procedure	
  can	
  be	
  found	
  in	
  the	
  WGCNA	
  website	
  (Horvath	
  
2011).	
  	
  
2.6.3.2 Module	
  detection	
  
	
   Pearson	
  correlation	
  coefficients	
  are	
  calculated	
  for	
  all	
  pair-­‐wise	
  
comparisons	
   of	
   genes	
   across	
   all	
   samples.	
   	
   The	
   resulting	
   Pearson	
  
correlation	
  matrix	
  is	
  transformed	
  into	
  the	
  weighted	
  adjacency	
  matrix	
  
with	
   the	
   above	
   power	
   beta	
   6.	
   The	
   average	
   linkage	
   hierarchical	
  
clustering	
   is	
   used	
   to	
   group	
   genes	
   on	
   the	
   basis	
   of	
   the	
   topological	
  
overlap	
  dissimilarity	
  measure	
  of	
  their	
  network	
  connection	
  strengths	
  
(Zhang,	
   Horvath	
   2005).	
   Using	
   a	
   dynamic	
   tree-­‐cutting	
   algorithm	
  
(Langfelder,	
  Zhang	
  et	
  al.	
  2008),	
  13	
  modules	
  are	
  found	
  with	
  the	
  minimum	
  
cluster	
  size	
  of	
  70	
  (Figure	
  6).	
  Genes	
  that	
  are	
  not	
  assigned	
  to	
  modules	
  
are	
  assigned	
  the	
  color	
  grey.	
  	
  
14	
  
	
  
2.6.3.3 Module	
  significance	
  
	
   The	
  p	
  value	
  of	
  moderated	
  t	
  test	
  is	
  the	
  output	
  from	
  topTable	
  of	
  
AffylmGUI	
  package	
  in	
  R	
  (Smyth	
  2004).	
  	
  	
  
2.7 Network	
  visualization	
  
2.7.1 File	
  format	
  	
  
	
   The	
  output	
  files	
  from	
  WGCNA	
  are	
  parsed	
  to	
  simple	
  interaction	
  
format	
  (SIF)	
  	
  (Shannon,	
  Markiel	
  et	
  al.	
  2003)	
  by	
  python.	
  	
  
2.7.2 Analyzing	
  network	
  by	
  plugin	
  in	
  Cytoscape	
  
	
   AllegroMCODE	
  and	
  Network	
  Analysis	
  plugin	
  in	
  Cytoscape	
  are	
  
used	
   to	
   analyze	
   the	
   network.	
   Finding	
   the	
   cluster	
   automatically	
   is	
  
achieved	
   by	
   AllegroMCODE.	
  
15	
  
	
  
	
  
3. RESULTS	
  
3.1 Differential	
  expression	
  	
  
3.1.1 Expectation	
  of	
  the	
  Microarray	
  data	
  
Genes	
   that	
   are	
   up-­‐regulated	
   in	
   the	
   overexpression	
   of	
   COE	
   or	
  
down-­‐regulated	
   in	
   loss	
   of	
   function	
   of	
   COE	
   are	
   considered	
   ASM	
  
candidate	
   genes	
   downstream	
   of	
   COE,	
   while	
   genes	
   that	
   are	
   down-­‐
regulated	
  in	
  overexpression	
  of	
  COE	
  or	
  up-­‐regulated	
  in	
  loss	
  of	
  function	
  
of	
  COE	
  are	
  considered	
  Heart	
  candidate	
  genes	
  repressed	
  by	
  COE	
  (Stolfi,	
  
Gainous	
  et	
  al.	
  2010).	
  	
  
Using	
   the	
   COE	
   and	
   COEW	
   group	
   as	
   two	
   classes	
   in	
   the	
  
Significant	
  Analysis	
  for	
  Microarrays	
  (SAM),	
  the	
  contrast	
  would	
  yield	
  
ASM	
  and	
  Heart	
  candidate	
  genes.	
  	
  
3.1.2 ASM	
  and	
  Heart	
  candidate	
  genes	
  
3.1.2.1	
   Lists	
  from	
  SAM	
  
	
  	
   336	
  significant	
  genes	
  are	
  derived	
  from	
  SAM	
  and	
  separated	
  into	
  
206	
  ASM	
  candidate	
  genes	
  (negative	
  in	
  SAM,	
  expression	
  of	
  COE	
  group	
  
lower	
   than	
   that	
   of	
   COEW	
   group)	
   and	
   130	
   Heart	
   candidate	
   genes	
  
(positive	
  in	
  SAM,	
  expression	
  of	
  COE	
  group	
  higher	
  than	
  that	
  of	
  COEW	
  
group).	
   	
   These	
   two	
   groups	
   can	
   be	
   distinguished	
   by	
   the	
   first	
   three	
  
columns	
  in	
  the	
  heat-­‐map	
  (Figure	
  3,	
  Figure	
  27).	
  	
  
16	
  
	
  
	
   Based	
  on	
  the	
  Hierarchical	
  Clustering	
  and	
  observation,	
  the	
  ASM	
  
candidate	
  genes	
  can	
  be	
  roughly	
  divided	
  into	
  three	
  large	
  groups:	
  
	
   A1.	
  The	
  first	
  group	
  (up-­‐down-­‐up-­‐ASM,	
  61	
  genes),	
  shows	
  a	
  “U”	
  
shape	
   curve	
   through	
   the	
   time-­‐series	
   experiments,	
   with	
   the	
   earliest	
  
up-­‐regulation	
   right	
   at	
   the	
   experimental	
   time	
   point	
   of	
   8	
   hours.	
   This	
  
group	
  contains	
  Snail	
  (‘SNAIL’	
  in	
  the	
  thesis),	
  SET	
  and	
  MYND	
  Domain	
  1	
  
(SMYD1)	
  and	
  Myodblast	
  determination	
  protein	
  (Myod,	
  ‘MYOD’	
  in	
  the	
  
thesis).	
  	
  
	
   A2.	
   The	
   second	
   group	
   (early-­‐ASM,	
   45	
   genes),	
   including	
   COE	
  
and	
   Myocyte	
   Regulatory	
   Light	
   Chain	
   (MRLC5,	
   ‘MYL5’	
   in	
   the	
   thesis)	
  
gene,	
  shows	
  early	
  up-­‐regulation	
  around	
  14	
  hours.	
  	
  
	
   A3.	
  The	
  third	
  group	
  (late-­‐ASM,	
  100	
  genes)	
  has	
  relatively	
  late	
  
up-­‐regulation	
  after	
  18	
  hours,	
  with	
  myosin	
  heavy	
  chain	
  genes	
  (MHC3),	
  
tropomyosin	
   1(TPM1,	
   ‘CTM1’	
   in	
   the	
   thesis)	
   and	
   muscle	
   like	
   actin	
   2	
  
(MA2)	
  in	
  the	
  group.	
  	
  
	
   The	
   Heart	
   candidate	
   genes	
   can	
   be	
   divided	
   into	
   two	
   large	
  
groups:	
  
	
   H1.	
   The	
   first	
   group	
   (early-­‐Heart,	
   99	
   genes)	
   shows	
   early	
   up-­‐
regulation	
  (before	
  20	
  hours),	
  containing	
  heart	
  markers	
  BMP2/4,	
  NK4,	
  
NOTRLC/HAND-­‐LIKE,	
  and	
  ETS/POINTED2.	
  	
  
17	
  
	
  
	
   H2.	
  The	
  second	
  group	
  (late-­‐Heart,	
  31	
  genes)	
  displays	
  relative	
  
late	
  up-­‐regulation	
  (after	
  20	
  hours),	
  with	
  mesenchyme	
  specific	
  gene	
  3	
  
(MECH3)	
  in	
  the	
  group.	
  	
  
	
   As	
  expected,	
  two	
  lists	
  of	
  genes	
  have	
  some	
  important	
  markers	
  
in	
  them	
  and	
  noticeable	
  temporal	
  expression.	
  But	
  these	
  ASM	
  and	
  Heart	
  
candidate	
  genes	
  didn’t	
  show	
  Go-­‐term	
  enrichment	
  from	
  the	
  Blast2GO,	
  
which	
  might	
  indicate	
  the	
  need	
  to	
  fine-­‐tune	
  the	
  list,	
  even	
  though	
  the	
  
Blast2GO	
  with	
  few	
  go	
  terms	
  is	
  another	
  concern.	
  Further	
  improvement	
  
of	
  the	
  ASM	
  and	
  Heart	
  candidate	
  gene	
  list	
  would	
  be	
  necessary	
  to	
  know	
  
the	
  effect	
  of	
  the	
  non-­‐specific	
  filtering,	
  selecting	
  the	
  probe	
  for	
  a	
  gene	
  by	
  
maximum	
  variance	
  and	
  SAM	
  ranking.	
  	
  
3.1.2.2	
   Clusters	
  from	
  STEM	
  
Total	
  7	
  significant	
  model	
  profiles	
  showed	
  in	
  the	
  STEM	
  output.	
  
23	
  out	
  of	
  the	
  206	
  ASM	
  candidate	
  genes	
  are	
  in	
  the	
  significant	
  profiles.	
  
Most	
  of	
  them	
  are	
  in	
  the	
  profile	
  20,	
  similar	
  to	
  the	
  late-­‐ASM,	
  including	
  
the	
  MHC3,	
  MA2	
  and	
  MYL5	
  genes.	
  For	
  the	
  Heart	
  candidate	
  genes,	
  13	
  
out	
  of	
  130	
  are	
  in	
  the	
  significant	
  profiles.	
  	
  
3.2 Network	
  Visualization	
  in	
  Gaggle	
  
3.2.1 Networks	
  
3.2.1.1 STRING	
  protein	
  network	
  	
  
	
   The	
   STRING	
   (Szklarczyk,	
   Franceschini	
   et	
   al.	
   2011)	
   protein	
  
network	
  is	
  created	
  to	
  make	
  good	
  use	
  of	
  the	
  existing	
  data	
  resources.	
  	
  It	
  
18	
  
	
  
provides	
   both	
   experimental	
   and	
   predicted	
   interaction	
   information	
  
from	
   computational	
   techniques,	
   presented	
   as	
   different	
   colors	
   in	
   the	
  
edge	
  (Figure	
  9).	
  	
  
3.2.1.2 Co-­‐expression	
  network	
  
	
   The	
   network-­‐based	
   approaches,	
   also	
   termed	
   graph-­‐based	
  
approaches,	
   aim	
   to	
   extract	
   recurrent	
   expression	
   patterns	
   or	
  
conserved	
   module	
   from	
   the	
   rapid	
   accumulation	
   of	
   Microarray	
  
datasets.	
  The	
  Microarray	
  dataset	
  is	
  modeled	
  as	
  a	
  relation	
  graph	
  where	
  
each	
  node	
  represents	
  one	
  gene	
  and	
  two	
  genes	
  are	
  connected	
  through	
  
the	
   edge	
   based	
   on	
   certain	
   expression	
   correlation	
   parameter	
   (Zhang,	
  
Horvath	
  2005)	
  to	
  measure	
  the	
  similarity	
  between	
  expression	
  profiles	
  
(Pearson	
   Correlation	
   Coefficient	
   is	
   used	
   in	
   this	
   thesis).	
   The	
   graph,	
  
namely	
   network,	
   can	
   be	
   represented	
   by	
   an	
   adjacency	
   matrix	
   that	
  
encodes	
   whether	
   a	
   pair	
   of	
   nodes	
   is	
   connected.	
   For	
   unweighted	
  
networks,	
   entries	
   are	
   1	
   or	
   0.	
   For	
   weighted	
   networks,	
   the	
   adjacency	
  
matrix	
  reports	
  the	
  connection	
  strength	
  for	
  the	
  gene	
  pairs,	
  between	
  1	
  
and	
   0	
   (Zhang,	
   Horvath	
   2005).	
   The	
   concept	
   of	
   connectivity	
   in	
   graph	
  
theory,	
   also	
   termed	
   degree,	
   can	
   be	
   depicted	
   as	
   the	
   row	
   sum	
   of	
   the	
  
adjacency	
  matrix,	
  measuring	
  the	
  direct	
  neighbors	
  of	
  the	
  node	
  in	
  the	
  
unweighted	
   networks	
   and	
   connection	
   strengths	
   in	
   the	
   weighted	
  
network.	
  	
   	
   	
  
Two	
  co-­‐expression	
  networks	
  are	
  generated	
  in	
  this	
  thesis.	
  	
  
19	
  
	
  
	
   The	
  unweighted	
  co-­‐expression	
  network	
  is	
  formed	
  by	
  the	
  genes	
  
with	
  the	
  Pearson	
  Correlation	
  Coefficient	
  higher	
  than	
  0.9.	
  A	
  total	
  766	
  
nodes	
   are	
   in	
   this	
   unweighted	
   network	
   with	
   clustering	
   coefficient	
  
0.311	
  (output	
  result	
  from	
  the	
  Network	
  Analysis	
  plugin	
  in	
  Cytoscape,	
  
measuring	
  the	
  cohesiveness	
  of	
  the	
  neighborhood	
  of	
  a	
  node).	
  	
  
	
   The	
   genes	
   with	
   the	
   top	
   5000	
   strong	
   weight	
   are	
   outputted	
   to	
  
build	
   the	
   weighted	
   co-­‐expression	
   network	
   (cutoff	
   for	
   the	
   weight	
   is	
  
0.23),	
  a	
  total	
  of	
  814	
  nodes,	
  with	
  clustering	
  coefficient	
  0.728.	
  	
  
	
   The	
  unweighted	
  network	
  has	
  more	
  isolated	
  clusters	
  with	
  only	
  
2	
  nodes	
  linked	
  by	
  1	
  edge.	
  The	
  weighted	
  network	
  has	
  greater	
  density	
  
with	
   some	
   hubs	
   (high	
   connectivity),	
   and	
   also	
   contains	
   colors	
   in	
   the	
  
node	
  for	
  the	
  different	
  modules	
  detected	
  in	
  the	
  WGCNA.	
  	
  
	
  	
  	
  	
  	
  	
  Though	
   these	
   two	
   networks	
   are	
   different	
   in	
   the	
   adjacency	
  
matrix,	
   they	
   are	
   both	
   based	
   on	
   Pearson	
   Correlation	
   Coefficient	
   to	
  
present	
   the	
   genes	
   of	
   high	
   similarity	
   in	
   the	
   graph	
   in	
   terms	
   of	
   their	
  
closeness.	
  In	
  other	
  words,	
  genes	
  of	
  same	
  expression	
  profiles	
  across	
  all	
  
of	
  the	
  experiments	
  would	
  be	
  close	
  to	
  each	
  other	
  in	
  the	
  network.	
  These	
  
network-­‐based	
  approaches	
  allow	
  for	
  the	
  exploration	
  of	
  the	
  position	
  of	
  
a	
  biological	
  entity	
  in	
  the	
  context	
  of	
  its	
  local	
  neighborhood	
  in	
  the	
  graph	
  
and	
   network	
   as	
   a	
   whole,	
   and	
   less	
   troubled	
   by	
   inherent	
   noise	
   that	
  
confound	
  conventional	
  pairwise	
  approaches	
  (Freeman,	
  Goldovsky	
  et	
  al.	
  
2007).	
  	
  
20	
  
	
  
3.2.2 Findings	
  from	
  the	
  network	
  visualization	
  in	
  Gaggle	
  	
  
3.2.2.1 Gaggle	
  as	
  information	
  integration	
  center	
  
	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  In	
  this	
  post-­‐genomic	
  era,	
  biologists	
  often	
  face	
  the	
  challenge	
  to	
  
freely	
   explore	
   the	
   experimental	
   and	
   computational	
   data	
   from	
   many	
  
different	
  sources	
  and	
  diverse	
  software	
  tools,	
  such	
  as	
  storing	
  different	
  
data	
  for	
  genes,	
  retrieving	
  data	
  from	
  a	
  list	
  of	
  genes,	
  and	
  mapping	
  one	
  
list	
  of	
  genes	
  with	
  another.	
  Once	
  the	
  network	
  has	
  been	
  loaded	
  in	
  the	
  
Cytoscape,	
   Gaggle,	
   as	
   an	
   information	
   integration	
   center,	
   can	
   help	
   to	
  
solve	
  these	
  problems	
  with	
  respect	
  to	
  Microarray	
  data.	
  
	
   Storing	
  different	
  data	
  for	
  genes	
  can	
  be	
  achieved	
  by	
  labeling.	
  As	
  
shown	
   in	
   the	
   Figure	
   9	
   and	
   10,	
   two	
   networks	
   present	
   data	
   from	
   6	
  
different	
  sources,	
  such	
  node	
  color	
  for	
  module,	
  node	
  label	
  for	
  ASM	
  or	
  
Heart	
   candidate	
   genes,	
   node	
   shape	
   for	
   significance	
   in	
   moderated	
   F	
  
test,	
   node	
   size	
   for	
   connectivity,	
   edge	
   color	
   for	
   different	
   interaction,	
  
and	
  distance	
  between	
  nodes	
  for	
  closeness.	
  Therefore	
  the	
  network	
  in	
  
Cytoscape	
  functions	
  as	
  a	
  visual	
  database.	
  	
  
	
   Retrieving	
  data	
  from	
  a	
  list	
  of	
  genes,	
  such	
  as	
  expression	
  matrix,	
  
is	
  also	
  feasible	
  through	
  the	
  basic	
  function	
  “broadcast”	
  in	
  Gaggle.	
  For	
  
example,	
  a	
  list	
  of	
  genes	
  of	
  interest	
  in	
  the	
  Cytoscape	
  can	
  be	
  sent	
  to	
  the	
  
Gaggle	
  Boss,	
  and	
  then	
  broadcast	
  to	
  Data	
  Matrix	
  Viewer	
  (DMV),	
  which	
  
can	
  output	
  the	
  expression	
  matrix.	
  	
  
21	
  
	
  
	
   Mapping	
   one	
   list	
   of	
   genes	
   with	
   another	
   can	
   be	
   done	
  
conveniently	
   in	
   Gaggle	
   thourhg	
   the	
   many	
   functions	
   that	
   it	
   offers.	
   In	
  
the	
   MultiExperiment	
   Viewer	
   (MeV),	
   a	
   sub-­‐list	
   of	
   genes	
   can	
   be	
  
launched	
   in	
   a	
   new	
   viewer.	
   In	
   Cytoscape,	
   the	
   function	
   “Create	
   new	
  
network	
   from	
   selected	
   nodes”	
   can	
   be	
   used	
   in	
   this	
   task.	
   Between	
  
different	
   tools,	
   the	
   function	
   “broadcast”	
   would	
   serve	
   as	
   a	
   bridge	
   to	
  
transfer	
  the	
  list	
  and	
  map	
  it	
  in	
  the	
  existing	
  tools.	
  
3.2.2.2 Module	
  from	
  AllegroMCODE	
  
	
   The	
  main	
  goal	
  of	
  the	
  co-­‐expression	
  network	
  visualization	
  is	
  to	
  
find	
  the	
  highly	
  correlated	
  genes	
  (module)	
  related	
  to	
  the	
  ASM	
  or	
  Heart	
  
network,	
  specifically	
  aiming	
  to	
  infer	
  targets	
  of	
  the	
  transcription	
  factor	
  
COE.	
  	
  
In	
   the	
   unweighted	
   network	
   without	
   predefined	
   modules,	
   the	
  
modules	
  can	
  be	
  automatically	
  detected	
  by	
  AllegroMCODE,	
  a	
  plugin	
  in	
  
Cytoscape	
   to	
   find	
   highly	
   interconnected	
   groups	
   of	
   nodes	
   in	
   a	
   huge	
  
complex	
  network.	
  The	
  1st	
  module	
  detected	
  by	
  AllegroMCODE	
  for	
  the	
  
unweighted	
   network	
   is	
   shown	
   in	
   the	
   Figure	
   11.	
   This	
   module	
   is	
  
significantly	
   enriched	
   in	
   biological	
   process	
   (Figure	
   12),	
   such	
   as	
  
biosynthetic	
  process	
  and	
  cellular	
  biosynthetic	
  process.	
  	
  
	
   For	
  the	
  weighted	
  network,	
  the	
  1st	
  module	
  (Figure	
  13)	
  detected	
  
by	
   AllegroMCODE	
   contains	
   largely	
   turquoise	
   module	
   genes	
   (only	
   1	
  
22	
  
	
  
grey	
  color	
  gene.	
  This	
  module	
  is	
  significantly	
  enriched	
  in	
  intracellular	
  
process	
  (Figure	
  14).	
  	
  
	
  	
   Comparing	
   these	
   1st	
   modules	
   of	
   unweighed	
   and	
   weighted	
  
network,	
  they	
  both	
  contain	
  ribosome	
  related	
  genes	
  (gene	
  name	
  starts	
  
with	
  “RP”).	
  	
  Because	
  these	
  two	
  networks	
  are	
  both	
  generated	
  from	
  the	
  
same	
   Microarray	
   data,	
   an	
   external	
   reference	
   would	
   be	
   necessary	
   to	
  
determine	
   whether	
   this	
   ribosome	
   group	
   is	
   found	
   by	
   chance.	
   The	
  
common	
   list	
   of	
   23	
   genes	
   is	
   from	
   the	
   comparison	
   between	
   the	
   1st	
  
module	
   in	
   weighted	
   network	
   and	
   all	
   turquoise	
   module	
   genes	
   in	
  
STRING	
  network,	
  which	
  has	
  16	
  ribosome	
  related	
  genes.	
  
3.2.2.3 Module	
  from	
  weighted	
  network	
  
	
   Weighted	
   correlation	
   analysis	
   (WGCNA)	
   has	
   advantages	
   in	
  
identifying	
   candidate	
   targets	
   with	
   its	
   unique	
   mathematical	
   features	
  
(Langfelder,	
  Horvath	
  2008).	
  While	
  the	
  highly	
  correlated	
  genes	
  can	
  be	
  
grouped	
   into	
   different	
   modules,	
   those	
   genes	
   that	
   are	
   far	
   from	
   the	
  
modules	
  are	
  depicted	
  in	
  grey.	
  Figure	
  18	
  shows	
  that	
  these	
  grey	
  color	
  
genes	
   in	
   the	
   weighted	
   network	
   are	
   often	
   with	
   fewer	
   edges	
   and	
  
targeted	
   at	
   miRNA,	
   which	
   are	
   reasonably	
   different	
   from	
   other	
  
functional	
  modules.	
  	
  
	
   In	
   Figure	
   7	
   and	
   Figure	
   8,	
   the	
   tan	
   and	
   brown	
   modules	
   have	
  
strong	
  module	
  significance	
  (the	
  significance	
  is	
  defined	
  as	
  –log10	
  (p-­‐
value	
   in	
   moderated	
   t	
   test)).	
   By	
   visualizing	
   these	
   two	
   modules	
   from	
  
23	
  
	
  
their	
   top	
   50	
   intramodular	
   connectivity	
   genes	
   respectively,	
   these	
  
modules	
  can	
  be	
  found	
  enriched	
  in	
  the	
  ASM	
  and	
  Heart	
  candidate	
  genes.	
  
Interestingly,	
  NK4	
  gene	
  is	
  in	
  the	
  tan	
  module	
  with	
  other	
  genes	
  (Figure	
  
19).	
  Islet	
  (ISL)	
  gene,	
  which	
  is	
  not	
  in	
  the	
  candidate	
  list	
  yet	
  reported	
  to	
  
be	
  ASM	
  gene,	
  is	
  in	
  the	
  brown	
  module	
  with	
  some	
  known	
  markers,	
  such	
  
as	
   MA2,	
   MHC3,	
   NOTRLC/HAND-­‐LIKE,	
   and	
   ETS/POINTED2	
   (Figure	
  
20).	
  	
  These	
  results	
  would	
  be	
  helpful	
  to	
  be	
  a	
  starting	
  point	
  for	
  making	
  
hypothesis	
  of	
  the	
  Heart	
  network	
  in	
  Ciona.	
  	
  
	
   As	
   the	
   largest	
   module	
   in	
   the	
   weighted	
   network,	
   enriched	
   in	
  
cellular	
   process	
   and	
   others	
   (Figure	
   21),	
   it	
   is	
   natural	
   to	
   consider	
  
limiting	
  the	
  list	
  of	
  the	
  turquoise	
  module	
  genes	
  with	
  other	
  conditions.	
  
The	
  list	
  of	
  genes	
  resulted	
  from	
  turquoise	
  module	
  and	
  STEM	
  condition	
  
shows	
   a	
   clear	
   temporal	
   expression	
   and	
   enrichment	
   in	
   muscle	
   and	
  
heart	
  related	
  go-­‐terms	
  (Figure	
  22,	
  Figure	
  23),	
  while	
  containing	
  only	
  
four	
  genes	
  found	
  in	
  the	
  list.	
  	
  
3.2.2.4 Fine-­‐tuned	
  list	
  
	
   The	
   network	
   in	
   Gaggle	
   can	
   serve	
   as	
   a	
   visualization	
   center	
   as	
  
well	
  as	
  a	
  fine-­‐tuning	
  filter	
  for	
  a	
  list	
  of	
  genes,	
  because	
  the	
  network	
  is	
  
built	
  upon	
  the	
  high	
  correlated	
  pair	
  of	
  genes	
  with	
  reduced	
  noise.	
  It	
  is	
  
by	
   no	
   means	
   the	
   genes	
   that	
   are	
   not	
   in	
   the	
   network	
   that	
   should	
   be	
  
discarded,	
   but	
   it	
   is	
   good	
   to	
   have	
   expected	
   go-­‐term	
   enrichment	
   to	
  
confirm	
   the	
   list.	
   Because	
   the	
   go-­‐term	
   enrichment	
   is	
   related	
   to	
   the	
  
24	
  
	
  
proportion	
   of	
   genes	
   with	
   the	
   same	
   go-­‐terms,	
   the	
   number	
   of	
   noisy	
  
genes	
  in	
  the	
  whole	
  list	
  would	
  have	
  a	
  great	
  impact	
  on	
  the	
  enrichment.	
  
Importing	
   the	
   candidate	
   list	
   to	
   the	
   co-­‐expression	
   network	
   would	
  
reduce	
  the	
  noise	
  and	
  yield	
  better	
  enrichment	
  result.	
  	
  
	
   By	
   “broadcasting”	
   function	
   in	
   the	
   MeV,	
   the	
   Cytoscape	
   can	
  
receive	
  and	
  label	
  the	
  336	
  significant	
  genes	
  in	
  the	
  unweighted	
  network	
  
with	
   yellow	
   color,	
   and	
   then	
   create	
   a	
   sub-­‐network	
   for	
   the	
   candidate	
  
genes.	
  A	
  subgroup	
  of	
  the	
  candidate	
  genes	
  (Figure	
  24)	
  is	
  significantly	
  
enriched	
   in	
   muscle	
   and	
   heart	
   related	
   go-­‐terms	
   (Figure	
   25),	
   which	
  
previously	
   could	
   not	
   be	
   reported	
   from	
   the	
   Blast2GO.	
   The	
   ASM	
  
candidate	
  genes	
  in	
  the	
  network	
  are	
  also	
  enriched	
  in	
  muscle	
  and	
  heart	
  
go-­‐terms	
  (Figure	
  26),	
  while	
  the	
  Heart	
  candidate	
  genes	
  in	
  the	
  network	
  
are	
  still	
  not	
  reported	
  enrichment	
  from	
  the	
  Blast2GO.	
  	
  
	
  
25	
  
	
  
	
  
4. DISCUSSION	
  
4.1 ASM	
  candidate	
  genes	
  
	
   COE	
   is	
   necessary	
   and	
   sufficient	
   to	
   specify	
   ASM	
   fate	
   (Stolfi,	
  
Gainous	
  et	
  al.	
  2010).	
   	
   It	
   is	
  understandable	
   that	
   COE	
   expresses	
   earlier	
  
than	
  the	
  late-­‐ASM	
  genes	
  (A3	
  group),	
  such	
  as	
  MHC3,	
  TPM1,	
  MA2.	
  While	
  
for	
  the	
  up-­‐down-­‐up-­‐ASM	
  (A1	
  group),	
  it	
  has	
  the	
  earliest	
  up-­‐regulation,	
  
with	
  MYOD	
  in	
  the	
  group.	
  In	
  Xenopus,	
  the	
  cross-­‐regulatory	
  interactions	
  
of	
  COE	
  orthologs	
  with	
  genes	
  of	
  the	
  Myogenic	
  Regulatory	
  Factor	
  (MRF)	
  
family,	
  such	
  as	
  MYOD	
  and	
  MYF5,	
  are	
  crucial	
  for	
  muscle	
  commitment	
  
and	
   differentiation	
   (Green,	
   Vetter	
   2011).	
   However,	
   how	
   COE	
   may	
  
repress	
   the	
   cardiac	
   fate	
   and	
   promote	
   cell	
   migration	
   in	
   Xenopus	
   has	
  
never	
  been	
  studied.	
  A	
  possible	
  hypothesis	
  is	
  that	
  in	
  Ciona,	
  the	
  early	
  
functions	
   controlled	
   by	
   COE	
   in	
   ASM	
   precursors	
   are	
   independent	
   on	
  
MRF	
   activation	
   since	
   the	
   MRF	
   in	
   the	
   A1	
   group	
   has	
   earlier	
   up-­‐
regulation	
  than	
  COE	
  in	
  the	
  A2	
  group.	
  	
  
And	
  the	
  A1	
  group	
  genes	
  are	
  more	
  likely	
  to	
  be	
  TVC	
  genes,	
  which	
  
also	
  can	
  explain	
  the	
  fact	
  that	
  there	
  are	
  heart	
  related	
  go-­‐terms	
  in	
  the	
  
enrichment	
  of	
  the	
  ASM	
  genes	
  in	
  the	
  weighted	
  network	
  (Figure	
  26).	
  	
  
4.2 Annotation	
  in	
  Ciona	
  intestinalis	
  	
  
	
   The	
  draft	
  of	
  genome	
  sequence	
  of	
  the	
  ascidian	
  Ciona	
  intestinalis	
  
(Dehal,	
   Satou	
   et	
   al.	
   2002)	
   has	
   been	
   a	
   valuable	
   research	
   resource.	
  
26	
  
	
  
However,	
  there	
  are	
  numerous	
  inconsistencies	
  with	
  the	
  gene	
  models	
  
because	
  of	
  the	
  intrinsic	
  limitations	
  in	
  gene	
  prediction	
  programs	
  and	
  
the	
   fragmented	
   nature	
   of	
   the	
   assembly	
   (Satou,	
   Mineta	
   et	
   al.	
   2008).	
  
Therefore	
   the	
   annotation	
   job	
   for	
   the	
   probe	
   in	
   this	
   study	
   focuses	
   on	
  
combining	
   available	
   resources	
   from	
   various	
   databases,	
   such	
   as	
  
Aniseed	
   (Tassy,	
   Dauga	
   et	
   al.),	
   Ensembl	
   Genome	
   Browser	
   (Kersey,	
  
Lawson	
  et	
  al.	
  2010),	
  CIPRO	
  (Endo,	
  Ueno	
  et	
  al.),	
  STRING	
  (Szklarczyk,	
  
Franceschini	
  et	
  al.	
  2011),	
  UCSC	
  Genome	
  Browser	
  (Karolchik,	
  Hinrichs	
  
et	
   al.	
   2011),	
   and	
   also	
   internal	
   files	
   from	
   Dr.	
   Lionel	
   Christiaen’s	
   lab.	
  
There	
  are	
  16,250	
  non-­‐redundant	
  genes	
  in	
  the	
  30,969	
  probes,	
  which	
  
will	
  be	
  the	
  criteria	
  to	
  map	
  a	
  probe	
  to	
  a	
  gene.	
  It	
  is	
  unavoidable	
  that	
  
there	
  are	
  differences	
  between	
  the	
  gene	
  annotation	
  in	
  this	
  thesis	
  and	
  
other	
  sources.	
  	
  	
  
4.3 Functional	
  ribosome	
  group	
  and	
  COE	
  	
  
The	
   highly	
   linked	
   ribosome	
   genes	
   in	
   the	
   STRING	
   network	
  
(Figure	
  19),	
  enriched	
  in	
  ribosome	
  process	
  (Figure	
  20),	
  naturally	
  lead	
  
to	
   a	
   question—what	
   is	
   the	
   relationship	
   between	
   this	
   functional	
  
ribosome	
  group	
  and	
  COE.	
  By	
  broadcasting	
  this	
  list	
  of	
  ribosomes	
  and	
  
COE	
   genes	
   to	
   MeV,	
   the	
   heat-­‐map	
   and	
   expression	
   plot	
   show	
   the	
  
similarity	
  in	
  the	
  time-­‐series	
  experiments	
  of	
  ribosome	
  group	
  and	
  COE.	
  
And	
   this	
   group	
   of	
   ribosome	
   genes	
   has	
   quite	
   a	
   stable	
   expression	
  
profile.	
   It	
   is	
   likely	
   to	
   find	
   more	
   housekeeping	
   genes	
   in	
   the	
   same	
  
module	
  as	
  the	
  ribosome	
  group,	
  which	
  is	
  not	
  the	
  focus	
  of	
  this	
  thesis.	
  
27	
  
	
  
4.4 Time-­‐series	
  
Though	
   the	
   clustering	
   algorithms,	
   such	
   as	
   Hierarchical	
  
clustering	
   (Eisen,	
   Spellman	
   et	
   al.	
   1998),	
   K-­‐means,	
   and	
   Self-­‐organizing	
  
Maps	
   (SOM)	
   (Tamayo,	
  Slonim	
  et	
  al.	
  1999),	
   can	
   be	
   used	
   to	
   analyze	
   the	
  
Microarray	
   data	
   and	
   yield	
   many	
   biological	
   insights,	
   they	
   are	
   not	
  
designed	
  for	
  time-­‐series	
  data	
  since	
  they	
  assume	
  that	
  data	
  at	
  each	
  time	
  
point	
  is	
  collected	
  independent	
  of	
  each	
  other,	
  and	
  ignore	
  the	
  sequential	
  
nature	
  of	
  time-­‐series	
  data	
  (Ernst,	
  Nau	
  et	
  al.	
  2005).	
  This	
  thesis	
  applies	
  
the	
   Short	
   Time-­‐series	
   Expression	
   Miner	
   (STEM)	
   method	
   to	
   learn	
  
about	
   the	
   time-­‐series	
   experiments	
   with	
   the	
   hope	
   of	
   finding	
   clues	
  
about	
  the	
  true	
  biological	
  pattern,	
  which	
  is	
  designed	
  for	
  the	
  analysis	
  of	
  
short	
   time	
   series	
   Microarray	
   gene	
   expression	
   data	
   (Ernst,	
  Bar	
  Joseph	
  
2006).	
  The	
  algorithm	
  (Ernst,	
  Nau	
  et	
  al.	
  2005)	
  of	
  STEM	
  starts	
  by	
  selecting	
  
a	
  set	
  of	
  potential	
  expression	
  profiles,	
  covering	
  the	
  entire	
  space	
  of	
  all	
  
possible	
  expression	
  profiles	
  that	
  can	
  be	
  generated	
  by	
  the	
  genes	
  in	
  the	
  
experiment,	
   and	
   each	
   represents	
   a	
   unique	
   temporal	
   expression	
  
pattern.	
   Next,	
   each	
   gene	
   will	
   be	
   assigned	
   to	
   one	
   of	
   the	
   profiles	
   and	
  
after	
   the	
   permutation	
   resulting	
   in	
   different	
   large	
   clusters	
   with	
  
significant	
  model	
  profiles	
  by	
  greedy	
  algorithm	
  (Ernst,	
  Nau	
  et	
  al.	
  2005),	
  
which	
  are	
  colored	
  in	
  the	
  top	
  list	
  in	
  the	
  user	
  interface.	
  	
  
It	
  is	
  worth	
  to	
  mention	
  that	
  the	
  STEM	
  is	
  designed	
  for	
  short	
  time-­‐
series	
   (defined	
   3	
   –	
   8	
   time	
   points	
   in	
   their	
   website);	
   while	
   the	
   time	
  
points	
  in	
  this	
  Microarray	
  dataset	
  is	
  11.	
  	
  
28	
  
	
  
4.5 Limitations	
  of	
  the	
  co-­‐expression	
  network	
  	
  
	
  	
  	
   The	
  co-­‐expression	
  network	
  approaches	
  have	
  several	
  limitations	
  
including	
  the	
  following.	
  First,	
  the	
  network	
  similarity	
  is	
  based	
  on	
  the	
  
Pearson	
   Correlation	
   Coefficient,	
   which	
   is	
   sensitive	
   to	
   outliers.	
  
Therefore	
  the	
  quality	
  of	
  the	
  input	
  matrix	
  would	
  be	
  important	
  to	
  the	
  
final	
  result.	
  It	
  would	
  be	
  helpful	
  to	
  try	
  the	
  data	
  transformation	
  or	
  use	
  
Spearman’s	
  rank	
  correlation	
  coefficient.	
  	
  
	
   A	
  second	
  limitation	
  is	
  that	
  the	
  Pearson	
  Correlation	
  Coefficient	
  
based	
   co-­‐expression	
   network	
   is	
   more	
   suitable	
   for	
   finding	
   global	
   co-­‐
expression	
   genes(Qian,	
   Dolled	
   Filhart	
   et	
   al.	
   2001),	
   and	
   it	
   cannot	
  
accurately	
  detect	
  the	
  time-­‐delayed	
  or	
  transient	
  response	
  of	
  the	
  down-­‐
stream	
  effectors	
  for	
  the	
  time-­‐series	
  experiments.	
  It	
  would	
  be	
  better	
  to	
  
use	
   local	
   clustering	
   (Qian,	
   Dolled	
   Filhart	
   et	
   al.	
   2001)	
   to	
   find	
   the	
   time-­‐
delay	
  or	
  local	
  co-­‐expression	
  genes,	
  or	
  other	
  tools	
  specialized	
  in	
  long	
  
time-­‐series	
   experiments	
   like	
   The	
   Graphical	
   Query	
   Language	
   (GQL)	
  
(Costa,	
  Schnhuth	
  et	
  al.	
  2005).	
  	
  
	
   A	
  third	
  limitation	
  is	
  that	
  it	
  is	
  difficult	
  to	
  pick	
  thresholds	
  for	
  a	
  
biological	
   network.	
   The	
   hard-­‐threshold	
   for	
   the	
   unweighted	
   network	
  
would	
  arbitrarily	
  cut	
  off	
  some	
  biological	
  meaningful	
  edges.	
  The	
  weak	
  
weight	
  modules	
  would	
  also	
  be	
  cut	
  off	
  in	
  the	
  weighted	
  network	
  while	
  it	
  
is	
   possible	
   that	
   this	
   kind	
   of	
   weak	
   linkage	
   would	
   be	
   biologically	
  
meaningful.	
  	
  
29	
  
	
  
Figures	
  and	
  tables	
  
	
  
Figure	
  1	
   Pipeline.	
  	
  
30	
  
	
  
	
  
Figure	
  2	
   Normalized	
  unscaled	
  standard	
  error	
  (NUSE).	
  	
  
One	
  of	
  the	
  tests	
  in	
  the	
  arrayQualityMetrics,	
  NUSE,	
  detected	
  sample	
  
LacZ3	
  as	
  an	
  outlier.	
  	
  
	
  
Figure	
  3	
   Heat-­‐map	
  of	
  ASM	
  and	
  Heart	
  candidate	
  genes.	
  	
  
ASM	
  candidate	
  genes	
  are	
  red	
  in	
  the	
  first	
  and	
  third	
  column.	
  A1:	
  up-­‐
down-­‐up-­‐ASM.	
  A2:	
  early-­‐ASM.	
  A3:	
  late-­‐ASM.	
  Heart	
  candidate	
  genes	
  
are	
  red	
  in	
  the	
  second	
  column.	
  H1:	
  early-­‐Heart.	
  H2:	
  late-­‐Heart.	
  	
  
31	
  
	
  
	
  
Figure	
  4	
   Output	
  of	
  the	
  Short	
  Time-­‐series	
  Expression	
  Miner.	
  	
  
Significant	
  clusters	
  are	
  colored	
  at	
  the	
  top	
  row.	
  	
  
5 10 15 20
0.30.40.50.60.70.80.9
Scale independence
Soft Threshold (power)
ScaleFreeTopologyModelFit,
signedR^2
1
2
3 4
5 6
7 8 9 10 11 12 13 14 15 16 17
18
19 20
5 10 15 20
050010001500 Mean connectivity
Soft Threshold (power)
MeanConnectivity
1
2
3
4
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
	
  
Figure	
  5	
   Selecting	
  soft	
  power.	
  	
  
The	
   soft	
   threshold	
   power	
   beta	
   of	
   6	
   is	
   chosen	
   for	
   calculating	
   the	
  
adjacency	
  matrix	
  since	
  it	
  reached	
  a	
  high	
  topology	
  model	
  fit	
  (R^2)	
  and	
  
high	
  mean	
  connectivity.	
  	
  
	
  
32	
  
	
  
	
  
Figure	
  6	
   Ciona	
  intestinalis	
  weighted	
  co-­‐expression	
  network.	
  	
  
The	
  dendrogram	
  results	
  from	
  average	
  linkage	
  hierarchical	
  clustering.	
  
The	
   color-­‐band	
   below	
   the	
   dendrogram	
   denotes	
   the	
   modules,	
   which	
  
are	
   defined	
   as	
   branches	
   in	
   the	
   dendrogram.	
   Of	
   the	
   10,	
   079	
   genes,	
  
6162	
   were	
   clustered	
   into	
   13	
   modules,	
   and	
   the	
   remaining	
   genes	
   are	
  
colored	
  in	
  grey.	
  
	
  
33	
  
	
  
black blue brown green greenyellow grey magenta pink purple red tan turquoise yellow
Dynamic−cutree Module Significance(COE−COEW modt) p= 3.1e−86
Dynamic Module
coesig
0.00.20.40.60.8
black blue brown green greenyellow grey magenta pink purple red tan turquoise yellow
Counts
01000200030004000
	
  
Figure	
  7	
   Module	
  significance.	
  
Module	
   significance	
   is	
   determined	
   as	
   the	
   average	
   absolute	
   gene	
  
significance	
  (defined	
  by	
  minus	
  log	
  of	
  a	
  p-­‐value)	
  measure	
  for	
  all	
  genes	
  
in	
  a	
  given	
  module.	
  
Visualization hang zhong
Visualization hang zhong
Visualization hang zhong
Visualization hang zhong
Visualization hang zhong
Visualization hang zhong
Visualization hang zhong
Visualization hang zhong
Visualization hang zhong
Visualization hang zhong
Visualization hang zhong
Visualization hang zhong
Visualization hang zhong
Visualization hang zhong

Weitere ähnliche Inhalte

Andere mochten auch

尽管去做——无压工作的艺术
尽管去做——无压工作的艺术尽管去做——无压工作的艺术
尽管去做——无压工作的艺术ray4hz
 
Senior sas programmer
Senior sas programmerSenior sas programmer
Senior sas programmerray4hz
 
Moving from programmer to statistician
Moving from programmer to statisticianMoving from programmer to statistician
Moving from programmer to statisticianray4hz
 
Types of restaurants
Types of restaurantsTypes of restaurants
Types of restaurantsRyuk Sh
 
Introduction to clinical sas programming
Introduction to clinical sas programmingIntroduction to clinical sas programming
Introduction to clinical sas programmingray4hz
 
Clinical sas programmer
Clinical sas programmerClinical sas programmer
Clinical sas programmerray4hz
 
Cdisc sdtm implementation_process _v1
Cdisc sdtm implementation_process _v1Cdisc sdtm implementation_process _v1
Cdisc sdtm implementation_process _v1ray4hz
 

Andere mochten auch (9)

Quiz ict
Quiz ictQuiz ict
Quiz ict
 
24 Lectures Ppt
24 Lectures Ppt24 Lectures Ppt
24 Lectures Ppt
 
尽管去做——无压工作的艺术
尽管去做——无压工作的艺术尽管去做——无压工作的艺术
尽管去做——无压工作的艺术
 
Senior sas programmer
Senior sas programmerSenior sas programmer
Senior sas programmer
 
Moving from programmer to statistician
Moving from programmer to statisticianMoving from programmer to statistician
Moving from programmer to statistician
 
Types of restaurants
Types of restaurantsTypes of restaurants
Types of restaurants
 
Introduction to clinical sas programming
Introduction to clinical sas programmingIntroduction to clinical sas programming
Introduction to clinical sas programming
 
Clinical sas programmer
Clinical sas programmerClinical sas programmer
Clinical sas programmer
 
Cdisc sdtm implementation_process _v1
Cdisc sdtm implementation_process _v1Cdisc sdtm implementation_process _v1
Cdisc sdtm implementation_process _v1
 

Ähnlich wie Visualization hang zhong

Pattern recognition techniques for the emerging feilds in bioinformatics
Pattern recognition techniques for the emerging feilds in bioinformaticsPattern recognition techniques for the emerging feilds in bioinformatics
Pattern recognition techniques for the emerging feilds in bioinformaticsKaveen Prathibha Kumarasinghe
 
Deep Learning for Health Informatics
Deep Learning for Health InformaticsDeep Learning for Health Informatics
Deep Learning for Health InformaticsJason J Pulikkottil
 
A Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORKA Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORKSara Parker
 
Automated Image Detection Of Retinal Pathology.pdf
Automated Image Detection Of Retinal Pathology.pdfAutomated Image Detection Of Retinal Pathology.pdf
Automated Image Detection Of Retinal Pathology.pdfMohammad Bawtag
 
Introductory Course on molecular Biology
Introductory Course on molecular BiologyIntroductory Course on molecular Biology
Introductory Course on molecular BiologyJean Bosco MBONIMPA
 
RY_PhD_Thesis_2012
RY_PhD_Thesis_2012RY_PhD_Thesis_2012
RY_PhD_Thesis_2012Rajeev Yadav
 
annurev-bioeng-082120-042814.pdf
annurev-bioeng-082120-042814.pdfannurev-bioeng-082120-042814.pdf
annurev-bioeng-082120-042814.pdfalaaaltaee3
 
Thesis_Sebastian_Ånerud_2015-06-16
Thesis_Sebastian_Ånerud_2015-06-16Thesis_Sebastian_Ånerud_2015-06-16
Thesis_Sebastian_Ånerud_2015-06-16Sebastian
 
46260004 blue-brain-seminar-report
46260004 blue-brain-seminar-report46260004 blue-brain-seminar-report
46260004 blue-brain-seminar-reportvishnuchitiki
 
KurtPortelliMastersDissertation
KurtPortelliMastersDissertationKurtPortelliMastersDissertation
KurtPortelliMastersDissertationKurt Portelli
 
[] Molecular cancer_therapeutics_strategies_for_d(book_zz.org)
[] Molecular cancer_therapeutics_strategies_for_d(book_zz.org)[] Molecular cancer_therapeutics_strategies_for_d(book_zz.org)
[] Molecular cancer_therapeutics_strategies_for_d(book_zz.org)Hoàng Lê
 
Classification of squamous cell cervical cytology
Classification of squamous cell cervical cytologyClassification of squamous cell cervical cytology
Classification of squamous cell cervical cytologykarthigailakshmi
 
Robofish - Final Report (amended)
Robofish - Final Report (amended)Robofish - Final Report (amended)
Robofish - Final Report (amended)Adam Zienkiewicz
 
Submitted Report Final Draft
Submitted Report Final DraftSubmitted Report Final Draft
Submitted Report Final DraftOwen Walton
 

Ähnlich wie Visualization hang zhong (20)

Pattern recognition techniques for the emerging feilds in bioinformatics
Pattern recognition techniques for the emerging feilds in bioinformaticsPattern recognition techniques for the emerging feilds in bioinformatics
Pattern recognition techniques for the emerging feilds in bioinformatics
 
dissertaion_Rideout_09022011
dissertaion_Rideout_09022011dissertaion_Rideout_09022011
dissertaion_Rideout_09022011
 
Deep Learning for Health Informatics
Deep Learning for Health InformaticsDeep Learning for Health Informatics
Deep Learning for Health Informatics
 
Hssttx2
Hssttx2Hssttx2
Hssttx2
 
A Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORKA Seminar Report On NEURAL NETWORK
A Seminar Report On NEURAL NETWORK
 
Automated Image Detection Of Retinal Pathology.pdf
Automated Image Detection Of Retinal Pathology.pdfAutomated Image Detection Of Retinal Pathology.pdf
Automated Image Detection Of Retinal Pathology.pdf
 
Introductory Course on molecular Biology
Introductory Course on molecular BiologyIntroductory Course on molecular Biology
Introductory Course on molecular Biology
 
RY_PhD_Thesis_2012
RY_PhD_Thesis_2012RY_PhD_Thesis_2012
RY_PhD_Thesis_2012
 
Inglis PhD Thesis
Inglis PhD ThesisInglis PhD Thesis
Inglis PhD Thesis
 
annurev-bioeng-082120-042814.pdf
annurev-bioeng-082120-042814.pdfannurev-bioeng-082120-042814.pdf
annurev-bioeng-082120-042814.pdf
 
Thesis_Sebastian_Ånerud_2015-06-16
Thesis_Sebastian_Ånerud_2015-06-16Thesis_Sebastian_Ånerud_2015-06-16
Thesis_Sebastian_Ånerud_2015-06-16
 
46260004 blue-brain-seminar-report
46260004 blue-brain-seminar-report46260004 blue-brain-seminar-report
46260004 blue-brain-seminar-report
 
Blue brain doc
Blue brain docBlue brain doc
Blue brain doc
 
KurtPortelliMastersDissertation
KurtPortelliMastersDissertationKurtPortelliMastersDissertation
KurtPortelliMastersDissertation
 
[] Molecular cancer_therapeutics_strategies_for_d(book_zz.org)
[] Molecular cancer_therapeutics_strategies_for_d(book_zz.org)[] Molecular cancer_therapeutics_strategies_for_d(book_zz.org)
[] Molecular cancer_therapeutics_strategies_for_d(book_zz.org)
 
Classification of squamous cell cervical cytology
Classification of squamous cell cervical cytologyClassification of squamous cell cervical cytology
Classification of squamous cell cervical cytology
 
Medically applied artificial intelligence from bench to bedside
Medically applied artificial intelligence from bench to bedsideMedically applied artificial intelligence from bench to bedside
Medically applied artificial intelligence from bench to bedside
 
Robofish - Final Report (amended)
Robofish - Final Report (amended)Robofish - Final Report (amended)
Robofish - Final Report (amended)
 
Wiley et al PRER
Wiley et al PRERWiley et al PRER
Wiley et al PRER
 
Submitted Report Final Draft
Submitted Report Final DraftSubmitted Report Final Draft
Submitted Report Final Draft
 

Kürzlich hochgeladen

9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdfQucHHunhnh
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13Steve Thomason
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 

Kürzlich hochgeladen (20)

9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13The Most Excellent Way | 1 Corinthians 13
The Most Excellent Way | 1 Corinthians 13
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 

Visualization hang zhong

  • 1. 1      Visualization  of  Ciona  Intestinalis   Co-­‐expression  Network   by   Hang  Zhong     A  dissertation  submitted  in  partial  fulfillment   of  the  requirements  for  the  degree  of   Master  of  Science   Department  of  Biology   New  York  University   May,  2012              
  • 2. 2     ACKNOWLEDGEMENTS     I   would   like   to   thank   my   advisor,   Richard   Bonneau,   for   providing  me  the  opportunity  to  participate  in  this  project,   ongoing   guidance   and   support.   I   am   also   indebted   to   professor   Lionel   Christiaen  for  inspiring  the  project.  This  thesis  could  not  have  come  to   fruition  without  the  help  of  Florian  Razy,  who  offered  insightful  and   thought-­‐provoking  input.     I  am  also  everlastingly  grateful  to  Duncan  Penfold-­‐Brown  for   teaching  me  the  programming.  I  would  also  like  to  thank  Kieran  Mace,   Aviv  Madar,  Kevin  Drew,  Maximilian  Haeussler  and  Claudia  Racioppi   who  so  patiently  offer  their  time  and  support.  Many  thanks  to  Todd   Heiniger  and  Joel  Rodriguez  for  revising  the  thesis.     Finally,   I   would   like   to   thank   my   family   for   the   invaluable   support  they  have  given  me  in  the  course  of  my  life  and  studies.                
  • 3. 3     ABSTRACT   The   abnormalities   of   the   heart   development   causes   most   frequent  congenital  diseases  in  humans.  The  conservation  of  the  Gene   Regulatory   Network   (GRN)   involved   in   heart   development,   cellular   simplicity,  low  genetic  redundancy  and  relevant  evolutionary  position   lead   researchers   to   study   the   ascidian   Ciona   intestinalis.   To   extract   useful  information  from  the  Microarray  data  for  researchers  to  infer   the  heart  network  in  Ciona,  this  thesis  not  only  applies  the  standard-­‐ based   approaches   to   find   the   differential   expression   genes,   but   also   explores  the  network-­‐based  approaches  to  find  functional  group.  By   visualizing  the  co-­‐expression  network    in  Gaggle,  the  list  of  ASM  and   heart   candidate   genes   are   fine-­‐tuned.   In   addition,   the   modules   containing   candiate   and   known   marker   genes   may   deserve   further   study.  
  • 4. 4       TABLE  OF  CONTENTS   ABSTRACT  ..................................................................................................................................  3   1.   INTRODUCTION  ...............................................................................................................  7   1.1   GENE  REGULATORY  NETWORK  OF  CARDIOGENIC  PRECURSORS  IN  CIONA  ...............................  7   1.2   MICROARRAY  DATA  ANALYSIS  ...............................................................................................  8   1.3   NETWORK  VISUALIZATION  THROUGH  GAGGLE  .......................................................................  9   2.   METHODS  ........................................................................................................................  10   2.1   MICROARRAY  EXPERIMENTAL  DESIGN  ................................................................................  10   2.2   GENE  EXPRESSION  DATA  ....................................................................................................  10   2.2.1   QUALITY  CONTROL  ........................................................................................................................  10   2.2.2   PREPROCESSING  ............................................................................................................................  11   2.3   STATISTICAL  TEST  ..............................................................................................................  11   2.4   CLUSTER  ANALYSIS  ............................................................................................................  11   2.5   FUNCTIONAL  ENRICHMENT  ANALYSIS  ................................................................................  12   2.6   GENERATION  OF  NETWORKS  ..............................................................................................  12   2.6.1   STRING  PROTEIN  NETWORK  ........................................................................................................  12   2.6.2   UNWEIGHTED  CO-­‐EXPRESSION  NETWORK  ................................................................................  13   2.6.3   WEIGHTED  CO-­‐EXPRESSION  NETWORK  .....................................................................................  13   2.7   NETWORK  VISUALIZATION  .................................................................................................  14   2.7.1   FILE  FORMAT  .................................................................................................................................  14   2.7.2   ANALYZING  NETWORK  BY  PLUGIN  IN  CYTOSCAPE  ....................................................................  14   3.   RESULTS  ..........................................................................................................................  15   3.1   DIFFERENTIAL  EXPRESSION  ...............................................................................................  15   3.1.1   EXPECTATION  OF  THE  MICROARRAY  DATA  ................................................................................  15   3.1.2   ASM  AND  HEART  CANDIDATE  GENES  ..........................................................................................  15   3.2   NETWORK  VISUALIZATION  IN  GAGGLE  ...............................................................................  17  
  • 5. 5     3.2.1   NETWORKS  .....................................................................................................................................  17   3.2.2   FINDINGS  FROM  THE  NETWORK  VISUALIZATION  IN  GAGGLE  ..................................................  20   3.2.2.1   GAGGLE  AS  INFORMATION  INTEGRATION  CENTER  ...............................................................  20   3.2.2.2   MODULE  FROM  ALLEGROMCODE  .............................................................................................  21   3.2.2.3   MODULE  FROM  WEIGHTED  NETWORK  ....................................................................................  22   3.2.2.4   FINE-­‐TUNED  LIST  ......................................................................................................................  23   4.   DISCUSSION  ....................................................................................................................  25   4.1        ASM  CANDIDATE  GENES  ......................................................................................................  25   4.2   ANNOTATION  IN  CIONA  INTESTINALIS  ................................................................................  25   4.3   FUNCTIONAL  RIBOSOME  GROUP  AND  COE  ...........................................................................  26   4.4   TIME-­‐SERIES  ......................................................................................................................  27   4.5   LIMITATIONS  OF  THE  CO-­‐EXPRESSION  NETWORK  ...............................................................  28   FIGURES  AND  TABLES  .........................................................................................................  29   FIGURE  1   PIPELINE.  ...................................................................................................................  29   FIGURE  2   NORMALIZED  UNSCALED  STANDARD  ERROR  (NUSE).  .................................................  30   FIGURE  3   HEAT-­‐MAP  OF  ASM  AND  HEART  CANDIDATE  GENES.  ...................................................  30   FIGURE  4   OUTPUT  OF  THE  SHORT  TIME-­‐SERIES  EXPRESSION  MINER.  ........................................  31   FIGURE  5   SELECTING  SOFT  POWER.  ...........................................................................................  31   FIGURE  6   CIONA  INTESTINALIS  WEIGHTED  CO-­‐EXPRESSION  NETWORK.  ....................................  32   FIGURE  7   MODULE  SIGNIFICANCE.  .............................................................................................  33   FIGURE  8   INTRAMODULAR  CONNECTIVITY  AND  MODULE  SIGNIFICANCE.  ...................................  34   FIGURE  9   STRING    PROTEIN  NETWORK.  .....................................................................................  35   FIGURE  10   LABELING  IN  WEIGHTED  NETWORK.  ........................................................................  35   FIGURE  11   THE  1ST  MODULE  INFERRED  BY  ALLEGROMCODE  FOR  UNWEIGHTED  CO-­‐EXPRESSION   NETWORK.   36   FIGURE  12   THE  1ST  MODULE  OF  UNWEIGHTED  CO-­‐EXPRESSION  NETWORK  ENRICHMENT.  .........  37   FIGURE  13   THE  1ST  MODULE  INFERRED  BY  ALLEGROMCODE  FOR  WEIGHTED  CO-­‐EXPRESSION   NETWORK.   37  
  • 6. 6     FIGURE  14   THE  1ST  MODULE  OF  WEIGHTED  NETWORK  ENRICHMENT.  .......................................  37   FIGURE  15   RIBOSOME  GROUP  IN  THE  STRING.  ...........................................................................  38   FIGURE  16   RIBOSOME  GROUP  IN  STRING  NETWORK  ENRICHMENT.  ............................................  38   FIGURE  17   RIBOSOME  GROUP  AND  COE.  ....................................................................................  39   FIGURE  18   GREY  COLOR  GENES.  ................................................................................................  39   FIGURE  19   TAN  MODULE  ...........................................................................................................  40   FIGURE  20   BROWN  MODULE  .....................................................................................................  40   FIGURE  21   TURQUOISE  MODULE  ENRICHMENT.  .........................................................................  41   FIGURE  22   GENES  IN  TURQUOISE  PLUS    STEM  CONDITION.  ........................................................  41   FIGURE  23   GENES  OF  TURQUOISE  PLUS  STEM  CONDITION  ENRICHMENT.  ...................................  42   FIGURE  24   SUB-­‐GROUP  OF  CANDIDATE  GENES  IN  UNWEIGHTED  NETWORK.  ..............................  42   FIGURE  25   SUB-­‐GROUP  OF  CANDIDATE  GENES  IN  UNWEIGHTED  NETWORK  ENRICHMENT.  ........  43   FIGURE  26   ASM  CANDIDATE  GENES  IN  WEIGHTED  NETWORK  ENRICHMENT.  .............................  43   FIGURE  27   ASM  AND  HEART  CANDIDATE  GENES  ........................................................................  44   REFERENCES  ...........................................................................................................................  45          
  • 7. 7       1. INTRODUCTION   1.1  Gene  regulatory  network  of  cardiogenic  precursors  in  Ciona         The   abnormalities   of   the   heart   development   causes   most   frequent  congenital  diseases  in  humans.  The  conservation  of  the  Gene   Regulatory   Network   (GRN)   involved   in   heart   development,   cellular   simplicity,  low  genetic  redundancy  and  relevant  evolutionary  position   lead   researchers   to   study   the   ascidian   Ciona   intestinalis(Davidson   2007).  In  Ciona,  a  single  pair  of  blastomeres  called  B7.5  gives  birth  to   the   anterior   tail   muscle   (ATM)   and   to   the   trunk   ventral   cells   (TVC)   (Figure   27).   Following   migration   from   the   tail,   the   TVC   undergo   asymmetric   cell   divisions   at   the   ventral   midline   of   the   trunk.   The   medial   TVC   give   rise   to   the   heart   while   the   lateral   TVCs   migrate   toward   the   atrial   placode   where   they   will   form   the   atrial   siphon   muscles  (ASM).  Thus,  the  TVC  are  similar  to  the  multipotent  cardio-­‐ pharyngeal   progenitors   found   in   vertebrates,   while   ASM   are   likely   equivalent  to  the  jaw  muscle  in  vertebrates.         A   few   years   ago,   the   first   cardiogenic   the   Gene   Regulatory   Network   (GRN)   in   Ciona   was   proposed   (Christiaen,   Davidson   et   al.   2008),  decoupling  genes  necessary  for  heart  specification  from  genes   necessary   for   cell   migration.   Later   study   has   been   shown   that   ASM   precursors   express   the   transcription   factor   COE   (Stolfi,   Gainous   et   al.  
  • 8. 8     2010),   which   is   necessary   and   sufficient   to   specify   ASM   fate.     Misexpression   of   COE   in   the   whole   TVC   lineage   blocks   heart   development   and   imposes   an   ASM   fate   to   all   cells.   Conversely,   misexpression  of  a  constitutive  repressor  form  of  COE  provokes  the   opposite  phenotype,  blocking  ASM  formation  and  causing  all  cells  to   form   heart   tissue.   Using   the   genome-­‐wide   Microarray   analysis   to   study  this  crucial  COE  gene  and  find  the  downstream  effectors  of  COE,   it  is  expected  to  gain  insights  to  the  gene  regulatory  network  of  the   heart.     1.2 Microarray  data  analysis     Most   of   the   existing   studies   have   focused   on   the   differential   expression  to  identify  genes  that  distinguish  different  sets  of  samples.   It’s  quite  common  to  apply  different  testing  method,  such  as  t-­‐test,  F-­‐ test,   or   nonparametric   versions   of   the   Wilcoxon   test   to   rank   thousands   of   genes,   and   the   most   significant   genes   are   select   (Gentleman   2005).   Other   specific   statistical   methods   are   also   commonly  used  in  the  Microarray  data  analysis,  such  as  Significance   Analysis   of   Microarray   (SAM)     (Tusher,   Tibshirani   et   al.   2001)   and   LIMMA   (Wettenhall,   Smyth   2004)   using   a   Bayesian   mixture   model.     Another   way   of   using   microarray   data   is   to   understand   an   individual   gene   or   protein’s   network   properties   by   studying   the   co-­‐ expression,  where  genes  that  have  similar  expression  patterns  across   a   set   of   samples   are   hypothesized   to   have   a   functional   relationship.  
  • 9. 9     This   co-­‐expression   network-­‐based   approach   is   consistent   with   the   important  concept  that  has  emerged  over  the  past  decade—genes  and   their  protein  products  carry  out  cellular  processes  in  the   context  of   functional   modules   and   are   related   (Barabasi,   Bonabeau   2003,   Barabasi,  Oltvai  2004).   1.3 Network  visualization  through  Gaggle       It  has  been  well  recognized  that  visualization  plays  a  key  role  in   helping   to   understand   biological   systems,   particularly   in   the   era   of   high-­‐throughput   studies   with   a   wealth   of   ‘omics’-­‐scale   data   (Gehlenborg,  O'Donoghue  et  al.  2010).  This  thesis  applies  the  simple,   open-­‐source   Java   software   system   Gaggle   (Shannon,   Reiss   et   al.   2006)   for   co-­‐expression   network   visualization.   Gaggle   is   a   cross-­‐platform   system  integrated  with  diverse  databases  (KEGG,  BioCyc,  and  String)   and   software   (Cytoscape,   DataMatrixViewer,   R   statistical   environment,   and   TIGR   Microarray   Expression   Viewer).   With   four   simple  data  types  (names,  matrices,  networks,  and  associative  arrays),   researchers   can   explore   many   different   sources   and   variety   of   software  tools  by  entering  these  information  into  the  Gaggle  Boss  and   transferred  to  other  tools.          
  • 10. 10       2. METHODS      The  pipeline  of  this  thesis  is  in  Figure  1.     2.1  Microarray  experimental  design     The  microarray  data  used  in  this  study  are  kindly  provided  by   Dr.  Lionel  Christiaen.  It  consists  of  30,969  probe  sets  from  Affymetrix   GeneChips.   The   perturbation   group   includes   LacZ   control,   the   over-­‐ expression  and  loss  of  function  of  transcription  factor  Collier/EBF/OIf   (COE)   in   the   sorted   TVC   cells   at   21   hours   post   fertilization   (hpf)— after  the  asymmetric  divisions  of  the  TVCs  but  before  completion  of   the  ASM  migration.  Time-­‐series  group  is  comprised  of  11  time  points,   every  2  hours  varying  from  8  to  28  hours  in  TVC  cells.     2.2  Gene  expression  data   2.2.1 Quality  control         This   thesis   applies   the   arrayQualityMetrics   (Kauffmann,   Gentleman  et  al.  2009),   a   Bioconductor   package   for   quality   control.   It   provides   an   HTML   report   with   several   diagnostics   plots.   In   general,   the   array   will   be   discarded   if   it   is   identified   as   an   outlier   in   both   before  and  after  normalization  in  the  report.         The   Microarray   data   firstly   is   imported   in   statistical   programming  language  R,  and  then  carried  on  the  quality  control  by   arrayQualityMetrics.   The   sample   LacZ.3   is   removed   since   it   was  
  • 11. 11     reported  an  outlier  in  both  before  and  after  normalization  (Figure  2).   2.2.2 Preprocessing     The   cell   files   of   the   Microarray   are   normalized   by   the   RMA   method   (Gentleman   2005).   The   expression   matrix   contains   30,969   probes   and   48   arrays.   After   the   non-­‐specific   filtering   by   variance   (IQR=0.5),  the  matrix  contains  15,484  probes,  48  arrays.       Using   the   collapseRows   function   in   WGCNA,   the   probes   with   maximum  variance  are  selected  to  represent  genes.  After  merging  the   probes,  the  merged  matrix  contains  10,079  probes  and  48  arrays.     2.3  Statistical  test     The  merged  matrix  is  ranked  by  moderated  F  test  and  genes   are   selected   with   significant   p-­‐value   (<0.05,   using   Limma   package)   (Smyth   2004)   after   adjusted   by   Benjamini-­‐Hochnerg   method.     After   ranking,  the  top-­‐rank  matrix  contains  4,307  probes  and  48  arrays.       The   top-­‐rank   matrix   is   imported   to   one   of   the   Gaggle   Geese   MultiExperiment   Viewer   (MeV)   and   under   Significant   Analysis   for   Microarrays   (SAM)   test   (COE   versus   COEW   group,   p-­‐value   <   0.05,   1000  permutation,  FDR  =  0.9).     2.4  Cluster  analysis  
  • 12. 12       Hierarchical   clustering   is   performed   for   ASM   and   Heart   candidate   genes   using   MeV,   using   Pearson   correlation   metric   and   average  linkage  clustering.       The  time-­‐series  group  data,  totaling  36  arrays,  are  averaged  for   each  time  point  and  imported  to  Short  Time-­‐series  Expression  Miner   (STEM),  using  STEM  Clustering  Method.   2.5  Functional  enrichment  analysis     Blast2GO   (B2G)     (Conesa,   Gtz   et   al.   2005)   is   a   comprehensive   bioinformatics   tool   for   annotation,   visualization   and   analysis   in   functional   genomics   research.   It   offers   a   suitable   platform   for   functional  research  in  non-­‐model  species,  such  as  Ciona  intestinalis.         DNA   sequences   in   fasta   format   were   loaded   to   Blast2GO.   15,629   genes   remained   in   the   Blast2GO,   followed   by   blasting,   go-­‐ mapping  and  yielded  Go-­‐terms  for  3,964  genes.  The  test  group  from   different   lists   is   tested   against   the   reference   group   (3,964   genes)   using  the  Fisher’s  Exact  Test  (p-­‐value  <  0.05,  FDR  correction).     2.6 Generation  of  networks   2.6.1 String  protein  network     Using  the  Ensembl  gene  name  in  this  filt.gene  matrix  as  input,   the  genes  of  interest  in  the  Search  Tool  for  the  Retrieval  of  Interacting   Genes   (STRING)   database   (Szklarczyk,   Franceschini   et   al.   2011)   are   extracted   from   the   STRING   website   in   Text   Summary   format   and  
  • 13. 13     parsed   to   Cystoscape   simple   interaction   format   (SIF)     (Shannon,   Markiel  et  al.  2003)  by  python  programming  language.     2.6.2 Unweighted  co-­‐expression  network     The   Pearson   Correlation   Coefficient   for   all   pair-­‐wise   comparisons   of   genes   is   calculated   from   filt.gene   matrix   in   R.   High   correlated   genes   are   selected   with   cutoff   0.9   and   parsed   to   simple   interaction  format  (SIF)    (Shannon,  Markiel  et  al.  2003)  by  python.     2.6.3 Weighted  co-­‐expression  network   2.6.3.1 Network  construction     The  procedure  can  be  found  in  the  WGCNA  website  (Horvath   2011).     2.6.3.2 Module  detection     Pearson  correlation  coefficients  are  calculated  for  all  pair-­‐wise   comparisons   of   genes   across   all   samples.     The   resulting   Pearson   correlation  matrix  is  transformed  into  the  weighted  adjacency  matrix   with   the   above   power   beta   6.   The   average   linkage   hierarchical   clustering   is   used   to   group   genes   on   the   basis   of   the   topological   overlap  dissimilarity  measure  of  their  network  connection  strengths   (Zhang,   Horvath   2005).   Using   a   dynamic   tree-­‐cutting   algorithm   (Langfelder,  Zhang  et  al.  2008),  13  modules  are  found  with  the  minimum   cluster  size  of  70  (Figure  6).  Genes  that  are  not  assigned  to  modules   are  assigned  the  color  grey.    
  • 14. 14     2.6.3.3 Module  significance     The  p  value  of  moderated  t  test  is  the  output  from  topTable  of   AffylmGUI  package  in  R  (Smyth  2004).       2.7 Network  visualization   2.7.1 File  format       The  output  files  from  WGCNA  are  parsed  to  simple  interaction   format  (SIF)    (Shannon,  Markiel  et  al.  2003)  by  python.     2.7.2 Analyzing  network  by  plugin  in  Cytoscape     AllegroMCODE  and  Network  Analysis  plugin  in  Cytoscape  are   used   to   analyze   the   network.   Finding   the   cluster   automatically   is   achieved   by   AllegroMCODE.  
  • 15. 15       3. RESULTS   3.1 Differential  expression     3.1.1 Expectation  of  the  Microarray  data   Genes   that   are   up-­‐regulated   in   the   overexpression   of   COE   or   down-­‐regulated   in   loss   of   function   of   COE   are   considered   ASM   candidate   genes   downstream   of   COE,   while   genes   that   are   down-­‐ regulated  in  overexpression  of  COE  or  up-­‐regulated  in  loss  of  function   of  COE  are  considered  Heart  candidate  genes  repressed  by  COE  (Stolfi,   Gainous  et  al.  2010).     Using   the   COE   and   COEW   group   as   two   classes   in   the   Significant  Analysis  for  Microarrays  (SAM),  the  contrast  would  yield   ASM  and  Heart  candidate  genes.     3.1.2 ASM  and  Heart  candidate  genes   3.1.2.1   Lists  from  SAM       336  significant  genes  are  derived  from  SAM  and  separated  into   206  ASM  candidate  genes  (negative  in  SAM,  expression  of  COE  group   lower   than   that   of   COEW   group)   and   130   Heart   candidate   genes   (positive  in  SAM,  expression  of  COE  group  higher  than  that  of  COEW   group).     These   two   groups   can   be   distinguished   by   the   first   three   columns  in  the  heat-­‐map  (Figure  3,  Figure  27).    
  • 16. 16       Based  on  the  Hierarchical  Clustering  and  observation,  the  ASM   candidate  genes  can  be  roughly  divided  into  three  large  groups:     A1.  The  first  group  (up-­‐down-­‐up-­‐ASM,  61  genes),  shows  a  “U”   shape   curve   through   the   time-­‐series   experiments,   with   the   earliest   up-­‐regulation   right   at   the   experimental   time   point   of   8   hours.   This   group  contains  Snail  (‘SNAIL’  in  the  thesis),  SET  and  MYND  Domain  1   (SMYD1)  and  Myodblast  determination  protein  (Myod,  ‘MYOD’  in  the   thesis).       A2.   The   second   group   (early-­‐ASM,   45   genes),   including   COE   and   Myocyte   Regulatory   Light   Chain   (MRLC5,   ‘MYL5’   in   the   thesis)   gene,  shows  early  up-­‐regulation  around  14  hours.       A3.  The  third  group  (late-­‐ASM,  100  genes)  has  relatively  late   up-­‐regulation  after  18  hours,  with  myosin  heavy  chain  genes  (MHC3),   tropomyosin   1(TPM1,   ‘CTM1’   in   the   thesis)   and   muscle   like   actin   2   (MA2)  in  the  group.       The   Heart   candidate   genes   can   be   divided   into   two   large   groups:     H1.   The   first   group   (early-­‐Heart,   99   genes)   shows   early   up-­‐ regulation  (before  20  hours),  containing  heart  markers  BMP2/4,  NK4,   NOTRLC/HAND-­‐LIKE,  and  ETS/POINTED2.    
  • 17. 17       H2.  The  second  group  (late-­‐Heart,  31  genes)  displays  relative   late  up-­‐regulation  (after  20  hours),  with  mesenchyme  specific  gene  3   (MECH3)  in  the  group.       As  expected,  two  lists  of  genes  have  some  important  markers   in  them  and  noticeable  temporal  expression.  But  these  ASM  and  Heart   candidate  genes  didn’t  show  Go-­‐term  enrichment  from  the  Blast2GO,   which  might  indicate  the  need  to  fine-­‐tune  the  list,  even  though  the   Blast2GO  with  few  go  terms  is  another  concern.  Further  improvement   of  the  ASM  and  Heart  candidate  gene  list  would  be  necessary  to  know   the  effect  of  the  non-­‐specific  filtering,  selecting  the  probe  for  a  gene  by   maximum  variance  and  SAM  ranking.     3.1.2.2   Clusters  from  STEM   Total  7  significant  model  profiles  showed  in  the  STEM  output.   23  out  of  the  206  ASM  candidate  genes  are  in  the  significant  profiles.   Most  of  them  are  in  the  profile  20,  similar  to  the  late-­‐ASM,  including   the  MHC3,  MA2  and  MYL5  genes.  For  the  Heart  candidate  genes,  13   out  of  130  are  in  the  significant  profiles.     3.2 Network  Visualization  in  Gaggle   3.2.1 Networks   3.2.1.1 STRING  protein  network       The   STRING   (Szklarczyk,   Franceschini   et   al.   2011)   protein   network  is  created  to  make  good  use  of  the  existing  data  resources.    It  
  • 18. 18     provides   both   experimental   and   predicted   interaction   information   from   computational   techniques,   presented   as   different   colors   in   the   edge  (Figure  9).     3.2.1.2 Co-­‐expression  network     The   network-­‐based   approaches,   also   termed   graph-­‐based   approaches,   aim   to   extract   recurrent   expression   patterns   or   conserved   module   from   the   rapid   accumulation   of   Microarray   datasets.  The  Microarray  dataset  is  modeled  as  a  relation  graph  where   each  node  represents  one  gene  and  two  genes  are  connected  through   the   edge   based   on   certain   expression   correlation   parameter   (Zhang,   Horvath  2005)  to  measure  the  similarity  between  expression  profiles   (Pearson   Correlation   Coefficient   is   used   in   this   thesis).   The   graph,   namely   network,   can   be   represented   by   an   adjacency   matrix   that   encodes   whether   a   pair   of   nodes   is   connected.   For   unweighted   networks,   entries   are   1   or   0.   For   weighted   networks,   the   adjacency   matrix  reports  the  connection  strength  for  the  gene  pairs,  between  1   and   0   (Zhang,   Horvath   2005).   The   concept   of   connectivity   in   graph   theory,   also   termed   degree,   can   be   depicted   as   the   row   sum   of   the   adjacency  matrix,  measuring  the  direct  neighbors  of  the  node  in  the   unweighted   networks   and   connection   strengths   in   the   weighted   network.         Two  co-­‐expression  networks  are  generated  in  this  thesis.    
  • 19. 19       The  unweighted  co-­‐expression  network  is  formed  by  the  genes   with  the  Pearson  Correlation  Coefficient  higher  than  0.9.  A  total  766   nodes   are   in   this   unweighted   network   with   clustering   coefficient   0.311  (output  result  from  the  Network  Analysis  plugin  in  Cytoscape,   measuring  the  cohesiveness  of  the  neighborhood  of  a  node).       The   genes   with   the   top   5000   strong   weight   are   outputted   to   build   the   weighted   co-­‐expression   network   (cutoff   for   the   weight   is   0.23),  a  total  of  814  nodes,  with  clustering  coefficient  0.728.       The  unweighted  network  has  more  isolated  clusters  with  only   2  nodes  linked  by  1  edge.  The  weighted  network  has  greater  density   with   some   hubs   (high   connectivity),   and   also   contains   colors   in   the   node  for  the  different  modules  detected  in  the  WGCNA.                Though   these   two   networks   are   different   in   the   adjacency   matrix,   they   are   both   based   on   Pearson   Correlation   Coefficient   to   present   the   genes   of   high   similarity   in   the   graph   in   terms   of   their   closeness.  In  other  words,  genes  of  same  expression  profiles  across  all   of  the  experiments  would  be  close  to  each  other  in  the  network.  These   network-­‐based  approaches  allow  for  the  exploration  of  the  position  of   a  biological  entity  in  the  context  of  its  local  neighborhood  in  the  graph   and   network   as   a   whole,   and   less   troubled   by   inherent   noise   that   confound  conventional  pairwise  approaches  (Freeman,  Goldovsky  et  al.   2007).    
  • 20. 20     3.2.2 Findings  from  the  network  visualization  in  Gaggle     3.2.2.1 Gaggle  as  information  integration  center                            In  this  post-­‐genomic  era,  biologists  often  face  the  challenge  to   freely   explore   the   experimental   and   computational   data   from   many   different  sources  and  diverse  software  tools,  such  as  storing  different   data  for  genes,  retrieving  data  from  a  list  of  genes,  and  mapping  one   list  of  genes  with  another.  Once  the  network  has  been  loaded  in  the   Cytoscape,   Gaggle,   as   an   information   integration   center,   can   help   to   solve  these  problems  with  respect  to  Microarray  data.     Storing  different  data  for  genes  can  be  achieved  by  labeling.  As   shown   in   the   Figure   9   and   10,   two   networks   present   data   from   6   different  sources,  such  node  color  for  module,  node  label  for  ASM  or   Heart   candidate   genes,   node   shape   for   significance   in   moderated   F   test,   node   size   for   connectivity,   edge   color   for   different   interaction,   and  distance  between  nodes  for  closeness.  Therefore  the  network  in   Cytoscape  functions  as  a  visual  database.       Retrieving  data  from  a  list  of  genes,  such  as  expression  matrix,   is  also  feasible  through  the  basic  function  “broadcast”  in  Gaggle.  For   example,  a  list  of  genes  of  interest  in  the  Cytoscape  can  be  sent  to  the   Gaggle  Boss,  and  then  broadcast  to  Data  Matrix  Viewer  (DMV),  which   can  output  the  expression  matrix.    
  • 21. 21       Mapping   one   list   of   genes   with   another   can   be   done   conveniently   in   Gaggle   thourhg   the   many   functions   that   it   offers.   In   the   MultiExperiment   Viewer   (MeV),   a   sub-­‐list   of   genes   can   be   launched   in   a   new   viewer.   In   Cytoscape,   the   function   “Create   new   network   from   selected   nodes”   can   be   used   in   this   task.   Between   different   tools,   the   function   “broadcast”   would   serve   as   a   bridge   to   transfer  the  list  and  map  it  in  the  existing  tools.   3.2.2.2 Module  from  AllegroMCODE     The  main  goal  of  the  co-­‐expression  network  visualization  is  to   find  the  highly  correlated  genes  (module)  related  to  the  ASM  or  Heart   network,  specifically  aiming  to  infer  targets  of  the  transcription  factor   COE.     In   the   unweighted   network   without   predefined   modules,   the   modules  can  be  automatically  detected  by  AllegroMCODE,  a  plugin  in   Cytoscape   to   find   highly   interconnected   groups   of   nodes   in   a   huge   complex  network.  The  1st  module  detected  by  AllegroMCODE  for  the   unweighted   network   is   shown   in   the   Figure   11.   This   module   is   significantly   enriched   in   biological   process   (Figure   12),   such   as   biosynthetic  process  and  cellular  biosynthetic  process.       For  the  weighted  network,  the  1st  module  (Figure  13)  detected   by   AllegroMCODE   contains   largely   turquoise   module   genes   (only   1  
  • 22. 22     grey  color  gene.  This  module  is  significantly  enriched  in  intracellular   process  (Figure  14).         Comparing   these   1st   modules   of   unweighed   and   weighted   network,  they  both  contain  ribosome  related  genes  (gene  name  starts   with  “RP”).    Because  these  two  networks  are  both  generated  from  the   same   Microarray   data,   an   external   reference   would   be   necessary   to   determine   whether   this   ribosome   group   is   found   by   chance.   The   common   list   of   23   genes   is   from   the   comparison   between   the   1st   module   in   weighted   network   and   all   turquoise   module   genes   in   STRING  network,  which  has  16  ribosome  related  genes.   3.2.2.3 Module  from  weighted  network     Weighted   correlation   analysis   (WGCNA)   has   advantages   in   identifying   candidate   targets   with   its   unique   mathematical   features   (Langfelder,  Horvath  2008).  While  the  highly  correlated  genes  can  be   grouped   into   different   modules,   those   genes   that   are   far   from   the   modules  are  depicted  in  grey.  Figure  18  shows  that  these  grey  color   genes   in   the   weighted   network   are   often   with   fewer   edges   and   targeted   at   miRNA,   which   are   reasonably   different   from   other   functional  modules.       In   Figure   7   and   Figure   8,   the   tan   and   brown   modules   have   strong  module  significance  (the  significance  is  defined  as  –log10  (p-­‐ value   in   moderated   t   test)).   By   visualizing   these   two   modules   from  
  • 23. 23     their   top   50   intramodular   connectivity   genes   respectively,   these   modules  can  be  found  enriched  in  the  ASM  and  Heart  candidate  genes.   Interestingly,  NK4  gene  is  in  the  tan  module  with  other  genes  (Figure   19).  Islet  (ISL)  gene,  which  is  not  in  the  candidate  list  yet  reported  to   be  ASM  gene,  is  in  the  brown  module  with  some  known  markers,  such   as   MA2,   MHC3,   NOTRLC/HAND-­‐LIKE,   and   ETS/POINTED2   (Figure   20).    These  results  would  be  helpful  to  be  a  starting  point  for  making   hypothesis  of  the  Heart  network  in  Ciona.       As   the   largest   module   in   the   weighted   network,   enriched   in   cellular   process   and   others   (Figure   21),   it   is   natural   to   consider   limiting  the  list  of  the  turquoise  module  genes  with  other  conditions.   The  list  of  genes  resulted  from  turquoise  module  and  STEM  condition   shows   a   clear   temporal   expression   and   enrichment   in   muscle   and   heart  related  go-­‐terms  (Figure  22,  Figure  23),  while  containing  only   four  genes  found  in  the  list.     3.2.2.4 Fine-­‐tuned  list     The   network   in   Gaggle   can   serve   as   a   visualization   center   as   well  as  a  fine-­‐tuning  filter  for  a  list  of  genes,  because  the  network  is   built  upon  the  high  correlated  pair  of  genes  with  reduced  noise.  It  is   by   no   means   the   genes   that   are   not   in   the   network   that   should   be   discarded,   but   it   is   good   to   have   expected   go-­‐term   enrichment   to   confirm   the   list.   Because   the   go-­‐term   enrichment   is   related   to   the  
  • 24. 24     proportion   of   genes   with   the   same   go-­‐terms,   the   number   of   noisy   genes  in  the  whole  list  would  have  a  great  impact  on  the  enrichment.   Importing   the   candidate   list   to   the   co-­‐expression   network   would   reduce  the  noise  and  yield  better  enrichment  result.       By   “broadcasting”   function   in   the   MeV,   the   Cytoscape   can   receive  and  label  the  336  significant  genes  in  the  unweighted  network   with   yellow   color,   and   then   create   a   sub-­‐network   for   the   candidate   genes.  A  subgroup  of  the  candidate  genes  (Figure  24)  is  significantly   enriched   in   muscle   and   heart   related   go-­‐terms   (Figure   25),   which   previously   could   not   be   reported   from   the   Blast2GO.   The   ASM   candidate  genes  in  the  network  are  also  enriched  in  muscle  and  heart   go-­‐terms  (Figure  26),  while  the  Heart  candidate  genes  in  the  network   are  still  not  reported  enrichment  from  the  Blast2GO.      
  • 25. 25       4. DISCUSSION   4.1 ASM  candidate  genes     COE   is   necessary   and   sufficient   to   specify   ASM   fate   (Stolfi,   Gainous  et  al.  2010).     It   is  understandable   that   COE   expresses   earlier   than  the  late-­‐ASM  genes  (A3  group),  such  as  MHC3,  TPM1,  MA2.  While   for  the  up-­‐down-­‐up-­‐ASM  (A1  group),  it  has  the  earliest  up-­‐regulation,   with  MYOD  in  the  group.  In  Xenopus,  the  cross-­‐regulatory  interactions   of  COE  orthologs  with  genes  of  the  Myogenic  Regulatory  Factor  (MRF)   family,  such  as  MYOD  and  MYF5,  are  crucial  for  muscle  commitment   and   differentiation   (Green,   Vetter   2011).   However,   how   COE   may   repress   the   cardiac   fate   and   promote   cell   migration   in   Xenopus   has   never  been  studied.  A  possible  hypothesis  is  that  in  Ciona,  the  early   functions   controlled   by   COE   in   ASM   precursors   are   independent   on   MRF   activation   since   the   MRF   in   the   A1   group   has   earlier   up-­‐ regulation  than  COE  in  the  A2  group.     And  the  A1  group  genes  are  more  likely  to  be  TVC  genes,  which   also  can  explain  the  fact  that  there  are  heart  related  go-­‐terms  in  the   enrichment  of  the  ASM  genes  in  the  weighted  network  (Figure  26).     4.2 Annotation  in  Ciona  intestinalis       The  draft  of  genome  sequence  of  the  ascidian  Ciona  intestinalis   (Dehal,   Satou   et   al.   2002)   has   been   a   valuable   research   resource.  
  • 26. 26     However,  there  are  numerous  inconsistencies  with  the  gene  models   because  of  the  intrinsic  limitations  in  gene  prediction  programs  and   the   fragmented   nature   of   the   assembly   (Satou,   Mineta   et   al.   2008).   Therefore   the   annotation   job   for   the   probe   in   this   study   focuses   on   combining   available   resources   from   various   databases,   such   as   Aniseed   (Tassy,   Dauga   et   al.),   Ensembl   Genome   Browser   (Kersey,   Lawson  et  al.  2010),  CIPRO  (Endo,  Ueno  et  al.),  STRING  (Szklarczyk,   Franceschini  et  al.  2011),  UCSC  Genome  Browser  (Karolchik,  Hinrichs   et   al.   2011),   and   also   internal   files   from   Dr.   Lionel   Christiaen’s   lab.   There  are  16,250  non-­‐redundant  genes  in  the  30,969  probes,  which   will  be  the  criteria  to  map  a  probe  to  a  gene.  It  is  unavoidable  that   there  are  differences  between  the  gene  annotation  in  this  thesis  and   other  sources.       4.3 Functional  ribosome  group  and  COE     The   highly   linked   ribosome   genes   in   the   STRING   network   (Figure  19),  enriched  in  ribosome  process  (Figure  20),  naturally  lead   to   a   question—what   is   the   relationship   between   this   functional   ribosome  group  and  COE.  By  broadcasting  this  list  of  ribosomes  and   COE   genes   to   MeV,   the   heat-­‐map   and   expression   plot   show   the   similarity  in  the  time-­‐series  experiments  of  ribosome  group  and  COE.   And   this   group   of   ribosome   genes   has   quite   a   stable   expression   profile.   It   is   likely   to   find   more   housekeeping   genes   in   the   same   module  as  the  ribosome  group,  which  is  not  the  focus  of  this  thesis.  
  • 27. 27     4.4 Time-­‐series   Though   the   clustering   algorithms,   such   as   Hierarchical   clustering   (Eisen,   Spellman   et   al.   1998),   K-­‐means,   and   Self-­‐organizing   Maps   (SOM)   (Tamayo,  Slonim  et  al.  1999),   can   be   used   to   analyze   the   Microarray   data   and   yield   many   biological   insights,   they   are   not   designed  for  time-­‐series  data  since  they  assume  that  data  at  each  time   point  is  collected  independent  of  each  other,  and  ignore  the  sequential   nature  of  time-­‐series  data  (Ernst,  Nau  et  al.  2005).  This  thesis  applies   the   Short   Time-­‐series   Expression   Miner   (STEM)   method   to   learn   about   the   time-­‐series   experiments   with   the   hope   of   finding   clues   about  the  true  biological  pattern,  which  is  designed  for  the  analysis  of   short   time   series   Microarray   gene   expression   data   (Ernst,  Bar  Joseph   2006).  The  algorithm  (Ernst,  Nau  et  al.  2005)  of  STEM  starts  by  selecting   a  set  of  potential  expression  profiles,  covering  the  entire  space  of  all   possible  expression  profiles  that  can  be  generated  by  the  genes  in  the   experiment,   and   each   represents   a   unique   temporal   expression   pattern.   Next,   each   gene   will   be   assigned   to   one   of   the   profiles   and   after   the   permutation   resulting   in   different   large   clusters   with   significant  model  profiles  by  greedy  algorithm  (Ernst,  Nau  et  al.  2005),   which  are  colored  in  the  top  list  in  the  user  interface.     It  is  worth  to  mention  that  the  STEM  is  designed  for  short  time-­‐ series   (defined   3   –   8   time   points   in   their   website);   while   the   time   points  in  this  Microarray  dataset  is  11.    
  • 28. 28     4.5 Limitations  of  the  co-­‐expression  network           The  co-­‐expression  network  approaches  have  several  limitations   including  the  following.  First,  the  network  similarity  is  based  on  the   Pearson   Correlation   Coefficient,   which   is   sensitive   to   outliers.   Therefore  the  quality  of  the  input  matrix  would  be  important  to  the   final  result.  It  would  be  helpful  to  try  the  data  transformation  or  use   Spearman’s  rank  correlation  coefficient.       A  second  limitation  is  that  the  Pearson  Correlation  Coefficient   based   co-­‐expression   network   is   more   suitable   for   finding   global   co-­‐ expression   genes(Qian,   Dolled   Filhart   et   al.   2001),   and   it   cannot   accurately  detect  the  time-­‐delayed  or  transient  response  of  the  down-­‐ stream  effectors  for  the  time-­‐series  experiments.  It  would  be  better  to   use   local   clustering   (Qian,   Dolled   Filhart   et   al.   2001)   to   find   the   time-­‐ delay  or  local  co-­‐expression  genes,  or  other  tools  specialized  in  long   time-­‐series   experiments   like   The   Graphical   Query   Language   (GQL)   (Costa,  Schnhuth  et  al.  2005).       A  third  limitation  is  that  it  is  difficult  to  pick  thresholds  for  a   biological   network.   The   hard-­‐threshold   for   the   unweighted   network   would  arbitrarily  cut  off  some  biological  meaningful  edges.  The  weak   weight  modules  would  also  be  cut  off  in  the  weighted  network  while  it   is   possible   that   this   kind   of   weak   linkage   would   be   biologically   meaningful.    
  • 29. 29     Figures  and  tables     Figure  1   Pipeline.    
  • 30. 30       Figure  2   Normalized  unscaled  standard  error  (NUSE).     One  of  the  tests  in  the  arrayQualityMetrics,  NUSE,  detected  sample   LacZ3  as  an  outlier.       Figure  3   Heat-­‐map  of  ASM  and  Heart  candidate  genes.     ASM  candidate  genes  are  red  in  the  first  and  third  column.  A1:  up-­‐ down-­‐up-­‐ASM.  A2:  early-­‐ASM.  A3:  late-­‐ASM.  Heart  candidate  genes   are  red  in  the  second  column.  H1:  early-­‐Heart.  H2:  late-­‐Heart.    
  • 31. 31       Figure  4   Output  of  the  Short  Time-­‐series  Expression  Miner.     Significant  clusters  are  colored  at  the  top  row.     5 10 15 20 0.30.40.50.60.70.80.9 Scale independence Soft Threshold (power) ScaleFreeTopologyModelFit, signedR^2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 5 10 15 20 050010001500 Mean connectivity Soft Threshold (power) MeanConnectivity 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20   Figure  5   Selecting  soft  power.     The   soft   threshold   power   beta   of   6   is   chosen   for   calculating   the   adjacency  matrix  since  it  reached  a  high  topology  model  fit  (R^2)  and   high  mean  connectivity.      
  • 32. 32       Figure  6   Ciona  intestinalis  weighted  co-­‐expression  network.     The  dendrogram  results  from  average  linkage  hierarchical  clustering.   The   color-­‐band   below   the   dendrogram   denotes   the   modules,   which   are   defined   as   branches   in   the   dendrogram.   Of   the   10,   079   genes,   6162   were   clustered   into   13   modules,   and   the   remaining   genes   are   colored  in  grey.    
  • 33. 33     black blue brown green greenyellow grey magenta pink purple red tan turquoise yellow Dynamic−cutree Module Significance(COE−COEW modt) p= 3.1e−86 Dynamic Module coesig 0.00.20.40.60.8 black blue brown green greenyellow grey magenta pink purple red tan turquoise yellow Counts 01000200030004000   Figure  7   Module  significance.   Module   significance   is   determined   as   the   average   absolute   gene   significance  (defined  by  minus  log  of  a  p-­‐value)  measure  for  all  genes   in  a  given  module.