Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Computational approaches to the regulatory genomics of neurogenesis
1. Computational approaches to the regulatory genomics of neurogenesis
Dr. Ian Simpson
Centre for Integrative Physiology
University of Edinburgh
Edinburgh Neuroscience Day, March 2010
1 / 20
2. Introduction animal model of neurogenesis
Anatomy of the Drosophila PNS - Sense organs
2 / 20
3. Introduction animal model of neurogenesis
Development of the Drosophila PNS
3 / 20
4. main gene regulatory networks
GRN for endomesoderm specification in the Sea Urchin
from Peter and Davidson (2009)
4 / 20
5. main scale and complexity
How to study gene regulatory networks ?
High throughput gene expression experiments
analysing c.15,000 genes on c.100 chips (scale)
profile, temporal, spatial, cell-type (complex)
Predicting transcription factor binding sites (TFBSs)
genomic search space (scale)
100s-1000s of PWMs (TFBS profiles) (scale)
multiple TFBSs arranged combinatorially (complex)
multiple evidence types to integrate, phylogenetic, protein interaction, genome
localisation (complex)
identifying cis-regulatory modules (complex)
5 / 20
6. main scale and complexity
How to study gene regulatory networks ?
High throughput gene expression experiments
analysing c.15,000 genes on c.100 chips (scale)
profile, temporal, spatial, cell-type (complex)
Predicting transcription factor binding sites (TFBSs)
genomic search space (scale)
100s-1000s of PWMs (TFBS profiles) (scale)
multiple TFBSs arranged combinatorially (complex)
multiple evidence types to integrate, phylogenetic, protein interaction, genome
localisation (complex)
identifying cis-regulatory modules (complex)
6 / 20
7. main example 1 : Clustering with re-sampling statistics
Gene expression profiles of cells expressing atonal
7 / 20
8. main example 1 : Clustering with re-sampling statistics
An example annotated cluster
cluster membership
Cluster Size
C1 13
C2 36
C3 23
C4 16
C5 65
C6 6
cluster 3
Sensory Organ Development
GO:0007423 (p=6e-6)
Gene name
argos ato
CG6330 CG31464
CG13653 nrm
unc sca
rho ImpL3
CG11671 CG7755
CG16815 CG15704
CG32150 knrl
CG32037 Toll-6
phyl nvy
cato
8 / 20
9. main example 1 : Clustering with re-sampling statistics
Consensus clustering, a method to assess the quality of clustering
The basic approach
iterate thousands of clustering experiments with sub-samples of the data
calculate the average connectivity of any two members - consensus matrix
derive the robustness of the clusters and their members from the consensus matrix
The problem
huge parameter space (cluster number, distance metric, sample proportion...)
huge number of different algorithms to chose from
large dataset, multiple conditions to test
The solution
Break each iteration (individual clustering experiment) into a single process
Batch the processes out to nodes on Eddie/ECDF (batch array)
Collate back into consensus matrices and calculate robustness measures
R-package for consensus clustering - clusterCons
available from CRAN and sourceforge (http://bit.ly/clusterCons)
9 / 20
10. main example 1 : Clustering with re-sampling statistics
Consensus clustering, a method to assess the quality of clustering
The basic approach
iterate thousands of clustering experiments with sub-samples of the data
calculate the average connectivity of any two members - consensus matrix
derive the robustness of the clusters and their members from the consensus matrix
The problem
huge parameter space (cluster number, distance metric, sample proportion...)
huge number of different algorithms to chose from
large dataset, multiple conditions to test
The solution
Break each iteration (individual clustering experiment) into a single process
Batch the processes out to nodes on Eddie/ECDF (batch array)
Collate back into consensus matrices and calculate robustness measures
R-package for consensus clustering - clusterCons
available from CRAN and sourceforge (http://bit.ly/clusterCons)
10 / 20
11. main example 1 : Clustering with re-sampling statistics
Consensus clustering, a method to assess the quality of clustering
The basic approach
iterate thousands of clustering experiments with sub-samples of the data
calculate the average connectivity of any two members - consensus matrix
derive the robustness of the clusters and their members from the consensus matrix
The problem
huge parameter space (cluster number, distance metric, sample proportion...)
huge number of different algorithms to chose from
large dataset, multiple conditions to test
The solution
Break each iteration (individual clustering experiment) into a single process
Batch the processes out to nodes on Eddie/ECDF (batch array)
Collate back into consensus matrices and calculate robustness measures
R-package for consensus clustering - clusterCons
available from CRAN and sourceforge (http://bit.ly/clusterCons)
11 / 20
12. main example 1 : Clustering with re-sampling statistics
Consensus clustering, a method to assess the quality of clustering
The basic approach
iterate thousands of clustering experiments with sub-samples of the data
calculate the average connectivity of any two members - consensus matrix
derive the robustness of the clusters and their members from the consensus matrix
The problem
huge parameter space (cluster number, distance metric, sample proportion...)
huge number of different algorithms to chose from
large dataset, multiple conditions to test
The solution
Break each iteration (individual clustering experiment) into a single process
Batch the processes out to nodes on Eddie/ECDF (batch array)
Collate back into consensus matrices and calculate robustness measures
R-package for consensus clustering - clusterCons
available from CRAN and sourceforge (http://bit.ly/clusterCons)
12 / 20
13. main example 1 : Clustering with re-sampling statistics
Heatmap of the consensus matrix
13 / 20
14. main example 1 : Clustering with re-sampling statistics
Gene prioritisation by consensus clustering
Re-sampling using hclust, it=1000, rf=80%
cluster robustness membership robustness
cluster3
affy_id mem affy_id mem
1639896_at 0.68 1641578_at 0.56
cluster rob
1640363_a_at 0.54 1623314_at 0.53
1 0.4731433
1636998_at 0.49 1637035_at 0.36
2 0.7704514
1631443_at 0.35 1639062_at 0.31
3 0.7295124
1623977_at 0.31 1627520_at 0.3
4 0.7196309
1637824_at 0.28 1632882_at 0.27
5 0.7033960
1624262_at 0.26 1640868_at 0.26
6 0.6786388
1631872_at 0.26 1637057_at 0.24
1625275_at 0.24 1624790_at 0.22
1635227_at 0.08 1623462_at 0.07
1635462_at 0.03 1628430_at 0.03
1626059_at 0.02
there are 8 out of 23 genes with <25% conservation in the cluster
14 / 20
15. main example 2 : TFBS and CRM detection on the genomic scale
An example of intersecting a state list with developmental module
normal high
low off
15 / 20
16. main example 2 : TFBS and CRM detection on the genomic scale
cis-regulatory module detection by HMM
after Wu and Xie, JCB 2008
16 / 20
17. main example 2 : TFBS and CRM detection on the genomic scale
TFBS binding probability calculation with a Bayesian integration framework
Mulitple prior data sources are combined in a probabilistic model to predict the
probability of TF binding
PWMs, ChIP-ChIP, Chip-Seq, damID, conservation, nucleosome positioning, regulatory potential...
after Lahdesmaki et al. PLoSOne, 2008
17 / 20
18. summary
Summary
Benefits of ECDF use for biological data analysis
Easy to use (honestly)
Can execute jobs in familiar languages: C,C++,Perl/BioPerl, R, Matlab...
Most common bioinformatic problems are similar analyses performed many times -> batch arrays
Often minimum re-coding needed
Free up workstations and local nodes, allow wider exploration of parameter space
Allow genome scale screening with multiple data sources
Current limitations of ECDF use for biological data analysis
Few computational biology algorithms are written for parallel processing
Loading large datasets can be problematic (memory limits)
Not generally accessible to the ’general user’ (although biological applications using GRID technologies are
appearing)
18 / 20
19. summary
Summary
Benefits of ECDF use for biological data analysis
Easy to use (honestly)
Can execute jobs in familiar languages: C,C++,Perl/BioPerl, R, Matlab...
Most common bioinformatic problems are similar analyses performed many times -> batch arrays
Often minimum re-coding needed
Free up workstations and local nodes, allow wider exploration of parameter space
Allow genome scale screening with multiple data sources
Current limitations of ECDF use for biological data analysis
Few computational biology algorithms are written for parallel processing
Loading large datasets can be problematic (memory limits)
Not generally accessible to the ’general user’ (although biological applications using GRID technologies are
appearing)
19 / 20
20. Acknowledgements
University of Edinburgh
Centre for Integrative Physiology
Andrew Jarman
Douglas Armstrong
Ian Simpson
Petra zur Lage
Lynn Powell
Sebastian Cachero
Lina Ma
Fay Newton
Guiseppe Gallone
Daniel Moore
Sadie Kemp
20 / 20