Presented at European Respiratory Society, Berlin, October 2017. High level talk to mix of clinicians and scientists on analyzing transcriptomic / gene expression data
1. Clusters, pathways, context
Interpreting transcriptomic data
Paul Agapow, Translational Bioinformatics, Data Science Institute
Syst. Medicine in Resp. Disease
Berlin October 2017
2. • Which genes are transcribed more
/ less?
• What’s the difference between:
– Cell lines?
– Healthy & unhealthy tissue?
– Tissues?
– Patients with & without a
SNP?
Expression data can tell us ...
3. • Dynamic
• Responsive
• Quantifiable
• More informative
Why study expression data?
But:
• (Processing)
• Comparative analysis
• Multiple technologies
• Cut-offs
• Batch effects
• Power
• Looking at the right place / time?
• Interpretation
4. • Microarrays:
– DNA anchored to a solid
surface
– Assess RNA that binds to it
– “Old” (90s)
– Noisy
– Finds what’s on the chip
Platforms
• RNA-seq:
– Deep-sequencing of RNA
– More accurate & reliable
– More expensive
– High throughput
– Finds everything
5. 1. Set of R software libraries for
analysis of high-throughput data
– Inter-operable
– documented
2. BC library for transcriptomic
analysis
Tools: Bioconductor & limma
6. Interpretation: Clustering
Put similar things together:
• Gene expression patterns (co-
regulation, modules)
• Patients (stratification)
But:
• What’s a cluster / similarity?
• Allow for noise
• Comparison
• Is it ontologically real?
7. Many methods but:
• K-Means / K-Medians clustering
– Simple
– Stochastic, define K
– Best with spherical data
• Hierarchical clustering
– Levels of granularity
– Produces dendrogram
– Computationally complex
How to cluster
But:
• Little comparative work
• No support / confidence
• Supervised vs unsupervised
• Poor reproducibility
– Bootstrap / Jackknife
• Comparing clusters
9. How do you compare clusters
obtained from 2+ different
experiments?
• Especially if clusters labelled
differently
• If separation poor
• If clusters nest
Comparing clusters
• Adjusted mutual information
(sklearn)
– No nesting
• Conditional entropy
10. • Match genes against lists
• Associate a gene with a
compartment or pathway
• Examine enrichment /
downregulation
Interpretation: enrichment
But:
• What’s a pathway?
• Are they right?
• Statistical basis
• Many choices
• Post-transcriptional regulation?
11. • Popular tools:
– DAVID (not updated?)
– GSEA
– Ingenuity / Metacore
– Bioconductor
• Individual cases:
– Hypergeometric test
• Gives you support
Enrichment
12. • Many knowledge bases are a pot-
pourri of undifferentiated “facts”
– Incomplete
– Where / what / how?
• Use curated knowledge bases
• Traverse graphs
Interpretation: contextualization
13. • Use graphs databases for
• Traverse graphs for “neighbours”
– Shortest paths connecting
protein COL6A5, a protein
implicated in airway
remodelling, to asthma
• Stats / support?
• Hypothesis generation
Graph databases for
knowledge representation
14. • Science is hard
• Assumptions are important
• Obtaining support / confidence / validation is
difficult
• ... but important
Conclusions?