1. Gene Expression One Cell at a Time
Experimental design and analysis of single-cell RNA-Seq data
David Cook
Vanderhyden Lab, uOttawa
DavidPCook
dpcook
dcook082@uottawa.ca
2. Conclusions from bulk analysis can be representative of nothing
Bulk analysis
InterpretationSample
3. Conclusions from bulk analysis can be representative of nothing
Impossible to conclude if differences are due to composition or cancer cells themselves
TCGA, Nature, 2011
4. Bulk summaries hide underlying structure
X Mean:
Y Mean:
X SD:
Y SD:
Corr:
54.26
47.83
16.76
26.93
-0.06
Matejka and Fitzmaurice (Autodesk Research, Toronto)
8. Fluidigm C1
Pros
Allows visual inspection of captured
cells
Customizability
Cons
Only two inlets for cell samples
Throughput can’t keep up with field
Relatively long prep time
Live Cell Dead Cell Multiple Live Cells
Calcein AM Ethidium homodimer-1
9. Droplet-based methods
Pros
Very high throughput
Up to 8 unique samples per run
System cost relatively low
Cons
Limited customizability
Zheng et al., Nature Comm, 2017
11. Common Chemistry: RT and 3’ Enrichment
Only 3’ end of transcript
is PCR amplified
12. Why 3’ enrichment?
5’ 3’1kb cDNA
Ten 100bp reads needed for 1x coverage
100bp reads
5’ 3’
200bp 3’ fragment
Two 100bp reads needed for 1x coverage
100bp reads
Consequence: Lose nearly all information about isoform usage (sorry, Matt)
13. Single-Cell Platforms
10x Genomics
BioRad ddSeq
Fluidigm C1
Plate methods
Cost per cell Cells per run Flexibility/Customizable
+ ~1000-46000 +
++ ~300-10000 +
++++ 96 or 800 +++
Protocol
Dependent
10 - >10k +++++
14. Cost
10x Genomics
Reagent Kit (20 samples): $20,000
One sample = ~600-6000 cells
Microfluidics Chips (Six 8-sample chips): $1,440
Fluidigm C1 (HT assays)
Reagent Kit (5 runs): $5,000
One run = ~800 cells
Integrated Fluidics Circuit (1 run): $2000
Sequencing
NextSeq500 High Output
1 run ($3700) enough for ~2-3k cells
HiSeq4000
1 lane (~$2700) enough for ~2-3k cells
(Often need to purchase entire flow cell)
16. How many cells?
Depends on what you’re looking at
More cells = better detection of rare populations
Mocosko et al,. Cell, 2015
Pollen et al,. Nature Biotech, 2014
More heterogeneity? More cells
17. Sequencing: How deep do you need to go?
Depends on what you want
Svensson et al., Nature Methods, 2017
Rough Guideline
Aim for 100,000 reads per cell
50,000 per cell is probably fine
Zheng et al., Nature Comm, 2017
16k reads/cell (>60k PBMCs)
Zheng et al., Nature Comm, 2017
18. Sample numbers and batch effects
Hicks et al., BioRxiv, 2016
Mix biological variables in individual runs!
21. Project Background
Control Estrogen
Areas of columnar OSE
Control Estradiol
0
5
10
15
%ovariansurfacethat
hascolumnarcells
*
Areas of hyperplastic OSE
Control Estradiol
5
10
15
%ovariansurface
thatishyperplastic
*
Placebo
E2
E2
Hormone replacement therapy increases risk of ovarian cancer
Exogenous estrogen enhances the cancer progression in mouse models
Prolonged estrogen exposure causes ovarian epithelial dysplasia in
normal mice
23. Alignment, transcript quantification, and import into R
Kallisto – Pseudoalignment to the transcriptome
Bray et al., Nature Biotech, 2016
tximport package to dump gene-level expression matrix into R
Soneson et al., F1000, 2016
25. Filtering scRNA-Seq Data
Dead Cell Multiple Live Cells
Ethidium homodimer-1
(Fluidigm specific)
Before Filtering After Filtering
800 cells
30735 genes
636 cells
14300 genes
Filter genes that
are not detected in
at least 10 cells
27. Finding and controlling for technical variables
Data exploration is critical
Exprs. matrices
Raw Counts
Log-transformed
Z-scores
Normalized
Cells
Genes
Cell metadataphenoData
Gene metadata
featureData
SCEset:
28. Finding and controlling for technical variables
1. Library Size
Scaling each library by a size factor
• Counts per million (CPM)
• DESeq
• TMM
• Pooled-based size factors (Lun et al., Genome
Biology, 2016)
29. Finding and controlling for technical variables
2. Cell Cycle (or other confounding biological processes we aren’t interested in)
Stegle et al., Nature Rev. Genetics, 2015
Cell cycle classification using “scran” package
Cell cycle not driving large amounts of
variation at this point
30. Finding and controlling for technical variables
3. Other technical variables
Finding variables that drive variation
Coloured by IFC Column
31. Finding and controlling for technical variables
3. Other technical variables
removeBatchEffect() – limma package
Yi = β0 + β1(TotalFeatures)i + β2(IFC.Row)i + β3(Condition)i + εi
Removes the effect of the technical
covariates on a per-gene basis
Note: IFC.Column tackled same way, but split by condition beforehand
Post-normalization Odd IFC Column
48. Trajectory Analysis
• Larger data sets
• Combining the technology with perturbations
• Collecting multiple –omics datasets from individual cells
Dixit et al., Cell, 2016
BioRxiv, 2017
50. Staying on the ball with scRNA-Seq
Nature Methods, Jan 23rd, 2017
Science, March 3rd, 2017
Nature Methods, March 27th, 2017
Nature Methods, March 6th, 2017
Nature Methods, April 17, 2017
Nature Biotechnology, May 1st, 2017
51. Resources
Sean Davis’s “Awesome Single Cell” list
https://github.com/seandavi/awesome-single-cell
10x Genomics Public Datasets
https://support.10xgenomics.com/single-cell/datasets
1.3 Million brain cells from E18 mice
68k PBMCs
Fun Tutorials
Seurat: http://satijalab.org/seurat/get_started.html
Monocle (find on Bioconductor)