1. Deep Seq Data Analysis
Part II
Christophe.antoniewski@upmc.fr
http://drosophile.org
Mouse Genetics
January 29, 2015, 13:30â
15:00
http://fr.slideshare.net/christopheantoniewski/
3. The method section available on line
RNA isolation and library construction
Both human and mouse blastomeres were prepared using identical protocols. Single
blastomeres were isolated by removing the zona pellucida using acidic tyrode
solution (Sigma, catalogue no. T1788), then separated by gentle mouth pipetting in a
calcium-free medium. Single cells were washed twice with 1Ă PBS containing 0.1%
BSA before placing in lysis buffer. RNA was isolated from single cells or single morula
embryos and amplified as described previously14. Library construction was
performed following Illumina manufacturer suggestions. Libraries were sequenced
on the Illumina Hiseq2000 platform and sequencing reads that contained polyA, low
quality, and adapters were pre-filtered before mapping. Filtered reads were mapped
to the hg19 genome and mm9 genome using default parameters from BWA aligner29,
and reads that failed to map to the genome were re-mapped to their respective
mRNA sequences to capture reads that span exons.
Transcriptional profiling
In both human and mouse cases, data normalization was performed by transforming
uniquely mapped transcript reads to RPKM30. Genes with low expression in all stages
(average RPKM < 0.5) were filtered out, followed by quantile normalization. For
differential expression, we compared every time point to its previous time point
using default parameters in DESeq using normalized read counts. Genes were called
differentially expressed if they exhibited a Benjamini and Hochbergâadjusted P value
(FDR) <5% and a mean fold change of >2.
4. Data 1
GEO dataset accession: GSE44183
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE44183
⢠Take the SRP identifier at the bottom of the page: SRP018525
⢠Search for this identifier in EBI SRA ENA SRA Galaxy tool
⢠Check for your experiment accession by clicking on the SRXâŚ. links
⢠Click on the fastq files (galaxy) links
ď Files are uploaded in yellow datasets that show up in the current history
GSM1080195: mouse oocyte 1; Mus musculus; RNA-Seq
1 ILLUMINA (Illumina HiSeq 2000) run: 16.4M spots, 3G bases, 1.9Gb downloads
Accession: SRX229784
GSM1080196: mouse oocyte 2; Mus musculus; RNA-Seq
1 ILLUMINA (Illumina HiSeq 2000) run: 20.2M spots, 3.6G bases, 2.4Gb downloads
Accession: SRX229785
GSM1080197: mouse pronuclei 1; Mus musculus; RNA-Seq
1 ILLUMINA (Illumina HiSeq 2000) run: 17.2M spots, 3.1G bases, 2Gb downloads
Accession: SRX229786
GSM1080198: mouse pronuclei 2; Mus musculus; RNA-Seq
1 ILLUMINA (Illumina HiSeq 2000) run: 12.8M spots, 2.3G bases, 1.5Gb downloads
Accession: SRX229787
GSM1080199: mouse pronuclei 3; Mus musculus; RNA-Seq
1 ILLUMINA (Illumina HiSeq 2000) run: 12.4M spots, 2.2G bases, 1.5Gb downloads
Accession: SRX229788
⢠Register in mississippi.fr
⢠Take an identifier :
oocyte1@pasteur.fr
⢠oocyte2@pasteur.fr
⢠pronuclei1@pasteur.fr
⢠pronuclei2@pasteur.fr
⢠pronuclei3@pasteur.fr
⢠And the same password:
gsgalaxy
⢠Click on âAnalyze Dataâ
⢠You are by default on an unnamed
history
⢠Name it âDatasetsâ
5. Data 2
⢠Click on âShare Data ď Data Librariesâ
⢠Click on âPublic Datasetsâ
⢠Click on âMouse Pasteurâ
⢠Check boxes corresponding RefSeq_Genes_mm9.gtf, and your datasets
⢠Click on the âGoâ item
⢠Click on âAnalyze Dataâ
⢠Look at the imported data sets (3 green boxes)
⢠Look at their content (eye)
⢠Look at their metadata (info icon)
The dataset are already available from the server
6. Read Mapping
1. Type âfastqcâ in the search field at the left-hand column
2. Click on âFastQC:Read QC reports using FastQCâ
3. Select your first fastq data set
4. Run the tool
5. Select the yellow box (running tool)
6. Click on the âredoâ box
7. Select your second fastq data set
8. Run the tool ď it will take 4-5 min max
9. Search for âbwaâ in the tool search field
10. Select âMap with BWA for Illuminaâ
11. Lets have a look to the tool form
Filtered reads were mapped to the hg19 genome and mm9 genome using
default parameters from BWA aligner29, and reads that failed to map to the
genome were re-mapped to their respective mRNA sequences to capture
reads that span exons.
1. The procedure is not reproducible because metadata and
parameters are lacking.
2. The procedure is out of date
⢠The article has been published in 2013
⢠Tophat has been published in 2009, 2012 â Tophat2 in April 2013
8. Read Mapping using Tophat2
See https://wiki.galaxyproject.org/Events/GCC2014/TrainingDay?action=AttachFile&do=view&target=RNA-SeqAltSlides.pdf
For a nice introduction to RNA-seq analysis
9. Read Mapping using Tophat2 in Galaxy
1. Create a new history and name it âtophat2 alignmentâ
2. Copy your 2 fastq files from the previous history, as well as the RefSeq.gtf reference file
3. Rename the files and put an annotation
4. Find and fill in the tophat2 tool form
5. Run the tool
6. Select your first fastq data set
7. Run the tool
8. While it is running look at the metadata
9. Rename the datasets using the pencil box
10. Import Two other datasets
11. Re-run the Tophat2 on these datasets
12. Look at the job in the admin panel (reproducible analyses)
13. Look at the tool on the galaxy tool repository
14. Stop all running tools
15. Import the history âGS SRP018525 tophat2â
16. Visualize your reads in Trackster (1 gtf track + 1 condition mapping)
17. Optional, visualize junctions, etcâŚ
18. Compare with another public genome browser (UCSC or Ensembl)
Paired-end reads were mapped to the mm9 genome using Tophat2 the
parameters ---, and the RefSeq gtf mm9 annotation as a guide.
10. Read Counting using featureCounts in
Galaxy
1. Create a new history called âRead Countsâ
2. Copy the accepted hits datasets from the âimported: GS SRP018525 tophat2â history
as well as the RefSef GTF guide
3. You have now 6 datasets in the âRead Countsâ history
4. Run feature count once on oocyte 1 data
5. Re-run the tool for oocyte 2 and pronuclei 1, 2, 3
6. Change the metadata of featureCount summaries
7. Iteratively paste the featureCounts outputs using the Paste two files side by side tool
8. ď We have a hit Table
9. Rename it FeatureCounts HIT TABLE
10. We can visualize data using chart
11. Differential count analysis
1. Create a new history called âDifferential count analysisâ
2. Copy the âFeatureCounts HIT TABLEâ
3. Run âDifferential_Count models using BioConductor packagesâ on the FeatureCounts
HIT TABLE
4. Review the results
5. Yet, we did not reproduce the sup Fig. 1
12. DESeq Analysis
1. Letâs examine Fig.1, together with the published methods
2. The information is wrong, but we will approach the figure, trying to guess what has
been really done
3. Copy the âFeatureCounts HIT TABLEâ in a new history called âmy DESeq approachâ
4. To run the Deseq(1) package we need to reformat the HIT TABLE
5. With a text editor OR within Galaxy
1. Cut columns
2. Remove header
3. Upload new header
4. Manipulate header
5. Concatenate files
6. Run the tool âDESeq Profiling (replicates) with sample replicatesâ
7. Get the R code available in the public library: Rscript_for_Sup_Fig1a
8. Run the Docker Tool Factory tool with this R code to generate the figure
9. Run the tool âDESeq2 Profilingâ
10. Re-run the Docker Tool Factory tool with the same R code on the DESeq2 DE analysis
Transcriptional profiling
In both human and mouse cases, data normalization was performed by
transforming uniquely mapped transcript reads to RPKM30. Genes with low
expression in all stages (average RPKM < 0.5) were filtered out, followed by
quantile normalization. For differential expression, we compared every time
point to its previous time point using default parameters in DESeq using
normalized read counts. Genes were called differentially expressed if they
exhibited a Benjamini and Hochbergâadjusted P value (FDR) <5% and a mean
fold change of >2.
13. Optional: comparison between the
tophat2 approach and the BWA
approach
1. Sharing the âSRP018525 BWAâ history
2. Sharing the âComparison BWA / Tophatâ visualization
3. Analyze the differences