2. ChIP-seq analysis
•ChIP-seq is the combination of chromatin immuno-
precipitation with ultra-sequencing.
• Allows to detect genomic portions bound by proteins such
as:
• Transcription factors
•Histones
• Polymerase II
•…
5. ChIP-seq analysis
Starting the analysis.
• Typically you will receive from 10 to 30 millions of raw
reads per sample corresponding to a zipped file of 0.5-1.5
Gbytes.
FASTQ format
@HWUSI-EAS621:69:64EKPAAXX:3:1:11477:1265 1:N:0: @(HEADER)
GAAACTTGAGGACTGCCCAGCTCGACAGACACTGGA
(SEQUENCE)
+ +(HEADER)
GEGGDGG@GGDGGGGGGGBDGGDG8GG@3D6:3:67
(QUALITY)
The quality is encoded with a ASCII character and
represents the Phred quality score.
p = probability that that base call is incorrect
Q = 20 means base call accuracy of 99%
6. ChIP-seq analysis
Starting the analysis.
• It is strongly recommended to check the quality of the
sequences we received before doing the analysis!
Fastqc analysis
7. ChIP-seq analysis
Starting the analysis.
Mapping by using ultra-fast mappers:
• GEM
• Bowtie
• BWA
• Stampy
It is required to index the reference genome before doing
the analysis.
12. ChIP-seq analysis
Peak calling – MACS
Given a sonication size (bandwith) and a fold-enrichment
(mfold), MACS slides 2*bandwidth windows across the genome to find
regions enriched to a random tag genome distribution >= mfold (default
between 10 and 30).
13. ChIP-seq analysis
Peak calling – MACS
MACS select at least 1,000 “model peaks” for calculating the distance
“d” between paired peaks.
14. ChIP-seq analysis
Peak calling – MACS
How to determine if peaks are greater than expected by chance?
•x = observed read number
•λ= expected read number
Probability to find a peak higher than x.
Tag distribution along the genome could be modeled by a Poisson
distribution.
15. ChIP-seq analysis
Peak calling – MACS
Example:
Tag count = 2
Number of reads = 30,000,000
Read length = 36
Mappable human genome = 2,700,000,000
16. ChIP-seq analysis
Peak calling – MACS
Example:
Tag count = 10
Number of reads = 30,000,000
Read length = 36
Mappable human genome = 2,700,000,000
17. ChIP-seq analysis
Peak calling – MACS
• shifting each tag d/2 to the 3’
• sliding windows with 2*d length across the genome to
detect the enriched regions (Poisson distribution p-value
<= 1e-5).
• Overlapping enriched regions are fused.
• Summit of the peak is considered the putative binding site
TF
18. ChIP-seq analysis
Peak calling – MACS
In order to address local biases in the genome such as local chromatin
structure, sequencing bias, genome copy number variation… MACS
evaluates candidates peaks by comparing them against a “local”
distribution.
Fold enrichment =
Enrichment over the
λlocal
19. ChIP-seq analysis
Peak calling – MACS
False Discovery Rate (FDR) is calculated as number of control peaks
called / number of sample peaks. Control peaks are calculated by
swapping control and sample.
FDR is calculated only when a control is provided!
21. ChIP-seq analysis
Practical part
Connect to the Etna machine by using ssh.
• MAC or Linux users can do using this command
$ ssh –X course@xxx.crg.es
course@xxx.crg.es'spassword:
Password:xxxxxxx
• Windows users should first download Putty and PSCP
programs and then use them for accessing that
machine. http://goo.gl/4BWud
24. ChIP-seq analysis
Launching MACS passing the sample, the control, the
genome size (hs = homo sapiens) and the name
$macs14 -t ../data/Treatment_tags.bed -c ../data/Input_tags.bed -ghs-n FoxA1
25. ChIP-seq analysis
Check the output printed to the screen.
$macs14 -t ../data/Treatment_tags.bed -c ../data/Input_tags.bed -ghs -n FoxA1
INFO @ Thu, 29 Mar 2012 14:58:35:
# ARGUMENTS LIST:
# name = FoxA1
# format = AUTO
# ChIP-seq file = ./Treatment_tags.bed
# control file = ./Input_tags.bed
# effective genome size = 2.70e+09
# band width = 300
# model fold = 10,30
# pvalue cutoff = 1.00e-05
# Small dataset will be scaled towards larger dataset.
# Range for calculating regional lambda is: 1000 bps and 10000 bps
INFO @ Thu, 29 Mar 2012 14:58:35: #1 read tag files...
INFO @ Thu, 29 Mar 2012 14:58:35: #1 read treatment tags...
INFO @ Thu, 29 Mar 2012 14:58:35: Detected format is: BED
Regional lambda has two values in this version: small to
consider bias around the summit and large for the
surrounding area.
26. ChIP-seq analysis
Check the output printed to the screen.
INFO @ Thu, 29 Mar 2012 14:59:41: #1 tag size is determined as 35 bps
INFO @ Thu, 29 Mar 2012 14:59:41: #1 tag size = 35
INFO @ Thu, 29 Mar 2012 14:59:41: #1 total tags in treatment: 3909805
..
INFO @ Thu, 29 Mar 2012 14:59:46: #2 Build Peak Model...
INFO @ Thu, 29 Mar 2012 15:00:00: #2 number of paired peaks: 11861
INFO @ Thu, 29 Mar 2012 15:00:00: #2 finished!
INFO @ Thu, 29 Mar 2012 15:00:00: #2 predicted fragment length is 119 bps
INFO @ Thu, 29 Mar 2012 15:00:00: #2.2 Generate R script for model : FoxA1_model.r
INFO @ Thu, 29 Mar 2012 15:00:00: #3 Call peaks...
INFO @ Thu, 29 Mar 2012 15:00:00: #3 shift treatment data
INFO @ Thu, 29 Mar 2012 15:00:01: #3 merge +/- strand of treatment data
INFO @ Thu, 29 Mar 2012 15:00:01: #3 call peak candidates
INFO @ Thu, 29 Mar 2012 15:00:13: #3 shift control data
INFO @ Thu, 29 Mar 2012 15:00:13: #3 merge +/- strand of control data
INFO @ Thu, 29 Mar 2012 15:00:15: #3 call negative peak candidates
INFO @ Thu, 29 Mar 2012 15:00:25: #3 use control data to filter peak candidates...
INFO @ Thu, 29 Mar 2012 15:00:31: #3 Finally, 13591 peaks are called!
INFO @ Thu, 29 Mar 2012 15:00:31: #3 find negative peaks by swapping treat and control
INFO @ Thu, 29 Mar 2012 15:00:36: #3 Finally, 594 peaks are called!
31. ChIP-seq analysis
$macs14 -t ../data/Treatment_tags.bed -c ../data/Input_tags.bed -ghs -n FoxA1 -w
-w option allows to create“wiggle” files for each
chromosome analyzed.
-B option creates “bedgraph” files.
-S option together with either –w or –B creates a single
huge file for the whole genome.
--space=NUM can be used for change the resolution of the
wiggle file
38. ChIP-seq analysis
Analyze histone modifications
• Broader peaks
• No clear shape (more summits)
• The peak model is often impossible to create.
$macs14 -t ../data/ES.H3K27me3.bed –g mm --nomodel --nolambda -n H3K27me3
• It is recommended to skip the model with the --nomodel
option.
• Since no control is available the comparison will be done
against the sample background. It is recommended to skip
the local background when you have no control and very
broad peaks.