diffReps: automated ChIP-seq differential analysis package

•Als PPTX, PDF herunterladen•

2 gefällt mir•6,563 views

Li Shen

diffReps is published in PLoS ONE. Link: http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0065598

Wissenschaft Technologie Business

diffReps: automated ChIP-seq
differential analysis package
Li Shen
Asst. Professor
Neuroscience, Mount Sinai
06/28/2013
Slides adapted from previous presentation

ChIP-seq differential analysis
Treatment
(coc i.p.)
Control
(sal i.p.)
Rep1
Rep2
Rep3
Rep1
Rep2
Rep3
Differences
Venn diagram for peak lists
Treatment Control
False
positive
False
negativeTreatment Control 2

Subtle changes of chromatin
modifications
H3K4me3 from ENCODE
K562
ESC
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
ASUN: Asunder, Spermatogenesis Regulator
[0, 1.2]
[0, 1.2]
3

Existing programs for differential
analysis
• ChIPDiff(2008): HMM-based
approach. NOT sensitive
enough for brain data.
• Peak-based: DIME(2011),
DBChIP(2012). Caveats.
• Read counts +
DESeq(2010)/edgeR(2010):
Not convenient to use.
K562
ESC
Peaks
4

diffReps: a ChIP-seq differential analysis package
• Written in PERL, easy
to use command line
tool; Do everything in
one command.
• Sliding window
strategy.
Background
modeling
Normalization
Differential
test
Merge and re-
test
Multiple
testing
correction
Workflow
diffReps.pl -tr A.bed B.bed -co C.bed D.bed -gn mm9 -re report.txt
Google code:
5

Differential analysis & tail behavior
Gaussian: p=1E-5
Empirical: p=1E-5
 H3K4me3 from mouse
brain; bin1kb counts
normalized.
6

Statistical tests for differential analysis
• Negative binomial test:
models biological replicates,
over-dispersion
• T-test: NOT recommended
• X2 test: SUM((exp – emp)^2)
=> X2 distr (p-val).
• G-test: SUM(ln(emp / exp))
=> X2 distr (p-val). A
modification to X2 test,
recommended.
diffReps on H3K4me3: cocaine vs. saline
Negative
binomial
test T-test6527
282
130 7

Two additional tools
1. Find hotspots - hotspots are regions where the differential
sites or peaks occur significantly more often than random
chance.
Hotspot
Differential sites
Greedy search algorithm
Local Poisson
Eval
2. Region analysis - any file with the first 3 columns to be:
chromosome, start, end. Annotate gene and heterochromatic
regions
Easy to use: region_analysis.pl -i input.txt
8

Test data: ENCODE H3K4me3 between
K562 and ESC
Target: H3K4me3 Mock: DNA Input
Identify differential chromatin
modification sites
ESC K562
Rep1
Rep2
Rep1
Rep2
Estimate empirical false
positive rate
9

Sensitivity & Specificity
Target
Mock
Negative binomial vs. G-test
eFDR < .05%
10

Overlapped & specific sites
Up-regulated sites, do the same for down sites
“Specific”
“Overlapped”
Promoter
Genebody Promoter Genebody
Using default
p<1E-4
RNA-seq
11

Correlating differential sites with transcription
“Specific”“Overlapped”
K562, ESC RNA-seq TopHat-Cufflinks: gene exp change,
alternative promoter/splicing
12

diffReps is used in many works
Big cocaine project:
14

diffReps: current status & community
feedback
diffReps
published
Great to see diffreps has found a nice home in plos one. It is
literally the program which has saved my sanity, my phD and
probably the paper i'm writing!
- Michael Reschen, Oxford Univ., UK
15
http://dx.plos.org/10.1371/journal.pone.0065598

Acknowledgement
Role Li Shen Ningyi Shao Xiaochuan Liu Eric Nestler
Development
Test & result
Documentation
Google code
Money$
diffReps:
16

Empfohlen

Demystifying flink memory allocation and tuning - Roshan Naik, UberFlink Forward

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

Ensuring data quality with lakeFSPaul Singman

Building Robust ETL Pipelines with Apache SparkDatabricks

STOP! VIEW THIS! 10-Step Checklist When Uploading to SlideshareEmpowered Presentations

Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Databricks

Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...Spark Summit

Filesystem Comparison: NFS vs GFS2 vs OCFS2Giuseppe Paterno'

Empfohlen

Demystifying flink memory allocation and tuning - Roshan Naik, UberFlink Forward

Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks

Ensuring data quality with lakeFSPaul Singman

Building Robust ETL Pipelines with Apache SparkDatabricks

STOP! VIEW THIS! 10-Step Checklist When Uploading to SlideshareEmpowered Presentations

Redis + Structured Streaming—A Perfect Combination to Scale-Out Your Continuo...Databricks

Working with Skewed Data: The Iterative Broadcast with Fokko Driesprong Rob K...Spark Summit

Filesystem Comparison: NFS vs GFS2 vs OCFS2Giuseppe Paterno'

Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...Amazon Web Services

Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks

Getting up to Speed with MirrorMaker 2 (Mickael Maison, IBM & Ryanne Dolan) K...HostedbyConfluent

OLTP+OLAP=HTAPEDB

PySpark dataframeJaemun Jung

Training Week: Create a Knowledge Graph: A Simple ML Approach Neo4j

Performance Profiling in RustInfluxData

End-to-End Deep Learning with Horovod on Apache SparkDatabricks

Using Apache Spark as ETL engine. Pros and Cons Provectus

Extending Machine Learning Algorithms with PySparkDatabricks

GPT : Generative Pre-Training ModelZimin Park

On Improving Broadcast Joins in Apache Spark SQLDatabricks

Photon Technical Deep Dive: How to Think VectorizedDatabricks

Java Performance TuningEnder Aydin Orak

Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks

DRS-111 Data Structure and Data Collection Methods.pdfNay Aung

Your first ClickHouse data warehouseAltinity Ltd

ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine

How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward

Dynamic Partition Pruning in Apache SparkDatabricks

RNA sequencing analysis tutorial with NGSHAMNAHAMNA8

Dgaston dec-06-2012Dan Gaston

Weitere ähnliche Inhalte

Was ist angesagt?

Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...Amazon Web Services

Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevDatabricks

Getting up to Speed with MirrorMaker 2 (Mickael Maison, IBM & Ryanne Dolan) K...HostedbyConfluent

OLTP+OLAP=HTAPEDB

PySpark dataframeJaemun Jung

Training Week: Create a Knowledge Graph: A Simple ML Approach Neo4j

Performance Profiling in RustInfluxData

End-to-End Deep Learning with Horovod on Apache SparkDatabricks

Using Apache Spark as ETL engine. Pros and Cons Provectus

Extending Machine Learning Algorithms with PySparkDatabricks

GPT : Generative Pre-Training ModelZimin Park

On Improving Broadcast Joins in Apache Spark SQLDatabricks

Photon Technical Deep Dive: How to Think VectorizedDatabricks

Java Performance TuningEnder Aydin Orak

Designing Structured Streaming Pipelines—How to Architect Things RightDatabricks

DRS-111 Data Structure and Data Collection Methods.pdfNay Aung

Your first ClickHouse data warehouseAltinity Ltd

ETL to ML: Use Apache Spark as an end to end tool for Advanced AnalyticsMiklos Christine

How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward

Dynamic Partition Pruning in Apache SparkDatabricks

Was ist angesagt? (20)

Analyzing and processing streaming data with Amazon EMR - ADB204 - New York A...

Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev

Getting up to Speed with MirrorMaker 2 (Mickael Maison, IBM & Ryanne Dolan) K...

OLTP+OLAP=HTAP

PySpark dataframe

Training Week: Create a Knowledge Graph: A Simple ML Approach

Performance Profiling in Rust

End-to-End Deep Learning with Horovod on Apache Spark

Using Apache Spark as ETL engine. Pros and Cons

Extending Machine Learning Algorithms with PySpark

GPT : Generative Pre-Training Model

On Improving Broadcast Joins in Apache Spark SQL

Photon Technical Deep Dive: How to Think Vectorized

Java Performance Tuning

Designing Structured Streaming Pipelines—How to Architect Things Right

DRS-111 Data Structure and Data Collection Methods.pdf

Your first ClickHouse data warehouse

ETL to ML: Use Apache Spark as an end to end tool for Advanced Analytics

How to build a streaming Lakehouse with Flink, Kafka, and Hudi

Dynamic Partition Pruning in Apache Spark

Ähnlich wie diffReps: automated ChIP-seq differential analysis package

RNA sequencing analysis tutorial with NGSHAMNAHAMNA8

Dgaston dec-06-2012Dan Gaston

Predicting phenotype from genotype with machine learningPatricia Francis-Lyon

6-8-2015 AACC Poster HIV p24 S-PLEX - Stengelin_finalLawrence Hwang

[Research] Detection of MCI using EEG Relative Power + DNNDonghyeon Kim

JClinChem_2003Jerod Ptacin

A rapid library preparation method with custom assay designs for detection of...Thermo Fisher Scientific

Using NGS to detect CNVs in familial hypercholesterolemiaDelaina Hawkins

Using NGS to detect CNVs in familial hypercholesterolemiaGolden Helix

Bioinformatics workshop Sept 2014LutzFr

Direct Sanger CE Sequencing of Individual Ampliseq Cancer Panel Targets from ...Thermo Fisher Scientific

ACMG Workshop 2011Oxford Gene Technology

Two-Tailed PCR - New Ultrasensitive and Ultraspecific Technique for the Quant...Kate Barlow

Detecting and Quantifying Low Level Variants in Sanger Sequencing TracesThermo Fisher Scientific

PMED Transition Workshop - Machine Learning Methods to Learn Improved Electro...The Statistical and Applied Mathematical Sciences Institute

Functional genomicsAjit Shinde

Whole Transcriptome Analysis of Testicular Germ Cell TumorsThermo Fisher Scientific

Cignal webinaElsa von Licy

RNA-Seq with R-BioconductorBioinformatics and Computational Biosciences Branch

Expanding Your Research Capabilities Using Targeted NGSIntegrated DNA Technologies

Ähnlich wie diffReps: automated ChIP-seq differential analysis package (20)

RNA sequencing analysis tutorial with NGS

Dgaston dec-06-2012

Predicting phenotype from genotype with machine learning

6-8-2015 AACC Poster HIV p24 S-PLEX - Stengelin_final

[Research] Detection of MCI using EEG Relative Power + DNN

JClinChem_2003

A rapid library preparation method with custom assay designs for detection of...

Using NGS to detect CNVs in familial hypercholesterolemia

Bioinformatics workshop Sept 2014

Direct Sanger CE Sequencing of Individual Ampliseq Cancer Panel Targets from ...

ACMG Workshop 2011

Two-Tailed PCR - New Ultrasensitive and Ultraspecific Technique for the Quant...

Detecting and Quantifying Low Level Variants in Sanger Sequencing Traces

PMED Transition Workshop - Machine Learning Methods to Learn Improved Electro...

Functional genomics

Whole Transcriptome Analysis of Testicular Germ Cell Tumors

Cignal webina

RNA-Seq with R-Bioconductor

Expanding Your Research Capabilities Using Targeted NGS

Kürzlich hochgeladen

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54

User Guide: Magellan MX™ Weather StationColumbia Weather Systems

Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju

User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems

Carbon Dioxide Capture and Storage (CSS)Tamer Koksalan, PhD

STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptxMurugaveni B

Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju

Bioteknologi kelas 10 kumer smapsa .pptx023NiWayanAnggiSriWa

User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems

《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29

Topic 9- General Principles of International Law.pptxJorenAcuavera1

basic entomology with insect anatomy and taxonomyDrAnita Sharma

Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Microphone- characteristics,carbon microphone, dynamic microphone.pptxpriyankatabhane

Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY

Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuinethapagita

Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCRlizamodels9

Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju

FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV

ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1

Kürzlich hochgeladen (20)

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)

User Guide: Magellan MX™ Weather Station

Pests of jatropha_Bionomics_identification_Dr.UPR.pdf

User Guide: Pulsar™ Weather Station (Columbia Weather Systems)

Carbon Dioxide Capture and Storage (CSS)

STOPPED FLOW METHOD & APPLICATION MURUGAVENI B.pptx

Pests of safflower_Binomics_Identification_Dr.UPR.pdf

Bioteknologi kelas 10 kumer smapsa .pptx

User Guide: Capricorn FLX™ Weather Station

《Queensland毕业文凭-昆士兰大学毕业证成绩单》

Topic 9- General Principles of International Law.pptx

basic entomology with insect anatomy and taxonomy

Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service

Microphone- characteristics,carbon microphone, dynamic microphone.pptx

Behavioral Disorder: Schizophrenia & it's Case Study.pdf

Call Girls in Majnu Ka Tilla Delhi 🔝9711014705🔝 Genuine

Call Girls In Nihal Vihar Delhi ❤️8860477959 Looking Escorts In 24/7 Delhi NCR

Pests of castor_Binomics_Identification_Dr.UPR.pdf

FREE NURSING BUNDLE FOR NURSES.PDF by na

ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx

diffReps: automated ChIP-seq differential analysis package

1. diffReps: automated ChIP-seq differential analysis package Li Shen Asst. Professor Neuroscience, Mount Sinai 06/28/2013 Slides adapted from previous presentation

2. ChIP-seq differential analysis Treatment (coc i.p.) Control (sal i.p.) Rep1 Rep2 Rep3 Rep1 Rep2 Rep3 Differences Venn diagram for peak lists Treatment Control False positive False negativeTreatment Control 2

3. Subtle changes of chromatin modifications H3K4me3 from ENCODE K562 ESC ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ASUN: Asunder, Spermatogenesis Regulator [0, 1.2] [0, 1.2] 3

4. Existing programs for differential analysis • ChIPDiff(2008): HMM-based approach. NOT sensitive enough for brain data. • Peak-based: DIME(2011), DBChIP(2012). Caveats. • Read counts + DESeq(2010)/edgeR(2010): Not convenient to use. K562 ESC Peaks 4

5. diffReps: a ChIP-seq differential analysis package • Written in PERL, easy to use command line tool; Do everything in one command. • Sliding window strategy. Background modeling Normalization Differential test Merge and re- test Multiple testing correction Workflow diffReps.pl -tr A.bed B.bed -co C.bed D.bed -gn mm9 -re report.txt Google code: 5

6. Differential analysis & tail behavior Gaussian: p=1E-5 Empirical: p=1E-5  H3K4me3 from mouse brain; bin1kb counts normalized. 6

7. Statistical tests for differential analysis • Negative binomial test: models biological replicates, over-dispersion • T-test: NOT recommended • X2 test: SUM((exp – emp)^2) => X2 distr (p-val). • G-test: SUM(ln(emp / exp)) => X2 distr (p-val). A modification to X2 test, recommended. diffReps on H3K4me3: cocaine vs. saline Negative binomial test T-test6527 282 130 7

8. Two additional tools 1. Find hotspots - hotspots are regions where the differential sites or peaks occur significantly more often than random chance. Hotspot Differential sites Greedy search algorithm Local Poisson Eval 2. Region analysis - any file with the first 3 columns to be: chromosome, start, end. Annotate gene and heterochromatic regions Easy to use: region_analysis.pl -i input.txt 8

9. Test data: ENCODE H3K4me3 between K562 and ESC Target: H3K4me3 Mock: DNA Input Identify differential chromatin modification sites ESC K562 Rep1 Rep2 Rep1 Rep2 Estimate empirical false positive rate 9

10. Sensitivity & Specificity Target Mock Negative binomial vs. G-test eFDR < .05% 10

11. Overlapped & specific sites Up-regulated sites, do the same for down sites “Specific” “Overlapped” Promoter Genebody Promoter Genebody Using default p<1E-4 RNA-seq 11

12. Correlating differential sites with transcription “Specific”“Overlapped” K562, ESC RNA-seq TopHat-Cufflinks: gene exp change, alternative promoter/splicing 12

13. diffReps “specific” sites - examples 13

14. diffReps is used in many works Big cocaine project: 14

15. diffReps: current status & community feedback diffReps published Great to see diffreps has found a nice home in plos one. It is literally the program which has saved my sanity, my phD and probably the paper i'm writing! - Michael Reschen, Oxford Univ., UK 15 http://dx.plos.org/10.1371/journal.pone.0065598

16. Acknowledgement Role Li Shen Ningyi Shao Xiaochuan Liu Eric Nestler Development Test & result Documentation Google code Money$ diffReps: 16

Hinweis der Redaktion

Good morning! Thank you for inviting me. I’ve been coming to this meeting for many times but never made any contribution. So today is pay back time.
The first problem I identified is sth. called differential analysis for ChIP-seq data. Basically, you have two groups of animals, one group is treatment and the other group is control. You take samples from these animals and send them for sequencing to measure the chromatin modifications. And you want to compare the two groups to find out the differences in chromatin modifications. This sounds like a straightforward question but the solution is not. Some people may say, well, this is easy, why not do peak calling for each group separately, then compare the two peak lists using a venn diagram, right? Well, this is surely going to be problematic. A treatment-specific peak may not be truly different. You may happen to set the cutoff between the two heights. This leads to false positives. On the other hand, a common peak may actually be different, but you set the cutoff below the two heights. This leads to false negatives. So we definitely need to treat this problem more carefully.
In addition to this problem, we also found, from real data, that some of the chromatin modifications are really subtle. This is an example from the ENCODE project. ChIP-seq was performed on histone mark h3k4me3 in both cell lines, k562 and embryonic stem cell. Clearly, there is a peak at the TSS of the gene ASUN in both cell lines. there appear to be two increased sites at each side of the peak, which seem to be really subtle. And the site downstream of the TSS seems to overlap with this variant exon. Could this chromatin modification site cause the change of the expression of these two isoforms? To answer this kind of question, you really feel like to cut the whole genome into many small slices and determine the chromatin modification at slice.
Unfortunately, when I started to work on this kind of problems, there was very few choices I can make. Back in 2009, it seems there was only one program, called chipdiff, which specifically targets the differential analysis for chip-seq. based on our experience, chipdiff tends to generate very few targets. When I used it on our brain data, it often gave me nothing. It was not until 2011 and 2012, there were two new tools called dime and dbchip which base their differential analysis on peak lists given by another peak calling program. But this kind of approach has caveats. Using the example I just showed you, you may identify a peak in k562 like this, and another peak in stem cell like this, how can you compare these two peaks? It’s very likely you’ll miss these two differential sites. Finally, people have also tried to use deseq and edger on chip-seq data. these two programs are my favorite because they treat statistics seriously. But they were originally designed for rna-seq. to use them on chip-seq, you’ll add a lot of pre- and post-processing steps. So they are not convenient to use.
Out of these frustrations, I decided to develop my own program called diffreps. It is a program package written in PERL. the workflow of diffreps is illustrated here. It goes from background modeling, normalization, all the way down to multiple testing correction. It is typically triggered by one command line like this and do all these things. It uses a sliding window strategy so you won’t miss a thing. Btw, diffreps is developed as an open source project and is hosted on google code.
Across my career, I have heard some people saying things like “it doesn’t matter what kind of distribution you use, they are all about the same”. I do not agree with that. One of the most common mistakes people make on sequencing data is that they do normalization on the read counts, and then assume these values are normally distributed. Here I used a chip-seq dataset from our brain samples. I then calculated the difference between the means of two groups of diffferent conditions. The dot-dashed line shows you the empirical density while the red line shows you the Gaussian fit. As you can see, the two distributions are totally different. The empirical data shows a sharp peak with a long righthand side tail. While the gaussian is much more broad. In differential analysis, it’s all about the tail behavior of the distribution. At p value of 10 to minus 5, this is where the Gaussian cutoff is, and this is where the empirical cutoff is. Look at how big the difference is between them.
So choosing the right statistical test is extremely important for chip-seq differential analysis. In diffreps, we implemented four different tests: negative binomial, t-test, chi-square test and g-test. If you have biological replicates, then negative binomial test is really what you should use. It models the over-dispersion among the biological replicates and control false positives. While t-test really should not be used. I only added it for comparison purpose. If your data do not contain biological replicates, then chi-square test or g-test can be an excellent choice. G-test can be basically considered as a modification to chi-sqaure test and is recommended by some statisticians to replace chi-square. On the top right, this group of people from oregon state has done some very nice comparison between negative binomial and t-test. The conclusion is that t-test is no good: it is not sensitive or specific on sequencing data. But they somehow publish this study in a not so prominent journal so probably most people did notice this paper. But if you are interested in differential analysis, I would suggest you to read it. On the bottom right, I also did some comparison between negative binomial and t-test on our own chip-seq data. The difference is striking. Negative binomial predicts 20 folds more sites than t-test. What’s even worse is that, only less than half of the t-test sites are overlapped by negative binomial. So this really raises a red flag for those who are using t-test on chip-seq or rna-seq data.
Besides differential tests, diffreps also includes two additional tools. The first tool is called find hotspots. A hotspot is basically a region where the differential sites or peaks occur significantly more often than random chance. In this cartoon, these guys are very close to each other and they form a hotspot while this guy is being squared. A greedy search algorithm is designed to identify those hotspots. It basically goes from start to the end and eats a differential site whenever it improves the score. When a hotspot is found, it is evaluated by a local poisson model. The second tool is called region analysis. It is a script which accepts any input file as long as the first 3 columns contain genomic coordinates. It will assign each region to genes or heterochromatic regions.
So we’ve talked a lot of methodology. Now, let’s put diffreps into test. This test dataset is from the ENCODE project. Chip-seq was performed on h3k4me3 between two cell lines: k562 and embryonic stem cell. There are two replicates in each group, the number of aligned reads ranges from 7 to 16 million. We also created a mock dataset using DNA input samples and we mixed the replicates between the two cell lines. The reason of doing that is because the dna input actually contains information about chromatin structures. So we want to remove those biases. By using this mock dataset, we can estimate the empirical false positive rate.
These two figures show you that diffreps predicts much more differential sites than the other approaches at different p-value cutoffs. Although diffreps also produces some differential sites on the mock data, the number decreases rapidly with the p-value cutoff. And the empirical false discovery rate is below for .5% for diffreps. It should also be noticed that g-test is very sensitive and produces much more sites than negative binomial test. It is not surprising because g-test ignores the variation within a group so it tends to have higher false positive rate. But the nice thing about g-test is that it nearly includes negative binomial. So if false positive is not your major concern, g-test can be a excellent choice.
Now, at the default p-value cutoff, diffreps produces a differential site list that basically includes deseq and chipdiff. There are lots of diffreps specific sites that are not overlapped with other methods. A natural question is whether those sites are actually biological, not just noise from the data. So we separate the differential sties into specific and overlapped category, and further classify them based on their location into promoter and genebody. Then correlate those sites with RNA-SEQ data.
The RNA-seq data from the two cell lines were processed using the tophat-cufflinks pipeline. This program not only measures gene expression change, but also more complicated things like alternative promoter usage and alternative splicing. We correlated these different categories of events using fisher’s exact test. When we look at the overlapped category, they correlate very well with gene expression changes. They also show some correlation with alternative promoters but not with alternative splicing. When we look at the diffreps specific category, they also show different kinds of correlation with transcriptional change. So this is very positive, that means a lot of the diffreps specific sites are likely to be biological. What is interesting here is, those diffreps specific sites also correlate with alternative splicing. This seems to suggest that a lot of subtle chromatin modifications are missed by other methods but diffreps can pick them up. So diffreps is a very sensitive method that catches both major and minor changes.
To give you some more intuitive and real examples, we created these two figures. In the upper figure, this micu1 gene has two alternative promoters. The second one is many kb downstream of the first one. The longer TSS has increased expression in k562 cell line. diffreps found two increased sites at the longer TSS. This is consistent with this histone mark’s role as an activation mark at the TSS. In the lower figure, this fanci gene has two isoforms. The second isoform contains a variant exon which has increased expression in k562 cell line. Diffreps found an increased site which overlaps with this variant exon. This seems to suggest a positive role for h3k4me3 in this exon’s inclusion.
As you can see, diffreps can be a very useful tool for chip-seq analysis. We have used it literally on every chip-seq dataset we have. It was used to study morphine-regulated h3k9me2 in mouse brain, a study that was published last year in the journal of neuroscience. It was also used in our big cocaine project to study the cocaine-regulated chromatin modification of 7 different histone marks.
The paper about diffreps is now in production in plos one and shall come online in no time. Recently, I received this email from one of diffreps’ users. This guy from UK said, and I quote, “great to see…”. Well, I am really flattered. Sometimes, I do feel that it is users like this who keep me motived to improve my programs and make them even better.
I thought I could be innovative in this section too. These are two heatmaps that show you each person’s role in the two software. The diffreps is kind of a one man’s project. I pretty much did everything and ningyi helped a lot with testing and results generation. For ngsplot, I developed most part of the code. Ningyi also made some contribution. Leo has been helping with testing, documentation and maintaining the google code page. He also imported it into Galaxy. Eric nestler is all about money.