Quan Nguyen in the ICG GigaScience Prize Track: Mammalian genomic regulatory regions predicted by utilizing human genomics, transcriptomics and epigenetics data. #ICG12 in Shenzhen, 26th October 2017
Ähnlich wie Quan Nguyen at #ICG12: Mammalian genomic regulatory regions predicted by utilizing human genomics, transcriptomics and epigenetics data (20)
Quan Nguyen at #ICG12: Mammalian genomic regulatory regions predicted by utilizing human genomics, transcriptomics and epigenetics data
1. Mammalian DNA regulatory regions predicted by
utilizing human genomics, transcriptomics and
epigenetics data
Quan H. Nguyen, Ross L. Tellam, Marina Naval-Sanchez, Laercio R. Porto-Neto, William Barendse, Antonio Reverter,
Benjamin Hayes, James Kijas, and Brian P. Dalrymple
Commonwealth Scientific and Industrial Research Organisation (CSIRO), Livestock Genomics, Brisbane, Australia
(Carlson et al., 2016, Nat Biotech)
2. Why are we searching for DNA regulatory regions?
• There are genome assemblies for many species
• We know little about which parts of the genome are
functional:
– We know mostly about protein-coding genes (~ 2% of the genome)
– Coding genes are mostly similar (in sequence and numbers) between
mammalian species
– The control of gene expression distinguishes species, individuals, and
tissues
– Regulatory DNA sequences are binding sites of transcriptional regulation
proteins (e.g. transcription factors)
1 |
• To utilise the genome information, we need to explore
beyond protein-coding sequences (i.e. regulatory
sequences)
Functional Annotation of Animal
Genomes (FAANG)
Proteins
3. 2 |
How to identify regulatory regions experimentally?
• Regulatory regions are identified by a
combination of:
1. Epigenetics data: histone
modifications (e.g. ChIP-Seq
H3K26me3, H4K20me1…), DNA
methylation (WGBS)
2. Genomics data: open-chromatin
assays (DNAse), chromatin
interactions (Hi-C)
3. Transcriptomics data: RNA-seq,
CAGE
*ROADMAP consortium, Nature, 2015
Promoter Inactive Enhancer
4. Human Atlases and Encyclopedia of Regulatory Databases
• Lots of human data available for many: 1) cell types, 2) tissues, and 3) assay types
• Much less data exist for other species
• We developed a method, HPRS (Human Projection of Regulatory Sequences), to map data in
humans to the genomes of other species
3 |
*ROADMAP consortium, Nature, 2015
5. HPRS predicts three broad categories of regulatory regions
• Promoters: more conserved, relatively easy to identify,
potentially many novel promoters of non-coding genes and
alternative transcription start sites
• Enhancers: less conserved, more tissue/cell type specific
• Other regulatory sequences: defined by
transcription factor binding sites
4 |
1. Map
2. Filter
3. Use
6. 5 |
Dataset Number
regions
Region types Tissues/cell
lines
Data Types
ENCODE 108,000 TF binding 0/12 ChIPseq
ROADMAP 5,917,129 Enhancers 48/40
ChIPseq,
DNAse I
FANTOM
Enhancers
43,011 Enhancers 135/673 CAGE
ENSEMBL 2,427,934 Enhancers 0/18
ChIPseq,
DNAse I
FANTOM
Promoters
201,802 Promoters 152/823 CAGE
Map: Selecting datasets from humans
• Datasets for mapping promoters, enhancers and
transcription factors are selected so that they
represent:
- Different tissues and cell lines
- Different biochemical assay types
7. Map: Maximizing enhancer coverage prediction
*Villar et al., 2015 Cell;160(3):554-66
• Mapping is based on inter-species conservation at two levels: 1)
primary sequence and 2) genome organization (relative locations
between regions)
• We optimized mapping parameters and mapping strategies (reciprocal
map & multimap) to recover most reference enhancers and promoters
• For example: we found lower similarity threshold resulted in higher
coverage of cattle liver reference enhancer dataset* but not specificity
6 |
8. 7 |
• Each filter step recovers a dataset with
more promoters/enhancers per Mb than
the initial baseline (whole genome without
HPRS):
Filter 1: H3K27Ac is the histone modification
mark for enhancers
Filter 2: CAGE measures bidirectional promoters
as signature for enhancers
Filter 3: Enhancer activity scored by SVM
(support vector machine)*
Filter 4: RNAseq measures active transcription
Filter 5: Number of regulatory features mapped
to the region
Filter 6: Sequence conservation (across 100
vertebrates)
Filter 7: Number of predicted transcription
factor binding sites
Filter: Seven steps for filtering mappable regulatory regions
*Lee et al., Nat Gen, 2015
9. • Started with 729,246 non-overlapping
regions in the cattle genome (the mapping
created an Universal dataset)
• Number of Villar et al* cattle liver reference
promoters (P) and enhancers (E) per 1 Mb
length was used to refine filtering
parameters
• Filtered dataset: ~7 fold enrichment in
cattle liver reference enhancers and
promoters* and ~4 fold reduction in regions
• Filtered dataset contains 70% and 79% of
enhancer and promoters in the cattle
liver reference set*
*Villar et al., 2015 Cell;160(3):554-66
8 |
Filter: Seven steps for filtering mappable regulatory regions
10. 9 |
HPRS prediction for 10 species
• We mapped 42 ROADMAP tissues
to 10 species
• Data from more biologically related
tissues produced higher coverage
(highest in liver tissue)
• Data from more evolutionarily
related species produced higher
coverage (highest in monkey –
macaca and marmoset)
• Combining multiple tissues
increased the enhancer coverage
markedly, to 65-87%
11. Enrichment of associated SNPs in regulatory regions
10 |
*Bolormaa et al., 2014, PLOS Genetics; 10 (3) e1004198
• There are 10s of millions of SNPs (Single Nucleotide
Polymorphisms (SNPs) ) in a genome:
- Commercial SNP arrays are small (~ 50,000 SNPs, i.e. 0.5% of the
total SNPs)
- SNPs affecting protein functions: minority ~ 5%
- SNPs affecting gene expression (regulatory SNPs): ~ 95%
• We tested significant SNPs for 32 traits (feed intake, growth, body
composition and reproduction), in 10,191 beef cattle*
• Substantial fold enrichment of low p-value SNPs in regulatory set v.
all other sets, including the set of SNPs 5kb upstream of protein
coding genes
12. 11 |
HPRS predicted results guide the selection of causative SNPs
13 GWAS SNP
*Karim et al., 2011, Nat. Genetics
• 13 SNPs at PLAG1 (pleomorphic
adenoma gene-1) region are
significantly associated to the cattle
stature (lower height) phenotype*
• We found 2 SNPs within promoters
and 1 SNP within an intergenic
enhancer
• The 2 SNPs at the promoter region
were validated by Karim et al.* to
change promoter activity
13. Using the HPRS dataset to understand mechanism - polled
12 |
• Two possible causal mutations of the
polled phenotype*:
A 212 base insertion/10 base deletion
mutation,
A ~80 kb duplication, at ~300 kb
away
• We found the deletion mutation is
located within a predicted enhancer and
a HAND1 transcription factor binding site
• The HAND1 deletion may lead to
downregulation of OLIG1, OLIG2 and
lincRNA2 (via distal enhancer interaction)
Hand1
Celtic deletion
Enhancer
Enhancer targets
OLIG1
OLIG1
lincRNA2
*Allais-Bonnet et al., 2013, PLOS ONE 8 e63512
14. Summary
1. The data in humans are useful to predict regulatory sequences in other
species (by HPRS mapping and filtering pipelines)
2. HPRS is a fast and economical approach, applicable when most data in a
target species are not available
3. SNPs significantly associated with phenotypes are enriched in the
predicted regulatory sequences (more enriched than traditional SNP
selection based on known coding regions)
4. HPRS results can contribute to genomics technology development, for
instance: to design a new generation causative SNP chip for large-scale
genotyping, or to predict regulatory targets as candidates for genome
editing
13 |
15. Acknowledgements
• CSIRO:
- Brian P. Dalrymple
- Juca Porto-Neto
- Ross L. Tellam
- James Kijas
- Bill Barendse
- Marina Naval-Sanchez
- Antonio Reverter
• QAFFI: Ben Hayes
• Funding: CSIRO OCE fellowship
14 |
St Lucia Campus, Brisbane, Australia
16. A machine learning tool to predict SNP effects in an enhancer
Red arrows show SNPs
• gkmSVM (gapped k mers Support Vector Machine)
scores regulatory activity by comparing enhancer
with non-enhancer regions
• deltaSVM scores were calculated for every base
across cattle ALDOB enhancer (projected from
human)
• deltaSVM scores reduced markedly at locations
overlapping transcription factor binding sites
(indicating loss of binding if mutation occurs)
15 |
*Lee et al., Nat Gen, 2015