1. Sifting the human genome for functional polymorphisms Pauline C. Ng, PhD
2.
3. Variation around genes are most likely to contribute to phenotype Coding Nonsynonymous SNPs, variation that causes an amino acid substitution 3’UTR Change in protein function? 5’UTR upstream 5’UTR
9. SIFT Choosing sequences a) Database search b) Choose closely related sequences Obtain alignment with related proteins. For each position, calculate scaled probabilities for each amino acid substitution. Query protein < cutoff > cutoff tolerated affects function
11. SIFT: Calculating probabilities 1 0 2 0 1 0 4 0 1 0 1 0 1 0 1 0 1 0 2 0 3 0 1 0 p x /p max < 0.05 => x affects function 20 12 4 1 20 16 13 9 4 2 12 0 9 7 18 13 19 12 16 11 c 20 14 c 13 9 c 16 10 16 9 c 5 2 13 7 12 8 12 8
12. SIFT output Substitution Probability Prediction Confidence M24S 0.04 Affect Function Low S82T 0.36 Tolerated High V247A 0.03 Affect Function High !!!
13. Confidence is determined by the diversity of sequences in the alignment many highly identical sequences Ideal case: Diverse set of orthologous proteins few sequences available Low confidence examples
14. Case Study: LacI lac operon repressed LacI expressed lactose present normal state 4000 single amino acid substitutions assayed: throughout entire protein both neutral and affected phenotypes TIBS 22:334-339 c c
15. Prediction on LacI substitutions 63% 28% Substitutions that affect protein function Substitutions that give no phenotype Total prediction accuracy 68% (2726/4004) Pr(observe affected phenotype | predicted to be damaging) 63% false - false + 37% 72% predicted to affect function predicted to be tolerated 37%
16. False negative error: Positions not conserved among paralogues dimer & sugar interface not conserved
18. SIFTing human variant databases 69% 25% Substitutions involved in disease 7397 subst., 606 proteins from SWISS-PROT Predicted on 76% proteins 71% subst nsSNPs in normal individuals 19% Putative polymorphisms 5780 nsSNPs, 3005 proteins from dbSNP Predicted on 60% prot., 53% subst. 185 nsSNPs, 69 proteins from Whitehead Institute Predicted on 77% prot. 62% subst 31% 81% 75%
19.
20. Account for 5% difference in dbSNP 16 genes with a high fraction of dbSNP variants predicted to affect function 1) Substitutions found in patients 2) Substitutions mapped to nonfunctional genes/regions 3) Substitutions detected in error Supports SIFT as a prediction tool
21. Account for 5% difference in dbSNP 16 genes with a high fraction of dbSNP variants predicted to affect function 1) Substitutions found in patients 2) Substitutions mapped to nonfunctional genes/regions 3) Substitutions detected in error Supports SIFT as a prediction tool
22. Mutations in MSHR increase skin cancer Mutations associated with cutaneous malignant melanoma 1 Mutations not associated with CMM 1-3 1 Am. J. Hum. Genet . 66: 176-186, 2 J. Invest. Dermatol . 116 :224-229, 3 J. Invest. Dermatol . 112: 512-513 R151C L60V R151C D294H R160W Tolerated Affect function Prediction Substitution L60V R163Q D84E Tolerated Affect function Prediction Substitution
23.
24.
25.
26. 16 genes with a high fraction of dbSNP variants predicted to affect function 1) Substitutions found in patients 2) Substitutions mapped to nonfunctional genes or regions 3) Substitutions detected in error
27. 16 genes with a high fraction of dbSNP variants predicted to affect function 1) Substitutions found in patients 2) Substitutions mapped to nonfunctional genes or regions 3) Substitutions detected in error
28. 16 genes with a high fraction of dbSNP variants predicted to affect function 1) Substitutions found in patients 2) Substitutions mapped to nonfunctional genes/regions 3) Substitutions detected in error Changes found in patients Confirms SIFT prediction and its sensitivity Unlikely to affect human health Irrelevant to human health
29. Comparison of Prediction Tools 69% 69% 63% 75% 28% 9% 25% 32% 15% 19% Variagenics SIFT SIFT EMBL disease subst. LacI Variagenics SIFT LacI EMBL* 15% Variagenics SIFT SIFT EMBL SNP databases normal individuals Substitutions that affect function Substitutions that do not affect function Polymorphisms 31% 72% 69% 91% 75% 68% 81% 85% SIFT has similar prediction accuracy to tools that use structure
hemoglobin with is a tetramer of 2 alpha and 2 beta subunits. structure on left from J.Mol. Biol he High Resolution Crystal Structure of Deoxyhemoglobin S Daniel J. Harrington, Kazuhiko Adachi, William E. Royer, Jr The Journal of Molecular Biology V272 No. 3 pp. 398-407 September 1997 http://web.wi.mit.edu/proteins/pub/BOA-2000/left.htm
Start off by with slide by defining SNPs 529 SNP/Mb in exon 921 SNP/Mb intron Nucleotide diversity
40% of proteins belong to a family 70% has at least one other match
(information content without correction) “In general,” “there are some exceptions”
pseudocounts are based on prior knowledge of the most common amino acid distributions observed in a database of many protein alignments probabilities are calcualted for ever amino acid at every position position aa allowed 5Y all 20
Ideal case: a variety of amino acids have had the time to evolve at positions not important for function. many highly identical sequences e.g. viral proteins, Ig’s (can be fixed by going to smaller database)
1764 substitutions that affect function 2240 substitutions that give no phenotype Intermediate grouped with null Mention intermediate grouped with null 15% better total prediction accuracy 10% increase in experimental prediction accuracy
white: tolerate >= 6 substitutions in assay red : positions high false positive error
what genes in whitehead SNPs. candidate genes for coronary artery disease, type II diabetes, schizophrenia
Substitutions were first identified in patients and then deposited into dbSNP. Thus it makes sense that the substitutions should be preicted as damaging.
when purine repressor , a LacI paralogue, used for prediction on LacI, Variagenics only predicted 19% of the substitutions that have an effect were correctly predicted as damaging.
There are two genetic approaches that make use of the variation around genes to find disease loci. Haplotypes may be stronger predictors of phenotype (mirvana, chakravarti) haplotype a set of alleles grouped together haplotype is a group of SNPs that are linked together tagSNPs are most informative Neil Risch – reduced positive with direct appraoch
Is the direct approach possible? Hoogendoorn, Bastiaan used reporter gene assays in cell lines. We have used denaturing high performance liquid chromatography to screen the first 500 bp of the 5' flanking region of 170 opportunistically selected genes identified from the Eukaryotic Promoter Database (EPD) for common polymorphisms. Using a screening set of 16 chromosomes, single-nucleotide polymorphisms were found in approximately 35% of genes. It was attempted to clone each of these promoters into a T-vector constructed from the reporter gene vector pGL3. The relative ability of each promoter haplotype to promote transcription of the luciferase gene was tested in each of three human cell lines (HEK293, JEG and TE671) using a co-transfected SEAP-CMV plasmid as a control. The findings suggest that around a third of promoter variants may alter gene expression to a functionally relevant extent .
causal variant may not have been identified. 80% common identified in European. 50% in Africans (Nickerson) rare variant some genes have no coverage – there may be no nsSNP or it has not yet been identified and deposited in dbSNP
dbSNP 120, 3.5 million double hit and snps with frequencies