Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Genome Wide Methodologies and Future Perspectives

5.538 Aufrufe

Veröffentlicht am

Slide notes loosely follow what was presented.

Veröffentlicht in: Wissenschaft
  • Login to see the comments

Genome Wide Methodologies and Future Perspectives

  1. 1. Genome Wide Methodologies and Future Perspectives Brian Krueger, PhD Duke University Center for Human Genome Variation
  2. 2. History of Genetic Linkage • Mendel’s Laws – Law of segregation • Each parent randomly passes one of two alleles to offspring – Law of Independent Assortment • Separate genes for separate traits are passed independently to offspring • Traits should appear in offspring in the ratio of 9:3:3:1 – Laws hold true for genes on different chromosomes or genes located far away from one another • Linkage – Bateson and Punnett quickly found traits that didn’t assort independently – Thomas Hunt Morgan and his student Alfred Sturtevant found that recombination frequency is a good predictor of distance between genes • Genes that are inherited together must be closer to one another – linked • Generated the first linkage maps – Serves as an important basis for understanding genetic association studies
  3. 3. Linkage Studies • Model Organisms – Fruit Flies, plants, etc – Extremely important for understanding human genetics – Fruit flies can produce new generations of 400+ offspring approximately every week! • Can very quickly understand the genetics of trait heritability • Familial Linkage Studies – Require multiple generations – Take decades to develop – Complicated by family participation • Association studies – Subtle difference between linkage studies – Try to apply knowledge of familial linkage to entire populations
  4. 4. Genome Wide Association Studies • GWA studies – Aim to find genetic variants that are associated with traits – Typically used to elucidate complex disease traits – Focus on SNPs, Indels, CNVs – Most often Case/Control Studies • SNP (Single Nucleotide Polymorphism) – Change in a single nucleotide position • Indel (Insertion/Deletion) – Describes the insertion or deletion of nucleotides • CNV (Copy number variations) – Large deletions or duplications of genetic material
  5. 5. GWA Study History • Human Genome Project (1990-2000) – Decade long international project to determine the complete human genome sequence – Provided the reference genome for future research on genome variation • Human HapMap (2002-2009) – Sequencing whole genomes is expensive – Needed a shortcut to understand how variation contributes to disease – Mapped millions of common known SNPs in 269 individuals – Theory that common SNPs are inherited and could be predictive of associated disease – Determine how SNPs from case/control studies associate with human disease
  6. 6. Defining Association • Variants are not always causal! – SNPs sometimes only serve as markers – Can play absolutely no role in the disease and even be located on different chromosomes from the gene actually responsible for the phenotype • Population stratification – Variants differ by population – Variants important markers of disease in one population or ethnicity may not be effective markers in another – For GWA studies to be effective predictors in multiple populations, large datasets for each ethnicity must be obtained
  7. 7. GWAS SNP Genotyping • Bead array genotyping – Uses a chip containing beads with covalently attached baits – Baits hybridized to fragmented DNA – Baits SPECIFIC for the DNA just upstream of a SNP – Base extension with fluorescently labeled bases allows interrogation of the SNP (each base has a different color!) – A single bead chip can assay millions of rs1372493 rs1372493 SNPs 16000 1.60 1.40 – Colorimetric output plotted 14000 12000 1.20 • Blue indicates homozygous for one version of the 10000 1 SNP - CC Intensity (B) 8000 0.80 • Purple is heterozygous - CA Norm R 6000 0.60 • Red homozygous for the other version of the SNP 4000 - AA 0.40 2000 0.20 0 0 2317 834 74 -2000 -0.20 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 0 0.20 0.40 0.60 0.80 1 Intensity (A) Norm Theta
  8. 8. GWAS SNP Genotyping and Validation • Realtime PCR – Use specific PCR probes to verify SNPs – Good for validating a handful of SNPs at a time • Mass Array – Use mass spec to find SNPs – Detected by looking at fragment weight differences – Good for detecting or validating a large number of SNPs rapidly • Sanger sequencing – Gold standard validation method – Can determine the SNP at its exact position – Very robust
  9. 9. GWA Study History • To this point in time, the power of most GWA studies was lacking – GWA not really genome wide – Looked at common variants across genome – Missed rare variants and not always descriptive of disease causation • Whole Genome Sequencing (WGS) – Actually assays the entire genome – Discovers all variants – Prohibitively costly before 2008 – Current cost of WGS ~$4000 • Thousand Genomes Project (2008-) – Facilitated by plummeting sequencing costs and technological advancements – Goal to fully sequence the genomes of 1000 healthy individuals to provide a true picture of genome wide variation
  10. 10. Second Generation Sequencing • Developed to increase throughput of Sanger sequencing • Can sequence many molecules in parallel – Does not require homogenous input – Sequenced as clusters • Sequencing by synthesis – Bases are added, signals scanned, and then washed – Cycle repeated (30-2000x)
  11. 11. 2nd Gen: Sequencing by Synthesis Overview Genomic Fragmented DNA Ligate Adaptors DNA Generate Clusters (On Flowcell or Beads) T T A T A T TA T A T T C C G G A G A G T T T T G G Repeat Hundreds of times on millions of clusters Detect Signals Add Bases
  12. 12. Flavors of Sequencing • Whole Genome Sequencing – Obtain whole blood or tissue sample – Create sequencing libraries of all DNA fragments • Whole Exome Sequencing – Utilizes a selection protocol – Attach complimentary RNA strands to beads – Fish out ONLY coding DNA sequences – Create sequencing libraries from enriched DNA – Reduces cost significantly • Custom Capture – Same protocol as Exome sequencing – Only target desired DNA sequences • Amplicon Sequencing – Use PCR to amplify target DNA – Sequence amplified DNA (Amplicon)
  13. 13. NGS Study Designs for Gene Discovery Multiplex families Case-control studies Trio sequencing of sporadic diseases
  14. 14. De novo Mutation Calling/Filtering Variant Individual variant Multi-sample calling calling variant calling Exome Variant Server 6500 exome Cross-checking sequenced individuals public databases Visual InspectionSanger sequencing confirmation
  15. 15. Detecting Copy Number Variants ERDS (Estimation by Read Depth with SNvs) Average read depth (RD) of every 2-kb window were calculated, followed by GC corrections. A paired Hidden Markov model was applied to infer copy numbers of every window by utilizing both RD information and heterozygosity information. homozygous heterozygous duplication deletion deletion Windows
  16. 16. Illumina • Uses a flow cell • Cluster generated on slide via bridge amplification • Sequencing by synthesis – Performed by flowing labeled bases over flow cell – 4 pictures taken (one for each base) – Cluster color determined at each cycle allows interrogation of sequence • Advantages – Low cost per base – Very high throughput • Limitations – High cost per experiment – Short read length (30-150bp) – Acquired a company that uses new tech to reach read lengths of 2-10Kb Schadt et al 2010 HMG
  17. 17. Ion Torrent • Emulsion PCR is used to generate clusters on a bead • Sequencing by synthesis – Pyrosequencing – Relies on release of pyrophosphate for detection – Instead of a visual cue, system senses the release of H+ as each base is flowed over the beads • Advantages – Short run time – Does not require modified bases – Longer read length (200bp) • Limitations – Low data output – High homopolymer error rate
  18. 18. Third Generation Sequencing • Defined as single molecule sequencing • Less complex sample prep • Much longer read length – SGS Short read length a huge disadvantage for de novo sequencing applications • Two categories – Sequencing by synthesis – Direct sequencing • Passing molecule through a nanopore • Using atomic force microscopy • Bleeding edge technology – Many technical hurdles – Currently very high error rates
  19. 19. Pacific Biosciences • Utilizes single molecule sequencing by synthesis • Extremely complex system – Each well contains a single DNA molecule and an immobilized polymerase – No reagent washing – Employs confocal microscopy to only detect fluorescence at the polymerase • Advantages – Very long read length (1-15kb) – Low complexity sample prep – Very fast data generation (real time) • Disadvantages – Prone to sequencing errors (~15% error rate) – Company on the verge of bankruptcy
  20. 20. Third/Second Generation Sequencing • Currently only one viable high throughput long read sequencing platform – PacBio system has a 15% error rate – Need long reads for many applications from de novo sequencing to haplotyping • Second generation sequencers high throughput and accurate – Short reads are hard to assemble and leave gaps in repetitive sequences • Can use both as a highly accurate and extremely powerful tool for de novo sequencing applications – Use PacBio assembly as a scaffold – Correct errors by aligning HiSeq reads on top – Effective error rate of 0.1% – Expensive but extremely fast and accurate compared to other methods Koren et al 2012 Nature Biotechnology
  21. 21. Future: Nanopore Sequencing • Leading candidate is Oxford Nanopore • Concept – Detect flow of electrons through the pore – Each base causes a detectable change in the current – Results in direct sequencing – Theoretically could be used to sequence RNA and protein too • Advantages – Long read length – Plug and play – Easily scalable • Disadvantages – No hard data yet Credit: John MacNeill/TechnologyReview – No specific release date
  22. 22. Future: Direct sequencing • Concept stage techniques – Significant technical hurdles to overcome – Mostly proof of concept experiments • IBM DNA Transistor Credit: IBM – Bases read as single stranded DNA passes through the transistor – Gold bands represent metal, gray bands are the dielectric • Atomic force microscopy sequencing – Use AFM tip to detect each base of single stranded DNA Credit: Lee et al US PAT 20040124084
  23. 23. Sequencing Applications • Old techniques which used to take days or years to perform can now be completed in hours • Next generation sequencing has opened a new door for addressing very complicated genetic questions – Has huge potential to revolutionize human healthcare – Survey complex tumor types – Research into macro and micro community genomics – Reveal evolutionary history
  24. 24. De novo Sequencing• Human genome took 10 years to complete and cost $3 billion dollars – Done by laboriously cloning overlapping segments of the human genome into bacmid libraries and Sanger sequencing each one – Genome assembled using computers to line up over lapping sequences• Current estimate is around $4000 – Can be completed in a week – Companies like Complete Genomics say they have already sequenced thousands of human genomes• Future – Long read sequencers will make agricultural sequencing more viable – Whole genome sequencing for human diagnostics will become routine – Increasing the catalog of organismal genomes will improve our understanding of evolution and development
  25. 25. Genome Mutation Analysis • Previously done by completing complicated and time consuming familial linkage studies and targeted Sanger sequencing • Next generation sequencing can look at every gene at once – Can produce a genetic map of the complete genome – Used to detect genetic polymorphisms – See every possible mutation • Future – Whole genome sequence analysis – Targeted genome sequencing analysis using predetermined sequence selection arrays (ex: Exome Enrichment)
  26. 26. Pharmacogenetics • Very hot topic in the biotech and insurance industries • Use genetic typing to guess how a person might respond to different drug treatments • Currently relies on microarrays • NGS could provide significantly more information at more loci – Microarrays only look at a handful of polymorphisms – Current NGS approaches port the microarray technique to enrich pools for sequencing • Future – As the catalog of human genomes increases, it will be easier to calculate responses to treatment before drugs are administered Gauthier et al 2007 Cancer Cell
  27. 27. Epigenetics • Defined as heritable genetic information that is not coded in the DNA bases – DNA methylation – Histone modifications • Previous mechanisms for detecting these Chromatin or DNA modifications relied on targeted probing – ChIP-PCR – Bisulfite sequencing – Footprinting assays • Next generation sequencing changed everything – Whole genome methylation mapping (MAP-IT) – Whole genome histone modification and protein binding mapping (ChIP-Seq - acetylation, methylation, etc) • ENCODE project
  28. 28. ENCyclOpedia of Dna Elements (ENCODE) • International project – Follow up to the human genome project • Only 98% of the human genome codes for protein – Creating and maintaining DNA is biochemically expensive – What’s the other 98% of the genome doing? • ENCODE goals – Determine the functional elements of the human genome – Protein Coding – Non-Coding RNA – mRNA Expression – Regulatory protein binding sites – Histone modifications • Preliminary estimates show that 80% of human DNA is functional!
  29. 29. Transcriptome/Expression Analysis • Gene expression analysis is important for disease discovery and cancer diagnosis • Expression analysis first relied on Northern blotting followed by DNA microarrays – Both cases require a probe – Need to “know” what you are looking for – Low resolution screening • Next generation approaches screen the entire transcriptome (RNA-Seq) – Single base resolution of expression – Can see level of expression and also visualize mutations in expressed sequences • Future – Important for diagnosing/treating cancer and heritable diseases
  30. 30. Phenotypic Correlation • NGS data generates huge datasets with 85-99.9% base accuracy – Must determine which signals are real, and which are noise/errors – Most promising hits are validated by other assays (Sanger, qRT, Mass Spec) – How do we determine which hits to validate? • Currently have very small datasets, even in pharmacogenetics that have limited utility • Validated hits can be distractions See NYTimes Series on whole genome – Tumor diversity presents multiple escape Sequencing: http://nyti.ms/No4fgd routes during targeted treatment • Future – Require large validated datasets that are ethnically and geographically diverse
  31. 31. Metagenomics • Used to survey macro and micro environments – Microbial communities (Soil/Gut) – Tumors – Plant communities – Coral reef ecosystems • Previous techniques coupled mtDNA or ribosomal Sanger sequencing with BLAST analysis – Limited by number of sequenced species – Can determine who, but not what is going on • NGS approaches now being used to determine exactly what organisms are present and how they interact – Can get expression data and link it back to community groups – Survey community diversity
  32. 32. Data • Absolutely the largest roadblock for next generation sequencing • Terabytes of data are useless if we can’t efficiently analyze the data • How long should data be kept? – Depends on application • Human Diagnostic sequencing? • Research sequencing? • Where should data be kept and processed? – Local or Cloud (Amazon, etc)? – Cost of infrastructure vs cost of cloud service – Security issues • Future – Cloud based solutions will become more attractive