SlideShare ist ein Scribd-Unternehmen logo
1 von 52
The Evolution of Genome Data


                Deanna M. Church, NCBI




@deannachurch
Collins FS et al, 1998




   Throughput: 500 Mb/year
     Cost: < $0.25 per base
Variation: 100,000 SNPs mapped
ClinVar
                        140,000                                                                                                                            2,500,000
                                                                                                                                GTR
                                         Twenty Two Years of Growth:                                                            Genome Remapping Service
                                                                                                                                PubMed Health
                                                                                                                                CloneDB
                        120,000
                                         NCBI Data and User Services                                          Public Access
                                                                                                                                Genome Decoration Page
                                                                                                              Influenza Seqs.
                                                       GenBank Base Pairs                                     GenSAT                                       2,000,000
                                                       Users (Average)                                        GeneTests
                                                                                                     PubChem                            Peptidome
                        100,000                                                                      Trace Archive                      BioSystems
                                                                                                     CCDS                               Flu H1N1
                                                                                                     Cancer Chromosomes
                                                                                                     Environmental Samples
                                                                                                                               Discovery Initiative         1,500,000
Base Pairs (Millions)




                         80,000                                                       PubMed Central Entrez Genes              Entrez Sensors




                                                                                                                                                                        Users/Weekday
                                                                                      BLINK              Mouse Composite       Primer BLAST
                                                                                      MapViewer           Genome
                                                                                      GEO                Gnomon         Seq Read Archive
                                                                                      GeneRIFs                          UniSTS
                                                                                                   WGS
                                                                                                                        RefSeqGene
                         60,000                                                                    HLA Haplotypes
                                                                                  Human Genome Human Genome-TPA Genome Reference
                                                                                  LinkOut                                 Consortium                        1,000,000
                                                                                             dbMHC                                             dbVar
                                                                       PubMed LocusLink                                                        Epigenomics
                                                                                             BookShelf
                                                                       PSI-BLAST RefSeq                                                        MyNCBI
                                                           BankIt                            Human Genome-
                                                                       VAST       dbSNP                                                        1000 Genomes
                         40,000                            Genomes                            Transcripts Alignments
                                                                       ePCR                                                                    Project
                                                           Taxonomy         Microbial Genomes                          Genome-Wide
                                                                            PHI-BLAST                                    Association Studies
                                              3D Structure        OMIM      CGAP                                       dbGap                                500,000
                                              Network Entrez      GeneMap                                              Entrez Portal
                         20,000                                   Cn3D
                                                        WWW
                                             GenBank              UniGene
                                                        dbSTS
                                       Entrez at NCBI
                                  BLAST      dbEST

                             0                                                                                                                             0
                              1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
Steve Sherry, NCBI

                                                   60
                                                         Millions
NCBI dbSNP database growth                               of rs-ids
human variations                                   50


                                                   40


                                                   30


                                                   20

Non-redundant                                              STR & Indel
                                                   10
                                                           SNP
annotations
                                                           Ambiguous mapping

 1999    2000                    2005      2011
                                            2010




                                                         Millions
Submissions                                              of submissions
                                                   25
by project
                                                   50

                                                   75

                                                   100
                                                           1000 Genomes
                                                   125     Other projects
                                                           HapMap
                                                   150     TSC
dbSNP build 135. November 2011
                                                   175
Kidd et al, 2007 APOBEC cluster




BLACK: Deletion
White: Insertion
http://www.ncbi.nlm.nih.gov/dbvar
Church et al., 2011 PLoS




http://genomereference.org
GRC Beginnings


       Distributed data

    Old Assembly Model

Genome not in INSDC Database
Build sequence contigs based on contigs
defined in TPF.
 Check for orientation consistencies
 Select switch points
 Instantiate sequence for further analysis


                 Switch point




                      Consensus sequence
http://genomereference.org
Community Input
Distributed data
      Centralized Data

    Old Assembly Model

Genome not in INSDC Database
Large-Scale Variation Complicates Genome Assembly

         Sequences from haplotype 1
         Sequences from haplotype 2




Old Assembly model: compress into a consensus



New Assembly model: represent both haplotypes
UGT2B17 Region




NCBI36 (hg18)
UGT2B17 Region
NCBI36 NC_000004.10 (chr4) Tiling Path
                AC079749.5         AC147055.2                                            AC019173.4                AC021146.7
  AC074378.4                 AC134921.2                               AC140484.1                      AC093720.2




                              TMPRSS11E                                                         TMPRSS11E2


GRCh37 NC_000004.11 (chr4) Tiling Path
                              AC079749.5                 AC147055.2                                                AC021146.7
  AC074378.4                                    AC134921.1                         AC093720.2


                                    TMPRSS11E


GRCh37: NT_167250.1 (UGT2B17 alternate locus)
                                                   AC019173.4                                                      AC021146.7
   AC074378.4                                                                                    AC226496.2
                AC140484.1

                                     TMPRSS11E2



Xue Y et al, 2008
UGT2B17   MHC                  MAPT   GRCh37 (hg19)




                             7 alternate haplotypes
                                        at the MHC

                               Alternate loci released as:
                                                    FASTA
                                                      AGP
                              Alignment to chromosome


http://genomereference.org
Assembly (e.g. GRCh37)
PAR                Non-nuclear
       Primary    assembly unit
       Assembly      (e.g. MT)

                   ALT       ALT   ALT
       Genomic      1         2     3
        Region
         (MHC)
       Genomic
                   ALT       ALT   ALT
        Region      4         5     6
      (UGT2B17)
       Genomic
        Region
                                   ALT
                   ALT
        (MAPT)                      7
                    8

                   ALT
                    9
Richa Agarwala




MHC Alternate locus
  Alignment to chr6
Oh No! Not a new
                             version of the human
                             genome!




http://genomereference.org
Assembly (e.g. GRCh37.p5)
PAR                Non-nuclear
       Primary    assembly unit
       Assembly      (e.g. MT)

                   ALT       ALT   ALT
       Genomic      1         2     3
         Region
         (MHC)
       Genomic
                   ALT       ALT   ALT
         Region     4         5     6
      (UGT2B17)
       Genomic
         Region
                                   ALT
                   ALT
        (MAPT)                      7
       Genomic      8
         Region
         (ABO)
       Genomic     ALT
         Region     9
         (SMA)
       Genomic
         Region
       (PECAM1)
                  Patches
         …
TBC1D3C         TBC1D3   TBC1D3H




                TBC1D3C




Myo19 region (17q21)
70 Fix PATCHES: Chromosome will update in GRCh38
  (adds >1 Mb of novel sequence to the assembly)

71 Novel PATCHES: Additional sequence added
  (adds >800K of novel sequence to the assembly)

                                                   Releasing patches quarterly
Distributed data
      Centralized Data
    Old Assembly Model
   Updated Assembly Model
Genome not in INSDC Database
  Genome in INSDC Database
Data Archives




                     GenBank



   Data in a common format
   Data in a single location (and mirrored)
   Most quality checked prior to deposition
   Robust data tracking mechanism (accession.version)
   Data owned by submitter
Data tracking

ABC14-1065514J1
                Date       Phase   Gaps      Length

FP565796.1   21-Oct-2009    1       1

FP565796.2   14-Oct-2010    1       0

FP565796.3   07-Nov-2010    3       0
Mouse chrX: 34,800,000-34,890,000

NC_000086.1
          2
          4
          3
          6
          5
          7   CM001013.1
                       2
Mouse chrX: 35,000,000-36,000000
           MGSCv3       MGSCv36




                    X
What’s in a name?

GRCh37
hg19

               Zv7
               danRer5

  MGSCv37
mm8
    NCBIM37
By any other name…




chr21:8,913,216-9,246,964
By any other name…




Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX
hg19
               GRCh37




http://www.ncbi.nlm.nih.gov/genome/assembly
Assembly (e.g. GRCh37.p5)
                 GCA_000001405.6 /GCF_000001405.17
                                            ALT      GCA_000001345.1/
  Primary        GCA_000001305.1/            4       GCF_000001345.1
  Assembly       GCF_000001305.13
                                            ALT      GCA_000001355.1/
                                             5       GCF_000001355.1

  Non-nuclear    GCA_000006015.1/           ALT      GCA_000001365.1/
 assembly unit   GCF_000006015.1             6       GCF_000001365.2
    (e.g. MT)
                                            ALT      GCA_000001375.1/
                                             7       GCF_000001375.1
ALT    GCA_000001315.1/
 1     GCF_000001315.1
                                            ALT      GCA_000001385.1/
                                             8       GCF_000001385.1
ALT    GCA_000001325.1/
 2     GCF_000001325.2
                                            ALT      GCA_000001395.1/
                                             9       GCF_000001395.1
ALT    GCA_000001335.1/
 3     GCF_000001335.1                               GCA_000005045.5
                                           Patches
                                                     GCF_000005045.4
GenBank               vs      RefSeq
Submitter Owned              RefSeq Owned
  Redundancy                 Non-Redundant
 Updated rarely                 Curated
    INSDC                      Not INSDC

                     BRCA1
83 genomic records            3 genomic records
31 mRNA records               5 mRNA records
27 protein records            1 RNA record
                              5 protein records
RefSeq for Assemblies

Typical assembly edits
  Addition of non-nuclear (e.g. MT) assembly units
  Removal of contamination
    Drop unlocalized/unplaced scaffolds
    Mask contamination that is placed on chromosome
http://www.ncbi.nlm.nih.gov/genome
Understanding relationships between
                 assemblies using alignments




First Pass   Reciprocal best hit




Second Pass        Non-reciprocal, duplicative hits
NCBI36




                                            GRCh37.p5




No second pass alignments in GRCh37.p5

http://www.ncbi.nlm.nih.gov/tools/gbench/
Genome Data is MORE than just the Genome
Genome Data is MORE than just the Genome
  ATGCGTGCAAAATGCAGTGAGT
   ATGCGTGCAAAATGCAGTGAGT
    ATGCGTGCAAAATGCAGTGAGT
      ATGCGTGCAAAATGCAGTGAGT




NM_000336.2:c.800C>T
ATGCGTGCAAAATGCAGTGAGT
 ATGCGTGCAAAATGCAGTGAGT
  ATGCGTGCAAAATGCAGTGAGT
    ATGCGTGCAAAATGCAGTGAGT




NM_000336.2:c.800C>T
NC_000001.10:g.(?_20700513)_(21062644_?)del
http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes
http://www.youtube.com/NCBINLM   @NCBI   http://www.facebook.com/ncbi.nlm

http://www.ncbi.nlm.nih.gov/education/
Thanks!
 The Genome Reference Consortium
  The Genome Center at Washington University
  The Wellcome Trust Sanger Institute
  The European Bioinformatics Institute
  The National Center for Biotechnology Information

  Church group at NCBI                                For Slides:
    Valerie Schneider                                  Francoise Thibaud-Nissen
    Nathan Bouk                                        Evan Eichler
    Hsiu-Chuan Chen                                    Steve Sherry
    Peter Meric
    Victor Ananiev
    Chao Chen
    John Lopez
    John Garner
    Tim Hefferon
                                                      NCBI
    Cliff Clausen

Weitere ähnliche Inhalte

Ähnlich wie Church nhgri 2012

The wheat genome sequence: a foundation for accelerating improvment of bread ...
The wheat genome sequence: a foundation for accelerating improvment of bread ...The wheat genome sequence: a foundation for accelerating improvment of bread ...
The wheat genome sequence: a foundation for accelerating improvment of bread ...Borlaug Global Rust Initiative
 
Experimentos de nubes científicas: Medical Genome Project
Experimentos de nubes científicas: Medical Genome ProjectExperimentos de nubes científicas: Medical Genome Project
Experimentos de nubes científicas: Medical Genome ProjectFundación Ramón Areces
 
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...Copenhagenomics
 
Caporaso sloan qiime_workshop_slides_18_oct2012
Caporaso sloan qiime_workshop_slides_18_oct2012Caporaso sloan qiime_workshop_slides_18_oct2012
Caporaso sloan qiime_workshop_slides_18_oct2012gregcaporaso
 
Stephen Friend Nature Genetics Colloquium 2012-03-24
Stephen Friend Nature Genetics Colloquium 2012-03-24Stephen Friend Nature Genetics Colloquium 2012-03-24
Stephen Friend Nature Genetics Colloquium 2012-03-24Sage Base
 
Welsh_BioHDF_BOSC2009
Welsh_BioHDF_BOSC2009Welsh_BioHDF_BOSC2009
Welsh_BioHDF_BOSC2009bosc
 
Stephen Friend Fanconi Anemia Research Fund 2012-01-21
Stephen Friend Fanconi Anemia Research Fund 2012-01-21Stephen Friend Fanconi Anemia Research Fund 2012-01-21
Stephen Friend Fanconi Anemia Research Fund 2012-01-21Sage Base
 
Bio-IT 2010 Genome Commons
Bio-IT 2010 Genome CommonsBio-IT 2010 Genome Commons
Bio-IT 2010 Genome CommonsReece Hart
 
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...Larry Smarr
 
GeneArt® services - Gene synthesis through protein production
GeneArt® services - Gene synthesis through protein productionGeneArt® services - Gene synthesis through protein production
GeneArt® services - Gene synthesis through protein productionThermo Fisher Scientific
 
Microarrays;application
Microarrays;applicationMicroarrays;application
Microarrays;applicationFyzah Bashir
 
Scratchpads in the Biodiversity Informatics Landscape
Scratchpads in the Biodiversity Informatics LandscapeScratchpads in the Biodiversity Informatics Landscape
Scratchpads in the Biodiversity Informatics LandscapeVince Smith
 
Unison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningUnison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningReece Hart
 
6.남영도110923
6.남영도1109236.남영도110923
6.남영도110923drugmetabol
 
Friend Oslo 2012-09-09
Friend Oslo 2012-09-09Friend Oslo 2012-09-09
Friend Oslo 2012-09-09Sage Base
 
Biocuration2012 Eugeni Belda
Biocuration2012 Eugeni BeldaBiocuration2012 Eugeni Belda
Biocuration2012 Eugeni Beldaeugenibc
 

Ähnlich wie Church nhgri 2012 (20)

The wheat genome sequence: a foundation for accelerating improvment of bread ...
The wheat genome sequence: a foundation for accelerating improvment of bread ...The wheat genome sequence: a foundation for accelerating improvment of bread ...
The wheat genome sequence: a foundation for accelerating improvment of bread ...
 
Experimentos de nubes científicas: Medical Genome Project
Experimentos de nubes científicas: Medical Genome ProjectExperimentos de nubes científicas: Medical Genome Project
Experimentos de nubes científicas: Medical Genome Project
 
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
Discovery of Cow Rumen Biomass-Degrading Genes and Genomes through DNA Sequen...
 
RNA-seq Analysis
RNA-seq AnalysisRNA-seq Analysis
RNA-seq Analysis
 
Caporaso sloan qiime_workshop_slides_18_oct2012
Caporaso sloan qiime_workshop_slides_18_oct2012Caporaso sloan qiime_workshop_slides_18_oct2012
Caporaso sloan qiime_workshop_slides_18_oct2012
 
Stephen Friend Nature Genetics Colloquium 2012-03-24
Stephen Friend Nature Genetics Colloquium 2012-03-24Stephen Friend Nature Genetics Colloquium 2012-03-24
Stephen Friend Nature Genetics Colloquium 2012-03-24
 
Welsh_BioHDF_BOSC2009
Welsh_BioHDF_BOSC2009Welsh_BioHDF_BOSC2009
Welsh_BioHDF_BOSC2009
 
Stephen Friend Fanconi Anemia Research Fund 2012-01-21
Stephen Friend Fanconi Anemia Research Fund 2012-01-21Stephen Friend Fanconi Anemia Research Fund 2012-01-21
Stephen Friend Fanconi Anemia Research Fund 2012-01-21
 
Bio-IT 2010 Genome Commons
Bio-IT 2010 Genome CommonsBio-IT 2010 Genome Commons
Bio-IT 2010 Genome Commons
 
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
The OptIPlanet Collaboratory Supporting Microbial Metagenomics Researchers Wo...
 
NCBI
NCBINCBI
NCBI
 
GeneArt® services - Gene synthesis through protein production
GeneArt® services - Gene synthesis through protein productionGeneArt® services - Gene synthesis through protein production
GeneArt® services - Gene synthesis through protein production
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
Microarrays;application
Microarrays;applicationMicroarrays;application
Microarrays;application
 
RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
 
Scratchpads in the Biodiversity Informatics Landscape
Scratchpads in the Biodiversity Informatics LandscapeScratchpads in the Biodiversity Informatics Landscape
Scratchpads in the Biodiversity Informatics Landscape
 
Unison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningUnison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic mining
 
6.남영도110923
6.남영도1109236.남영도110923
6.남영도110923
 
Friend Oslo 2012-09-09
Friend Oslo 2012-09-09Friend Oslo 2012-09-09
Friend Oslo 2012-09-09
 
Biocuration2012 Eugeni Belda
Biocuration2012 Eugeni BeldaBiocuration2012 Eugeni Belda
Biocuration2012 Eugeni Belda
 

Mehr von Deanna Church

Mehr von Deanna Church (16)

Church SFAF2014 keynote
Church SFAF2014 keynoteChurch SFAF2014 keynote
Church SFAF2014 keynote
 
Church_NCBIvariation2013
Church_NCBIvariation2013Church_NCBIvariation2013
Church_NCBIvariation2013
 
Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013Church_GenomeAccess_2013_genome2013
Church_GenomeAccess_2013_genome2013
 
Church iowa2013
Church iowa2013Church iowa2013
Church iowa2013
 
Church emory2013
Church emory2013Church emory2013
Church emory2013
 
Church GeT-RM
Church GeT-RMChurch GeT-RM
Church GeT-RM
 
Church sfaf13
Church sfaf13Church sfaf13
Church sfaf13
 
Church gia13
Church gia13Church gia13
Church gia13
 
Church apr2013
Church apr2013Church apr2013
Church apr2013
 
Church ngs
Church ngsChurch ngs
Church ngs
 
Church agbt13 merge
Church agbt13 mergeChurch agbt13 merge
Church agbt13 merge
 
Church clinical2012
Church clinical2012Church clinical2012
Church clinical2012
 
Church isca2012
Church isca2012Church isca2012
Church isca2012
 
Church gmod2012 pt2
Church gmod2012 pt2Church gmod2012 pt2
Church gmod2012 pt2
 
Imgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorialImgc2011 bioinformatics tutorial
Imgc2011 bioinformatics tutorial
 
Church Fif2009
Church Fif2009Church Fif2009
Church Fif2009
 

Kürzlich hochgeladen

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Kürzlich hochgeladen (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

Church nhgri 2012

  • 1. The Evolution of Genome Data Deanna M. Church, NCBI @deannachurch
  • 2. Collins FS et al, 1998 Throughput: 500 Mb/year Cost: < $0.25 per base Variation: 100,000 SNPs mapped
  • 3. ClinVar 140,000 2,500,000 GTR Twenty Two Years of Growth: Genome Remapping Service PubMed Health CloneDB 120,000 NCBI Data and User Services Public Access Genome Decoration Page Influenza Seqs. GenBank Base Pairs GenSAT 2,000,000 Users (Average) GeneTests PubChem Peptidome 100,000 Trace Archive BioSystems CCDS Flu H1N1 Cancer Chromosomes Environmental Samples Discovery Initiative 1,500,000 Base Pairs (Millions) 80,000 PubMed Central Entrez Genes Entrez Sensors Users/Weekday BLINK Mouse Composite Primer BLAST MapViewer Genome GEO Gnomon Seq Read Archive GeneRIFs UniSTS WGS RefSeqGene 60,000 HLA Haplotypes Human Genome Human Genome-TPA Genome Reference LinkOut Consortium 1,000,000 dbMHC dbVar PubMed LocusLink Epigenomics BookShelf PSI-BLAST RefSeq MyNCBI BankIt Human Genome- VAST dbSNP 1000 Genomes 40,000 Genomes Transcripts Alignments ePCR Project Taxonomy Microbial Genomes Genome-Wide PHI-BLAST Association Studies 3D Structure OMIM CGAP dbGap 500,000 Network Entrez GeneMap Entrez Portal 20,000 Cn3D WWW GenBank UniGene dbSTS Entrez at NCBI BLAST dbEST 0 0 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
  • 4. Steve Sherry, NCBI 60 Millions NCBI dbSNP database growth of rs-ids human variations 50 40 30 20 Non-redundant STR & Indel 10 SNP annotations Ambiguous mapping 1999 2000 2005 2011 2010 Millions Submissions of submissions 25 by project 50 75 100 1000 Genomes 125 Other projects HapMap 150 TSC dbSNP build 135. November 2011 175
  • 5. Kidd et al, 2007 APOBEC cluster BLACK: Deletion White: Insertion
  • 7.
  • 8. Church et al., 2011 PLoS http://genomereference.org
  • 9. GRC Beginnings Distributed data Old Assembly Model Genome not in INSDC Database
  • 10. Build sequence contigs based on contigs defined in TPF. Check for orientation consistencies Select switch points Instantiate sequence for further analysis Switch point Consensus sequence
  • 11.
  • 13.
  • 15. Distributed data Centralized Data Old Assembly Model Genome not in INSDC Database
  • 16. Large-Scale Variation Complicates Genome Assembly Sequences from haplotype 1 Sequences from haplotype 2 Old Assembly model: compress into a consensus New Assembly model: represent both haplotypes
  • 18. UGT2B17 Region NCBI36 NC_000004.10 (chr4) Tiling Path AC079749.5 AC147055.2 AC019173.4 AC021146.7 AC074378.4 AC134921.2 AC140484.1 AC093720.2 TMPRSS11E TMPRSS11E2 GRCh37 NC_000004.11 (chr4) Tiling Path AC079749.5 AC147055.2 AC021146.7 AC074378.4 AC134921.1 AC093720.2 TMPRSS11E GRCh37: NT_167250.1 (UGT2B17 alternate locus) AC019173.4 AC021146.7 AC074378.4 AC226496.2 AC140484.1 TMPRSS11E2 Xue Y et al, 2008
  • 19. UGT2B17 MHC MAPT GRCh37 (hg19) 7 alternate haplotypes at the MHC Alternate loci released as: FASTA AGP Alignment to chromosome http://genomereference.org
  • 20.
  • 21. Assembly (e.g. GRCh37) PAR Non-nuclear Primary assembly unit Assembly (e.g. MT) ALT ALT ALT Genomic 1 2 3 Region (MHC) Genomic ALT ALT ALT Region 4 5 6 (UGT2B17) Genomic Region ALT ALT (MAPT) 7 8 ALT 9
  • 22. Richa Agarwala MHC Alternate locus Alignment to chr6
  • 23.
  • 24. Oh No! Not a new version of the human genome! http://genomereference.org
  • 25.
  • 26. Assembly (e.g. GRCh37.p5) PAR Non-nuclear Primary assembly unit Assembly (e.g. MT) ALT ALT ALT Genomic 1 2 3 Region (MHC) Genomic ALT ALT ALT Region 4 5 6 (UGT2B17) Genomic Region ALT ALT (MAPT) 7 Genomic 8 Region (ABO) Genomic ALT Region 9 (SMA) Genomic Region (PECAM1) Patches …
  • 27. TBC1D3C TBC1D3 TBC1D3H TBC1D3C Myo19 region (17q21)
  • 28. 70 Fix PATCHES: Chromosome will update in GRCh38 (adds >1 Mb of novel sequence to the assembly) 71 Novel PATCHES: Additional sequence added (adds >800K of novel sequence to the assembly) Releasing patches quarterly
  • 29. Distributed data Centralized Data Old Assembly Model Updated Assembly Model Genome not in INSDC Database Genome in INSDC Database
  • 30. Data Archives GenBank  Data in a common format  Data in a single location (and mirrored)  Most quality checked prior to deposition  Robust data tracking mechanism (accession.version)  Data owned by submitter
  • 31. Data tracking ABC14-1065514J1 Date Phase Gaps Length FP565796.1 21-Oct-2009 1 1 FP565796.2 14-Oct-2010 1 0 FP565796.3 07-Nov-2010 3 0
  • 34. What’s in a name? GRCh37 hg19 Zv7 danRer5 MGSCv37 mm8 NCBIM37
  • 35. By any other name… chr21:8,913,216-9,246,964
  • 36. By any other name… Zv7 chr21:8,913,216-9,246,964 X Mouse Build 36 chrX
  • 37. hg19 GRCh37 http://www.ncbi.nlm.nih.gov/genome/assembly
  • 38.
  • 39. Assembly (e.g. GRCh37.p5) GCA_000001405.6 /GCF_000001405.17 ALT GCA_000001345.1/ Primary GCA_000001305.1/ 4 GCF_000001345.1 Assembly GCF_000001305.13 ALT GCA_000001355.1/ 5 GCF_000001355.1 Non-nuclear GCA_000006015.1/ ALT GCA_000001365.1/ assembly unit GCF_000006015.1 6 GCF_000001365.2 (e.g. MT) ALT GCA_000001375.1/ 7 GCF_000001375.1 ALT GCA_000001315.1/ 1 GCF_000001315.1 ALT GCA_000001385.1/ 8 GCF_000001385.1 ALT GCA_000001325.1/ 2 GCF_000001325.2 ALT GCA_000001395.1/ 9 GCF_000001395.1 ALT GCA_000001335.1/ 3 GCF_000001335.1 GCA_000005045.5 Patches GCF_000005045.4
  • 40. GenBank vs RefSeq Submitter Owned RefSeq Owned Redundancy Non-Redundant Updated rarely Curated INSDC Not INSDC BRCA1 83 genomic records 3 genomic records 31 mRNA records 5 mRNA records 27 protein records 1 RNA record 5 protein records
  • 41.
  • 42. RefSeq for Assemblies Typical assembly edits Addition of non-nuclear (e.g. MT) assembly units Removal of contamination Drop unlocalized/unplaced scaffolds Mask contamination that is placed on chromosome
  • 44. Understanding relationships between assemblies using alignments First Pass Reciprocal best hit Second Pass Non-reciprocal, duplicative hits
  • 45.
  • 46. NCBI36 GRCh37.p5 No second pass alignments in GRCh37.p5 http://www.ncbi.nlm.nih.gov/tools/gbench/
  • 47. Genome Data is MORE than just the Genome
  • 48. Genome Data is MORE than just the Genome ATGCGTGCAAAATGCAGTGAGT ATGCGTGCAAAATGCAGTGAGT ATGCGTGCAAAATGCAGTGAGT ATGCGTGCAAAATGCAGTGAGT NM_000336.2:c.800C>T
  • 49. ATGCGTGCAAAATGCAGTGAGT ATGCGTGCAAAATGCAGTGAGT ATGCGTGCAAAATGCAGTGAGT ATGCGTGCAAAATGCAGTGAGT NM_000336.2:c.800C>T NC_000001.10:g.(?_20700513)_(21062644_?)del
  • 51. http://www.youtube.com/NCBINLM @NCBI http://www.facebook.com/ncbi.nlm http://www.ncbi.nlm.nih.gov/education/
  • 52. Thanks! The Genome Reference Consortium The Genome Center at Washington University The Wellcome Trust Sanger Institute The European Bioinformatics Institute The National Center for Biotechnology Information Church group at NCBI For Slides: Valerie Schneider Francoise Thibaud-Nissen Nathan Bouk Evan Eichler Hsiu-Chuan Chen Steve Sherry Peter Meric Victor Ananiev Chao Chen John Lopez John Garner Tim Hefferon NCBI Cliff Clausen

Hinweis der Redaktion

  1. Alignments refer to pairs of sequence. Once you know how a pair of sequences go together, you can look at stringing the pairs along into a contig. The contig is essentially the consensus sequence that is produced from the components.To create a contig, we use the steps shown on this slide.What are switch points? As you create the consensus sequence of the contig, the switch points tell you where to stop using the sequence from one component and begin using the sequence from the next.
  2. Show alignment of a feature from first slide to show how far down the chromosome it has moved…
  3. Keeping track of people is way easier than keeping track of assemblies.