SlideShare ist ein Scribd-Unternehmen logo
1 von 59
Database talk for
Bits & Bites meeting
        Jill Wegryzn
  Department of Plant Sciences
  University of California at Davis
Forest Genomics (Conifers)
• Phylogenetic Representation –
   – None currently exists. The conifers (gymnosperms) are the oldest of the major
     plant clades, arising some 300 million years ago. They are key to our
     understanding of the origins of genetic diversity in higher plants.
• Ecological Representation –
   – Conifers are of immense ecological importance, comprising the dominant life
     forms in most of the temperate and boreal ecosystems in the Northern
     Hemisphere.
• Fundamental Genetic Information –
   – Reference sequences are the fundamental data necessary to understand
     conifer biology and aid in guiding management of genetic resources.
• Development of Genomic Technologies –
   – The analytical and computational challenge of building a reference sequence
     for such large genomes will drive development of tools, strategies, and human
     resources throughout the genomics community.
Existing and Planned Angiosperm Tree Genome Sequences
                                Species                                      Genome Size1            Number of              Status3
                                                                                                      Genes2
In Progress With Draft Assemblies
         Populustrichocarpa                    Black Cottonwood                   500 Mbp               ~ 40,000             2.0 / 2.2
         Eucalyptusgrandis                           Rose Gum                     691 Mbp                ~36,000             1.0 / 1.1
          Malusdomestica                               Apple                      881 Mbp                ~26,000             1.0 / 1.0
           Prunuspersica                               Peach                      227 Mbp                ~28,000             1.0 / 1.0
           Citrus sinensis                          Sweet Orange                  319 Mbp               ~ 25,000             1.0 / 1.0
           Carica papaya                               Papaya                     372 Mbp                   -
        Amborellatrichopoda                          Amborella                    870 Mbp                   -
In Progress Or Planned – No Published Assemblies
         Castaneamollisama                      Chinese Chestnut                  800 Mbp                   -
           Salix purpurea                           Purple Willow                 327 Mbp                   -
            Quercusrobur                           Pedunculate Oak                740 Mbp                   -
      Populusspp and ecotypes                          Various                     various                  -
          Azadirachtaindica                            Neem                       384 Mbp                   -


1) Genome size: Approximate total size, not completely assembled.
2) Number of Genes: Approximate number of loci containing protein coding sequence.
3) Status: Assembly / Annotation versions; http://www.phytozome.net/ ; http://asgpb.mhpcc.hawaii.edu/papaya/ ; http://www.amborella.org ;
(purple willow – Http://www.poplar.ca/pdf/edomonton11smart.pdf ; Neem - (http://www.strandls.com/viewnews.php?param=5&param1=68
Plant Genome Size Comparisons
                      40000
                                  3000           Arabidopsis
                                                 Oryza
                      35000       2000           Populus
                                                                                          Pinuslambert
                                                 Sorghum                                  iana
                                  1000
                                                 Glycine
1C DNA content (Mb)




                      30000                      Zea
                                      0                                         Pinus
                                                                               pinaster
                                                                       Pinus
                      25000                                     Picea taeda                              P. menziesii
                                                          Picea glauca
                      20000                  Pseudotsuga abies
                                                menziesii

                      15000      Taxodium
                                 distichum

                      10000

                       5000

                          0
What can be discovered about a gene by
           a database search?
• Best to have specific informational goals:
   – Evolutionary information: homologous genes, taxonomic
     distributions, allele frequencies, synteny, etc.
   – Genomic information: chromosomal location, introns, UTRs,
     regulatory regions, shared domains, etc.
   – Structural information: associated protein structures, fold types,
     structural domains
   – Expression information: expression specific to particular tissues,
     developmental stages, phenotypes, diseases, etc.
   – Functional information: enzymatic/molecular function,
     pathway/cellular role, localization, role in diseases
Using a database
• How to get information out of a database:
   – Summaries: how many entries, average or extreme
     values; rates of change, most recent entries, etc.
   – Browsing: getting a sense of the kind and quality of
     information available, e.g. checking familiar records
   – Search: looking for specific, predefined information
• “Key” to searching a database:
   – Must identify the element(s) of the database that are of
     interest somehow:
      • Gene name, symbol, location or other identifying information.
      • Sequences of genes, mRNAs, proteins, etc.
      • A crossreference from another database or database generated id.
NCBI and Entrez
• One of the most useful and comprehensive database
  collections is the NCBI, part of the National Library of
  Medicine.
   – Home to GenBank, PubMed & many other familiar DBs.
• NCBI provides interesting summaries, browsers, and
  search tools
• Entrez is their database search interface
  http://www.ncbi.nlm.nih.gov/Entrez
• Can search on gene names, chromosomal location,
  diseases, articles, keywords...
Types of Databases

• Primary Databases
   – Original submissions by experimentalists
   – Content controlled by the submitter
     • Examples: GenBank (nr and nt), SNP, GEO
• Derivative Databases
  – Built from primary data
  – Content controlled by third party (NCBI)
     • Examples: Refseq, Plant
       Protein, RefSNP, UniGene, NCBI
       Protein, Structure, Conserved Domain
NCBI is not all there is...
• Links to non-NCBI databases (see also “Link Out”)
    –   Reactome for pathways (also KEGG)
    –   HGNC for nomenclature
    –   HPRD protein information
    –   Regulatory / binding site DBs (e.g. CREB; some not linked)
    –   IHOP (information hyperlinked over proteins)
• Other important gene/protein resources:
    –   UniProt (most carefully annotated)
    –   PDB (main macromolecular structure repository)
    –   UCSC (best genome viewer & many useful „tracks‟)
    –   DIP / MINT (protein-protein interactions)
    –   More: InterPro, MetaCyc, Enzyme, etc. etc.
    –   Species Databses: TAIR, Gramene, MGI, Wormbase, Flybase.
        GDR, TreeGenes
• Alternatives
    – SRA versus DNANexus
Flat Files
Characteristics:
• Data is stored as records in regular files
• Records usually have a simple structure and fixed
  number of fields
• For fast access may support indexing of fields in the
  records
• No mechanisms for relating data between files
• One needs special programs in order to access and
  manipulate the data
Limitations of Flat Files
• Most applications require that specific
  information can be quickly and efficiently
  retrieved
• Often critical that performance does not
  degrade as more entities are added
• Flat text files don’t always fulfill these
  requirements, especially when there are many
  entities and/or relationships
Relational Database
Characteristics:
• Data is organized into tables: rows & columns
• Each row represents an instance of an entity
• Each column represents an attribute of an entity
• Metadata describes each table column
• Relationships between entities are represented by
  values stored in the columns of the corresponding
  tables (keys)
• Accessible through Standard Query Language (SQL)
Metadata & Data Table
Organism
Name                   Type                  Max Length       Description
Name                   Alphanumeric          100              Organism name
Size                   Integer               10               Genome length (bases)
Gc                     Float                 5                Percent GC
Accession              Alphanumeric          10               Accession number
Release                Date                  8                Release date
Center                 Alphanumeric          100              Genome center name
Sequence               Alphanumeric          Variable         Sequence

Name                   Size        Gc   Accession       Release      Center          Sequence
Escherichia coli K12   4,640,000   50   NC_000913       09/05/1997   Univ.           AGCTTTTC
                                                                     Wisconsin       ATT…
Streptococcus          2,040,000   40   NC_003098       09/07/2001   Eli Lilly and   TTGAAAGA
pneumoniae R6                                                        Company         AAA…
…
Relationships
• Used to connect tables
• Field(s) that have the same value in the related tables
• Organism.Accession=Gene.OAccession
• Organism.Accession
   – Unique
   – Primary key
• Gene.OAccession
   – Not unique
   – Secondary key
Schema: Representation of Table
         Organization
SQL
• ANSI (American National Standards Institute)
  standard computer language for accessing and
  manipulating database systems.
• SQL statements are used to retrieve and
  update data in a database.
• Includes:
  – Data Manipulation Language (DML)
  – Data Definition Language (DDL)
DBMS Advantages
• Program-data independence
• Minimal data redundancy
• Improved data consistency & quality
   – Access control
   – Transaction control
• Improved accessibility & data sharing
• Increased productivity of application development
• Enforced standards
DBMS
• Software package for defining and managing a
  database.
• Examples:
  – Proprietary: MS Access, MS SQL Server, DB2,
    Oracle, Sybase
  – Open source: MySql, PostgreSQL
http://dendrome.ucdavis.edu
TreeGenes Database
          Encompasses Dendrome Resources, DendromePlone, TreeGenes Database &DiversiTree




•   Nine modules to store and interrelate data for query and analysis in PostgreSQL
     • Direct resource for nearly 2,500 forest geneticists representing 800 organizations
        worldwide. Over 6,000 unique visitors in December 2011.
         • Forest Geneticists Colleague module
         • Literature module
         • Transcriptome annotation pipeline and module
         • Comparative map module
         • Species module
         • Sequencing module
         • Primers module
         • Genotype/EST module
         • Phenotype/Expression module
         • Sample tracking module
Genomic Resources
678 Species Representing 77 Genus
Generic Model Organism Database
CMAP: Obtaining TreeGenes (TG) Accession Number




                                           (optional) Add additional map files
                                                                                 Obtain TG
                                                                                 Accession
                                                                                  number!




Add literature data and (first) map file
Individual features
and their locations
on map




List of features on
map
GMOD Genome Browser




       Search and
Select data source




      Tracks can be
       reordered or
hidden as necessary
Douglas-fir
              Transcriptome Resources in TreeGenes
Gene Ontology
• Gene annotation system

• Controlled vocabulary that can be applied
  to all organisms (protein/RNA)

• Used to describe gene products
= bud initiation
Metazoa


= bud initiation
Saccharomyces


= bud initiation
Viridiplantae
What’s in a name?
• The same name can be used to describe
  different concepts
What‟s in a name?
•   Glucose synthesis
•   Glucose biosynthesis
•   Glucose formation
•   Glucose anabolism
•   Gluconeogenesis

• All refer to the process of making glucose from
  simpler components
How does GO work?
What information might we want to
capture about a gene product?

• What does the gene product do?
• Why does it perform these activities?
• Where does it act?
The 3 Gene Ontologies
• Molecular Function= elemental activity/task
   – the tasks performed by individual gene products; examples are carbohydrate
     binding and ATPase activity


• Biological Process= biological goal or objective
   – broad biological goals, such as mitosis or purine metabolism, that
     are accomplished by ordered assemblies of molecular functions

• Cellular Component= location or complex
   – subcellular structures, locations, and macromolecular complexes;
     examples include nucleus, telomere, and RNA polymerase II
     holoenzyme
Ontology Structure
Ontologies can be represented as graphs,
where the nodes are connected by edges

   Nodes = concepts in the ontology
   Edges = relationships between the concepts

                     node

                            edge

              node             node
Ontology Structure
• The Gene Ontology is structured as a
  hierarchical directed acyclic graph (DAG)

• Terms can have more than one parent and
  zero, one or more children

• Terms are linked by two relationships
   – is-a
   – part-of
True Path Rule

• The path from a child term all the way up to its
  top-level parent(s) must always be true

cell                                      is-a
        cytoplasm                        part-of
           chromosome
             nuclear chromosome
              cytoplasmic chromosome
              mitochondrial chromosome
           nucleus
              nuclear chromosome
What‟s in a GO term?
term: gluconeogenesis

id: GO:0006094

definition: The formation of glucose from
noncarbohydrate precursors, such as
pyruvate, amino acids and glycerol.
Source of Ontology Assignments
   IEAInferredfromElectronicAnnotation
   ISSInferred from Sequence Similarity
   IEPInferred from Expression Pattern
   IMPInferred from Mutant Phenotype
   IGIInferred from Genetic Interaction
   IPIInferred from Physical Interaction
   IDAInferred from Direct Assay
   RCA Inferred from Reviewed Computational Analysis
   TASTraceable Author Statement
   NASNon-traceable Author Statement
   ICInferred by Curator
   NDNo biological Data available
Ontology Development
                     Plant Ontology and Trait Ontology



• Plant Ontology
   – Structure
      • Needle, Cambium
   – Growth stages
• Trait Ontology
   – Forest Tree Specific Phenotypes
      • Wood Density
• PATO
   – Phenotypic Qualities
Currently Ontology Listings:
      OBO Foundry
Interwebs 101
• Web 1.0 – Hyperlinks
• Web 2.0 – Interactivity, information sharing, user
  centered design (wikis, blogs, social media)
• Web 3.0 – Semantic Web
   – Data focused
   – Answer the limitations of HTML
   – HTML describes documents and the links between them.
     RDF, OWL, and XML, by contrast, can describe specific
     things
   – Machine-readable data and relationships between the
     data – knowledge processing – deductive reasoning and
     inference
Web Services Development
                                  Communication within TreeGenes



   • Development of Web Services in cooperation with
     NSF’s iPlantCyberinfrastructure Project
         – Software system to support interoperable machine to
           machine interaction over a network regardless of platform
           incompatabilities
         – Web service descriptive language (WSDL) is implemented to
           relate operations
Service Oriented Architecture      Remote Procedure Call (RPC)         Representational State Transfer
(SOA)                                                                  (REST)
With SOAP, the basic unit of       RPC Web services define a call      REST use HTTP by constraining the
communication is a message         interface which the basic unit is   interface to standard operations
                                   the WSDL operation.                 (like GET, POST, PUT, DELETE for
                                                                       HTTP). The focus is on interacting
                                                                       with stateful resources, rather
                                                                       than messages or operations.
SSWAP Ontology
Creating and Contributing to Existing Servlets for Common Genomic Types
Forest Tree Genetic Stock Center
Bulk Retrieval Window Components

                        Data & Annotation Selection Fields
Bulk Retrieval Window
TreeGenes Sample Tracking System   Accurately track samples
                                      through collection, DNA
                                      extraction, and genotyping


                                   Provide a standard and
                                      efficient method to collect
                                      and store phenotypic data


                                   Provide a public interface to
                                      readily query raw
                                      genotype, phenotype, and
                                      association results
                                      (DiversiTree)


                                   Provide interfaces and
                                      database backend to
                                      support a DNA distribution
                                      center (UCD)
Population Genetics
                           Association Studies, Landscape Genomics




• Currently no other repositories to target association data with geo-referenced data
    • dbGAP
    • Dryad
• Starting with enforcement at the journal level: Tree Genetics and Genomes
GenSAS development with Content Management
                         Plone and Drupal
login/signup panel
                                             query sequence panel




data retrieval panel


                                               tool selection panel




 task queue panel
GenSAS development
Multiple Gene Prediction Tracks


                                  overview track
                                  control track


                                  sequence track
                                  evidence tracks




                                     custom track
                                    function track

                                     message box
GenSAS integration with Gbrowse
   Prototyped with Peach Genome in GDR
Analysis Resources
   Custom Databases
Integrating Tools into TreeGenes
             Galaxy
Genomic
resources
Fluxes of CO2 and H20: FLUXNET and Ameriflux




Free Air CO2 Enrichment (FACE)
TRY – Global Database of Plant Traits

• Scientists compiled three million traits for 69,000 out of the world's
  ~300,000 plant species.
• Worldwide collaboration of scientists from 106 research institutions
• TRY is hosted at the Max Planck Institute for Biogeochemistry in Jena
  (Germany)
    – Jointly coordinated with:
        • University of Leipzig (Germany)
        • IMBIV-CONICET (Argentina)
        • Macquarie University (Australia)
        • CNRS and University of Paris-Sud (France)
Database talk for Bits & Bites meeting
Database talk for Bits & Bites meeting

Weitere ähnliche Inhalte

Ähnlich wie Database talk for Bits & Bites meeting

The International Journal of Engineering and Science
The International Journal of Engineering and ScienceThe International Journal of Engineering and Science
The International Journal of Engineering and Science
theijes
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
c.titus.brown
 

Ähnlich wie Database talk for Bits & Bites meeting (20)

CWR US presentation PGOC 2012
CWR US presentation PGOC 2012CWR US presentation PGOC 2012
CWR US presentation PGOC 2012
 
2012 erin-crc-nih-seattle
2012 erin-crc-nih-seattle2012 erin-crc-nih-seattle
2012 erin-crc-nih-seattle
 
Use of DNA barcoding and its role in the plant species/varietal Identifica...
Use of DNA  barcoding  and its role in the plant species/varietal  Identifica...Use of DNA  barcoding  and its role in the plant species/varietal  Identifica...
Use of DNA barcoding and its role in the plant species/varietal Identifica...
 
Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.
 
Real-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe ParkerReal-time Phylogenomics: Joe Parker
Real-time Phylogenomics: Joe Parker
 
Building the Atlas of Living Australia
Building the Atlas of Living AustraliaBuilding the Atlas of Living Australia
Building the Atlas of Living Australia
 
iPlant TNRS for digital collections - iDigBio Workshop
iPlant TNRS for digital collections - iDigBio WorkshopiPlant TNRS for digital collections - iDigBio Workshop
iPlant TNRS for digital collections - iDigBio Workshop
 
1_7_genome_1.ppt
1_7_genome_1.ppt1_7_genome_1.ppt
1_7_genome_1.ppt
 
Domestication, polyploidy and genomics of crops #PAGXXV Heslop-Harrison
Domestication, polyploidy and genomics of crops #PAGXXV Heslop-HarrisonDomestication, polyploidy and genomics of crops #PAGXXV Heslop-Harrison
Domestication, polyploidy and genomics of crops #PAGXXV Heslop-Harrison
 
DNA barcoding the vascular plant flora of the Canadian Arctic
DNA barcoding the vascular plant flora of the Canadian ArcticDNA barcoding the vascular plant flora of the Canadian Arctic
DNA barcoding the vascular plant flora of the Canadian Arctic
 
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK
iEvoBio Keynote: Frontiers of discovery with Encyclopedia of Life -- TRAITBANK
 
Lecture 1,2
Lecture 1,2Lecture 1,2
Lecture 1,2
 
ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?
 
The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)
 
The International Journal of Engineering and Science
The International Journal of Engineering and ScienceThe International Journal of Engineering and Science
The International Journal of Engineering and Science
 
Tetrahymena genome project update 2004 by Jonathan Eisen
Tetrahymena genome project update 2004 by Jonathan EisenTetrahymena genome project update 2004 by Jonathan Eisen
Tetrahymena genome project update 2004 by Jonathan Eisen
 
Next Generation Sequencing Technologies and Their Applications in Ornamental ...
Next Generation Sequencing Technologies and Their Applications in Ornamental ...Next Generation Sequencing Technologies and Their Applications in Ornamental ...
Next Generation Sequencing Technologies and Their Applications in Ornamental ...
 
O7 Coffey
O7 CoffeyO7 Coffey
O7 Coffey
 
The wheat genome sequence: a foundation for accelerating improvment of bread ...
The wheat genome sequence: a foundation for accelerating improvment of bread ...The wheat genome sequence: a foundation for accelerating improvment of bread ...
The wheat genome sequence: a foundation for accelerating improvment of bread ...
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 

Mehr von Keith Bradnam

The art of good science writing
The art of good science writingThe art of good science writing
The art of good science writing
Keith Bradnam
 

Mehr von Keith Bradnam (20)

13 questions you might have about galaxy
13 questions you might have about galaxy13 questions you might have about galaxy
13 questions you might have about galaxy
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...
 
This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'This bioinformatics lesson is brought to you by the letter 'W'
This bioinformatics lesson is brought to you by the letter 'W'
 
This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'This bioinformatics lesson is brought to you by the letter 'T'
This bioinformatics lesson is brought to you by the letter 'T'
 
This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'This bioinformatics lesson is brought to you by the letter 'D'
This bioinformatics lesson is brought to you by the letter 'D'
 
Thoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contestThoughts on the feasibility of an Assemblathon 3 contest
Thoughts on the feasibility of an Assemblathon 3 contest
 
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
 
Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2Genome assembly: then and now (with notes) — v1.2
Genome assembly: then and now (with notes) — v1.2
 
Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2Genome assembly: then and now — v1.2
Genome assembly: then and now — v1.2
 
Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1Genome assembly: then and now — v1.1
Genome assembly: then and now — v1.1
 
Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1Genome assembly: then and now — with notes — v1.1
Genome assembly: then and now — with notes — v1.1
 
What's in a name? Better vocabularies = better bioinformatics?
What's in a name? Better vocabularies = better bioinformatics?What's in a name? Better vocabularies = better bioinformatics?
What's in a name? Better vocabularies = better bioinformatics?
 
The art of good science writing
The art of good science writingThe art of good science writing
The art of good science writing
 
Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0Genome assembly: then and now — v1.0
Genome assembly: then and now — v1.0
 
Polish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slidesPolish that presentation! 25 tips to bring clarity to your slides
Polish that presentation! 25 tips to bring clarity to your slides
 
10 tips for adding polish to presentations
10 tips for adding polish to presentations10 tips for adding polish to presentations
10 tips for adding polish to presentations
 
Benchmarking short-read mapping programs
Benchmarking short-read mapping programsBenchmarking short-read mapping programs
Benchmarking short-read mapping programs
 
Thoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore TechnologiesThoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore Technologies
 
When is a genome finished?
When is a genome finished? When is a genome finished?
When is a genome finished?
 
Twitter 101 - an introduction to Twitter
Twitter 101  - an introduction to TwitterTwitter 101  - an introduction to Twitter
Twitter 101 - an introduction to Twitter
 

Kürzlich hochgeladen

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
MateoGardella
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
SanaAli374401
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 

Kürzlich hochgeladen (20)

Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Gardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch LetterGardella_PRCampaignConclusion Pitch Letter
Gardella_PRCampaignConclusion Pitch Letter
 
An Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdfAn Overview of Mutual Funds Bcom Project.pdf
An Overview of Mutual Funds Bcom Project.pdf
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 

Database talk for Bits & Bites meeting

  • 1. Database talk for Bits & Bites meeting Jill Wegryzn Department of Plant Sciences University of California at Davis
  • 2. Forest Genomics (Conifers) • Phylogenetic Representation – – None currently exists. The conifers (gymnosperms) are the oldest of the major plant clades, arising some 300 million years ago. They are key to our understanding of the origins of genetic diversity in higher plants. • Ecological Representation – – Conifers are of immense ecological importance, comprising the dominant life forms in most of the temperate and boreal ecosystems in the Northern Hemisphere. • Fundamental Genetic Information – – Reference sequences are the fundamental data necessary to understand conifer biology and aid in guiding management of genetic resources. • Development of Genomic Technologies – – The analytical and computational challenge of building a reference sequence for such large genomes will drive development of tools, strategies, and human resources throughout the genomics community.
  • 3. Existing and Planned Angiosperm Tree Genome Sequences Species Genome Size1 Number of Status3 Genes2 In Progress With Draft Assemblies Populustrichocarpa Black Cottonwood 500 Mbp ~ 40,000 2.0 / 2.2 Eucalyptusgrandis Rose Gum 691 Mbp ~36,000 1.0 / 1.1 Malusdomestica Apple 881 Mbp ~26,000 1.0 / 1.0 Prunuspersica Peach 227 Mbp ~28,000 1.0 / 1.0 Citrus sinensis Sweet Orange 319 Mbp ~ 25,000 1.0 / 1.0 Carica papaya Papaya 372 Mbp - Amborellatrichopoda Amborella 870 Mbp - In Progress Or Planned – No Published Assemblies Castaneamollisama Chinese Chestnut 800 Mbp - Salix purpurea Purple Willow 327 Mbp - Quercusrobur Pedunculate Oak 740 Mbp - Populusspp and ecotypes Various various - Azadirachtaindica Neem 384 Mbp - 1) Genome size: Approximate total size, not completely assembled. 2) Number of Genes: Approximate number of loci containing protein coding sequence. 3) Status: Assembly / Annotation versions; http://www.phytozome.net/ ; http://asgpb.mhpcc.hawaii.edu/papaya/ ; http://www.amborella.org ; (purple willow – Http://www.poplar.ca/pdf/edomonton11smart.pdf ; Neem - (http://www.strandls.com/viewnews.php?param=5&param1=68
  • 4. Plant Genome Size Comparisons 40000 3000 Arabidopsis Oryza 35000 2000 Populus Pinuslambert Sorghum iana 1000 Glycine 1C DNA content (Mb) 30000 Zea 0 Pinus pinaster Pinus 25000 Picea taeda P. menziesii Picea glauca 20000 Pseudotsuga abies menziesii 15000 Taxodium distichum 10000 5000 0
  • 5. What can be discovered about a gene by a database search? • Best to have specific informational goals: – Evolutionary information: homologous genes, taxonomic distributions, allele frequencies, synteny, etc. – Genomic information: chromosomal location, introns, UTRs, regulatory regions, shared domains, etc. – Structural information: associated protein structures, fold types, structural domains – Expression information: expression specific to particular tissues, developmental stages, phenotypes, diseases, etc. – Functional information: enzymatic/molecular function, pathway/cellular role, localization, role in diseases
  • 6. Using a database • How to get information out of a database: – Summaries: how many entries, average or extreme values; rates of change, most recent entries, etc. – Browsing: getting a sense of the kind and quality of information available, e.g. checking familiar records – Search: looking for specific, predefined information • “Key” to searching a database: – Must identify the element(s) of the database that are of interest somehow: • Gene name, symbol, location or other identifying information. • Sequences of genes, mRNAs, proteins, etc. • A crossreference from another database or database generated id.
  • 7. NCBI and Entrez • One of the most useful and comprehensive database collections is the NCBI, part of the National Library of Medicine. – Home to GenBank, PubMed & many other familiar DBs. • NCBI provides interesting summaries, browsers, and search tools • Entrez is their database search interface http://www.ncbi.nlm.nih.gov/Entrez • Can search on gene names, chromosomal location, diseases, articles, keywords...
  • 8. Types of Databases • Primary Databases – Original submissions by experimentalists – Content controlled by the submitter • Examples: GenBank (nr and nt), SNP, GEO • Derivative Databases – Built from primary data – Content controlled by third party (NCBI) • Examples: Refseq, Plant Protein, RefSNP, UniGene, NCBI Protein, Structure, Conserved Domain
  • 9.
  • 10. NCBI is not all there is... • Links to non-NCBI databases (see also “Link Out”) – Reactome for pathways (also KEGG) – HGNC for nomenclature – HPRD protein information – Regulatory / binding site DBs (e.g. CREB; some not linked) – IHOP (information hyperlinked over proteins) • Other important gene/protein resources: – UniProt (most carefully annotated) – PDB (main macromolecular structure repository) – UCSC (best genome viewer & many useful „tracks‟) – DIP / MINT (protein-protein interactions) – More: InterPro, MetaCyc, Enzyme, etc. etc. – Species Databses: TAIR, Gramene, MGI, Wormbase, Flybase. GDR, TreeGenes • Alternatives – SRA versus DNANexus
  • 11. Flat Files Characteristics: • Data is stored as records in regular files • Records usually have a simple structure and fixed number of fields • For fast access may support indexing of fields in the records • No mechanisms for relating data between files • One needs special programs in order to access and manipulate the data
  • 12. Limitations of Flat Files • Most applications require that specific information can be quickly and efficiently retrieved • Often critical that performance does not degrade as more entities are added • Flat text files don’t always fulfill these requirements, especially when there are many entities and/or relationships
  • 13. Relational Database Characteristics: • Data is organized into tables: rows & columns • Each row represents an instance of an entity • Each column represents an attribute of an entity • Metadata describes each table column • Relationships between entities are represented by values stored in the columns of the corresponding tables (keys) • Accessible through Standard Query Language (SQL)
  • 14. Metadata & Data Table Organism Name Type Max Length Description Name Alphanumeric 100 Organism name Size Integer 10 Genome length (bases) Gc Float 5 Percent GC Accession Alphanumeric 10 Accession number Release Date 8 Release date Center Alphanumeric 100 Genome center name Sequence Alphanumeric Variable Sequence Name Size Gc Accession Release Center Sequence Escherichia coli K12 4,640,000 50 NC_000913 09/05/1997 Univ. AGCTTTTC Wisconsin ATT… Streptococcus 2,040,000 40 NC_003098 09/07/2001 Eli Lilly and TTGAAAGA pneumoniae R6 Company AAA… …
  • 15. Relationships • Used to connect tables • Field(s) that have the same value in the related tables • Organism.Accession=Gene.OAccession • Organism.Accession – Unique – Primary key • Gene.OAccession – Not unique – Secondary key
  • 16. Schema: Representation of Table Organization
  • 17. SQL • ANSI (American National Standards Institute) standard computer language for accessing and manipulating database systems. • SQL statements are used to retrieve and update data in a database. • Includes: – Data Manipulation Language (DML) – Data Definition Language (DDL)
  • 18. DBMS Advantages • Program-data independence • Minimal data redundancy • Improved data consistency & quality – Access control – Transaction control • Improved accessibility & data sharing • Increased productivity of application development • Enforced standards
  • 19. DBMS • Software package for defining and managing a database. • Examples: – Proprietary: MS Access, MS SQL Server, DB2, Oracle, Sybase – Open source: MySql, PostgreSQL
  • 21. TreeGenes Database Encompasses Dendrome Resources, DendromePlone, TreeGenes Database &DiversiTree • Nine modules to store and interrelate data for query and analysis in PostgreSQL • Direct resource for nearly 2,500 forest geneticists representing 800 organizations worldwide. Over 6,000 unique visitors in December 2011. • Forest Geneticists Colleague module • Literature module • Transcriptome annotation pipeline and module • Comparative map module • Species module • Sequencing module • Primers module • Genotype/EST module • Phenotype/Expression module • Sample tracking module
  • 22. Genomic Resources 678 Species Representing 77 Genus
  • 24. CMAP: Obtaining TreeGenes (TG) Accession Number (optional) Add additional map files Obtain TG Accession number! Add literature data and (first) map file
  • 25. Individual features and their locations on map List of features on map
  • 26. GMOD Genome Browser Search and Select data source Tracks can be reordered or hidden as necessary
  • 27. Douglas-fir Transcriptome Resources in TreeGenes
  • 28. Gene Ontology • Gene annotation system • Controlled vocabulary that can be applied to all organisms (protein/RNA) • Used to describe gene products
  • 29. = bud initiation Metazoa = bud initiation Saccharomyces = bud initiation Viridiplantae
  • 30. What’s in a name? • The same name can be used to describe different concepts
  • 31. What‟s in a name? • Glucose synthesis • Glucose biosynthesis • Glucose formation • Glucose anabolism • Gluconeogenesis • All refer to the process of making glucose from simpler components
  • 32. How does GO work? What information might we want to capture about a gene product? • What does the gene product do? • Why does it perform these activities? • Where does it act?
  • 33. The 3 Gene Ontologies • Molecular Function= elemental activity/task – the tasks performed by individual gene products; examples are carbohydrate binding and ATPase activity • Biological Process= biological goal or objective – broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions • Cellular Component= location or complex – subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and RNA polymerase II holoenzyme
  • 34. Ontology Structure Ontologies can be represented as graphs, where the nodes are connected by edges  Nodes = concepts in the ontology  Edges = relationships between the concepts node edge node node
  • 35. Ontology Structure • The Gene Ontology is structured as a hierarchical directed acyclic graph (DAG) • Terms can have more than one parent and zero, one or more children • Terms are linked by two relationships – is-a – part-of
  • 36. True Path Rule • The path from a child term all the way up to its top-level parent(s) must always be true cell is-a  cytoplasm part-of  chromosome nuclear chromosome  cytoplasmic chromosome  mitochondrial chromosome  nucleus  nuclear chromosome
  • 37. What‟s in a GO term? term: gluconeogenesis id: GO:0006094 definition: The formation of glucose from noncarbohydrate precursors, such as pyruvate, amino acids and glycerol.
  • 38. Source of Ontology Assignments IEAInferredfromElectronicAnnotation ISSInferred from Sequence Similarity IEPInferred from Expression Pattern IMPInferred from Mutant Phenotype IGIInferred from Genetic Interaction IPIInferred from Physical Interaction IDAInferred from Direct Assay RCA Inferred from Reviewed Computational Analysis TASTraceable Author Statement NASNon-traceable Author Statement ICInferred by Curator NDNo biological Data available
  • 39. Ontology Development Plant Ontology and Trait Ontology • Plant Ontology – Structure • Needle, Cambium – Growth stages • Trait Ontology – Forest Tree Specific Phenotypes • Wood Density • PATO – Phenotypic Qualities
  • 41. Interwebs 101 • Web 1.0 – Hyperlinks • Web 2.0 – Interactivity, information sharing, user centered design (wikis, blogs, social media) • Web 3.0 – Semantic Web – Data focused – Answer the limitations of HTML – HTML describes documents and the links between them. RDF, OWL, and XML, by contrast, can describe specific things – Machine-readable data and relationships between the data – knowledge processing – deductive reasoning and inference
  • 42. Web Services Development Communication within TreeGenes • Development of Web Services in cooperation with NSF’s iPlantCyberinfrastructure Project – Software system to support interoperable machine to machine interaction over a network regardless of platform incompatabilities – Web service descriptive language (WSDL) is implemented to relate operations Service Oriented Architecture Remote Procedure Call (RPC) Representational State Transfer (SOA) (REST) With SOAP, the basic unit of RPC Web services define a call REST use HTTP by constraining the communication is a message interface which the basic unit is interface to standard operations the WSDL operation. (like GET, POST, PUT, DELETE for HTTP). The focus is on interacting with stateful resources, rather than messages or operations.
  • 43. SSWAP Ontology Creating and Contributing to Existing Servlets for Common Genomic Types
  • 44.
  • 45. Forest Tree Genetic Stock Center
  • 46. Bulk Retrieval Window Components Data & Annotation Selection Fields Bulk Retrieval Window
  • 47. TreeGenes Sample Tracking System Accurately track samples through collection, DNA extraction, and genotyping Provide a standard and efficient method to collect and store phenotypic data Provide a public interface to readily query raw genotype, phenotype, and association results (DiversiTree) Provide interfaces and database backend to support a DNA distribution center (UCD)
  • 48. Population Genetics Association Studies, Landscape Genomics • Currently no other repositories to target association data with geo-referenced data • dbGAP • Dryad • Starting with enforcement at the journal level: Tree Genetics and Genomes
  • 49.
  • 50. GenSAS development with Content Management Plone and Drupal login/signup panel query sequence panel data retrieval panel tool selection panel task queue panel
  • 51. GenSAS development Multiple Gene Prediction Tracks overview track control track sequence track evidence tracks custom track function track message box
  • 52. GenSAS integration with Gbrowse Prototyped with Peach Genome in GDR
  • 53. Analysis Resources Custom Databases
  • 54. Integrating Tools into TreeGenes Galaxy
  • 56. Fluxes of CO2 and H20: FLUXNET and Ameriflux Free Air CO2 Enrichment (FACE)
  • 57. TRY – Global Database of Plant Traits • Scientists compiled three million traits for 69,000 out of the world's ~300,000 plant species. • Worldwide collaboration of scientists from 106 research institutions • TRY is hosted at the Max Planck Institute for Biogeochemistry in Jena (Germany) – Jointly coordinated with: • University of Leipzig (Germany) • IMBIV-CONICET (Argentina) • Macquarie University (Australia) • CNRS and University of Paris-Sud (France)

Hinweis der Redaktion

  1. How large is a typical genome? There is no simple answer of course, for organisms vary widely in genome size. Arabidopsis, the tiny model species in the mustard family was the first plant to have a fully sequenced genome. It sports a genome of 160 million bases. Poplar, the first tree sequenced, has about 480 million bases in its genome. The corn genome has nearly 2.5 billion bases, and humans around 3 billion. Genome size for conifers is substantially greater. In the figure above, genome size is given for 181 gymnosperms (mostly conifers). They vary in size from 6 to well over 30 billion bases. Some members of the lily family exceed 100 billion bases in size. Explanations for why genome size varies as it does for individual organisms are many and often speculative.