SlideShare ist ein Scribd-Unternehmen logo
1 von 13
Downloaden Sie, um offline zu lesen
Prospects for enabling                         Suppose you have the sequence of a protein-coding
   phylogenetically informed                          gene, and are interested in its function. What is
                                                      the first thing you would do?
 comparative biology on the web
                                                    • If it were me, I would search for conserved
                                                      domains that match records in Pfam and other
    Todd Vision & Hilmar Lapp
                      1,2                  1
                                                      protein domain databases.
 1U.S.   National Evolutionary Synthesis Center
                                                    • Are these databases complete?
2Dept.   Of Biology, University of North Carolina
                                                    • Are they infallible?
                   at Chapel HIll
                                                    • Are they still useful?




                                                        Why are these data useful?
                                                    • You needn’t have mastery of the specialist
                                                      literature before the search
                                                    • A match connects you to a vast interconnected
                                                      world of information
                                                    • Why not worry about completeness?
                                                       ! A negative result is not expensive
                                                       ! Many broadly useful records are already present
                                                    • Why not worry about fallibility?
                                                       ! The user can weigh the evidence once a match is
                                                         found
                                                       ! Assertions should be exposed to scrutiny




                                                                                                           1
Some observations                              The case of phylogenetic data
• This infrastructure is designed to disseminate data   • There is a broad audience for phylogenetic data
  to non-specialists                                       ! Organismal phylogeny (e.g. Encyclopedia of Life)
• The relevant data may be derived from multiple           ! Gene/protein trees
  “studies”, not all of which are published             • Many of the available resources are geared
                                                          toward specialist researchers & students
• Data is hoarded neither by the researcher nor by
  the domain database                                   • Non-specialists turn to taxonomic classifications
                                                          when they need organismal phylogenetic
• The search service is as widely disseminated as
                                                          information
  the data
                                                        • Few know where to find gene/protein trees at all
• Semantic-level machine-to-machine
  communication facilitates human comprehensive




                   TreeBase
                                                              Tree of Life Web Project
• screenshot




                                                                                                                2
The NCBI taxonomy
• Provides
  ! A hierarchy for all species represented by DNA
    sequences in Genbank
  ! Names and IDs for internal nodes
  ! An FTP dump
• But does NOT
  ! Include unsequences species
  ! Report confidence in topology or monophyly
  ! Taxonomic nuance (it has synonyms & common
    names)




                                                     Node-oriented web services from
 What if the NCBI taxonomy…                           the Tree of Life Web Project
                                                         Name
• Listed all taxa, including fossils?                •
                                                         Description
                                                     •
• Allowed one to assess where there are
                                                         Authority
                                                     •
  conflicting topologies?
                                                         Date
                                                     •
• Reported support values for clades?                    Other names
                                                     •
• Reported divergence time estimates for                 Completeness of children
                                                     •
  nodes (e.g. from TimeTree)                             Extinction status
                                                     •
                                                         Confidence of position
                                                     •
• Reported the provenance of the data?
                                                         Monophyly
                                                     •




                                                                                       3
Further barriers to dissemination
                                                                            Outline
   of phylogenetic information
                                                     • Informatics @ NESCent
• Technical obstacles
                                                     • An example of a phylogenetically-informed
       Technology for storing and querying trees
   !
                                                       semantic web application for phenotype
       Difficulties with exchange standards
   !
                                                       data
       Inference of consensus trees and supertrees
   !
                                                     • Promoting interoperability and closing
       Taxonomic intelligence
   !
                                                       technical gaps in phyloinformatics through
       Globally unique identifiers
   !
                                                       open development
• Social obstacles
   ! Reluctance to provide incomplete or fallible
     information




                                                           NESCent sponsored science
                                                     • Catalysis Meetings (large, one-time events)
                                                        ! To foster new collaborations and synthetic research
                                                     • Working Groups
                                                        ! Smaller, focused, multiple meetings
                                                     • Sabbatical Scholars
                                                     • Postdoctoral fellows
                                                     • Short-term visitor program
                                                        ! 2 weeks to 3 months
                                                        ! Encourage collaborative projects
                                                     • Application info: http://www.nescent.org




                                                                                                                4
NESCent Informatics
      Evolutionary Informatics WG
                                                                     • Support for sponsored science and scientists
   • Organizers: Arlin Stoltzfus and Rutger Vos                         ! Facilitating electronic collaboration
   • Selected goals:                                                    ! Software/database development
                                                                        ! Providing HPC and other IT infrastructure
      ! XML serialization of NEXUS
                                                                     • Cyberinfrastructure for synthetic science
      ! Formal grammar for validation and interconversion of
                                                                            Data sharing
                                                                        !
        NEXUS & other formats
                                                                            Software interoperability
                                                                        !
      ! A transition model language for evolutionary models
                                                                            Training
                                                                        !
        used in statistical inference
                                                                            In partnership with major national and international
                                                                        !
      ! An ontology for evolutionary comparative data analysis
                                                                            efforts
   • http://www.nescent.org/wg_evoinfo




                                                                       Phylogenetic cyberinfrastructure to enable
                 GeoPhyloBuilder
                                                                                 comparative biology
                                                                     • Two traditions in the recording of phenotype data
                                          “Putting the                  ! Natural language descriptions and character matrices
                                          geography into                ! Statements made using anatomical and trait ontologies,
                                                                          designed to capitalize on the semantic web
                                          phylogeography”
                                                                     • NESCent WG on morphological evolution in fish
                                                                        ! Organized by Paula Mabee and Monte Westerfield
                                          David Kidd & Xianhua Liu
                                                                        ! Led to a larger project
                                                                     • Aim is to integrate
• Extension for ArcGIS Software that creates a spatiotemporal
                                                                        ! Mutant phenotype data for zebrafish
  GIS network model from a tree with georeferenced nodes.
                                                                        ! Comparative morphology data for the Ostariophysi
• 3D visualizations are possible through ArcSCENE.
• http://www.nescent.org/informatics/software.php




                                                                                                                                   5
Describing phenotypes using
                       Ontologies
                                                                                  ontologies
• Defined terms with defined relationships                              • Entity-Quality system (EQ)
   ! e.g. Gene Ontology, Cell Ontology
                                                                      • Entity term from an anatomy ontology
                                                                        ! zebrafish anatomy cell ontology, etc.
                              cell              part_of
            part_of                                                   • Quality term from Phenotype and Trait
                                                                        Ontology (PATO)
                                                   cell
        membrane
                                                projection            • e.g. Entity=dorsal fin, Shape=round
                      is_a                          is_a

                axolemma part_of                axon




  Phenotype and Trait Ontology
                                                                       Evolutionary character matrices
            (PATO)
                                       ...
                                                                      • Common phenotypic data format in
                                     physical
                                                                        evolutionary biology (e.g. NEXUS)
                                      quality
                 optical
                 quality
                                                                      • Characters + character states, similar to
chromatic
                                                           buoyancy
                                                                        EQ
 property
                                                                                          dorsal fin shape   character 2
                       color
                                             amplitude
                                                                                             round            state
                                                                           Species one
         blue
                                                                                            pointed           state
                                                                           Species two
                               green

                                                                                            undulate          state
                                                                          Species three
  bright blue     dark blue




                                                                                                                          6
Character Matrix vs. EQ                                             A scenario
                                                     • A geneticist observes a reduction in the number
                          Character                    of a particular bone type (e.g. branchiostegal ray)
      Character
                                                       in a zebrafish mutant of her favorite gene.
                            State         AO
                                                     • She asks: is this bone variable in number among
    Entity Attribute       Value         PATO          species in nature?
   dorsal fin shape         round
                                                     • She could query the evolutionary phenotype
                                                       database using:
    Entity           Quality                            ! Entity = Branchiostegal ray (from TAO)
                                                        ! Qualities pertaining to attribute ‘count’ (from PATO)




                                                     • By examining additional changes on these same
• She could examine a visualization of the             branches, she sees several parallelisms:
  phylogenetic relationships of the taxa with           ! loss of the swimbladder, pelvic fins, and scales
  the relevant character changes mapped.                ! elongation of the mandibular or hyoid arches
                                                        ! reduction or loss of the opercle in syngnathids and
• She would see that most Ostariophysi have 3
                                                          saccopharyngoids.
  rays, but that reduction has occurred                 ! a variety of other bones and soft tissues are lost or
  multiple times:                                         greatly modified
  ! solenostomids and syngnathids (ghost pipefishes   • She might hypothesize that these trait
    and pipefishes)                                     correlations are all due to alterations in the
                                                       expression of the same suite of morphogens.
  ! giganturids
                                                     • She can select appropriate species from these
  ! saccopharyngoid (gulper and swallower) eels
                                                       lineages to follow-up experimentally.




                                                                                                                  7
Some anatomical ontologies
What data are needed to enable
        this scenario?                                                                             Amphibia
                                                                                               •
                                                                                                   C. elegans
                                                                                               •
• Anatomy and trait ontologies
                                                                                                   Fish (zebrafish, medaka, teleosts)
                                                                                               •
• Phenotypes in EQ syntax for
                                                                                                   Insects (Drosophila, Mosquito, Hymenoptera)
                                                                                               •
  ! Zebrafish mutants (already exist)
                                                                                                   Mammals (mouse, human)
                                                                                               •
  ! Species/clades of Ostariophysi
                                                                                                   Plants (Arabidopsis, cereals, maize, all plants)
                                                                                               •
• Phylogenetic relationships among the
  Ostariophysi
  ! Taxonomy ontology




                                                                                                    Preserving published data for
                                      NESCent
                                    (Vision, Lapp,
                                 Software Developers)


                                                                                                      future integration efforts
                                    Working groups               U. Oregon
                                                                (Westerfield)
                                   Curator interface
                                                               Usability testing
                                  EQSYTE database
                                                                                                   Sequence alignments (e.g. Treebase)
                                                                                               •
                                                               Liason to ZFIN
                                EQSYTE public interface
                                                               Liason to NCBO

                                                                                                   Long-term population records (e.g. pedigrees)
                                                                                               •
               USD
             (Mabee,                 EQSYTE contents

                                                                                                   2D and 3D images
           Data Curator)
                                                                                               •
                                                  Zebrafish
                                Ostariophysan
                                                 phenotypic
                                                                                                   Collection and locality information
                                 phenotypic
                                                                                               •
                                                  & genetic
             Morphology              data                                     NCBO
                                                     data
            collaborators
           (Arratia, Coburn,
                                                                                                   Behaviorial observations
                                                                                               •
                                                                          Applications
                                         Ontologies
      Hilton Lunderg, Mayden)
                                                                      (Phenote, OBO-Edit)
                                      (taxonomy, TAO,
                                      PATO, homology)

                                                                                                   Numerical tables
                                                                                               •
                                                                               OBO
                                                                      (host of TAO, PATO,
                                                                       taxonomy ontology)

                                                                                                   Etc.
                                                                                               •
                                            Tulane U.
                                                                     Phenotype Ontologies
                                     (Rios/Ontology Curator)
                                                                    for Evolutionary Biology
      Ichthyology community
                                         Liason to CToL                    Workshops
        (DeepFin, Fishbase)


                                                                                               • Most of these data are lost upon publication
                                                                                               • These are the stuff of comparative biology




                                                                                                                                                      8
Dryad: A digital repository for published data
                                                                      Journals and societies involved
          in evolutionary biology
                                                                                  so far
                                                                         American Naturalist (ASN)
                                                                  •
                                                                         Evolution (SSE)
                                                                  •
                                                                         Journal of Evolutionary Biology (ESEB)
                                                                  •
                                                                         Integrative and Comparative Biology (SICB)
                                                                  •
                                                                         Molecular Biology and Evolution (SMBE)
                                                                  •
                                                                         Molecular Ecology
                                                                  •
                                                                         Molecular Phylogenetics and Evolution
                                                                  •
                                                                         Systematic Biology (SSB)
                                                                  •
 NCSU Digital Library Initiative




                                                                  2006 Phyloinformatics Hackathon
              Open development
                                                                   ATV     NCL   NESCent   HyPhy    PAUP*   CIPRES    GARLI      TreeBase


 • Open source refers only to the licensing of the
   software code                                                   Bio::CDAT     Biojava   BioSQL   JEBL    Bioruby    BioPerl      Biopython


 • At NESCent, we have been experimenting with
   practices in open development
    ! Community contributes to a shared code base
    ! Higher barrier to entry
    ! Can be a substantial payoff in terms of interoperability,
      functionality, usability, maintenance
    ! Surprisingly rare in academia




                                                                                                                                                9
Hackathon mechanics
 • Before the meeting
    ! Participants and users suggested integrative workflows
 • At the meeting
      Gaps in existing toolkits were identified
    !
      Subgroups collaborated on high priority targets
    !
      Followed a “use case” model
    !
      Subgroups and targets were allowed to be fluid
    !
      Users were on hand to provide datasets, test code,
    !
      provide their perspective
    ! Dedicated participants tasked with documentation
 • All code is open-source and deposited in
   established repositories




             Accomplishments
                                                              • Reconciling trees
• Sequence family evolution                                     ! BioPerl: Support for NJTree
  ! BioPerl: Support for TribeMCL, QuickTree,                   ! Biopython: Wrapper for Softparsmap
    ClustalW, Phylip, PAML                                      ! BioRuby: Model for phylogenetic trees and
                                                                  networks with graph algorithms
  ! BioPerl & Biopython: Support for dN/dS-based
    tests for selection in HyPhy                                ! BioSQL: Model for phylogenetic trees and
                                                                  networks with optimization methods and
  ! Biojava: Parser for Phylip alignment format
                                                                  topological queries
  ! BioRuby: Support for T-Coffee, MAFFT, and
    Phylip




                                                                                                              10
• Phylogenetic inference on non-molecular
                                                                 • NEXUS compliance
    characters
     ! BioPerl: Interoperability between Bio::Phylo and             ! Biojava: Interoperability between Biojava and JEBL
       BioPerl APIs                                                 ! Biojava & BioRuby: Level II-compliant NEXUS parsers
     ! BioRuby: NEXUS-compliant data model and parser for
                                                                    ! All:
       PAUP and TNT results
                                                                            Evaluated major APIs
                                                                        !

                                                                            Proposed compliance levels
                                                                        !

  • Phylogenetic footprinting                                               Gathered test files exposing common errors
                                                                        !

     ! BioPerl: Support for Footprinter, PhastCons, and using               Fixed compliance issues in NCL and Bio::NEXUS reference
                                                                        !

       ClustalW over a sliding window                                       implementations
                                                                            Worked on integrating those into GARLI and BioPerl,
                                                                        !
                                                                            respectively
  • Estimation of divergence times
     ! BioPerl: Draft design of r8s wrapper




                Next hackathon
• Comparative Phylogenetic Methods in R
• December 10-14, 2007                                          • Student internships in open-source software
• Organizers: S. Kembel, H. Lapp, B. O'Meara, S.                  development
  Price, T. Vision, A. Zanne                                       ! Students work with any of a large number of
                                                                     established OS projects
• http://hackathon.nescent.org/R_Hackathon_1
                                                                   ! Students and mentors work & communicate remotely
                                                                • NESCent recruited mentors and oversaw student
• Have an idea for a future event? Submit a                       progress
  whitepaper!                                                      ! Eleven students worked on projects in visualization,
                                                                     usability, interoperability & implementation of new
                                                                     methods




                                                                                                                                      11
NEXML                                      Command-line BioSQL
                                                     • Student: Jamie Estill
    Student: Jason Caravas
•
                                                     • Mentor: Hilmar Lapp
    Mentor: Rutger Vos
•
                                                     • Commands for
    Flexible serialization of phylogenetic objects
•                                                           Database initialization
                                                        !
                                                            Bio::TreeIO import
                                                        !
    Perl Bio::Phylo module tools for NEXML
•
                                                            Bio::TreeIO export
                                                        !
    parsing and serialization                               Tree query
                                                        !
                                                            Tree optimization
                                                        !
                                                            Tree manipulation
                                                        !




                                                       Conservation of phylogenetic
                                                                diversity
                                                     • Student: Klaas Hartmann
                                                     • Mentor: Tobias Thierer
                                                     • Implementation of algorithm and GUI for
                                                       optimal allocation of a finite budget to
                                                       individual species to maximize phylogenetic
                                                       diversity.




                                                                                                     12
Bayesian calibration of                                   Phyloinformatics Summer Course
           divergence times
                                                               Teaching advanced
                                                       •
                                                               programming skills to
• Student: Michael Nowak                                       phylogenetic methods
• Mentor: Derrick Zwickl                                       developers
                                                               Focus is on software
                                                       •
                                                               technologies rather than
                                                               methodology
                                                               First year
                                                       •
• Fossil occurrence data is used to                             ! 10 days in July 2007
  construct informative priors on                               ! Organized by Bill Piel of
                                                                  TreeBASE
  divergence times for Bayesian                                 ! 8 co-instructors
  analysis in, e.g. BEAST                                       ! 23 students (11 female) in the
                                                                  first year




                                                                Additional acknowledgements
                   Conclusions
                                                                Hackathon participants
• The future of web-enabled comparative biology is         •
  beginning to become clearer.                                  GSoC mentors and students
                                                           •
   ! For a preview, see genomics!                               Summer course instructors
                                                           •
• The facile exchange of phylogenetic data is what              Phenotype evolution project
                                                           •
  will enable it.                                                ! Jim Balhoff, Wasila Dahdul, John Lundberg, Paula
• Expect to be using technologies such as                          Mabee, Peter Midford, Monte Westerfield
  ontologies and web services, which are now               • Data depository:
  largely foreign to phylogenetic researchers.                   ! Ryan Scherle, Jane Greenberg
• Also expect a shift toward open development.
   ! This will necessitate new modes of training for
     academic phyloinformaticists.




                                                                                                                      13

Weitere ähnliche Inhalte

Andere mochten auch

Менторская программа Startup Magic
Менторская программа Startup MagicМенторская программа Startup Magic
Менторская программа Startup Magic
Eugene Kalinin
 
Бизнес-план за 60 минут. Презентация на стартап-школе в Ульяновске
Бизнес-план за 60 минут. Презентация на стартап-школе в УльяновскеБизнес-план за 60 минут. Презентация на стартап-школе в Ульяновске
Бизнес-план за 60 минут. Презентация на стартап-школе в Ульяновске
Eugene Kalinin
 
Новый социальный процесс
Новый социальный процессНовый социальный процесс
Новый социальный процесс
Eugene Kalinin
 
сотрудничество
сотрудничествосотрудничество
сотрудничество
Eugene Kalinin
 

Andere mochten auch (10)

Making data sticky
Making data stickyMaking data sticky
Making data sticky
 
Трекшн карта
Трекшн картаТрекшн карта
Трекшн карта
 
Менторская программа Startup Magic
Менторская программа Startup MagicМенторская программа Startup Magic
Менторская программа Startup Magic
 
Трекшн карта и проблемное интервью
Трекшн карта и проблемное интервьюТрекшн карта и проблемное интервью
Трекшн карта и проблемное интервью
 
Бизнес-план за 60 минут. Презентация на стартап-школе в Ульяновске
Бизнес-план за 60 минут. Презентация на стартап-школе в УльяновскеБизнес-план за 60 минут. Презентация на стартап-школе в Ульяновске
Бизнес-план за 60 минут. Презентация на стартап-школе в Ульяновске
 
Новый социальный процесс
Новый социальный процессНовый социальный процесс
Новый социальный процесс
 
Репутация
РепутацияРепутация
Репутация
 
Новый социальный процесс, v.1.1
Новый социальный процесс, v.1.1Новый социальный процесс, v.1.1
Новый социальный процесс, v.1.1
 
сотрудничество
сотрудничествосотрудничество
сотрудничество
 
Parliamo di SOA
Parliamo di SOAParliamo di SOA
Parliamo di SOA
 

Ähnlich wie Data Mining GenBank for Phylogenetic inference - T. Vision

Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseTowards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Hilmar Lapp
 
DNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data PerspectiveDNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data Perspective
Palaniappan SP
 
Genetically Modified Organisms (Carrie)
Genetically Modified Organisms (Carrie)Genetically Modified Organisms (Carrie)
Genetically Modified Organisms (Carrie)
Eileen O'Connor
 
Unison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningUnison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic mining
Reece Hart
 
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesApollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Monica Munoz-Torres
 
Where are the Data? Perspectives from the Neuroscience Information Framework.
Where are the Data? Perspectives from the Neuroscience Information Framework. Where are the Data? Perspectives from the Neuroscience Information Framework.
Where are the Data? Perspectives from the Neuroscience Information Framework.
Neuroscience Information Framework
 

Ähnlich wie Data Mining GenBank for Phylogenetic inference - T. Vision (20)

Trends in Annotation of Genomic Data
Trends in Annotation of Genomic DataTrends in Annotation of Genomic Data
Trends in Annotation of Genomic Data
 
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic DatabaseTowards a Simple, Standards-Compliant, and Generic Phylogenetic Database
Towards a Simple, Standards-Compliant, and Generic Phylogenetic Database
 
Ontology Mapping - Out Of The Babel Tower
Ontology Mapping - Out Of The Babel TowerOntology Mapping - Out Of The Babel Tower
Ontology Mapping - Out Of The Babel Tower
 
Biology
BiologyBiology
Biology
 
B.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 databaseB.sc biochem i bobi u 2 database
B.sc biochem i bobi u 2 database
 
DataONE_cobb_hubbub2012_20120924_v05
DataONE_cobb_hubbub2012_20120924_v05DataONE_cobb_hubbub2012_20120924_v05
DataONE_cobb_hubbub2012_20120924_v05
 
DNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data PerspectiveDNA Sequence Data in Big Data Perspective
DNA Sequence Data in Big Data Perspective
 
The Sanger Mouse Resources Portal - A Testbed for Collaborative Data Integration
The Sanger Mouse Resources Portal - A Testbed for Collaborative Data IntegrationThe Sanger Mouse Resources Portal - A Testbed for Collaborative Data Integration
The Sanger Mouse Resources Portal - A Testbed for Collaborative Data Integration
 
Global Biodiversity Information Facility (GBIF) - 2012
Global Biodiversity Information Facility (GBIF) - 2012Global Biodiversity Information Facility (GBIF) - 2012
Global Biodiversity Information Facility (GBIF) - 2012
 
Genetically Modified Organisms (Carrie)
Genetically Modified Organisms (Carrie)Genetically Modified Organisms (Carrie)
Genetically Modified Organisms (Carrie)
 
Pathway analysis 2012
Pathway analysis 2012Pathway analysis 2012
Pathway analysis 2012
 
20110719 mcguinness deborah_ontologies_for_the_real_world_microsoft_faculty_s...
20110719 mcguinness deborah_ontologies_for_the_real_world_microsoft_faculty_s...20110719 mcguinness deborah_ontologies_for_the_real_world_microsoft_faculty_s...
20110719 mcguinness deborah_ontologies_for_the_real_world_microsoft_faculty_s...
 
Final Acb All Hands 26 11 07.Key
Final Acb All Hands 26 11 07.KeyFinal Acb All Hands 26 11 07.Key
Final Acb All Hands 26 11 07.Key
 
Phylotastic reconciliation
Phylotastic reconciliationPhylotastic reconciliation
Phylotastic reconciliation
 
Unison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic miningUnison: Enabling easy, rapid, and comprehensive proteomic mining
Unison: Enabling easy, rapid, and comprehensive proteomic mining
 
ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides ESI Supplemental Webinar 2 - DataONE presentation slides
ESI Supplemental Webinar 2 - DataONE presentation slides
 
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of GenomesApollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
Apollo and i5K: Collaborative Curation and Interactive Analysis of Genomes
 
Big data from small data: A deep survey of the neuroscience landscape data via
Big data from small data:  A deep survey of the neuroscience landscape data viaBig data from small data:  A deep survey of the neuroscience landscape data via
Big data from small data: A deep survey of the neuroscience landscape data via
 
Dnaprofiling
DnaprofilingDnaprofiling
Dnaprofiling
 
Where are the Data? Perspectives from the Neuroscience Information Framework.
Where are the Data? Perspectives from the Neuroscience Information Framework. Where are the Data? Perspectives from the Neuroscience Information Framework.
Where are the Data? Perspectives from the Neuroscience Information Framework.
 

Mehr von Roderic Page

GrBio Workshop talk
GrBio Workshop talkGrBio Workshop talk
GrBio Workshop talk
Roderic Page
 
Biodiversity Knowledge Graphs
Biodiversity Knowledge GraphsBiodiversity Knowledge Graphs
Biodiversity Knowledge Graphs
Roderic Page
 
GBIF Science Committee Report GB21, Delhi, India
GBIF Science Committee Report GB21, Delhi, IndiaGBIF Science Committee Report GB21, Delhi, India
GBIF Science Committee Report GB21, Delhi, India
Roderic Page
 

Mehr von Roderic Page (20)

ALEC (A List of Everything Cool)
ALEC (A List of Everything Cool)ALEC (A List of Everything Cool)
ALEC (A List of Everything Cool)
 
Wikidata and the Biodiversity Knowledge Graph
Wikidata and the Biodiversity Knowledge GraphWikidata and the Biodiversity Knowledge Graph
Wikidata and the Biodiversity Knowledge Graph
 
BioStor Next
BioStor NextBioStor Next
BioStor Next
 
Ozymandias - from an atlas to a knowledge graph of living Australia
Ozymandias - from an atlas to a knowledge graph of living AustraliaOzymandias - from an atlas to a knowledge graph of living Australia
Ozymandias - from an atlas to a knowledge graph of living Australia
 
SLiDInG6 talk on biodiversity knowledge graph
SLiDInG6 talk on biodiversity knowledge graphSLiDInG6 talk on biodiversity knowledge graph
SLiDInG6 talk on biodiversity knowledge graph
 
Wild idea for TDWG17 Bitcoins, biodiversity and micropayments
Wild idea for TDWG17 Bitcoins, biodiversity and micropaymentsWild idea for TDWG17 Bitcoins, biodiversity and micropayments
Wild idea for TDWG17 Bitcoins, biodiversity and micropayments
 
Towards a biodiversity knowledge graph
Towards a biodiversity knowledge graphTowards a biodiversity knowledge graph
Towards a biodiversity knowledge graph
 
The Sam Adams talk
The Sam Adams talkThe Sam Adams talk
The Sam Adams talk
 
Unknown knowns, long tails, and long data
Unknown knowns, long tails, and long dataUnknown knowns, long tails, and long data
Unknown knowns, long tails, and long data
 
In praise of grumpy old men: Open versus closed data and the challenge of cre...
In praise of grumpy old men: Open versus closed data and the challenge of cre...In praise of grumpy old men: Open versus closed data and the challenge of cre...
In praise of grumpy old men: Open versus closed data and the challenge of cre...
 
BHL, BioStor, and beyond
BHL, BioStor, and beyondBHL, BioStor, and beyond
BHL, BioStor, and beyond
 
Cisco Digital Catapult
Cisco Digital CatapultCisco Digital Catapult
Cisco Digital Catapult
 
Built in the 19th century, rebuilt for the 21st
Built in the 19th century, rebuilt for the 21stBuilt in the 19th century, rebuilt for the 21st
Built in the 19th century, rebuilt for the 21st
 
Two graphs, three responses
Two graphs, three responsesTwo graphs, three responses
Two graphs, three responses
 
GrBio Workshop talk
GrBio Workshop talkGrBio Workshop talk
GrBio Workshop talk
 
Biodiversity Knowledge Graphs
Biodiversity Knowledge GraphsBiodiversity Knowledge Graphs
Biodiversity Knowledge Graphs
 
Visualing phylogenies: a personal view
Visualing phylogenies: a personal viewVisualing phylogenies: a personal view
Visualing phylogenies: a personal view
 
Biodiversity informatics: digitising the living world
Biodiversity informatics: digitising the living worldBiodiversity informatics: digitising the living world
Biodiversity informatics: digitising the living world
 
Ebbe Nielsen Challenge GBIF #gb21
Ebbe Nielsen Challenge GBIF #gb21Ebbe Nielsen Challenge GBIF #gb21
Ebbe Nielsen Challenge GBIF #gb21
 
GBIF Science Committee Report GB21, Delhi, India
GBIF Science Committee Report GB21, Delhi, IndiaGBIF Science Committee Report GB21, Delhi, India
GBIF Science Committee Report GB21, Delhi, India
 

Kürzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 

Data Mining GenBank for Phylogenetic inference - T. Vision

  • 1. Prospects for enabling Suppose you have the sequence of a protein-coding phylogenetically informed gene, and are interested in its function. What is the first thing you would do? comparative biology on the web • If it were me, I would search for conserved domains that match records in Pfam and other Todd Vision & Hilmar Lapp 1,2 1 protein domain databases. 1U.S. National Evolutionary Synthesis Center • Are these databases complete? 2Dept. Of Biology, University of North Carolina • Are they infallible? at Chapel HIll • Are they still useful? Why are these data useful? • You needn’t have mastery of the specialist literature before the search • A match connects you to a vast interconnected world of information • Why not worry about completeness? ! A negative result is not expensive ! Many broadly useful records are already present • Why not worry about fallibility? ! The user can weigh the evidence once a match is found ! Assertions should be exposed to scrutiny 1
  • 2. Some observations The case of phylogenetic data • This infrastructure is designed to disseminate data • There is a broad audience for phylogenetic data to non-specialists ! Organismal phylogeny (e.g. Encyclopedia of Life) • The relevant data may be derived from multiple ! Gene/protein trees “studies”, not all of which are published • Many of the available resources are geared toward specialist researchers & students • Data is hoarded neither by the researcher nor by the domain database • Non-specialists turn to taxonomic classifications when they need organismal phylogenetic • The search service is as widely disseminated as information the data • Few know where to find gene/protein trees at all • Semantic-level machine-to-machine communication facilitates human comprehensive TreeBase Tree of Life Web Project • screenshot 2
  • 3. The NCBI taxonomy • Provides ! A hierarchy for all species represented by DNA sequences in Genbank ! Names and IDs for internal nodes ! An FTP dump • But does NOT ! Include unsequences species ! Report confidence in topology or monophyly ! Taxonomic nuance (it has synonyms & common names) Node-oriented web services from What if the NCBI taxonomy… the Tree of Life Web Project Name • Listed all taxa, including fossils? • Description • • Allowed one to assess where there are Authority • conflicting topologies? Date • • Reported support values for clades? Other names • • Reported divergence time estimates for Completeness of children • nodes (e.g. from TimeTree) Extinction status • Confidence of position • • Reported the provenance of the data? Monophyly • 3
  • 4. Further barriers to dissemination Outline of phylogenetic information • Informatics @ NESCent • Technical obstacles • An example of a phylogenetically-informed Technology for storing and querying trees ! semantic web application for phenotype Difficulties with exchange standards ! data Inference of consensus trees and supertrees ! • Promoting interoperability and closing Taxonomic intelligence ! technical gaps in phyloinformatics through Globally unique identifiers ! open development • Social obstacles ! Reluctance to provide incomplete or fallible information NESCent sponsored science • Catalysis Meetings (large, one-time events) ! To foster new collaborations and synthetic research • Working Groups ! Smaller, focused, multiple meetings • Sabbatical Scholars • Postdoctoral fellows • Short-term visitor program ! 2 weeks to 3 months ! Encourage collaborative projects • Application info: http://www.nescent.org 4
  • 5. NESCent Informatics Evolutionary Informatics WG • Support for sponsored science and scientists • Organizers: Arlin Stoltzfus and Rutger Vos ! Facilitating electronic collaboration • Selected goals: ! Software/database development ! Providing HPC and other IT infrastructure ! XML serialization of NEXUS • Cyberinfrastructure for synthetic science ! Formal grammar for validation and interconversion of Data sharing ! NEXUS & other formats Software interoperability ! ! A transition model language for evolutionary models Training ! used in statistical inference In partnership with major national and international ! ! An ontology for evolutionary comparative data analysis efforts • http://www.nescent.org/wg_evoinfo Phylogenetic cyberinfrastructure to enable GeoPhyloBuilder comparative biology • Two traditions in the recording of phenotype data “Putting the ! Natural language descriptions and character matrices geography into ! Statements made using anatomical and trait ontologies, designed to capitalize on the semantic web phylogeography” • NESCent WG on morphological evolution in fish ! Organized by Paula Mabee and Monte Westerfield David Kidd & Xianhua Liu ! Led to a larger project • Aim is to integrate • Extension for ArcGIS Software that creates a spatiotemporal ! Mutant phenotype data for zebrafish GIS network model from a tree with georeferenced nodes. ! Comparative morphology data for the Ostariophysi • 3D visualizations are possible through ArcSCENE. • http://www.nescent.org/informatics/software.php 5
  • 6. Describing phenotypes using Ontologies ontologies • Defined terms with defined relationships • Entity-Quality system (EQ) ! e.g. Gene Ontology, Cell Ontology • Entity term from an anatomy ontology ! zebrafish anatomy cell ontology, etc. cell part_of part_of • Quality term from Phenotype and Trait Ontology (PATO) cell membrane projection • e.g. Entity=dorsal fin, Shape=round is_a is_a axolemma part_of axon Phenotype and Trait Ontology Evolutionary character matrices (PATO) ... • Common phenotypic data format in physical evolutionary biology (e.g. NEXUS) quality optical quality • Characters + character states, similar to chromatic buoyancy EQ property dorsal fin shape character 2 color amplitude round state Species one blue pointed state Species two green undulate state Species three bright blue dark blue 6
  • 7. Character Matrix vs. EQ A scenario • A geneticist observes a reduction in the number Character of a particular bone type (e.g. branchiostegal ray) Character in a zebrafish mutant of her favorite gene. State AO • She asks: is this bone variable in number among Entity Attribute Value PATO species in nature? dorsal fin shape round • She could query the evolutionary phenotype database using: Entity Quality ! Entity = Branchiostegal ray (from TAO) ! Qualities pertaining to attribute ‘count’ (from PATO) • By examining additional changes on these same • She could examine a visualization of the branches, she sees several parallelisms: phylogenetic relationships of the taxa with ! loss of the swimbladder, pelvic fins, and scales the relevant character changes mapped. ! elongation of the mandibular or hyoid arches ! reduction or loss of the opercle in syngnathids and • She would see that most Ostariophysi have 3 saccopharyngoids. rays, but that reduction has occurred ! a variety of other bones and soft tissues are lost or multiple times: greatly modified ! solenostomids and syngnathids (ghost pipefishes • She might hypothesize that these trait and pipefishes) correlations are all due to alterations in the expression of the same suite of morphogens. ! giganturids • She can select appropriate species from these ! saccopharyngoid (gulper and swallower) eels lineages to follow-up experimentally. 7
  • 8. Some anatomical ontologies What data are needed to enable this scenario? Amphibia • C. elegans • • Anatomy and trait ontologies Fish (zebrafish, medaka, teleosts) • • Phenotypes in EQ syntax for Insects (Drosophila, Mosquito, Hymenoptera) • ! Zebrafish mutants (already exist) Mammals (mouse, human) • ! Species/clades of Ostariophysi Plants (Arabidopsis, cereals, maize, all plants) • • Phylogenetic relationships among the Ostariophysi ! Taxonomy ontology Preserving published data for NESCent (Vision, Lapp, Software Developers) future integration efforts Working groups U. Oregon (Westerfield) Curator interface Usability testing EQSYTE database Sequence alignments (e.g. Treebase) • Liason to ZFIN EQSYTE public interface Liason to NCBO Long-term population records (e.g. pedigrees) • USD (Mabee, EQSYTE contents 2D and 3D images Data Curator) • Zebrafish Ostariophysan phenotypic Collection and locality information phenotypic • & genetic Morphology data NCBO data collaborators (Arratia, Coburn, Behaviorial observations • Applications Ontologies Hilton Lunderg, Mayden) (Phenote, OBO-Edit) (taxonomy, TAO, PATO, homology) Numerical tables • OBO (host of TAO, PATO, taxonomy ontology) Etc. • Tulane U. Phenotype Ontologies (Rios/Ontology Curator) for Evolutionary Biology Ichthyology community Liason to CToL Workshops (DeepFin, Fishbase) • Most of these data are lost upon publication • These are the stuff of comparative biology 8
  • 9. Dryad: A digital repository for published data Journals and societies involved in evolutionary biology so far American Naturalist (ASN) • Evolution (SSE) • Journal of Evolutionary Biology (ESEB) • Integrative and Comparative Biology (SICB) • Molecular Biology and Evolution (SMBE) • Molecular Ecology • Molecular Phylogenetics and Evolution • Systematic Biology (SSB) • NCSU Digital Library Initiative 2006 Phyloinformatics Hackathon Open development ATV NCL NESCent HyPhy PAUP* CIPRES GARLI TreeBase • Open source refers only to the licensing of the software code Bio::CDAT Biojava BioSQL JEBL Bioruby BioPerl Biopython • At NESCent, we have been experimenting with practices in open development ! Community contributes to a shared code base ! Higher barrier to entry ! Can be a substantial payoff in terms of interoperability, functionality, usability, maintenance ! Surprisingly rare in academia 9
  • 10. Hackathon mechanics • Before the meeting ! Participants and users suggested integrative workflows • At the meeting Gaps in existing toolkits were identified ! Subgroups collaborated on high priority targets ! Followed a “use case” model ! Subgroups and targets were allowed to be fluid ! Users were on hand to provide datasets, test code, ! provide their perspective ! Dedicated participants tasked with documentation • All code is open-source and deposited in established repositories Accomplishments • Reconciling trees • Sequence family evolution ! BioPerl: Support for NJTree ! BioPerl: Support for TribeMCL, QuickTree, ! Biopython: Wrapper for Softparsmap ClustalW, Phylip, PAML ! BioRuby: Model for phylogenetic trees and networks with graph algorithms ! BioPerl & Biopython: Support for dN/dS-based tests for selection in HyPhy ! BioSQL: Model for phylogenetic trees and networks with optimization methods and ! Biojava: Parser for Phylip alignment format topological queries ! BioRuby: Support for T-Coffee, MAFFT, and Phylip 10
  • 11. • Phylogenetic inference on non-molecular • NEXUS compliance characters ! BioPerl: Interoperability between Bio::Phylo and ! Biojava: Interoperability between Biojava and JEBL BioPerl APIs ! Biojava & BioRuby: Level II-compliant NEXUS parsers ! BioRuby: NEXUS-compliant data model and parser for ! All: PAUP and TNT results Evaluated major APIs ! Proposed compliance levels ! • Phylogenetic footprinting Gathered test files exposing common errors ! ! BioPerl: Support for Footprinter, PhastCons, and using Fixed compliance issues in NCL and Bio::NEXUS reference ! ClustalW over a sliding window implementations Worked on integrating those into GARLI and BioPerl, ! respectively • Estimation of divergence times ! BioPerl: Draft design of r8s wrapper Next hackathon • Comparative Phylogenetic Methods in R • December 10-14, 2007 • Student internships in open-source software • Organizers: S. Kembel, H. Lapp, B. O'Meara, S. development Price, T. Vision, A. Zanne ! Students work with any of a large number of established OS projects • http://hackathon.nescent.org/R_Hackathon_1 ! Students and mentors work & communicate remotely • NESCent recruited mentors and oversaw student • Have an idea for a future event? Submit a progress whitepaper! ! Eleven students worked on projects in visualization, usability, interoperability & implementation of new methods 11
  • 12. NEXML Command-line BioSQL • Student: Jamie Estill Student: Jason Caravas • • Mentor: Hilmar Lapp Mentor: Rutger Vos • • Commands for Flexible serialization of phylogenetic objects • Database initialization ! Bio::TreeIO import ! Perl Bio::Phylo module tools for NEXML • Bio::TreeIO export ! parsing and serialization Tree query ! Tree optimization ! Tree manipulation ! Conservation of phylogenetic diversity • Student: Klaas Hartmann • Mentor: Tobias Thierer • Implementation of algorithm and GUI for optimal allocation of a finite budget to individual species to maximize phylogenetic diversity. 12
  • 13. Bayesian calibration of Phyloinformatics Summer Course divergence times Teaching advanced • programming skills to • Student: Michael Nowak phylogenetic methods • Mentor: Derrick Zwickl developers Focus is on software • technologies rather than methodology First year • • Fossil occurrence data is used to ! 10 days in July 2007 construct informative priors on ! Organized by Bill Piel of TreeBASE divergence times for Bayesian ! 8 co-instructors analysis in, e.g. BEAST ! 23 students (11 female) in the first year Additional acknowledgements Conclusions Hackathon participants • The future of web-enabled comparative biology is • beginning to become clearer. GSoC mentors and students • ! For a preview, see genomics! Summer course instructors • • The facile exchange of phylogenetic data is what Phenotype evolution project • will enable it. ! Jim Balhoff, Wasila Dahdul, John Lundberg, Paula • Expect to be using technologies such as Mabee, Peter Midford, Monte Westerfield ontologies and web services, which are now • Data depository: largely foreign to phylogenetic researchers. ! Ryan Scherle, Jane Greenberg • Also expect a shift toward open development. ! This will necessitate new modes of training for academic phyloinformaticists. 13