Robinson bosc2010 bio_hdf

•Als PPTX, PDF herunterladen•

0 gefällt mir•498 views

BOSC 2010

The document discusses BioHDF, an open-source project to develop binary file formats for storing next-generation sequencing data. It addresses the challenges of very large and varied NGS data by proposing a flexible data model, efficient file format, and software toolkit. BioHDF uses the HDF5 file format and is led by Geospiza with involvement from The HDF Group. It aims to provide a portable, high performance solution for NGS data storage and analysis.

Technologie Bildung

The HDF Group

BioHDF
Open Binary File Formats for
Next-Generation Sequencing Data
Current Status and Future Directions

Dana Robinson
The HDF Group
derobins@hdfgroup.org

Copyright © 2010 The HDF Group. All Rights Reserved
July 9, 2010 1 www.hdfgroup.org

NGS Data Challenges

Very large quantities of data
(100s of GB)

"Drinking from the firehose"

Analysis methods vary greatly, so a flexible yet unified
data store would be useful.

July 9, 2010 2 www.hdfgroup.org

What is Needed

A Data Model
A data model which accurately describes the data and can
be expanded to contain new types of data

A Data Store
A file format or data store which is efficient in access time
and storage size and which scales well

A Toolkit
A flexible software toolkit that can be used to create tools
and pipelines based on the data model and file format

July 9, 2010 3 www.hdfgroup.org

What is BioHDF?
An open-source, community-driven project, funded by an NIH
SBIR grant and led by Geospiza, Inc. in collaboration with
The HDF Group.

BioHDF is a particular arrangement of objects in an HDF5
file (similar to a database schema)

BioHDF is a library and C API which can be used to write
applications (coming soon)

BioHDF is a set of command line tools for
storing, retrieving and manipulating data in BioHDF files
July 9, 2010 4 www.hdfgroup.org

HDF = Hierarchical Data Format

An example of how data is stored in HDF5

somefile.h5 datasets
/
Reads/

Alignments/ is_sorted
groups
References attributes

July 9, 2010 5 www.hdfgroup.org

Benefits of BioHDF
• Portability and data sharing:
Platform independent, endian independent, self
describing, common data models.

• High performance:
Fast random access and efficient, scalable, petabyte level
compressed storage.

• Widespread adoption:
MATLAB, IDL, NASA-Earth Observing System, Pacific
Biosciences, SOLiD, 100's of products.

• 20 year history:
Robust, performance tuned, and well supported by The HDF
Group, an independent non-profit entity.
July 9, 2010 6 www.hdfgroup.org

HDF in Bioinformatics

• Baylor Imaging Group
• Life Technologies
• Pacific Biosciences
• Oxford Nanopore
• GenomeData (UW)
• Geospiza
• Others

July 9, 2010 www.hdfgroup.org

Data Stored

The prototype BioHDF stores

Reads

Alignments

Annotations

Clusters of Aligned Reads

Reference Sequences

Indexes (NCList or simple)

July 9, 2010 8 www.hdfgroup.org

Data Stored

Additional user-specific data can be stored without breaking
the library or tools.

Similar to how
BioHDF adding additional
Data tables to a
database schema
does not invalidate
existing queries.
User-Specific
Data

July 9, 2010 9 www.hdfgroup.org

Project Stages

A "pipeline prototype " set of tools to demonstrate the
suitability of HDF5 for NGS data storage.

An version 1.0 release of a BioHDF library and C API targeting
the functionality of samtools.

A higher-level C API that abstracts out and hides the
underlying storage technology.

July 9, 2010 10 www.hdfgroup.org

HDF5 API and Applications

BioHDF Applications and
Wrappers (e.g. Perl, Python)

High-Level API

BioHDF API

HDF5 API

Physical Storage

July 9, 2010 11 www.hdfgroup.org

A Higher-Level API

A high-level API will encapsulate and hide the underlying
storage technology.

low-level
C APIs samtools
BioHDF
API high-level tool
C API

BAM wrapper
API

July 9, 2010 12 www.hdfgroup.org

Acknowledgements

Geospiza
Todd Smith
Mark Welsh

The HDF Group
Mike Folk

BioHDF is supported by NIH SBIR Phase II grant HG003792
awarded to Geospiza, Inc.

July 9, 2010 13 www.hdfgroup.org

The HDF Group

Thank you for your time!
If you are interested in using or contributing to
BioHDF, please contact us!

Dana Robinson (derobins@hdfgroup.org)

http://www.biohdf.org

BOSC BoF: Friday 5:10-6:00

ISMB Poster J18: Monday, July 12: 12:40-2:30

ISMB BoF: Tuesday, July 13 1-2 pm, room 306
July 9, 2010 14 www.hdfgroup.org

Empfohlen

TDWG VoMaG Vocabulary management workflow, 2013-10-31Dag Endresen

Knowledge Organization System (KOS) for biodiversity information resources, G...Dag Endresen

Customisable cross-database Bio2RDF queriesPeter Ansell

EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)Dag Endresen

Germplasm data exchange, CGIAR SINGER (2009)Dag Endresen

GBIF-Norway at NMBU, January 2015Dag Endresen

TDWG and GBIF, at European genbank network meeting (Bonn, April 2004)Dag Endresen

European agrobiodioversity, ECPGR network meeting on EURISCO, Central Crop Da...Dag Endresen

Empfohlen

TDWG VoMaG Vocabulary management workflow, 2013-10-31Dag Endresen

Knowledge Organization System (KOS) for biodiversity information resources, G...Dag Endresen

Customisable cross-database Bio2RDF queriesPeter Ansell

EURISCO and GBIF IPT, at the Vavilov Institute in St Petersburg (27 April 2010)Dag Endresen

Germplasm data exchange, CGIAR SINGER (2009)Dag Endresen

GBIF-Norway at NMBU, January 2015Dag Endresen

TDWG and GBIF, at European genbank network meeting (Bonn, April 2004)Dag Endresen

European agrobiodioversity, ECPGR network meeting on EURISCO, Central Crop Da...Dag Endresen

Global Biodiversity Information Facility - 2013Dag Endresen

EURISCO needs and priorities, at CGIAR ICT-KM Workshop, IPGRI, Rome (2005)Dag Endresen

Data exchange alternatives, GIGA TAG (2009)Dag Endresen

GBIF BIFA mentoring, Day 5a Data management, July 2016Dag Endresen

GBIF-Norway status for the 6th European GBIF nodes meeting April 2014Dag Endresen

BioCASE web services for germplasm data sets, at FAO, Rome (2006)Dag Endresen

鏈結資料在圖書館的應用20131107皓仁柯

DataCite and its DOI infrastructure - IASSIST 2013Frauke Ziedorn

#HepaticaWeek April 2016, GBIF data publishingDag Endresen

Cross-Community User Requirements and the Biodiversity Heritage LibraryChris Freeland

Web services for sharing germplasm data sets, at FAO in Rome (2006)Dag Endresen

Workshop about research data archiving and open access publishing at the Rese...Dag Endresen

Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)Dag Endresen

Spitzer Preprints and the Research WorkflowNational Information Standards Organization (NISO)

Persistent Identifiers, Herbarium workshop at Kongsvold, September 1 to 4, 2014Dag Endresen

Integrated database biology with well-curated and circulated knowledgeHidemasa Bono

EURISCO and GBIF, at the European genbank network meeting (Bonn, April 2004)Dag Endresen

GBIF registry (GBRDS), at European Nodes meeting in Alicante, Spain (10 March...Dag Endresen

Darwin Core extension for germplasm (11th December 2013)Dag Endresen

圖書館趨勢觀察Ted Lin (林泰宏)

Explaining A Programming Model for Context-Aware Applications in Large-Scale ...Luis Cipriani

Vänsterpartiet - Tisdagens frukostseminarie i AlmedalenInfopaq Sverige

Weitere ähnliche Inhalte

Was ist angesagt?

Global Biodiversity Information Facility - 2013Dag Endresen

EURISCO needs and priorities, at CGIAR ICT-KM Workshop, IPGRI, Rome (2005)Dag Endresen

Data exchange alternatives, GIGA TAG (2009)Dag Endresen

GBIF BIFA mentoring, Day 5a Data management, July 2016Dag Endresen

GBIF-Norway status for the 6th European GBIF nodes meeting April 2014Dag Endresen

BioCASE web services for germplasm data sets, at FAO, Rome (2006)Dag Endresen

鏈結資料在圖書館的應用20131107皓仁柯

DataCite and its DOI infrastructure - IASSIST 2013Frauke Ziedorn

#HepaticaWeek April 2016, GBIF data publishingDag Endresen

Cross-Community User Requirements and the Biodiversity Heritage LibraryChris Freeland

Web services for sharing germplasm data sets, at FAO in Rome (2006)Dag Endresen

Workshop about research data archiving and open access publishing at the Rese...Dag Endresen

Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)Dag Endresen

Spitzer Preprints and the Research WorkflowNational Information Standards Organization (NISO)

Persistent Identifiers, Herbarium workshop at Kongsvold, September 1 to 4, 2014Dag Endresen

Integrated database biology with well-curated and circulated knowledgeHidemasa Bono

EURISCO and GBIF, at the European genbank network meeting (Bonn, April 2004)Dag Endresen

GBIF registry (GBRDS), at European Nodes meeting in Alicante, Spain (10 March...Dag Endresen

Darwin Core extension for germplasm (11th December 2013)Dag Endresen

圖書館趨勢觀察Ted Lin (林泰宏)

Was ist angesagt? (20)

Global Biodiversity Information Facility - 2013

EURISCO needs and priorities, at CGIAR ICT-KM Workshop, IPGRI, Rome (2005)

Data exchange alternatives, GIGA TAG (2009)

GBIF BIFA mentoring, Day 5a Data management, July 2016

GBIF-Norway status for the 6th European GBIF nodes meeting April 2014

BioCASE web services for germplasm data sets, at FAO, Rome (2006)

鏈結資料在圖書館的應用20131107

DataCite and its DOI infrastructure - IASSIST 2013

#HepaticaWeek April 2016, GBIF data publishing

Cross-Community User Requirements and the Biodiversity Heritage Library

Web services for sharing germplasm data sets, at FAO in Rome (2006)

Workshop about research data archiving and open access publishing at the Rese...

Prototype Crop Wild Relatives Portal, at the IMC Meeting (2007)

Spitzer Preprints and the Research Workflow

Persistent Identifiers, Herbarium workshop at Kongsvold, September 1 to 4, 2014

Integrated database biology with well-curated and circulated knowledge

EURISCO and GBIF, at the European genbank network meeting (Bonn, April 2004)

GBIF registry (GBRDS), at European Nodes meeting in Alicante, Spain (10 March...

Darwin Core extension for germplasm (11th December 2013)

圖書館趨勢觀察

Andere mochten auch

Explaining A Programming Model for Context-Aware Applications in Large-Scale ...Luis Cipriani

Vänsterpartiet - Tisdagens frukostseminarie i AlmedalenInfopaq Sverige

605專屬搭畢業特輯musicghost

Linked In Power Point 2robertascherbarth

Graduate Students Workshop Naz Torabi

How To Use Your Website to Get CustomersclickTRUE

RefWorks for DEPARTMENT OF FAMILY MEDICINE - Faculty Development Naz Torabi

Making Your Apps More SociableSamsung

HP Programvare SPOR 3HP Norge

IPad boot camp iste 2013 without videosKevin Amboe

H σαλαμινα στις τεχνεςRallou Thoma

Marketing Busuness Art 2012Arif Mahmood

Latest trends in emTEO (The Event Organizers)

Benjamín Arditi (Democracia postliberal participativa)Adolfo Orive

mobility programs for educationRosario Outes

Presentation plAndrzej

DiNapoli Family Trip to Italytomdinapoli

4wd couponMaterazzi3

Portfolio Presentation1jamespiatt

Cultural diffJessWalker1

Andere mochten auch (20)

Explaining A Programming Model for Context-Aware Applications in Large-Scale ...

Vänsterpartiet - Tisdagens frukostseminarie i Almedalen

605專屬搭畢業特輯

Linked In Power Point 2

Graduate Students Workshop

How To Use Your Website to Get Customers

RefWorks for DEPARTMENT OF FAMILY MEDICINE - Faculty Development

Making Your Apps More Sociable

HP Programvare SPOR 3

IPad boot camp iste 2013 without videos

H σαλαμινα στις τεχνες

Marketing Busuness Art 2012

Latest trends in em

Benjamín Arditi (Democracia postliberal participativa)

mobility programs for education

Presentation pl

DiNapoli Family Trip to Italy

4wd coupon

Portfolio Presentation1

Cultural diff

Ähnlich wie Robinson bosc2010 bio_hdf

Open@Fao presentation at the EADI Open For Development Project, 2012 Stephen Katz

Hadoop.powerpoint.pptxsonukumar379092

Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...GigaScience, BGI Hong Kong

University of Minho Data Repository - features to publish & share data and w...Pedro Príncipe

An On-line Collaborative Data Management SystemCameron Kiddle

HKU Data Curation MLIM7350 Class 8Scott Edmunds

GBIF: An infrastructure for infrastructures Francisco Pando

Mendeley Data: Enhancing Data Discovery, Sharing and ReuseAnita de Waard

Brain Imaging Data Structure and Center for Reproducible NeuroscinceKrzysztof Gorgolewski

Good (enough) research data management practicesLeon Osinski

Elsevier1 vcVishwas Chavan

Accessing Digital Collections Data Sources for Research: A Tour of iDigBio Da...Matthew J Collins

Diversity++2015 talk: R2R+BCO-DMO - Linked Oceanographic DatasetsAdila Krisnadhi

2 Discovery and Acquisition of Data1.pptxvijayapraba1

big dataArohi Khandelwal

Introduction of Big data and Hadoop Arohi Khandelwal

What funders want you to do with your dataLeon Osinski

White Paper: Hadoop in Life Sciences — An Introduction EMC

e-Science, Research Data and LibariesRob Grim

Setting up a data repository, what does it entail?International Food Policy Research Institute (IFPRI)

Ähnlich wie Robinson bosc2010 bio_hdf (20)

Open@Fao presentation at the EADI Open For Development Project, 2012

Hadoop.powerpoint.pptx

Scott Edmunds: GigaScience - a journal or a database? Lessons learned from th...

University of Minho Data Repository - features to publish & share data and w...

An On-line Collaborative Data Management System

HKU Data Curation MLIM7350 Class 8

GBIF: An infrastructure for infrastructures

Mendeley Data: Enhancing Data Discovery, Sharing and Reuse

Brain Imaging Data Structure and Center for Reproducible Neuroscince

Good (enough) research data management practices

Elsevier1 vc

Accessing Digital Collections Data Sources for Research: A Tour of iDigBio Da...

Diversity++2015 talk: R2R+BCO-DMO - Linked Oceanographic Datasets

2 Discovery and Acquisition of Data1.pptx

big data

Introduction of Big data and Hadoop

What funders want you to do with your data

White Paper: Hadoop in Life Sciences — An Introduction

e-Science, Research Data and Libaries

Setting up a data repository, what does it entail?

Mehr von BOSC 2010

Mercer bosc2010 microsoft_frameworkBOSC 2010

Langmead bosc2010 cloud-genomicsBOSC 2010

Schultheiss bosc2010 persistance-web-servicesBOSC 2010

Swertz bosc2010 molgenisBOSC 2010

Rice bosc2010 embossBOSC 2010

Morris bosc2010 evokerBOSC 2010

Kono bosc2010 pathway_projectorBOSC 2010

Kanterakis bosc2010 molgenisBOSC 2010

Gautier bosc2010 pythonbioconductorBOSC 2010

Gardler bosc2010 community_developmentattheasfBOSC 2010

Friedberg bosc2010 iprstatsBOSC 2010

Fields bosc2010 bio_perlBOSC 2010

Chapman bosc2010 biopythonBOSC 2010

Bonnal bosc2010 bio_rubyBOSC 2010

Puton bosc2010 bio_python-modules-rnaBOSC 2010

Bader bosc2010 cytowebBOSC 2010

Talevich bosc2010 bio-phyloBOSC 2010

Zmasek bosc2010 aptxBOSC 2010

Wilkinson bosc2010 moby-to-sadiBOSC 2010

Venkatesan bosc2010 onto-toolkitBOSC 2010

Mehr von BOSC 2010 (20)

Mercer bosc2010 microsoft_framework

Langmead bosc2010 cloud-genomics

Schultheiss bosc2010 persistance-web-services

Swertz bosc2010 molgenis

Rice bosc2010 emboss

Morris bosc2010 evoker

Kono bosc2010 pathway_projector

Kanterakis bosc2010 molgenis

Gautier bosc2010 pythonbioconductor

Gardler bosc2010 community_developmentattheasf

Friedberg bosc2010 iprstats

Fields bosc2010 bio_perl

Chapman bosc2010 biopython

Bonnal bosc2010 bio_ruby

Puton bosc2010 bio_python-modules-rna

Bader bosc2010 cytoweb

Talevich bosc2010 bio-phylo

Zmasek bosc2010 aptx

Wilkinson bosc2010 moby-to-sadi

Venkatesan bosc2010 onto-toolkit

Kürzlich hochgeladen

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

A Year of the Servo Reboot: Where Are We Now?Igalia

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

Artificial Intelligence: Facts and MythsJoaquim Jorge

What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer

How to convert PDF to text with Nanonetsnaman860154

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

A Call to Action for Generative AI in 2024Results

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Kürzlich hochgeladen (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

A Year of the Servo Reboot: Where Are We Now?

Axa Assurance Maroc - Insurer Innovation Award 2024

The 7 Things I Know About Cyber Security After 25 Years | April 2024

CNv6 Instructor Chapter 6 Quality of Service

Artificial Intelligence: Facts and Myths

What Are The Drone Anti-jamming Systems Technology?

2024: Domino Containers - The Next Step. News from the Domino Container commu...

Powerful Google developer tools for immediate impact! (2023-24 C)

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Boost Fertility New Invention Ups Success Rates.pdf

Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024

How to convert PDF to text with Nanonets

The Codex of Business Writing Software for Real-World Solutions 2.pptx

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

A Call to Action for Generative AI in 2024

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

Robinson bosc2010 bio_hdf

1. The HDF Group BioHDF Open Binary File Formats for Next-Generation Sequencing Data Current Status and Future Directions Dana Robinson The HDF Group derobins@hdfgroup.org Copyright © 2010 The HDF Group. All Rights Reserved July 9, 2010 1 www.hdfgroup.org

2. NGS Data Challenges Very large quantities of data (100s of GB) "Drinking from the firehose" Analysis methods vary greatly, so a flexible yet unified data store would be useful. July 9, 2010 2 www.hdfgroup.org

3. What is Needed A Data Model A data model which accurately describes the data and can be expanded to contain new types of data A Data Store A file format or data store which is efficient in access time and storage size and which scales well A Toolkit A flexible software toolkit that can be used to create tools and pipelines based on the data model and file format July 9, 2010 3 www.hdfgroup.org

4. What is BioHDF? An open-source, community-driven project, funded by an NIH SBIR grant and led by Geospiza, Inc. in collaboration with The HDF Group. BioHDF is a particular arrangement of objects in an HDF5 file (similar to a database schema) BioHDF is a library and C API which can be used to write applications (coming soon) BioHDF is a set of command line tools for storing, retrieving and manipulating data in BioHDF files July 9, 2010 4 www.hdfgroup.org

5. HDF = Hierarchical Data Format An example of how data is stored in HDF5 somefile.h5 datasets / Reads/ Alignments/ is_sorted groups References attributes July 9, 2010 5 www.hdfgroup.org

6. Benefits of BioHDF • Portability and data sharing: Platform independent, endian independent, self describing, common data models. • High performance: Fast random access and efficient, scalable, petabyte level compressed storage. • Widespread adoption: MATLAB, IDL, NASA-Earth Observing System, Pacific Biosciences, SOLiD, 100's of products. • 20 year history: Robust, performance tuned, and well supported by The HDF Group, an independent non-profit entity. July 9, 2010 6 www.hdfgroup.org

7. HDF in Bioinformatics • Baylor Imaging Group • Life Technologies • Pacific Biosciences • Oxford Nanopore • GenomeData (UW) • Geospiza • Others July 9, 2010 www.hdfgroup.org

8. Data Stored The prototype BioHDF stores Reads Alignments Annotations Clusters of Aligned Reads Reference Sequences Indexes (NCList or simple) July 9, 2010 8 www.hdfgroup.org

9. Data Stored Additional user-specific data can be stored without breaking the library or tools. Similar to how BioHDF adding additional Data tables to a database schema does not invalidate existing queries. User-Specific Data July 9, 2010 9 www.hdfgroup.org

10. Project Stages A "pipeline prototype " set of tools to demonstrate the suitability of HDF5 for NGS data storage. An version 1.0 release of a BioHDF library and C API targeting the functionality of samtools. A higher-level C API that abstracts out and hides the underlying storage technology. July 9, 2010 10 www.hdfgroup.org

11. HDF5 API and Applications BioHDF Applications and Wrappers (e.g. Perl, Python) High-Level API BioHDF API HDF5 API Physical Storage July 9, 2010 11 www.hdfgroup.org

12. A Higher-Level API A high-level API will encapsulate and hide the underlying storage technology. low-level C APIs samtools BioHDF API high-level tool C API BAM wrapper API July 9, 2010 12 www.hdfgroup.org

13. Acknowledgements Geospiza Todd Smith Mark Welsh The HDF Group Mike Folk BioHDF is supported by NIH SBIR Phase II grant HG003792 awarded to Geospiza, Inc. July 9, 2010 13 www.hdfgroup.org

14. The HDF Group Thank you for your time! If you are interested in using or contributing to BioHDF, please contact us! Dana Robinson (derobins@hdfgroup.org) http://www.biohdf.org BOSC BoF: Friday 5:10-6:00 ISMB Poster J18: Monday, July 12: 12:40-2:30 ISMB BoF: Tuesday, July 13 1-2 pm, room 306 July 9, 2010 14 www.hdfgroup.org

Hinweis der Redaktion

My goal here is to show people how data is stored in HDF5 (groups, datasets, attributes), not to speak about NGS data storage in BioHDF. I get the impression that people have little understanding of what HDF5 is so I'd like to give them a bare-bones overview.
The reason people will be discouraged from using the HDF5 API directly is that would encourage them to meddle with low-level data elements that can change. This would make their software more brittle.
A first implementation of this will probably be at the linker level (e.g. samtools-biohdf and samtools-bam). Further down the road, we might implement a plugin architecture to handle this.
A first implementation of this will probably be at the linker level (e.g. samtools-biohdf and samtools-bam). Further down the road, we might implement a plugin architecture to handle this.