The document discusses BioHDF, an open-source project to develop binary file formats for storing next-generation sequencing data. It addresses the challenges of very large and varied NGS data by proposing a flexible data model, efficient file format, and software toolkit. BioHDF uses the HDF5 file format and is led by Geospiza with involvement from The HDF Group. It aims to provide a portable, high performance solution for NGS data storage and analysis.
2. NGS Data Challenges
Very large quantities of data
(100s of GB)
"Drinking from the firehose"
Analysis methods vary greatly, so a flexible yet unified
data store would be useful.
July 9, 2010 2 www.hdfgroup.org
3. What is Needed
A Data Model
A data model which accurately describes the data and can
be expanded to contain new types of data
A Data Store
A file format or data store which is efficient in access time
and storage size and which scales well
A Toolkit
A flexible software toolkit that can be used to create tools
and pipelines based on the data model and file format
July 9, 2010 3 www.hdfgroup.org
4. What is BioHDF?
An open-source, community-driven project, funded by an NIH
SBIR grant and led by Geospiza, Inc. in collaboration with
The HDF Group.
BioHDF is a particular arrangement of objects in an HDF5
file (similar to a database schema)
BioHDF is a library and C API which can be used to write
applications (coming soon)
BioHDF is a set of command line tools for
storing, retrieving and manipulating data in BioHDF files
July 9, 2010 4 www.hdfgroup.org
5. HDF = Hierarchical Data Format
An example of how data is stored in HDF5
somefile.h5 datasets
/
Reads/
Alignments/ is_sorted
groups
References attributes
July 9, 2010 5 www.hdfgroup.org
6. Benefits of BioHDF
• Portability and data sharing:
Platform independent, endian independent, self
describing, common data models.
• High performance:
Fast random access and efficient, scalable, petabyte level
compressed storage.
• Widespread adoption:
MATLAB, IDL, NASA-Earth Observing System, Pacific
Biosciences, SOLiD, 100's of products.
• 20 year history:
Robust, performance tuned, and well supported by The HDF
Group, an independent non-profit entity.
July 9, 2010 6 www.hdfgroup.org
7. HDF in Bioinformatics
• Baylor Imaging Group
• Life Technologies
• Pacific Biosciences
• Oxford Nanopore
• GenomeData (UW)
• Geospiza
• Others
July 9, 2010 www.hdfgroup.org
8. Data Stored
The prototype BioHDF stores
Reads
Alignments
Annotations
Clusters of Aligned Reads
Reference Sequences
Indexes (NCList or simple)
July 9, 2010 8 www.hdfgroup.org
9. Data Stored
Additional user-specific data can be stored without breaking
the library or tools.
Similar to how
BioHDF adding additional
Data tables to a
database schema
does not invalidate
existing queries.
User-Specific
Data
July 9, 2010 9 www.hdfgroup.org
10. Project Stages
A "pipeline prototype " set of tools to demonstrate the
suitability of HDF5 for NGS data storage.
An version 1.0 release of a BioHDF library and C API targeting
the functionality of samtools.
A higher-level C API that abstracts out and hides the
underlying storage technology.
July 9, 2010 10 www.hdfgroup.org
11. HDF5 API and Applications
BioHDF Applications and
Wrappers (e.g. Perl, Python)
High-Level API
BioHDF API
HDF5 API
Physical Storage
July 9, 2010 11 www.hdfgroup.org
12. A Higher-Level API
A high-level API will encapsulate and hide the underlying
storage technology.
low-level
C APIs samtools
BioHDF
API high-level tool
C API
BAM wrapper
API
July 9, 2010 12 www.hdfgroup.org
13. Acknowledgements
Geospiza
Todd Smith
Mark Welsh
The HDF Group
Mike Folk
BioHDF is supported by NIH SBIR Phase II grant HG003792
awarded to Geospiza, Inc.
July 9, 2010 13 www.hdfgroup.org
14. The HDF Group
Thank you for your time!
If you are interested in using or contributing to
BioHDF, please contact us!
Dana Robinson (derobins@hdfgroup.org)
http://www.biohdf.org
BOSC BoF: Friday 5:10-6:00
ISMB Poster J18: Monday, July 12: 12:40-2:30
ISMB BoF: Tuesday, July 13 1-2 pm, room 306
July 9, 2010 14 www.hdfgroup.org
Hinweis der Redaktion
My goal here is to show people how data is stored in HDF5 (groups, datasets, attributes), not to speak about NGS data storage in BioHDF. I get the impression that people have little understanding of what HDF5 is so I'd like to give them a bare-bones overview.
The reason people will be discouraged from using the HDF5 API directly is that would encourage them to meddle with low-level data elements that can change. This would make their software more brittle.
A first implementation of this will probably be at the linker level (e.g. samtools-biohdf and samtools-bam). Further down the road, we might implement a plugin architecture to handle this.
A first implementation of this will probably be at the linker level (e.g. samtools-biohdf and samtools-bam). Further down the road, we might implement a plugin architecture to handle this.