4. What is HDF5?
HDF5 is a highly scalable way to organize and
store heterogeneous, multidimensional data
of user-defined types.
HDF5 also allows data relationships and
context to be stored using annotation and
linking.
July 13, 2012 BOSC 2012 4
5. HDF5
The HDF5 technology suite includes:
• A structured binary file format
• An abstract data model for describing your data
• A data access library, written in C
(w/ bindings for C++, Fortran 95/2003, and Java)
July 13, 2012 BOSC 2012 5
6. HDF5 has characteristics of …
Directories and Files PDF
• standard
• hierarchical
exchange format
• collections of • heterogeneous
related
information HDF5 information
Databases XML
• subsetting • self-describing
• random access Binary Flat File • extensible
• high- types
performance • rich metadata
July 13, 2012
April 17-19, 2012 BOSC 2012 6
7. Advantages of HDF5
• Platform and architecture-independent
• Scalable in space and time
• File size only limited by OS and filesystem
• Data access time (esp. parallel) scales well
• Flexible (user-defined types and organization)
• Files are self-describing
July 13, 2012 BOSC 2012 7
8. Advantages of HDF5 (2)
• High-performance
• Parallel I/O via MPI-IO
• Supports compression and other filters
• Open source (BSD license)
• THG committed to provide long-term support
July 13, 2012 BOSC 2012 8
9. HDF5 Data Objects
• Groups • Datatypes
• Datasets • Metadata (Attributes)
July 13, 2012 BOSC 2012 9
10. Example: LCMS Data
sample name
chromatography
parameters
ms parameters ms/ms parameters
July 13, 2012 BOSC 2012 10
11. HDF5 Data Access
Unlike many data storage systems, HDF5 has no
built-in query engine or indexes.
You will have to write your own data access code,
usually using the HDF5 API.
July 13, 2012 BOSC 2012 11
12. Dataspaces
HDF5 has a rich set of data subsetting functionality.
Example: displaying a thumbnail of a high-
resolution image.
July 13, 2012 BOSC 2012 12
13. Filters and Compression
HDF5 supports data filters, including compression,
which transform data as it enters or leaves the file.
compression
filter
compressed data uncompressed data
in the file in user's buffer
Note that HDF5 data objects are filtered individually,
not the entire file!
July 13, 2012 BOSC 2012 13
14. Higher Language Bindings
C++ Fortran (95 & 2003) Java .NET Python
• C++ & Fortran distributed with library
• Java distributed separately
• .NET distributed separately, not supported by THG (as-is)
• Python (PyTables, h5py) not distributed by THG
NOTE:
HDF5 bindings are thin wrappers over the C API.
• There is no object-oriented interface to HDF5
• Not pure Java, .NET, etc.
July 13, 2012 BOSC 2012 14
The second statement is what we mean by "rich"
High-level view, point out that the file format is NOT "HDF5" (mention VOL).Gerd is a little unhappy with "structured", but it should be ok for this audience.
HDF5 has the characteristics of other formats that are outthere.It’s hard to store metadata in a binary flat file and it is not scalable
Gerd points out that a library is properly a part of the self-describing representation
High performance can have many meanings
Again, note that links are named, not objects
Much more low-level than, say, an RDBMS, though the ease of use of a database can come at a performance cost"easy" access via Python, Gerd'sPowershell snap-in, etc.Can write your own data access API to create queries, etc.
Need to reword this! "These are calleddataspaces" = bad.
Add resource links to this slide
Why should you listen to my talk?
Note that links are named, not objects!Gerd thinks of names as NAVIGATORS
Wide variety of integer and floating point types, enum types, etc.Need to point out that variable-length strings have compression issues (fixable, with $$$)
Might mention sparsity for chunks here.Mike suggests not mentioning chunks, so perhaps that could be replaced with a note about sparse data.