SlideShare a Scribd company logo
1 of 15
Using HDF5 To Work With Large
Quantities of Rich Biological Data
    Dana Robinson (derobins @hdfgroup.org)
    The HDF Group




 July 13, 2012       BOSC 2012       1
Today's Goal

Is that you walk away from this talk with a basic
understanding of the HDF5 technology stack.




July 13, 2012         BOSC 2012        2
Where is HDF5 used?




July 13, 2012        BOSC 2012   3
What is HDF5?

HDF5 is a highly scalable way to organize and
store heterogeneous, multidimensional data
of user-defined types.

HDF5 also allows data relationships and
context to be stored using annotation and
linking.



July 13, 2012        BOSC 2012        4
HDF5
The HDF5 technology suite includes:

• A structured binary file format

• An abstract data model for describing your data

• A data access library, written in C
  (w/ bindings for C++, Fortran 95/2003, and Java)


  July 13, 2012        BOSC 2012       5
HDF5 has characteristics of …
      Directories and Files                               PDF
                                                  • standard
     • hierarchical
                                                    exchange format
     • collections of                             • heterogeneous
       related
       information              HDF5                information



         Databases                                       XML
      • subsetting                                • self-describing
      • random access          Binary Flat File   • extensible
                              • high-               types
                                performance       • rich metadata

       July 13, 2012
April 17-19, 2012                  BOSC 2012          6
Advantages of HDF5
• Platform and architecture-independent

• Scalable in space and time
  • File size only limited by OS and filesystem
  • Data access time (esp. parallel) scales well

• Flexible (user-defined types and organization)

• Files are self-describing
  July 13, 2012         BOSC 2012       7
Advantages of HDF5 (2)
• High-performance

• Parallel I/O via MPI-IO

• Supports compression and other filters

• Open source (BSD license)

• THG committed to provide long-term support
  July 13, 2012         BOSC 2012     8
HDF5 Data Objects

• Groups                   • Datatypes
• Datasets                 • Metadata (Attributes)




 July 13, 2012       BOSC 2012         9
Example: LCMS Data
                                          sample name
chromatography
  parameters




      ms parameters     ms/ms parameters
 July 13, 2012                BOSC 2012                 10
HDF5 Data Access

Unlike many data storage systems, HDF5 has no
built-in query engine or indexes.

You will have to write your own data access code,
usually using the HDF5 API.




  July 13, 2012        BOSC 2012       11
Dataspaces
HDF5 has a rich set of data subsetting functionality.
Example: displaying a thumbnail of a high-
resolution image.




  July 13, 2012         BOSC 2012        12
Filters and Compression
 HDF5 supports data filters, including compression,
 which transform data as it enters or leaves the file.


                           compression
                              filter


   compressed data                       uncompressed data
      in the file                          in user's buffer

Note that HDF5 data objects are filtered individually,
not the entire file!
    July 13, 2012          BOSC 2012           13
Higher Language Bindings
    C++ Fortran (95 & 2003) Java .NET Python

•   C++ & Fortran distributed with library
•   Java distributed separately
•   .NET distributed separately, not supported by THG (as-is)
•   Python (PyTables, h5py) not distributed by THG

NOTE:
HDF5 bindings are thin wrappers over the C API.
   • There is no object-oriented interface to HDF5
   • Not pure Java, .NET, etc.
    July 13, 2012               BOSC 2012             14
Questions?


                            Helpful links

THG                  www.hdfgroup.org
Downloads            www.hdfgroup.org/HDF5/release/obtain5.html
Documentation        www.hdfgroup.org/HDF5/doc/index.html
Bioinformatics       www.hdfgroup.org/projects/bioinformatics/
Tutorials            www.hdfgroup.org/HDF5/Tutor/index.html
Contact/help desk    www.hdfgroup.org/about/contact.html

     July 13, 2012             BOSC 2012          15

More Related Content

What's hot

Deploying RDF Linked Data via Virtuoso Universal Server
Deploying RDF Linked Data via Virtuoso Universal ServerDeploying RDF Linked Data via Virtuoso Universal Server
Deploying RDF Linked Data via Virtuoso Universal Server
rumito
 

What's hot (20)

Advanced HDF5 Features
Advanced HDF5 FeaturesAdvanced HDF5 Features
Advanced HDF5 Features
 
fhir-documents
fhir-documentsfhir-documents
fhir-documents
 
Best of Marketing
Best of MarketingBest of Marketing
Best of Marketing
 
Fhir foundation (grahame)
Fhir foundation (grahame)Fhir foundation (grahame)
Fhir foundation (grahame)
 
Fhir dev days_basic_fhir_terminology_services
Fhir dev days_basic_fhir_terminology_servicesFhir dev days_basic_fhir_terminology_services
Fhir dev days_basic_fhir_terminology_services
 
IDL Support for HDF4 and HDF5
IDL Support for HDF4 and HDF5IDL Support for HDF4 and HDF5
IDL Support for HDF4 and HDF5
 
Furore devdays 2017- rdf2(solbrig)
Furore devdays 2017- rdf2(solbrig)Furore devdays 2017- rdf2(solbrig)
Furore devdays 2017- rdf2(solbrig)
 
HDF5 Tools in IDL
HDF5 Tools in IDLHDF5 Tools in IDL
HDF5 Tools in IDL
 
Dev days 2017 questionnaires (brian postlethwaite)
Dev days 2017 questionnaires (brian postlethwaite)Dev days 2017 questionnaires (brian postlethwaite)
Dev days 2017 questionnaires (brian postlethwaite)
 
Solving Real Problems Using Linked Data
Solving Real Problems Using Linked DataSolving Real Problems Using Linked Data
Solving Real Problems Using Linked Data
 
HDF5 Backward and Forward Compatibility Issues
HDF5 Backward and Forward Compatibility IssuesHDF5 Backward and Forward Compatibility Issues
HDF5 Backward and Forward Compatibility Issues
 
Dublin Core Metadata Initiative Abstract Model
Dublin Core Metadata Initiative Abstract ModelDublin Core Metadata Initiative Abstract Model
Dublin Core Metadata Initiative Abstract Model
 
PDF/A: A Preservation Format
PDF/A: A Preservation Format PDF/A: A Preservation Format
PDF/A: A Preservation Format
 
Deploying RDF Linked Data via Virtuoso Universal Server
Deploying RDF Linked Data via Virtuoso Universal ServerDeploying RDF Linked Data via Virtuoso Universal Server
Deploying RDF Linked Data via Virtuoso Universal Server
 
Content models
Content modelsContent models
Content models
 
Images of HDF5
Images of HDF5Images of HDF5
Images of HDF5
 
Linked Data Driven Data Virtualization for Web-scale Integration
Linked Data Driven Data Virtualization for Web-scale IntegrationLinked Data Driven Data Virtualization for Web-scale Integration
Linked Data Driven Data Virtualization for Web-scale Integration
 
Virtuoso Universal Server Overview
Virtuoso Universal Server OverviewVirtuoso Universal Server Overview
Virtuoso Universal Server Overview
 
On the way to a Relation Registry for ISOcat data categories
On the way to a Relation Registry for ISOcat data categoriesOn the way to a Relation Registry for ISOcat data categories
On the way to a Relation Registry for ISOcat data categories
 
EUDAT-B2FIND: A FAIR-friendly and Interdisciplinary Data Catalogue
EUDAT-B2FIND: A FAIR-friendly and Interdisciplinary Data CatalogueEUDAT-B2FIND: A FAIR-friendly and Interdisciplinary Data Catalogue
EUDAT-B2FIND: A FAIR-friendly and Interdisciplinary Data Catalogue
 

Viewers also liked

Viewers also liked (8)

D Baker - Galaxy Update
D Baker - Galaxy UpdateD Baker - Galaxy Update
D Baker - Galaxy Update
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
 
A Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining components
 
S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...
 
B Kinoshita - Creating biology pipelines with BioUno
B Kinoshita - Creating biology pipelines with BioUnoB Kinoshita - Creating biology pipelines with BioUno
B Kinoshita - Creating biology pipelines with BioUno
 
M Gumbel - SCABIO: a framework for bioinformatics algorithms in Scala
M Gumbel - SCABIO: a framework for bioinformatics algorithms in ScalaM Gumbel - SCABIO: a framework for bioinformatics algorithms in Scala
M Gumbel - SCABIO: a framework for bioinformatics algorithms in Scala
 
VIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic Variation
 
Wolstencroft K - Workflows on the Cloud: scaling for national service
Wolstencroft K - Workflows on the Cloud: scaling for national serviceWolstencroft K - Workflows on the Cloud: scaling for national service
Wolstencroft K - Workflows on the Cloud: scaling for national service
 

Similar to D Robinson - Using HDF5 to work with large quantities of rich biological data

Similar to D Robinson - Using HDF5 to work with large quantities of rich biological data (20)

Introduction to HDF5 Data and Programming Models
Introduction to HDF5 Data and Programming ModelsIntroduction to HDF5 Data and Programming Models
Introduction to HDF5 Data and Programming Models
 
HDF Update for DAAC Managers (2017-02-27)
HDF Update for DAAC Managers (2017-02-27)HDF Update for DAAC Managers (2017-02-27)
HDF Update for DAAC Managers (2017-02-27)
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?
 
HDF5 OPeNDAP project update and demo
HDF5 OPeNDAP project update and demoHDF5 OPeNDAP project update and demo
HDF5 OPeNDAP project update and demo
 
HDF5 iRODS
HDF5 iRODSHDF5 iRODS
HDF5 iRODS
 
HDF Update
HDF UpdateHDF Update
HDF Update
 
Parallel HDF5 Developments
Parallel HDF5 DevelopmentsParallel HDF5 Developments
Parallel HDF5 Developments
 
HDF5 Documentation
HDF5 DocumentationHDF5 Documentation
HDF5 Documentation
 
HDF Group Support for NPP/NPOESS/JPSS
HDF Group Support for NPP/NPOESS/JPSSHDF Group Support for NPP/NPOESS/JPSS
HDF Group Support for NPP/NPOESS/JPSS
 
Transition from HDF4 to HDF5
Transition from HDF4 to HDF5 Transition from HDF4 to HDF5
Transition from HDF4 to HDF5
 
Hdf5 parallel
Hdf5 parallelHdf5 parallel
Hdf5 parallel
 
Hierarchical Data Formats (HDF) Update
Hierarchical Data Formats (HDF) UpdateHierarchical Data Formats (HDF) Update
Hierarchical Data Formats (HDF) Update
 
HDF Project Status and Plans
HDF Project Status and PlansHDF Project Status and Plans
HDF Project Status and Plans
 
Hdf5 intro
Hdf5 introHdf5 intro
Hdf5 intro
 
Integrating HDF5 with SRB
Integrating HDF5 with SRBIntegrating HDF5 with SRB
Integrating HDF5 with SRB
 
Introduction to HDF5
Introduction to HDF5Introduction to HDF5
Introduction to HDF5
 
Plans for Enhanced NetCDF-4 Interface to HDF5 Data
Plans for Enhanced NetCDF-4 Interface to HDF5 DataPlans for Enhanced NetCDF-4 Interface to HDF5 Data
Plans for Enhanced NetCDF-4 Interface to HDF5 Data
 
DB2 and PHP in Depth on IBM i
DB2 and PHP in Depth on IBM iDB2 and PHP in Depth on IBM i
DB2 and PHP in Depth on IBM i
 

More from Jan Aerts

Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?
Jan Aerts
 
Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013
Jan Aerts
 
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Jan Aerts
 

More from Jan Aerts (20)

Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?
 
Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?
 
Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013
 
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)
 
Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data Analysis
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualization
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformatics
 
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing Consortium
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis Framework
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysis
 
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
 
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutes
 
M Reich - GenomeSpace
M Reich - GenomeSpaceM Reich - GenomeSpace
M Reich - GenomeSpace
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
L Forer - Cloudgene: an execution platform for MapReduce programs in public a...
 
Holland R - Pistoia Alliance Sequence Squeeze
Holland R - Pistoia Alliance Sequence SqueezeHolland R - Pistoia Alliance Sequence Squeeze
Holland R - Pistoia Alliance Sequence Squeeze
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 

D Robinson - Using HDF5 to work with large quantities of rich biological data

  • 1. Using HDF5 To Work With Large Quantities of Rich Biological Data Dana Robinson (derobins @hdfgroup.org) The HDF Group July 13, 2012 BOSC 2012 1
  • 2. Today's Goal Is that you walk away from this talk with a basic understanding of the HDF5 technology stack. July 13, 2012 BOSC 2012 2
  • 3. Where is HDF5 used? July 13, 2012 BOSC 2012 3
  • 4. What is HDF5? HDF5 is a highly scalable way to organize and store heterogeneous, multidimensional data of user-defined types. HDF5 also allows data relationships and context to be stored using annotation and linking. July 13, 2012 BOSC 2012 4
  • 5. HDF5 The HDF5 technology suite includes: • A structured binary file format • An abstract data model for describing your data • A data access library, written in C (w/ bindings for C++, Fortran 95/2003, and Java) July 13, 2012 BOSC 2012 5
  • 6. HDF5 has characteristics of … Directories and Files PDF • standard • hierarchical exchange format • collections of • heterogeneous related information HDF5 information Databases XML • subsetting • self-describing • random access Binary Flat File • extensible • high- types performance • rich metadata July 13, 2012 April 17-19, 2012 BOSC 2012 6
  • 7. Advantages of HDF5 • Platform and architecture-independent • Scalable in space and time • File size only limited by OS and filesystem • Data access time (esp. parallel) scales well • Flexible (user-defined types and organization) • Files are self-describing July 13, 2012 BOSC 2012 7
  • 8. Advantages of HDF5 (2) • High-performance • Parallel I/O via MPI-IO • Supports compression and other filters • Open source (BSD license) • THG committed to provide long-term support July 13, 2012 BOSC 2012 8
  • 9. HDF5 Data Objects • Groups • Datatypes • Datasets • Metadata (Attributes) July 13, 2012 BOSC 2012 9
  • 10. Example: LCMS Data sample name chromatography parameters ms parameters ms/ms parameters July 13, 2012 BOSC 2012 10
  • 11. HDF5 Data Access Unlike many data storage systems, HDF5 has no built-in query engine or indexes. You will have to write your own data access code, usually using the HDF5 API. July 13, 2012 BOSC 2012 11
  • 12. Dataspaces HDF5 has a rich set of data subsetting functionality. Example: displaying a thumbnail of a high- resolution image. July 13, 2012 BOSC 2012 12
  • 13. Filters and Compression HDF5 supports data filters, including compression, which transform data as it enters or leaves the file. compression filter compressed data uncompressed data in the file in user's buffer Note that HDF5 data objects are filtered individually, not the entire file! July 13, 2012 BOSC 2012 13
  • 14. Higher Language Bindings C++ Fortran (95 & 2003) Java .NET Python • C++ & Fortran distributed with library • Java distributed separately • .NET distributed separately, not supported by THG (as-is) • Python (PyTables, h5py) not distributed by THG NOTE: HDF5 bindings are thin wrappers over the C API. • There is no object-oriented interface to HDF5 • Not pure Java, .NET, etc. July 13, 2012 BOSC 2012 14
  • 15. Questions? Helpful links THG www.hdfgroup.org Downloads www.hdfgroup.org/HDF5/release/obtain5.html Documentation www.hdfgroup.org/HDF5/doc/index.html Bioinformatics www.hdfgroup.org/projects/bioinformatics/ Tutorials www.hdfgroup.org/HDF5/Tutor/index.html Contact/help desk www.hdfgroup.org/about/contact.html July 13, 2012 BOSC 2012 15

Editor's Notes

  1. HDF is an ADJECTIVE
  2. Add Sony Pictures
  3. The second statement is what we mean by "rich"
  4. High-level view, point out that the file format is NOT "HDF5" (mention VOL).Gerd is a little unhappy with "structured", but it should be ok for this audience.
  5. HDF5 has the characteristics of other formats that are outthere.It’s hard to store metadata in a binary flat file and it is not scalable
  6. Gerd points out that a library is properly a part of the self-describing representation
  7. High performance can have many meanings
  8. Again, note that links are named, not objects
  9. Much more low-level than, say, an RDBMS, though the ease of use of a database can come at a performance cost"easy" access via Python, Gerd'sPowershell snap-in, etc.Can write your own data access API to create queries, etc.
  10. Need to reword this! "These are calleddataspaces" = bad.
  11. Add resource links to this slide
  12. Why should you listen to my talk?
  13. Note that links are named, not objects!Gerd thinks of names as NAVIGATORS
  14. Wide variety of integer and floating point types, enum types, etc.Need to point out that variable-length strings have compression issues (fixable, with $$$)
  15. Might mention sparsity for chunks here.Mike suggests not mentioning chunks, so perhaps that could be replaced with a note about sparse data.