From a talk by Andrew Collette to the Boulder Earth and Space Science Informatics Group (BESSIG) on November 20, 2013.
This talk explores how researchers can use the scalable, self-describing HDF5 data format together with the Python programming language to improve the analysis pipeline, easily archive and share large datasets, and improve confidence in scientific results. The discussion will focus on real-world applications of HDF5 in experimental physics at two multimillion-dollar research facilities: the Large Plasma Device at UCLA, and the NASA-funded hypervelocity dust accelerator at CU Boulder. This event coincides with the launch of a new O’Reilly book, Python and HDF5: Unlocking Scientific Data.
As scientific datasets grow from gigabytes to terabytes and beyond, the use of standard formats for data storage and communication becomes critical. HDF5, the most recent version of the Hierarchical Data Format originally developed at the National Center for Supercomputing Applications (NCSA), has rapidly emerged as the mechanism of choice for storing and sharing large datasets. At the same time, many researchers who routinely deal with large numerical datasets have been drawn to the Python by its ease of use and rapid development capabilities.
Over the past several years, Python has emerged as a credible alternative to scientific analysis environments like IDL or MATLAB. In addition to stable core packages for handling numerical arrays, analysis, and plotting, the Python ecosystem provides a huge selection of more specialized software, reducing the amount of work necessary to write scientific code while also increasing the quality of results. Python’s excellent support for standard data formats allows scientists to interact seamlessly with colleagues using other platforms.
3. What makes scientific data special?
It’s meant to be shared - collaborative
Ad-hoc or changing structure - flexible
Archived and preserved - robust
Python and HDF5 together address all three
5. (the platform)
Mature numerical, plotting and scientific modules
Hundreds of specialized science packages
Thousands more general-purpose
Python itself is “batteries included”
7. Thousands of others
Distribution - distutils/pip single-command installs
Unit testing - unittest module in stdlib
Interface: F2PY (Fortran), Cython (C), ctypes, others
Web servers and development - literally hundreds
Only need to write code for your problem
18. Hierarchical Data Format
3 things:
File specification and object model
C library
Ecosystem of users and developers
19. Objects
Datasets - Homogenous arrays of data
Groups: containers holding datasets and groups
Attributes: arbitrary metadata on groups & datasets
Standard constructs using these, or make your own!
20. Dataset features
Partial I/O: read and write just what you want
(In Python, we even use the array-access syntax!)
Automatic type conversion
On-the-fly compression
Parallel reads & writes with MPI
(Directly from Python!)
21. Metadata & Organization
Groups form a POSIX-style “filesystem” in the file
Attributes can store arbitrary data on arbitrary objects
How should the file be organized?
You decide!
!
Thousands of domain-specific “application formats”
Anyone can read them because HDF5 is self-describing!
34. LAPD Data Products
Acquisition file - “Planes” of data in HDF5
Metadata:
timestamps, digitizer settings, probe positions,
background plasma conditions…
Packaged into HDF5 following “lab layout”
Users take their data back home and analyze
37. Only 160 lines of code!
A. Collette et al. Phys. Rev. Lett 105, 195003 (2010)
38. Python does 3D too!
“MayaVi” 3D visualizer
Development sponsored
by Enthought
Both offline (scripted) and
interactive modes
A. Collette et al. Phys. Plasmas 18, 055705 (2011)
45. Where to get Python
Distributions are the best way to get started
(they include HDF5/h5py!)
Anaconda (Windows, Mac, Linux):
http://continuum.io
PythonXY (Windows)
http://pythonxy.googlecode.com