This document discusses the need for standardized indexing in HDF5 to facilitate querying and subsetting large scientific datasets. It proposes an H5IN API with two functions: Create_index to build indexes on HDF5 datasets, and Query to search indexed datasets and return matching subsets. The initial prototype focuses on single-dataset projection indexes for simple boolean queries, storing indexes in separate datasets for portability. The goal is to prove the concept and pave the way for more advanced indexing capabilities and queries in HDF5.
3. Why Not Commercial DMBSs?
Proprietary format
Lack of portability
Low scalability
Lack of desirable access modes
Presence of expensive concurrency
control and logging mechanism
Expensive parallel versions
3
4. State of the Art Not Enough
Scientific file formats and associated
I/O APIs
Concentrating on HDF5
Data recovery is navigational
Subsetting only on a small set of
attributes
4
6. Previous Indexing Efforts
Implicit indexing in HDF5
JPL use of HDF Vdatas
HDF-EOS point data
PyTables
HDF5 internal B-Tree structures
6
7. Why a Standard Indexing API?
Avoid duplication of effort
Standardize indexing in HDF5
PyTables
Standard API can be differently
implemented
Make indexes portable
Store indexes in HDF5 files
7
8. H5IN API
Create_index
Parameters: location of index, location of
data, binning information, memory limits
Returns: location of the index
Query
Parameters: dataset to query, query string
Returns: selection representing subset of the
data corresponding to the query
8
9. Design Decisions
Limited scope of the prototype
Index stored in a separate dataset
Returns a selection
Projection index
Support for simple boolean queries
9
10. Limited Scope
1st indexing prototype in HDF5
Presence of implicit indexing
Index on single datasets
Query over single datasets
Conditions should be over a single dataset
Result could be mapped to a separate dataset
10
18. Why Projection Index ?
Data is read only
Mostly dataset once written is not changed
Index does not need to be updated
Projection indexes well suited
Number of disk accesses is same as in case
of a B-Tree
Are not considering multidimensional
queries
18
19. Only Simple Boolean Queries
Query Format
SELECT
WHERE
SELECTION
c11 < Attribute1 < c12
AND c21 < Attribute2 < c22
…
Results being selections boolean operations
can be done inside the library
19
20. Conclusion
Developing a standard indexing API in
HDF5
Creating a proof of concept prototype
using projection indexes
Take first step towards developing a
query language for HDF5
20