HDF5 Projection Indexes

•Als PPT, PDF herunterladen•

0 gefällt mir•750 views

The HDF-EOS Tools and Information Center

This document discusses the need for standardized indexing in HDF5 to facilitate querying and subsetting large scientific datasets. It proposes an H5IN API with two functions: Create_index to build indexes on HDF5 datasets, and Query to search indexed datasets and return matching subsets. The initial prototype focuses on single-dataset projection indexes for simple boolean queries, storing indexes in separate datasets for portability. The goal is to prove the concept and pave the way for more advanced indexing capabilities and queries in HDF5.

Technologie

Projection Indexes in
HDF5
Rishi Rakesh Sinha
The HDF Group
1

Science Produces Large Datasets


Observation/experiment driven
144 MB/hr

Simulation driven
200 GB/run

Information driven
> 7GB/expt
2

Why Not Commercial DMBSs?
Proprietary format
 Lack of portability
 Low scalability
 Lack of desirable access modes
 Presence of expensive concurrency
control and logging mechanism
 Expensive parallel versions


3

State of the Art Not Enough


Scientific file formats and associated
I/O APIs


Concentrating on HDF5

Data recovery is navigational



Subsetting only on a small set of
attributes


4

Previous Indexing Efforts
Implicit indexing in HDF5
 JPL use of HDF Vdatas
 HDF-EOS point data
 PyTables
 HDF5 internal B-Tree structures


6

Why a Standard Indexing API?


Avoid duplication of effort




Standardize indexing in HDF5




PyTables
Standard API can be differently
implemented

Make indexes portable


Store indexes in HDF5 files
7

H5IN API


Create_index
Parameters: location of index, location of
data, binning information, memory limits
 Returns: location of the index




Query
Parameters: dataset to query, query string
 Returns: selection representing subset of the
data corresponding to the query


8

Design Decisions
Limited scope of the prototype
 Index stored in a separate dataset
 Returns a selection
 Projection index
 Support for simple boolean queries


9

Limited Scope


1st indexing prototype in HDF5


Presence of implicit indexing

Index on single datasets
 Query over single datasets


Conditions should be over a single dataset
 Result could be mapped to a separate dataset


10

Index Storage
Root Group: /

DAY1

DAY2

DAY3

Location Data
F1

F2

F3

DAY4

Temperature

Pressure

11

Index Storage
Root Group: /

LD_INDEX
F1

DAY3

F2
Location Data
F1

F2

F3

12

Index Storage
Root Group: /

T_IN
Temperature

DAY3

P_IN
Pressure

Pressure

Temperature

13

Returns a Selection
FIND PRESSURE WHERE TEMP IN [100, 200]

Temperature

Pressure

Concise Storage
 Efficient Boolean operations


14

Projection Index
Temp
A
F
D
D
F

Category Pressure

52
42
57
22
67

A
D
F
A
D

32
34
21
22
27

15

Binning

1-3

1

4-6

2

3

4

7-9

5

6

7

10-12

8

9

13-15

10 11 12 13 14 15

16

Projection Index
Temp
40
50
60
Pressure
29
30
31

17

Why Projection Index ?


Data is read only


Mostly dataset once written is not changed

Index does not need to be updated
 Projection indexes well suited


Number of disk accesses is same as in case
of a B-Tree
 Are not considering multidimensional
queries


18

Only Simple Boolean Queries


Query Format
SELECT
WHERE



SELECTION
c11 < Attribute1 < c12
AND c21 < Attribute2 < c22
…

Results being selections boolean operations
can be done inside the library

19

Conclusion
Developing a standard indexing API in
HDF5
 Creating a proof of concept prototype
using projection indexes
 Take first step towards developing a
query language for HDF5


20

Future Work
Multi-dimensionality
 Multiple datasets in same file
 Multiple datasets across files
 Indexes on attributes
 Allow user to index subset of datasets


21

Weitere ähnliche Inhalte

Was ist angesagt?

Advanced HDF5 FeaturesThe HDF-EOS Tools and Information Center

Digital Object Identifiers for EOSDIS dataThe HDF-EOS Tools and Information Center

Open-source Scientific Computing and Data Analytics using HDFThe HDF-EOS Tools and Information Center

HDF5 <-> ZarrThe HDF-EOS Tools and Information Center

HDF Tools TutorialThe HDF-EOS Tools and Information Center

Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3The HDF-EOS Tools and Information Center

Using HDF5 tools for performance tuning and troubleshootingThe HDF-EOS Tools and Information Center

HDF4 Mapping Project UpdateThe HDF-EOS Tools and Information Center

HDF5 Roadmap 2019-2020The HDF-EOS Tools and Information Center

HDF5 Advanced Topics - Datatypes and Partial I/OThe HDF-EOS Tools and Information Center

Improved Methods for Accessing Scientific Data for the MassesThe HDF-EOS Tools and Information Center

SPD and KEA: HDF5 based file formats for Earth ObservationThe HDF-EOS Tools and Information Center

Using IDL with Suomi NPP VIIRS DataThe HDF-EOS Tools and Information Center

Implementing HDF5 in MATLABThe HDF-EOS Tools and Information Center

Parallel HDF5 DevelopmentsThe HDF-EOS Tools and Information Center

Introduction to HDF5 Data Model, Programming Model and Library APIsThe HDF-EOS Tools and Information Center

NetCDF and HDF5The HDF-EOS Tools and Information Center

HDF & HDF-EOS Data & Support at NSIDCThe HDF-EOS Tools and Information Center

Performance Tuning in HDF5 The HDF-EOS Tools and Information Center

NASA Terra Data FusionThe HDF-EOS Tools and Information Center

Was ist angesagt? (20)

Advanced HDF5 Features

Digital Object Identifiers for EOSDIS data

Open-source Scientific Computing and Data Analytics using HDF

HDF5 <-> Zarr

HDF Tools Tutorial

Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3

Using HDF5 tools for performance tuning and troubleshooting

HDF4 Mapping Project Update

HDF5 Roadmap 2019-2020

HDF5 Advanced Topics - Datatypes and Partial I/O

Improved Methods for Accessing Scientific Data for the Masses

SPD and KEA: HDF5 based file formats for Earth Observation

Using IDL with Suomi NPP VIIRS Data

Implementing HDF5 in MATLAB

Parallel HDF5 Developments

Introduction to HDF5 Data Model, Programming Model and Library APIs

NetCDF and HDF5

HDF & HDF-EOS Data & Support at NSIDC

Performance Tuning in HDF5

NASA Terra Data Fusion

Ähnlich wie HDF5 Projection Indexes

D Robinson - Using HDF5 to work with large quantities of rich biological dataJan Aerts

HDF5 Advanced TopicsThe HDF-EOS Tools and Information Center

NEON HDF5The HDF-EOS Tools and Information Center

HDF Update for DAAC Managers (2017-02-27)The HDF-EOS Tools and Information Center

Introduction to HDF5The HDF-EOS Tools and Information Center

UML Representation of NPOESS Data Products in HDF5The HDF-EOS Tools and Information Center

NPP/NPOESS Product Data FormatThe HDF-EOS Tools and Information Center

Health Check Your DB2 UDB For Z/OS Systemsjreese

Apache CarbonData:New high performance data format for faster data analysisliang chen

HDF5 OPeNDAP Handler Updates, and Performance DiscussionThe HDF-EOS Tools and Information Center

DB2 9 for z/OS - Business ValueSurekha Parekh

Hdf5 introSmith Kim

1 extreme performance - part isqlserver.co.il

Hadoop infrastructure scaling with the Dell PowerEdge FX2 Principled Technologies

Syncsort et le retour d'expérience ComScoreModern Data Stack France

Introduction to HDF5The HDF-EOS Tools and Information Center

Day 02 sap_bi_overview_and_terminologytovetrivel

Lecture 18Shani729

Content Framework for Operational Environmental Remote Sensing Data Sets: NPO...The HDF-EOS Tools and Information Center

Big Data – Shining the Light on Enterprise Dark DataHitachi Vantara

Ähnlich wie HDF5 Projection Indexes (20)

D Robinson - Using HDF5 to work with large quantities of rich biological data

HDF5 Advanced Topics

NEON HDF5

HDF Update for DAAC Managers (2017-02-27)

Introduction to HDF5

UML Representation of NPOESS Data Products in HDF5

NPP/NPOESS Product Data Format

Health Check Your DB2 UDB For Z/OS System

Apache CarbonData:New high performance data format for faster data analysis

HDF5 OPeNDAP Handler Updates, and Performance Discussion

DB2 9 for z/OS - Business Value

Hdf5 intro

1 extreme performance - part i

Hadoop infrastructure scaling with the Dell PowerEdge FX2

Syncsort et le retour d'expérience ComScore

Introduction to HDF5

Day 02 sap_bi_overview_and_terminology

Lecture 18

Content Framework for Operational Environmental Remote Sensing Data Sets: NPO...

Big Data – Shining the Light on Enterprise Dark Data

Mehr von The HDF-EOS Tools and Information Center

Cloud-Optimized HDF5 FilesThe HDF-EOS Tools and Information Center

Accessing HDF5 data in the cloud with HSDSThe HDF-EOS Tools and Information Center

The State of HDFThe HDF-EOS Tools and Information Center

Highly Scalable Data Service (HSDS) Performance FeaturesThe HDF-EOS Tools and Information Center

Creating Cloud-Optimized HDF5 FilesThe HDF-EOS Tools and Information Center

Hyrax: Serving Data from S3The HDF-EOS Tools and Information Center

Accessing Cloud Data and Services Using EDL, Pydap, MATLABThe HDF-EOS Tools and Information Center

HDF - Current status and Future DirectionsThe HDF-EOS Tools and Information Center

HDFEOS.org User Analsys, Updates, and FutureThe HDF-EOS Tools and Information Center

HDF - Current status and Future Directions The HDF-EOS Tools and Information Center

MATLAB Modernization on HDF5 1.10The HDF-EOS Tools and Information Center

HDF for the Cloud - Serverless HDFThe HDF-EOS Tools and Information Center

HDF for the Cloud - New HDF Server FeaturesThe HDF-EOS Tools and Information Center

HDF5 and Ecosystem: What Is New?The HDF-EOS Tools and Information Center

Leveraging the Cloud for HDF Software TestingThe HDF-EOS Tools and Information Center

Google Colaboratory for HDF-EOSThe HDF-EOS Tools and Information Center

Parallel Computing with HDF ServerThe HDF-EOS Tools and Information Center

HDF-EOS Data Product Developer's GuideThe HDF-EOS Tools and Information Center

HDF Status UpdateThe HDF-EOS Tools and Information Center

HDF Cloud: HDF5 at ScaleThe HDF-EOS Tools and Information Center

Mehr von The HDF-EOS Tools and Information Center (20)

Cloud-Optimized HDF5 Files

Accessing HDF5 data in the cloud with HSDS

The State of HDF

Highly Scalable Data Service (HSDS) Performance Features

Creating Cloud-Optimized HDF5 Files

Hyrax: Serving Data from S3

Accessing Cloud Data and Services Using EDL, Pydap, MATLAB

HDF - Current status and Future Directions

HDFEOS.org User Analsys, Updates, and Future

HDF - Current status and Future Directions

MATLAB Modernization on HDF5 1.10

HDF for the Cloud - Serverless HDF

HDF for the Cloud - New HDF Server Features

HDF5 and Ecosystem: What Is New?

Leveraging the Cloud for HDF Software Testing

Google Colaboratory for HDF-EOS

Parallel Computing with HDF Server

HDF-EOS Data Product Developer's Guide

HDF Status Update

HDF Cloud: HDF5 at Scale

Kürzlich hochgeladen

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Salesforce Community Group Quito, Salesforce 101Paola De la Torre

Boost PC performance: How more available memory can improve productivityPrincipled Technologies

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Finology Group – Insurtech Innovation Award 2024The Digital Insurer

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

CNv6 Instructor Chapter 6 Quality of Servicegiselly40

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Kürzlich hochgeladen (20)

Understanding the Laravel MVC Architecture

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Salesforce Community Group Quito, Salesforce 101

Boost PC performance: How more available memory can improve productivity

08448380779 Call Girls In Friends Colony Women Seeking Men

IAC 2024 - IA Fast Track to Search Focused AI Solutions

Maximizing Board Effectiveness 2024 Webinar.pptx

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

Unblocking The Main Thread Solving ANRs and Frozen Frames

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners

08448380779 Call Girls In Civil Lines Women Seeking Men

The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf

SQL Database Design For Developers at php[tek] 2024

Finology Group – Insurtech Innovation Award 2024

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...

CNv6 Instructor Chapter 6 Quality of Service

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

HDF5 Projection Indexes

1. Projection Indexes in HDF5 Rishi Rakesh Sinha The HDF Group 1

2. Science Produces Large Datasets  Observation/experiment driven 144 MB/hr Simulation driven 200 GB/run Information driven > 7GB/expt 2

3. Why Not Commercial DMBSs? Proprietary format  Lack of portability  Low scalability  Lack of desirable access modes  Presence of expensive concurrency control and logging mechanism  Expensive parallel versions  3

4. State of the Art Not Enough  Scientific file formats and associated I/O APIs  Concentrating on HDF5 Data recovery is navigational  Subsetting only on a small set of attributes  4

5. Why Indexes? Easy Not So Easy 5

6. Previous Indexing Efforts Implicit indexing in HDF5  JPL use of HDF Vdatas  HDF-EOS point data  PyTables  HDF5 internal B-Tree structures  6

7. Why a Standard Indexing API?  Avoid duplication of effort   Standardize indexing in HDF5   PyTables Standard API can be differently implemented Make indexes portable  Store indexes in HDF5 files 7

8. H5IN API  Create_index Parameters: location of index, location of data, binning information, memory limits  Returns: location of the index   Query Parameters: dataset to query, query string  Returns: selection representing subset of the data corresponding to the query  8

9. Design Decisions Limited scope of the prototype  Index stored in a separate dataset  Returns a selection  Projection index  Support for simple boolean queries  9

10. Limited Scope  1st indexing prototype in HDF5  Presence of implicit indexing Index on single datasets  Query over single datasets  Conditions should be over a single dataset  Result could be mapped to a separate dataset  10

11. Index Storage Root Group: / DAY1 DAY2 DAY3 Location Data F1 F2 F3 DAY4 Temperature Pressure 11

12. Index Storage Root Group: / LD_INDEX F1 DAY3 F2 Location Data F1 F2 F3 12

13. Index Storage Root Group: / T_IN Temperature DAY3 P_IN Pressure Pressure Temperature 13

14. Returns a Selection FIND PRESSURE WHERE TEMP IN [100, 200] Temperature Pressure Concise Storage  Efficient Boolean operations  14

15. Projection Index Temp A F D D F Category Pressure 52 42 57 22 67 A D F A D 32 34 21 22 27 15

16. Binning 1-3 1 4-6 2 3 4 7-9 5 6 7 10-12 8 9 13-15 10 11 12 13 14 15 16

17. Projection Index Temp 40 50 60 Pressure 29 30 31 17

18. Why Projection Index ?  Data is read only  Mostly dataset once written is not changed Index does not need to be updated  Projection indexes well suited  Number of disk accesses is same as in case of a B-Tree  Are not considering multidimensional queries  18

19. Only Simple Boolean Queries  Query Format SELECT WHERE  SELECTION c11 < Attribute1 < c12 AND c21 < Attribute2 < c22 … Results being selections boolean operations can be done inside the library 19

20. Conclusion Developing a standard indexing API in HDF5  Creating a proof of concept prototype using projection indexes  Take first step towards developing a query language for HDF5  20

21. Future Work Multi-dimensionality  Multiple datasets in same file  Multiple datasets across files  Indexes on attributes  Allow user to index subset of datasets  21

Hinweis der Redaktion

We also use inernal Btree JPL works started in 1990. As soon as Vdata came out, people started building indexes.
Clarify how the storage works. Add an example
Add details on what the datasets mean. The actual query would help a whole lot.
Unique values. Another value for binned indexes. Introduce the term bins there.
Slide on Bins
REPLACE DATSDPSCE WITH SELETION Add how the selection is created REPLACE HDF with HDF5

HDF5 Projection Indexes

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie HDF5 Projection Indexes

Ähnlich wie HDF5 Projection Indexes (20)

Mehr von The HDF-EOS Tools and Information Center

Mehr von The HDF-EOS Tools and Information Center (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

HDF5 Projection Indexes

Hinweis der Redaktion