Python and HDF5: Overview

•

6 gefällt mir•8,783 views

From a talk by Andrew Collette to the Boulder Earth and Space Science Informatics Group (BESSIG) on November 20, 2013. This talk explores how researchers can use the scalable, self-describing HDF5 data format together with the Python programming language to improve the analysis pipeline, easily archive and share large datasets, and improve confidence in scientific results. The discussion will focus on real-world applications of HDF5 in experimental physics at two multimillion-dollar research facilities: the Large Plasma Device at UCLA, and the NASA-funded hypervelocity dust accelerator at CU Boulder. This event coincides with the launch of a new O’Reilly book, Python and HDF5: Unlocking Scientific Data. As scientific datasets grow from gigabytes to terabytes and beyond, the use of standard formats for data storage and communication becomes critical. HDF5, the most recent version of the Hierarchical Data Format originally developed at the National Center for Supercomputing Applications (NCSA), has rapidly emerged as the mechanism of choice for storing and sharing large datasets. At the same time, many researchers who routinely deal with large numerical datasets have been drawn to the Python by its ease of use and rapid development capabilities. Over the past several years, Python has emerged as a credible alternative to scientific analysis environments like IDL or MATLAB. In addition to stable core packages for handling numerical arrays, analysis, and plotting, the Python ecosystem provides a huge selection of more specialized software, reducing the amount of work necessary to write scientific code while also increasing the quality of results. Python’s excellent support for standard data formats allows scientists to interact seamlessly with colleagues using other platforms.

Technologie

Python and HDF5
Andrew Collette
University of Colorado

What makes scientiﬁc data special?

It’s meant to be shared - collaborative
Ad-hoc or changing structure - ﬂexible
Archived and preserved - robust

Python and HDF5 together address all three

High-level language
Fully object-oriented
Almost no “boilerplate” code

Readable
Free
(the language)
“Exception” error handling

Self-documenting

First-class module/namespace support

(the platform)

Mature numerical, plotting and scientiﬁc modules
Hundreds of specialized science packages
Thousands more general-purpose
Python itself is “batteries included”

Core analysis packages
NumPy - Array objects and basic operations
SciPy - Advanced science & engineering library
Matplotlib - Publication-quality plots 
(both rendered and interactive)

Thousands of others
Distribution - distutils/pip single-command installs
Unit testing - unittest module in stdlib
Interface: F2PY (Fortran), Cython (C), ctypes, others
Web servers and development - literally hundreds
Only need to write code for your problem

Speed
FFTs and optimized routines built in to NumPy/Scipy

Speed
FFTs and optimized routines built in to NumPy/Scipy
ctypes and Cython

ctypes
Advanced foreign function interface
Call C libraries from pure Python code

Hierarchical Data Format
3 things:
File speciﬁcation and object model
C library
Ecosystem of users and developers

Objects
Datasets - Homogenous arrays of data
Groups: containers holding datasets and groups
Attributes: arbitrary metadata on groups & datasets

Standard constructs using these, or make your own!

Dataset features
Partial I/O: read and write just what you want
(In Python, we even use the array-access syntax!)
Automatic type conversion
On-the-ﬂy compression
Parallel reads & writes with MPI
(Directly from Python!)

Metadata & Organization
Groups form a POSIX-style “ﬁlesystem” in the ﬁle
Attributes can store arbitrary data on arbitrary objects
How should the ﬁle be organized?
You decide!
!

Thousands of domain-speciﬁc “application formats”
Anyone can read them because HDF5 is self-describing!

Open an HDF5 ﬁle
Extract a particular dataset
Read the data
Make an interactive plot
Close the ﬁle

UCLA Large Plasma Device

Image credit: Basic Plasma Science Facility

Laser Experiment

Image credit: Basic Plasma Science Facility

LAPD Data Products
Acquisition ﬁle - “Planes” of data in HDF5
Metadata: 
timestamps, digitizer settings, probe positions,
background plasma conditions…
Packaged into HDF5 following “lab layout”
Users take their data back home and analyze

Python 2D plotting

A. Collette et al. Phys. Rev. Lett 105, 195003 (2010)

Only 160 lines of code!

A. Collette et al. Phys. Rev. Lett 105, 195003 (2010)

Python does 3D too!
“MayaVi” 3D visualizer
Development sponsored
by Enthought
Both ofﬂine (scripted) and
interactive modes

A. Collette et al. Phys. Plasmas 18, 055705 (2011)

CU Accelerator
Raw data

HDF5 Shot ﬁle
Automated
speed/mass
calculation

Data search
HDF5 ﬁle for user

MySQL

Where to get Python
Distributions are the best way to get started
(they include HDF5/h5py!)
Anaconda (Windows, Mac, Linux):
http://continuum.io
PythonXY (Windows)
http://pythonxy.googlecode.com

Weitere ähnliche Inhalte

Was ist angesagt?

Adding CF Attributes to an HDF5 FileThe HDF-EOS Tools and Information Center

HDF5 Advanced Topics - Datatypes and Partial I/OThe HDF-EOS Tools and Information Center

Images of HDF5The HDF-EOS Tools and Information Center

Introduction to HDF5 Data Model, Programming Model and Library APIsThe HDF-EOS Tools and Information Center

HDF5 FastQueryThe HDF-EOS Tools and Information Center

Projection Indexes for HDF5 DatasetsThe HDF-EOS Tools and Information Center

Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 productsThe HDF-EOS Tools and Information Center

HDF Group Support for NPP/NPOESS/JPSSThe HDF-EOS Tools and Information Center

Advanced HDF5 FeaturesThe HDF-EOS Tools and Information Center

NASA HDF/HDF-EOS Data for Dummies (and Developers)The HDF-EOS Tools and Information Center

Introduction to HDF5The HDF-EOS Tools and Information Center

HDF4 Mapping Project UpdateThe HDF-EOS Tools and Information Center

Tools to improve the usability of NASA HDF DataThe HDF-EOS Tools and Information Center

Digital Object Identifiers for EOSDIS dataThe HDF-EOS Tools and Information Center

Introduction to HDF5 Data Model, Programming Model and Library APIsThe HDF-EOS Tools and Information Center

Introduction to HDF5The HDF-EOS Tools and Information Center

NetCDF and HDF5The HDF-EOS Tools and Information Center

NASA HDF/HDF-EOS Data Access ChallengesThe HDF-EOS Tools and Information Center

Data InteroperabilityThe HDF-EOS Tools and Information Center

Democratizing Big Semantic Data managementWU (Vienna University of Economics and Business)

Was ist angesagt? (20)

Adding CF Attributes to an HDF5 File

HDF5 Advanced Topics - Datatypes and Partial I/O

Images of HDF5

Introduction to HDF5 Data Model, Programming Model and Library APIs

HDF5 FastQuery

Projection Indexes for HDF5 Datasets

Interoperability with netCDF-4 - Experience with NPP and HDF-EOS5 products

HDF Group Support for NPP/NPOESS/JPSS

Advanced HDF5 Features

NASA HDF/HDF-EOS Data for Dummies (and Developers)

Introduction to HDF5

HDF4 Mapping Project Update

Tools to improve the usability of NASA HDF Data

Digital Object Identifiers for EOSDIS data

Introduction to HDF5 Data Model, Programming Model and Library APIs

Introduction to HDF5

NetCDF and HDF5

NASA HDF/HDF-EOS Data Access Challenges

Data Interoperability

Democratizing Big Semantic Data management

Ähnlich wie Python and HDF5: Overview

Introduction to HDF5 Data Model, Programming Model and Library APIsThe HDF-EOS Tools and Information Center

Introduction to HDF5The HDF-EOS Tools and Information Center

Parallel HDF5 Introductory TutorialThe HDF-EOS Tools and Information Center

Module net cdf4 fikrul islamy

Introduction to HDF5The HDF-EOS Tools and Information Center

Fedora Overvieweposthumus

HDF Update for DAAC Managers (2017-02-27)The HDF-EOS Tools and Information Center

Hdf5 introSmith Kim

Using HDF5 tools for performance tuning and troubleshootingThe HDF-EOS Tools and Information Center

Enhancing Domain Specific Language Implementations Through OntologyChunhua Liao

Hdf Augmentation: Interoperability in the Last MileTed Habermann

Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout MapsThe HDF-EOS Tools and Information Center

Hdf5 parallelmfolk

HDFThe HDF-EOS Tools and Information Center

HDF Status and DevelopmentThe HDF-EOS Tools and Information Center

247th ACS Meeting: The Eureka Research WorkbenchStuart Chalk

HDF5 Tools in IDLThe HDF-EOS Tools and Information Center

Hopsworks at Google AI Huddle, SunnyvaleJim Dowling

DAOS Middleware overviewAndrey Kudryavtsev

An IDL-Based Validation Toolkit: Extensions to use the HDF-EOS Swath FormatThe HDF-EOS Tools and Information Center

Ähnlich wie Python and HDF5: Overview (20)

Introduction to HDF5 Data Model, Programming Model and Library APIs

Introduction to HDF5

Parallel HDF5 Introductory Tutorial

Module net cdf4

Introduction to HDF5

Fedora Overview

HDF Update for DAAC Managers (2017-02-27)

Hdf5 intro

Using HDF5 tools for performance tuning and troubleshooting

Enhancing Domain Specific Language Implementations Through Ontology

Hdf Augmentation: Interoperability in the Last Mile

Ensuring Long Term Access to Remotely Sensed HDF4 Data with Layout Maps

Hdf5 parallel

HDF

HDF Status and Development

247th ACS Meeting: The Eureka Research Workbench

HDF5 Tools in IDL

Hopsworks at Google AI Huddle, Sunnyvale

DAOS Middleware overview

An IDL-Based Validation Toolkit: Extensions to use the HDF-EOS Swath Format

Kürzlich hochgeladen

WordPress Websites for Engineers: Elevate Your Brandgvaughan

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Story boards and shot lists for my a level piececharlottematthew16

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

CloudStudio User manual (basic edition):comworks

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Training state-of-the-art general text embeddingZilliz

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Kürzlich hochgeladen (20)

WordPress Websites for Engineers: Elevate Your Brand

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Dev Dives: Streamline document processing with UiPath Studio Web

Powerpoint exploring the locations used in television show Time Clash

Story boards and shot lists for my a level piece

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

"Debugging python applications inside k8s environment", Andrii Soldatenko

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Unraveling Multimodality with Large Language Models.pdf

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

CloudStudio User manual (basic edition):

Developer Data Modeling Mistakes: From Postgres to NoSQL

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Artificial intelligence in cctv survelliance.pptx

Training state-of-the-art general text embedding

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost

Connect Wave/ connectwave Pitch Deck Presentation

What's New in Teams Calling, Meetings and Devices March 2024

Python and HDF5: Overview

1. Python and HDF5 Andrew Collette University of Colorado

2. What makes scientiﬁc data special?

3. What makes scientiﬁc data special? It’s meant to be shared - collaborative Ad-hoc or changing structure - ﬂexible Archived and preserved - robust Python and HDF5 together address all three

4. High-level language Fully object-oriented Almost no “boilerplate” code Readable Free (the language) “Exception” error handling Self-documenting First-class module/namespace support

5. (the platform) Mature numerical, plotting and scientiﬁc modules Hundreds of specialized science packages Thousands more general-purpose Python itself is “batteries included”

6. Core analysis packages NumPy - Array objects and basic operations SciPy - Advanced science & engineering library Matplotlib - Publication-quality plots  (both rendered and interactive)

7. Thousands of others Distribution - distutils/pip single-command installs Unit testing - unittest module in stdlib Interface: F2PY (Fortran), Cython (C), ctypes, others Web servers and development - literally hundreds Only need to write code for your problem

8. Python highlights

9. Readable

10. Iteration C IDL Python

11. Speed

12. Speed FFTs and optimized routines built in to NumPy/Scipy

13. Speed FFTs and optimized routines built in to NumPy/Scipy ctypes and Cython

14. ctypes Advanced foreign function interface Call C libraries from pure Python code

15. Cython Example from the HDF5 C Library:

16. HDF5

17. HDF5

18. Hierarchical Data Format 3 things: File speciﬁcation and object model C library Ecosystem of users and developers

19. Objects Datasets - Homogenous arrays of data Groups: containers holding datasets and groups Attributes: arbitrary metadata on groups & datasets Standard constructs using these, or make your own!

20. Dataset features Partial I/O: read and write just what you want (In Python, we even use the array-access syntax!) Automatic type conversion On-the-ﬂy compression Parallel reads & writes with MPI (Directly from Python!)

21. Metadata & Organization Groups form a POSIX-style “filesystem” in the file Attributes can store arbitrary data on arbitrary objects How should the file be organized? You decide! ! Thousands of domain-specific “application formats” Anyone can read them because HDF5 is self-describing!

22. Example

23. Open an HDF5 ﬁle Extract a particular dataset Read the data Make an interactive plot Close the ﬁle

24. Open an HDF5 ﬁle Extract a particular dataset Read the data Make an interactive plot Close the ﬁle

25. Open an HDF5 ﬁle Extract a particular dataset Read the data Make an interactive plot Close the ﬁle

26. Open an HDF5 ﬁle Extract a particular dataset Read the data Make an interactive plot Close the ﬁle

27. Open an HDF5 ﬁle Extract a particular dataset Read the data Make an interactive plot Close the ﬁle

28. Open an HDF5 ﬁle Extract a particular dataset Read the data Make an interactive plot Close the ﬁle

29. Demo

30. Real-world use

31. UCLA Large Plasma Device

32. UCLA Large Plasma Device Image credit: Basic Plasma Science Facility

33. Laser Experiment Image credit: Basic Plasma Science Facility

34. LAPD Data Products Acquisition ﬁle - “Planes” of data in HDF5 Metadata:  timestamps, digitizer settings, probe positions, background plasma conditions… Packaged into HDF5 following “lab layout” Users take their data back home and analyze

35. Visualization

36. Python 2D plotting A. Collette et al. Phys. Rev. Lett 105, 195003 (2010)

37. Only 160 lines of code! A. Collette et al. Phys. Rev. Lett 105, 195003 (2010)

38. Python does 3D too! “MayaVi” 3D visualizer Development sponsored by Enthought Both ofﬂine (scripted) and interactive modes A. Collette et al. Phys. Plasmas 18, 055705 (2011)

43. CU Accelerator Raw data HDF5 Shot ﬁle Automated speed/mass calculation Data search HDF5 ﬁle for user MySQL

44. Where to get Python

45. Where to get Python Distributions are the best way to get started (they include HDF5/h5py!) Anaconda (Windows, Mac, Linux): http://continuum.io PythonXY (Windows) http://pythonxy.googlecode.com

46. Questions?

Python and HDF5: Overview

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Python and HDF5: Overview

Ähnlich wie Python and HDF5: Overview (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Python and HDF5: Overview