SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Parallel Computing with
HDF Server
1
John Readey
The HDF5 data format 2
• Established 20 years ago the HDF5 file format is the most commonly used
format in Earth Science
• Note: NetCDF4 files are actually HDF5 “under the hood”
• HDF5 was designed with the (somewhat contradictory) goals of:
• Archival format – data that can stored for decades
• Analysis Ready -- data that can be directly utilized for analytics (no
conversion needed)
• There’s a rich set of tools and language SDKs:
• C/C++/Fortran
• Python
• Java, etc.
HDF5 File Format meets the Cloud 3
• Storing large HDF5 collection on AWS is almost always about utilizing S3:
• Cost effective
• Redundant
• Sharable
• It’s easy enough to store HDF5 files as S3 objects, but these files can’t be
read using the HDF5 library (which is expecting a POSIX filesystem)
• Experience using FUSE to read from S3 using HDF5Library has not tended
to work so well
• In practice users have been left with copying files to local disk first
• This has led to interest in alternative formats such as Zarr, TileDB, and
our own HSDS S3 Storage Schema (more on that later)
• Our HSDS server provides a means of efficiently accessing HDF5 data on
S3
HDF Server 4
• HSDS is an open source REST based service for HDF data
• Think of it as HDF gone cloud native. 
• HSDS Features:
• Runs as a set of containers on Kubernetes – so can scale beyond one machine
• Requests can be parallelized across multiple containers
• Feature compatible with the HDF library but is independent code base
• Supports multiple readers/writers
• Uses S3 as data store
• Existing HDF APIs (h5py, h5netcdf, xarray, etc.) work seamlessly with HSDS
• Available now as part of HDF Kita Lab (our hosted Jupyter environment):
https://hdflab.hdfgroup.org
• Available on AWS Marketplace as “Kita Server”
HDF Cloud Schema
Big Idea: Map individual
HDF5 objects (datasets,
groups, chunks) as Object
Storage Objects
• Limit maximum storage object size
• Support parallelism for read/write
• Only data that is modified needs to be
updated
• Multiple clients can be reading/updating
the same “file”
Legend:
• Dataset is partitioned into
chunks
• Each chunk stored as an S3
object
• Dataset meta data (type, shape,
attributes, etc.) stored in a
separate object (as JSON text)
How to store HDF5 content in S3?
Each chunk (heavy outlines) get
persisted as a separate object
6Dataset JSON Example
• creationProperties contains HDF5
dataset creation property list settings.
• Id is the objects UUID.
• Layout represents HDF5 dataspace.
• Root points back to the root group
• Created & lastModified are timestamps
• type represents HDF5 datatype.
• attributes holds a list of HDF5 attribute
JSON objects.
{
"creationProperties": {},
"id": "d-9a097486-58dd-11e8-a964-
0242ac110009",
"layout": {"dims": [10], "class":
"H5D_CHUNKED"},
"root": "g-952b0bfa-58dd-11e8-a964-
0242ac110009",
"created": 1526456944,
"lastModified": 1526456944,
"shape": {"dims": [10], "class":
"H5S_SIMPLE"},
"type": {"base": "H5T_STD_I32LE",
"class": "H5T_INTEGER"},
"attributes": {}
}
Schema Details 7
• Key Organization
• Objects are stored root_id
• All non-root objects are stored as sub-keys of root_id
• “flat” organization to support non-cycle links
• Each storage node is limited to about 300 req/s 5,500 req/s
• Note: rate limit raised last year by Amazon
• Chunks are stored in the same folder as dataset metadata
• Chunk key is determined based on chunk position in the data space
• E.g. c-<uuid>_0_0_0 Is the corner chunk of a 3-dimensional dataset
• Chunk objects get created as needed on first write
• Schema is currently used just by HDF Server, but could just as easily be
used directly by clients (assuming that writes don’t conflict)
• Several improvements have been made over the last year
• Read access for traditional HDF5 Files stored in S3
• More on this in the next slide
• Shuffle filter support
• Along with deflate
• Fast metadata loading
• Optionally load all metadata in one request
• Support for multiple buckets
• HSDS can access data stored in different buckets
• H5netcdf & xarray support
• Support for the REST API is built into these packages
• Additional CLI tools (hsmv, hscp, hsdiff)
• Variable Length data support with compression
• Schema V2
8New Features
Supporting traditional HDF5 files 9
• Downside of the HDF S3 Schema is that data needs be transmogrified*
• Since the bulk of the data is usually the chunk data it makes sense to just
leave the data in place and save pointers to the original file(s):
• Convert just the metadata of the source HDF5 file to the S3 Schema
• Store the source file as a S3 object
• For data reads, metadata provides offset and length into the HDF5 file
• S3 Range GET returns needed data
• This approach can be used either directly or with HDF Server
• Compared with accessing S3 directly, you reduce the number of S3
requests needed
• Performance is comparable to sharded data model
• Only read access is supported
Hybrid Approach: Metadata + HDF5 Files 1
0
S3://BIG_REPO/…/AN_HDF5_FILE.h5
Imported Metadata (JSON) HDF5 File stored as S3 object
Dset /dset1: chunk 0
Dset /dset1: chunk 1
Dset /dset1: chunk n
S3 Range GET(
• S3 Key
• Offset
• Num Bytes)
• HDF Kita Lab runs on AWS in a Kubernetes cluster
• Cluster can scale to handle different number of users
• Each user gets:
• 1 CPU Core (2.5GHz Xeon)
• 8 GB RAM
• 10 GB Disk
• 100 GB S3 Storage
• Access to HDF Kita Server
• Ability to read/write HDF data stored on S3
• User environment configured for commonly used Python Packages for
HDF users:
• H5py(d), pandas, h5netcdf, xarray, bokeh, dask
• HDF Kita Command Line tools:
• Hsinfo, hsls, hsget, hsload, etc.
1
1Kita Lab – playground for HDF Server
• JupyterLab and Kita Server both runs as a set of Docker containers
• Kubernetes transparently manages running these containers across multiple
machines
1
2Kubernetes Platform
AWS
Kubernetes
JupyterHub HDF Kita Server (HSDS)
{Containers
1
3Architecture
AWS S3
Kita Server (HSDS)
User
SN
SN
SN
SN
DN
DN
DN
DN
User Containers &
EBS Volumes
spawn
References 1
4
• HSDS: https://github.com/HDFGroup/hsds
• H5Pyd: https://github.com/HDFGroup/h5pyd
• Kita Lab: https://www.hdfgroup.org/hdfkitalab/
• SciPy2017 talk:
https://s3.amazonaws.com/hdfgroup/docs/hdf_data_services_scipy201
7.pdf
• AWS Big Data Blog article: https://aws.amazon.com/blogs/big-data/power-
from-wind-open-data-on-aws/
• Spark and HDF Blog article: https://www.hdfgroup.org/2015/04/putting-
some-spark-into-hdf-eos
• Notebook from this talk:
https://gist.github.com/jreadey/d1c67aee07451985397f48a50be2cdaa

Weitere ähnliche Inhalte

Was ist angesagt?

STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...The HDF-EOS Tools and Information Center
 

Was ist angesagt? (20)

HDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDFHDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDF
 
Leveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software TestingLeveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software Testing
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
 
Caching and Buffering in HDF5
Caching and Buffering in HDF5Caching and Buffering in HDF5
Caching and Buffering in HDF5
 
HDF Update for DAAC Managers (2017-02-27)
HDF Update for DAAC Managers (2017-02-27)HDF Update for DAAC Managers (2017-02-27)
HDF Update for DAAC Managers (2017-02-27)
 
Google Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOSGoogle Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOS
 
HDF Update 2016
HDF Update 2016HDF Update 2016
HDF Update 2016
 
MODIS Land and HDF-EOS
MODIS Land and HDF-EOSMODIS Land and HDF-EOS
MODIS Land and HDF-EOS
 
HDF Product Designer
HDF Product DesignerHDF Product Designer
HDF Product Designer
 
Parallel HDF5 Developments
Parallel HDF5 DevelopmentsParallel HDF5 Developments
Parallel HDF5 Developments
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
 
HDF-EOS 2/5 to netCDF Converter
HDF-EOS 2/5 to netCDF ConverterHDF-EOS 2/5 to netCDF Converter
HDF-EOS 2/5 to netCDF Converter
 
MATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and CapabilitiesMATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and Capabilities
 
Putting some Spark into HDF5
Putting some Spark into HDF5Putting some Spark into HDF5
Putting some Spark into HDF5
 
HDF for the Cloud
HDF for the CloudHDF for the Cloud
HDF for the Cloud
 
Easy Access of NASA HDF data via OPeNDAP
Easy Access of NASA HDF data via OPeNDAPEasy Access of NASA HDF data via OPeNDAP
Easy Access of NASA HDF data via OPeNDAP
 
HDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at ScaleHDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at Scale
 
NetCDF and HDF5
NetCDF and HDF5NetCDF and HDF5
NetCDF and HDF5
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
HDF Project Update
HDF Project UpdateHDF Project Update
HDF Project Update
 

Ähnlich wie Parallel Computing with HDF Server

Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdfManoel Ribeiro
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaData Con LA
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsDataWorks Summit
 

Ähnlich wie Parallel Computing with HDF Server (20)

Accessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDSAccessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDS
 
HDF Cloud Services
HDF Cloud ServicesHDF Cloud Services
HDF Cloud Services
 
Highly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance FeaturesHighly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance Features
 
Cloud-Optimized HDF5 Files
Cloud-Optimized HDF5 FilesCloud-Optimized HDF5 Files
Cloud-Optimized HDF5 Files
 
Performance Tuning in HDF5
Performance Tuning in HDF5 Performance Tuning in HDF5
Performance Tuning in HDF5
 
HDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF ServiceHDF Kita Lab: JupyterLab + HDF Service
HDF Kita Lab: JupyterLab + HDF Service
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIsIntroduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
 
HDF Update
HDF UpdateHDF Update
HDF Update
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf9.-dados e processamento distribuido-hadoop.pdf
9.-dados e processamento distribuido-hadoop.pdf
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Aziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jhaAziksa hadoop architecture santosh jha
Aziksa hadoop architecture santosh jha
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the experts
 
Integrating HDF5 with SRB
Integrating HDF5 with SRBIntegrating HDF5 with SRB
Integrating HDF5 with SRB
 

Mehr von The HDF-EOS Tools and Information Center (11)

The State of HDF
The State of HDFThe State of HDF
The State of HDF
 
Creating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 FilesCreating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 Files
 
HDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance DiscussionHDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance Discussion
 
Hyrax: Serving Data from S3
Hyrax: Serving Data from S3Hyrax: Serving Data from S3
Hyrax: Serving Data from S3
 
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLABAccessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
 
HDF - Current status and Future Directions
HDF - Current status and Future DirectionsHDF - Current status and Future Directions
HDF - Current status and Future Directions
 
HDF-EOS Data Product Developer's Guide
HDF-EOS Data Product Developer's GuideHDF-EOS Data Product Developer's Guide
HDF-EOS Data Product Developer's Guide
 
HDF Status Update
HDF Status UpdateHDF Status Update
HDF Status Update
 
NASA Terra Data Fusion
NASA Terra Data FusionNASA Terra Data Fusion
NASA Terra Data Fusion
 
S3 VFD
S3 VFDS3 VFD
S3 VFD
 
HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
HDF Data in the Cloud
 

Kürzlich hochgeladen

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 

Kürzlich hochgeladen (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Parallel Computing with HDF Server

  • 1. Parallel Computing with HDF Server 1 John Readey
  • 2. The HDF5 data format 2 • Established 20 years ago the HDF5 file format is the most commonly used format in Earth Science • Note: NetCDF4 files are actually HDF5 “under the hood” • HDF5 was designed with the (somewhat contradictory) goals of: • Archival format – data that can stored for decades • Analysis Ready -- data that can be directly utilized for analytics (no conversion needed) • There’s a rich set of tools and language SDKs: • C/C++/Fortran • Python • Java, etc.
  • 3. HDF5 File Format meets the Cloud 3 • Storing large HDF5 collection on AWS is almost always about utilizing S3: • Cost effective • Redundant • Sharable • It’s easy enough to store HDF5 files as S3 objects, but these files can’t be read using the HDF5 library (which is expecting a POSIX filesystem) • Experience using FUSE to read from S3 using HDF5Library has not tended to work so well • In practice users have been left with copying files to local disk first • This has led to interest in alternative formats such as Zarr, TileDB, and our own HSDS S3 Storage Schema (more on that later) • Our HSDS server provides a means of efficiently accessing HDF5 data on S3
  • 4. HDF Server 4 • HSDS is an open source REST based service for HDF data • Think of it as HDF gone cloud native.  • HSDS Features: • Runs as a set of containers on Kubernetes – so can scale beyond one machine • Requests can be parallelized across multiple containers • Feature compatible with the HDF library but is independent code base • Supports multiple readers/writers • Uses S3 as data store • Existing HDF APIs (h5py, h5netcdf, xarray, etc.) work seamlessly with HSDS • Available now as part of HDF Kita Lab (our hosted Jupyter environment): https://hdflab.hdfgroup.org • Available on AWS Marketplace as “Kita Server”
  • 5. HDF Cloud Schema Big Idea: Map individual HDF5 objects (datasets, groups, chunks) as Object Storage Objects • Limit maximum storage object size • Support parallelism for read/write • Only data that is modified needs to be updated • Multiple clients can be reading/updating the same “file” Legend: • Dataset is partitioned into chunks • Each chunk stored as an S3 object • Dataset meta data (type, shape, attributes, etc.) stored in a separate object (as JSON text) How to store HDF5 content in S3? Each chunk (heavy outlines) get persisted as a separate object
  • 6. 6Dataset JSON Example • creationProperties contains HDF5 dataset creation property list settings. • Id is the objects UUID. • Layout represents HDF5 dataspace. • Root points back to the root group • Created & lastModified are timestamps • type represents HDF5 datatype. • attributes holds a list of HDF5 attribute JSON objects. { "creationProperties": {}, "id": "d-9a097486-58dd-11e8-a964- 0242ac110009", "layout": {"dims": [10], "class": "H5D_CHUNKED"}, "root": "g-952b0bfa-58dd-11e8-a964- 0242ac110009", "created": 1526456944, "lastModified": 1526456944, "shape": {"dims": [10], "class": "H5S_SIMPLE"}, "type": {"base": "H5T_STD_I32LE", "class": "H5T_INTEGER"}, "attributes": {} }
  • 7. Schema Details 7 • Key Organization • Objects are stored root_id • All non-root objects are stored as sub-keys of root_id • “flat” organization to support non-cycle links • Each storage node is limited to about 300 req/s 5,500 req/s • Note: rate limit raised last year by Amazon • Chunks are stored in the same folder as dataset metadata • Chunk key is determined based on chunk position in the data space • E.g. c-<uuid>_0_0_0 Is the corner chunk of a 3-dimensional dataset • Chunk objects get created as needed on first write • Schema is currently used just by HDF Server, but could just as easily be used directly by clients (assuming that writes don’t conflict)
  • 8. • Several improvements have been made over the last year • Read access for traditional HDF5 Files stored in S3 • More on this in the next slide • Shuffle filter support • Along with deflate • Fast metadata loading • Optionally load all metadata in one request • Support for multiple buckets • HSDS can access data stored in different buckets • H5netcdf & xarray support • Support for the REST API is built into these packages • Additional CLI tools (hsmv, hscp, hsdiff) • Variable Length data support with compression • Schema V2 8New Features
  • 9. Supporting traditional HDF5 files 9 • Downside of the HDF S3 Schema is that data needs be transmogrified* • Since the bulk of the data is usually the chunk data it makes sense to just leave the data in place and save pointers to the original file(s): • Convert just the metadata of the source HDF5 file to the S3 Schema • Store the source file as a S3 object • For data reads, metadata provides offset and length into the HDF5 file • S3 Range GET returns needed data • This approach can be used either directly or with HDF Server • Compared with accessing S3 directly, you reduce the number of S3 requests needed • Performance is comparable to sharded data model • Only read access is supported
  • 10. Hybrid Approach: Metadata + HDF5 Files 1 0 S3://BIG_REPO/…/AN_HDF5_FILE.h5 Imported Metadata (JSON) HDF5 File stored as S3 object Dset /dset1: chunk 0 Dset /dset1: chunk 1 Dset /dset1: chunk n S3 Range GET( • S3 Key • Offset • Num Bytes)
  • 11. • HDF Kita Lab runs on AWS in a Kubernetes cluster • Cluster can scale to handle different number of users • Each user gets: • 1 CPU Core (2.5GHz Xeon) • 8 GB RAM • 10 GB Disk • 100 GB S3 Storage • Access to HDF Kita Server • Ability to read/write HDF data stored on S3 • User environment configured for commonly used Python Packages for HDF users: • H5py(d), pandas, h5netcdf, xarray, bokeh, dask • HDF Kita Command Line tools: • Hsinfo, hsls, hsget, hsload, etc. 1 1Kita Lab – playground for HDF Server
  • 12. • JupyterLab and Kita Server both runs as a set of Docker containers • Kubernetes transparently manages running these containers across multiple machines 1 2Kubernetes Platform AWS Kubernetes JupyterHub HDF Kita Server (HSDS) {Containers
  • 13. 1 3Architecture AWS S3 Kita Server (HSDS) User SN SN SN SN DN DN DN DN User Containers & EBS Volumes spawn
  • 14. References 1 4 • HSDS: https://github.com/HDFGroup/hsds • H5Pyd: https://github.com/HDFGroup/h5pyd • Kita Lab: https://www.hdfgroup.org/hdfkitalab/ • SciPy2017 talk: https://s3.amazonaws.com/hdfgroup/docs/hdf_data_services_scipy201 7.pdf • AWS Big Data Blog article: https://aws.amazon.com/blogs/big-data/power- from-wind-open-data-on-aws/ • Spark and HDF Blog article: https://www.hdfgroup.org/2015/04/putting- some-spark-into-hdf-eos • Notebook from this talk: https://gist.github.com/jreadey/d1c67aee07451985397f48a50be2cdaa

Hinweis der Redaktion

  1. Many users of HDF5 are now migrating data archives to public or private cloud systems. The access approaches and performance characteristics of cloud storage are fundamentally different than traditional data storage systems because 1) the data are accessed over http and 2) the data are stored in an object store and identified using unique keys. There are many different ways to organize and access data in the cloud. The HDF Group is currently exploring and developing approaches that will facilitate migration to the cloud and support many existing HDF5 data access use cases. Our goal is to protect data providers and users from disruption as their data and applications are migrated to the cloud.
  2. This idea has been kicking around for a while, but storing potentially millions of files on a Linux filesystem would be problematic. Using S3 as the storage vehicle is a natural fit since there’s no limit to the number of objects in a bucket. With NREL we’ve validated this approach to 50 TB’s of data over 27MM objects (see aws-big-data blog article: https://aws.amazon.com/blogs/big-data/power-from-wind-open-data-on-aws/ )