SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
HDF5 in the Cloud
HDFCloud
Workshop
John Readey
The HDF Group
jreadey@hdfgroup.org
My Background
Sr. Architect at The HDF Group
Started in 2014
Have been exploring remote interfaces to HDF
Previously: Dev Manager at Amazon/AWS
More previously: Used HDF5 while a developer
at Intel
2
What is HDF5?
Depends on your point of view:
• a C-API
• a File Format
• a data model
3
Think of HDF5 as a file system
within a file.
Store arrays with chunking and compression.
Add NumPy style data selection.
• Native support for multidimensional
data
• Data and metadata in one place =>
streamlines data lifecycle & pipelines
• Portable, no vendor lock-in
• Maintains logical view while adapting to
storage context
• In-memory, over-the-wire, on-disk, parallel
FS, object store
• Pluggable filter pipeline for compression,
checksum, encryption, etc.
• High-performance I/O
• Large ecosystem (700+ Github projects)
Why is this concept so different + useful? 4
Why HDF in the Cloud
• It can provide a cost-effective infrastructure
• Pay for what you use vs pay for what you may need
• Lower overhead: no hard ware setup/network configuration, etc.
• Potentially can benefit from cloud-based technologies:
• Elastic compute – scale compute resources dynamically
• Object based storage – low cost/built in redundancy
• Community platform (potentially)
• Enables interested users to bring their applications to the data
• Share data among many users
Cost Factors
• Most public clouds bill per usage
• For HDF in the cloud, there are three big cost drivers:
• Storage – What storage system will be used? (see next slide)
• Compute – Elastic compute on demand better than fixed cost
• Scale compute to usage not size of data
• Egress charges
• Ingress is free but getting data out will cost you ($0.09/GB)
• Enabling users to get just the data they need will tend to lower egress charges
6
Storage Costs
How much will it costs to store 1PB for one year on AWS?
Answer depends on the technology and tradeoffs you are willing to accept…
Technology What it is Cost for 1PB/1yr Fine Print
Glacier Offline (tape) Storage $125K - 4 hour latency for first read
- Additional costs for restore
S3 Infrequent Access Nearline Object Storage $157K - $0.01/GB data retrieval charge
- $10K to read entire PB!
S3 Online Object Storage $289K - Request pricing $0.01 per 10K req
- Transfer out charge $0.01/GB
EBS Attachable Disk Storage $629K - Extra charges for guaranteed IOPS
- Need backups
EFS Shared Network (NFS) $3,774K - Not all NFSv4.1 features supported
- E.g. File Locking
DynamoDB NoSQL Database $3,145K - Extra charge for IOPS
Introducing Highly Scalable Data Service
(HSDS)
• RESTful interface to HDF5 using object storage
• Storage using AWS S3
• Built in redundancy
• Cost effective
• Scalable throughput
• Runs as a cluster of Docker containers
• Elastically scale compute with usage
• Feature compatible with HDF5 library
• Implemented in Python using asyncio
• Task oriented parallelism
8
HSDS Features
• Clients can interact with service using REST API
• SDKs provide language specific interface (e.g. h5pyd for Python)
• Can read/write just the data they need (as opposed to transferring entire
files)
• No limit to the amount of data that can be stored by the service
• Multiple clients can read/write to same data source
• Scalable performance:
• Can cache recently accessed data in RAM
• Can parallelize requests across multiple nodes
• More nodes -> better performance
9
What makes it RESTful?
• Client-server model
• Stateless – (no client context stored on server)
• Cacheable – clients can cache responses
• Resources identified by URIs (datasets, groups, attributes, etc)
• Standard HTTP methods and behaviors:
Method Safe Idempotent Description
GET Y Y Get a description of a resource
POST N N Create a new resource
PUT N Y Create a new named resource
DELETE N Y Delete a resource
Example POST Request – Create Dataset
POST /datasets HTTP/1.1
Content-Length: 39
User-Agent: python-requests/2.3.0 CPython/2.7.8 Darwin/14.0.0
Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ==
host: newdset.datasettest.test.hdfgroup.org
Accept: */*
Accept-Encoding: gzip, deflate
{ "shape": 10, "type": "H5T_IEEE_F32LE" }
HTTP/1.1 201 Created
Date: Thu, 29 Jan 2017 06:14:02 GMT
Content-Length: 651
Content-Type: application/json
Server: aiohttp/3.2.2
{ "id": "0568d8c5-a77e-11e4-9f7a-3c15c2da029e", "attributeCount": 0, "created":
"2017-01-29T06:14:02Z", "lastModified": "2017-01-29T06:14:02Z", … ] }
Object Storage Challenges for HDF
• Not POSIX!
• High latency (>0.1s) per request
• Not write/read consistent
• High throughput needs some tricks
• (use many async requests)
• Request charges can add up (public cloud)
For HDF5, using the HDF5 library
directly on an object storage
system is a non-starter. Will
need an alternative solution…
HSDS S3 Schema
Big Idea: Map individual HDF5
objects (datasets, groups,
chunks) as Object Storage
Objects• Limit maximum storage object size
• Support parallelism for read/write
• Only data that is modified needs to be updated
• (Potentially) Multiple clients can be
reading/updating the same “file”
Legend:
• Dataset is partitioned into
chunks
• Each chunk stored as an S3
object
• Dataset meta data (type,
shape, attributes, etc.) stored in
a separate object (as JSON
text)
How to store HDF5 content in S3?
13
Each chunk (heavy outlines) get
persisted as a separate object
Client/Server Architecture 14
Architecture for HSDS
Legend:
• Client: Any user of the service
• Load balancer – distributes requests to Service nodes
• Service Nodes – processes requests from clients (with help from Data
Nodes)
• Data Nodes – responsible for partition of Object Store
• Object Store: Base storage service (e.g. AWS S3)
15
Implementing HSDS with asyncio
• HSDS relies heavily on Python’s new asyncio module
• Concurrency based on tasks (rather than say multithreading or
multiprocessing)
• Task switching occurs when process would otherwise wait on I/O
async def my_func():
a_regular_function_call()
await a_blocking_call()
• Control will switch to another task when await is
encountered
• Result is the app can do other useful work vs. blocking
• Supporting 1000’s of concurrent tasks within a process
16
Parallelizing data access with asyncio
• SN node invoking parallel requests on DN nodes
tasks = []
for chunk_id in my_chunk_list:
task = asyncio.ensure_future(read_chunk_query(chunk_id))
tasks.append(task)
await asyncio.gather(*tasks, loop=loop)
• Read_chunk_query makes a http request to a specific DN
node
• Set of DN nodes can be reading from S3, decompression
and selecting requested data in parallel
• Asyncio.gather waits for all tasks to complete before
continuing
• Meanwhile, new requests can be processed by SN node
17
Python and Docker
• Docker makes developing clustered applications sooo much easier
• Can run dozens of containers on a moderate laptop
• Containers communicate with each other just like on a physical network
• Use docker stats to check up cpu, net i/o, disk i/o usage per container
• Can try out different constraints for amount of memory, disk per container
• Same code “just works” on an actual cluster
• ”scale up” by launching more containers on production hardware
• AWS ECS enables running containers in a machine agnostic way
• Using docker does require a reversion to the edit/build/run paradigm
• The build step is now the creation of the docker image
• Run is launching the container(s)
18
Python package MVPs
• numpy – python arrays
• Used heavily in server and client stacks
• Great performance for common array operations
• Simplifies much of the logic needed for hyperslab selection
• aiohttp – async http client/server
• Use of asyncio requires async enabled packages
• Aiohttp is used in HSDS as both web server and client
• Aiobotocore – async aws s3 client
• Enables async read/write to S3
• H5py – template for h5pyd package
19
H5pyd – Python client for HDF Server
• H5py is a popular Python package that provide a Pythonic interface to the
HDF5 library
• H5pyd (for h5py distributed) provides a h5py compatible h5py for accessing
the server
• Pure Python – uses requests package to make http calls to server
• Compatible with h5serv (the reference implementation of the HDF REST API)
• Include several extensions to h5py:
• List content in folders
• Get/Set ACLs (access control list)
• Pytables-like query interface
20
HDF REST VOL
• The HDF5 VOL architecture is a plugin layer for HDF5
• User API stays the same, but different backends can be
implemented
• REST VOL substitutes REST API requests for file i/o actions
• C/Fortran applications should be able to run as is
• Still in development – Beta expected this year
2
1
HSDS CLI (Command Line Interface)
• Accessing HDF via a service means can’t utilize usual shell commands:
ls, rm, chmod, etc.
• Command line tools are a set of simple apps to use instead:
• hsinfo: display server version, connect info
• hsls: list content of folder or file
• hstouch: create folder or file
• hsdel: delete a file
• hsload: upload an HDF5 file
• hsget: download content from server to an HDF5 file
• hsacl: create/list/update ACLs (Access Control Lists)
• Implemented in Python & uses h5pyd
22
Future Work 2
3
• Work planned for the next year
• Compression
• Variable length datatypes
• NetCDF support
• Auto Scaling
• Scalability and performance testing
Demo Time!
NREL (National Renewable Energy Laboratory) is using HSDS
to make 7TB of wind simulation data accessible to the public.
Datasets are three-dimensional covering the continental US:
• Time (one slice/hour)
• Lon (~2k resolution)
• Lat (~2k resolution)
Initial data covers one year (8760 slices), but will be soon be
extended to 5 years (35 TBs).
Rather than downloading TB’s of files, interested users can
now use the HSDS client libraries to explore the datasets.
To Find out More:
• H5serv: https://github.com/HDFGroup/h5serv
• Documentation: http://h5serv.readthedocs.io/
• H5pyd: https://github.com/HDFGroup/h5pyd
• RESTful HDF5 White Paper:
https://www.hdfgroup.org/pubs/papers/RESTful_HDF5.pdf
• Blog articles:
• https://hdfgroup.org/wp/2015/04/hdf5-for-the-web-hdf-server/
• https://hdfgroup.org/wp/2015/12/serve-protect-web-security-hdf5/
• https://www.hdfgroup.org/2017/04/the-gfed-analysis-tool-an-hdf-
server-implementation/
25
26
HDF5 Community Support
• Documentation - https://support.hdfgroup.org/documentation/
• Tutorials, FAQs, examples
• HDF-Forum – mailing list and archive
• Great for specific questions
• Helpdesk Email – help@hdfgroup.org
• Issues with software and documentation
https://support.hdfgroup.org/services/community_support.html
2
6
Questions? Comments?
Dave Pearah
CEO
David.Pearah@hdfgroup.org
Dax Rodriguez
Director of Commercial Services and
Solutions
Dax.Rodriguez@hdfgroup.org
www.hdfgroup.org

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Hierarchical Data Formats (HDF) Update
Hierarchical Data Formats (HDF) UpdateHierarchical Data Formats (HDF) Update
Hierarchical Data Formats (HDF) Update
 
MATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and CapabilitiesMATLAB and Scientific Data: New Features and Capabilities
MATLAB and Scientific Data: New Features and Capabilities
 
Utilizing HDF4 File Content Maps for the Cloud Computing
Utilizing HDF4 File Content Maps for the Cloud ComputingUtilizing HDF4 File Content Maps for the Cloud Computing
Utilizing HDF4 File Content Maps for the Cloud Computing
 
Efficiently serving HDF5 via OPeNDAP
Efficiently serving HDF5 via OPeNDAPEfficiently serving HDF5 via OPeNDAP
Efficiently serving HDF5 via OPeNDAP
 
Moving form HDF4 to HDF5/netCDF-4
Moving form HDF4 to HDF5/netCDF-4Moving form HDF4 to HDF5/netCDF-4
Moving form HDF4 to HDF5/netCDF-4
 
NEON HDF5
NEON HDF5NEON HDF5
NEON HDF5
 
Matlab, Big Data, and HDF Server
Matlab, Big Data, and HDF ServerMatlab, Big Data, and HDF Server
Matlab, Big Data, and HDF Server
 
Product Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the WebProduct Designer Hub - Taking HPD to the Web
Product Designer Hub - Taking HPD to the Web
 
HDF Product Designer: Using Templates to Achieve Interoperability
HDF Product Designer: Using Templates to Achieve InteroperabilityHDF Product Designer: Using Templates to Achieve Interoperability
HDF Product Designer: Using Templates to Achieve Interoperability
 
Scientific Computing and Visualization using HDF
Scientific Computing and Visualization using HDFScientific Computing and Visualization using HDF
Scientific Computing and Visualization using HDF
 
MODIS Land and HDF-EOS
MODIS Land and HDF-EOSMODIS Land and HDF-EOS
MODIS Land and HDF-EOS
 
Incorporating ISO Metadata Using HDF Product Designer
Incorporating ISO Metadata Using HDF Product DesignerIncorporating ISO Metadata Using HDF Product Designer
Incorporating ISO Metadata Using HDF Product Designer
 
Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
Geospatial Data Abstraction Library (GDAL) Enhancement for ESDIS (GEE)
 
HDF Product Designer
HDF Product DesignerHDF Product Designer
HDF Product Designer
 
ArcGIS and Multi-D: Tools & Roadmap
ArcGIS and Multi-D: Tools & RoadmapArcGIS and Multi-D: Tools & Roadmap
ArcGIS and Multi-D: Tools & Roadmap
 
Bridging ICESat and ICESat-2 Standard Data Products
Bridging ICESat and ICESat-2 Standard Data ProductsBridging ICESat and ICESat-2 Standard Data Products
Bridging ICESat and ICESat-2 Standard Data Products
 
SPD and KEA: HDF5 based file formats for Earth Observation
SPD and KEA: HDF5 based file formats for Earth ObservationSPD and KEA: HDF5 based file formats for Earth Observation
SPD and KEA: HDF5 based file formats for Earth Observation
 
HDF Project Update
HDF Project UpdateHDF Project Update
HDF Project Update
 
Working with HDF and netCDF Data in ArcGIS: Tools and Case Studies
Working with HDF and netCDF Data in ArcGIS: Tools and Case StudiesWorking with HDF and netCDF Data in ArcGIS: Tools and Case Studies
Working with HDF and netCDF Data in ArcGIS: Tools and Case Studies
 
Working with Scientific Data in MATLAB
Working with Scientific Data in MATLABWorking with Scientific Data in MATLAB
Working with Scientific Data in MATLAB
 

Ähnlich wie HDFCloud Workshop: HDF5 in the Cloud

Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inRahulBhole12
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Cloudian
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFSUSE Italy
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsDataWorks Summit
 
CI_CONF 2012: Scaling
CI_CONF 2012: ScalingCI_CONF 2012: Scaling
CI_CONF 2012: ScalingChris Miller
 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSandeep Patil
 
Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...Trishali Nayar
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsDataWorks Summit
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...Alluxio, Inc.
 

Ähnlich wie HDFCloud Workshop: HDF5 in the Cloud (20)

HDF Cloud Services
HDF Cloud ServicesHDF Cloud Services
HDF Cloud Services
 
HDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at ScaleHDF Cloud: HDF5 at Scale
HDF Cloud: HDF5 at Scale
 
Accessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDSAccessing HDF5 data in the cloud with HSDS
Accessing HDF5 data in the cloud with HSDS
 
HDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server FeaturesHDF for the Cloud - New HDF Server Features
HDF for the Cloud - New HDF Server Features
 
Highly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance FeaturesHighly Scalable Data Service (HSDS) Performance Features
Highly Scalable Data Service (HSDS) Performance Features
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
Parallel Computing with HDF Server
Parallel Computing with HDF ServerParallel Computing with HDF Server
Parallel Computing with HDF Server
 
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
Case Study: Implementing Hadoop and Elastic Map Reduce on Scale-out Object S...
 
HDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDFHDF for the Cloud - Serverless HDF
HDF for the Cloud - Serverless HDF
 
HDF for the Cloud
HDF for the CloudHDF for the Cloud
HDF for the Cloud
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
 
Hadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the expertsHadoop in the cloud – The what, why and how from the experts
Hadoop in the cloud – The what, why and how from the experts
 
CI_CONF 2012: Scaling - Chris Miller
CI_CONF 2012: Scaling - Chris MillerCI_CONF 2012: Scaling - Chris Miller
CI_CONF 2012: Scaling - Chris Miller
 
CI_CONF 2012: Scaling
CI_CONF 2012: ScalingCI_CONF 2012: Scaling
CI_CONF 2012: Scaling
 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN Caching
 
Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...
 
Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 

Mehr von The HDF-EOS Tools and Information Center

STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...The HDF-EOS Tools and Information Center
 

Mehr von The HDF-EOS Tools and Information Center (20)

Cloud-Optimized HDF5 Files
Cloud-Optimized HDF5 FilesCloud-Optimized HDF5 Files
Cloud-Optimized HDF5 Files
 
The State of HDF
The State of HDFThe State of HDF
The State of HDF
 
Creating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 FilesCreating Cloud-Optimized HDF5 Files
Creating Cloud-Optimized HDF5 Files
 
HDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance DiscussionHDF5 OPeNDAP Handler Updates, and Performance Discussion
HDF5 OPeNDAP Handler Updates, and Performance Discussion
 
Hyrax: Serving Data from S3
Hyrax: Serving Data from S3Hyrax: Serving Data from S3
Hyrax: Serving Data from S3
 
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLABAccessing Cloud Data and Services Using EDL, Pydap, MATLAB
Accessing Cloud Data and Services Using EDL, Pydap, MATLAB
 
HDF - Current status and Future Directions
HDF - Current status and Future DirectionsHDF - Current status and Future Directions
HDF - Current status and Future Directions
 
HDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and FutureHDFEOS.org User Analsys, Updates, and Future
HDFEOS.org User Analsys, Updates, and Future
 
HDF - Current status and Future Directions
HDF - Current status and Future Directions HDF - Current status and Future Directions
HDF - Current status and Future Directions
 
H5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only LibraryH5Coro: The Cloud-Optimized Read-Only Library
H5Coro: The Cloud-Optimized Read-Only Library
 
MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10MATLAB Modernization on HDF5 1.10
MATLAB Modernization on HDF5 1.10
 
HDF5 <-> Zarr
HDF5 <-> ZarrHDF5 <-> Zarr
HDF5 <-> Zarr
 
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
Apache Drill and Unidata THREDDS Data Server for NASA HDF-EOS on S3
 
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
STARE-PODS: A Versatile Data Store Leveraging the HDF Virtual Object Layer fo...
 
HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?HDF5 and Ecosystem: What Is New?
HDF5 and Ecosystem: What Is New?
 
HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020HDF5 Roadmap 2019-2020
HDF5 Roadmap 2019-2020
 
Leveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software TestingLeveraging the Cloud for HDF Software Testing
Leveraging the Cloud for HDF Software Testing
 
Google Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOSGoogle Colaboratory for HDF-EOS
Google Colaboratory for HDF-EOS
 
HDF-EOS Data Product Developer's Guide
HDF-EOS Data Product Developer's GuideHDF-EOS Data Product Developer's Guide
HDF-EOS Data Product Developer's Guide
 
HDF Status Update
HDF Status UpdateHDF Status Update
HDF Status Update
 

Kürzlich hochgeladen

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Kürzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

HDFCloud Workshop: HDF5 in the Cloud

  • 1. HDF5 in the Cloud HDFCloud Workshop John Readey The HDF Group jreadey@hdfgroup.org
  • 2. My Background Sr. Architect at The HDF Group Started in 2014 Have been exploring remote interfaces to HDF Previously: Dev Manager at Amazon/AWS More previously: Used HDF5 while a developer at Intel 2
  • 3. What is HDF5? Depends on your point of view: • a C-API • a File Format • a data model 3 Think of HDF5 as a file system within a file. Store arrays with chunking and compression. Add NumPy style data selection.
  • 4. • Native support for multidimensional data • Data and metadata in one place => streamlines data lifecycle & pipelines • Portable, no vendor lock-in • Maintains logical view while adapting to storage context • In-memory, over-the-wire, on-disk, parallel FS, object store • Pluggable filter pipeline for compression, checksum, encryption, etc. • High-performance I/O • Large ecosystem (700+ Github projects) Why is this concept so different + useful? 4
  • 5. Why HDF in the Cloud • It can provide a cost-effective infrastructure • Pay for what you use vs pay for what you may need • Lower overhead: no hard ware setup/network configuration, etc. • Potentially can benefit from cloud-based technologies: • Elastic compute – scale compute resources dynamically • Object based storage – low cost/built in redundancy • Community platform (potentially) • Enables interested users to bring their applications to the data • Share data among many users
  • 6. Cost Factors • Most public clouds bill per usage • For HDF in the cloud, there are three big cost drivers: • Storage – What storage system will be used? (see next slide) • Compute – Elastic compute on demand better than fixed cost • Scale compute to usage not size of data • Egress charges • Ingress is free but getting data out will cost you ($0.09/GB) • Enabling users to get just the data they need will tend to lower egress charges 6
  • 7. Storage Costs How much will it costs to store 1PB for one year on AWS? Answer depends on the technology and tradeoffs you are willing to accept… Technology What it is Cost for 1PB/1yr Fine Print Glacier Offline (tape) Storage $125K - 4 hour latency for first read - Additional costs for restore S3 Infrequent Access Nearline Object Storage $157K - $0.01/GB data retrieval charge - $10K to read entire PB! S3 Online Object Storage $289K - Request pricing $0.01 per 10K req - Transfer out charge $0.01/GB EBS Attachable Disk Storage $629K - Extra charges for guaranteed IOPS - Need backups EFS Shared Network (NFS) $3,774K - Not all NFSv4.1 features supported - E.g. File Locking DynamoDB NoSQL Database $3,145K - Extra charge for IOPS
  • 8. Introducing Highly Scalable Data Service (HSDS) • RESTful interface to HDF5 using object storage • Storage using AWS S3 • Built in redundancy • Cost effective • Scalable throughput • Runs as a cluster of Docker containers • Elastically scale compute with usage • Feature compatible with HDF5 library • Implemented in Python using asyncio • Task oriented parallelism 8
  • 9. HSDS Features • Clients can interact with service using REST API • SDKs provide language specific interface (e.g. h5pyd for Python) • Can read/write just the data they need (as opposed to transferring entire files) • No limit to the amount of data that can be stored by the service • Multiple clients can read/write to same data source • Scalable performance: • Can cache recently accessed data in RAM • Can parallelize requests across multiple nodes • More nodes -> better performance 9
  • 10. What makes it RESTful? • Client-server model • Stateless – (no client context stored on server) • Cacheable – clients can cache responses • Resources identified by URIs (datasets, groups, attributes, etc) • Standard HTTP methods and behaviors: Method Safe Idempotent Description GET Y Y Get a description of a resource POST N N Create a new resource PUT N Y Create a new named resource DELETE N Y Delete a resource
  • 11. Example POST Request – Create Dataset POST /datasets HTTP/1.1 Content-Length: 39 User-Agent: python-requests/2.3.0 CPython/2.7.8 Darwin/14.0.0 Authorization: Basic QWxhZGRpbjpvcGVuIHNlc2FtZQ== host: newdset.datasettest.test.hdfgroup.org Accept: */* Accept-Encoding: gzip, deflate { "shape": 10, "type": "H5T_IEEE_F32LE" } HTTP/1.1 201 Created Date: Thu, 29 Jan 2017 06:14:02 GMT Content-Length: 651 Content-Type: application/json Server: aiohttp/3.2.2 { "id": "0568d8c5-a77e-11e4-9f7a-3c15c2da029e", "attributeCount": 0, "created": "2017-01-29T06:14:02Z", "lastModified": "2017-01-29T06:14:02Z", … ] }
  • 12. Object Storage Challenges for HDF • Not POSIX! • High latency (>0.1s) per request • Not write/read consistent • High throughput needs some tricks • (use many async requests) • Request charges can add up (public cloud) For HDF5, using the HDF5 library directly on an object storage system is a non-starter. Will need an alternative solution…
  • 13. HSDS S3 Schema Big Idea: Map individual HDF5 objects (datasets, groups, chunks) as Object Storage Objects• Limit maximum storage object size • Support parallelism for read/write • Only data that is modified needs to be updated • (Potentially) Multiple clients can be reading/updating the same “file” Legend: • Dataset is partitioned into chunks • Each chunk stored as an S3 object • Dataset meta data (type, shape, attributes, etc.) stored in a separate object (as JSON text) How to store HDF5 content in S3? 13 Each chunk (heavy outlines) get persisted as a separate object
  • 15. Architecture for HSDS Legend: • Client: Any user of the service • Load balancer – distributes requests to Service nodes • Service Nodes – processes requests from clients (with help from Data Nodes) • Data Nodes – responsible for partition of Object Store • Object Store: Base storage service (e.g. AWS S3) 15
  • 16. Implementing HSDS with asyncio • HSDS relies heavily on Python’s new asyncio module • Concurrency based on tasks (rather than say multithreading or multiprocessing) • Task switching occurs when process would otherwise wait on I/O async def my_func(): a_regular_function_call() await a_blocking_call() • Control will switch to another task when await is encountered • Result is the app can do other useful work vs. blocking • Supporting 1000’s of concurrent tasks within a process 16
  • 17. Parallelizing data access with asyncio • SN node invoking parallel requests on DN nodes tasks = [] for chunk_id in my_chunk_list: task = asyncio.ensure_future(read_chunk_query(chunk_id)) tasks.append(task) await asyncio.gather(*tasks, loop=loop) • Read_chunk_query makes a http request to a specific DN node • Set of DN nodes can be reading from S3, decompression and selecting requested data in parallel • Asyncio.gather waits for all tasks to complete before continuing • Meanwhile, new requests can be processed by SN node 17
  • 18. Python and Docker • Docker makes developing clustered applications sooo much easier • Can run dozens of containers on a moderate laptop • Containers communicate with each other just like on a physical network • Use docker stats to check up cpu, net i/o, disk i/o usage per container • Can try out different constraints for amount of memory, disk per container • Same code “just works” on an actual cluster • ”scale up” by launching more containers on production hardware • AWS ECS enables running containers in a machine agnostic way • Using docker does require a reversion to the edit/build/run paradigm • The build step is now the creation of the docker image • Run is launching the container(s) 18
  • 19. Python package MVPs • numpy – python arrays • Used heavily in server and client stacks • Great performance for common array operations • Simplifies much of the logic needed for hyperslab selection • aiohttp – async http client/server • Use of asyncio requires async enabled packages • Aiohttp is used in HSDS as both web server and client • Aiobotocore – async aws s3 client • Enables async read/write to S3 • H5py – template for h5pyd package 19
  • 20. H5pyd – Python client for HDF Server • H5py is a popular Python package that provide a Pythonic interface to the HDF5 library • H5pyd (for h5py distributed) provides a h5py compatible h5py for accessing the server • Pure Python – uses requests package to make http calls to server • Compatible with h5serv (the reference implementation of the HDF REST API) • Include several extensions to h5py: • List content in folders • Get/Set ACLs (access control list) • Pytables-like query interface 20
  • 21. HDF REST VOL • The HDF5 VOL architecture is a plugin layer for HDF5 • User API stays the same, but different backends can be implemented • REST VOL substitutes REST API requests for file i/o actions • C/Fortran applications should be able to run as is • Still in development – Beta expected this year 2 1
  • 22. HSDS CLI (Command Line Interface) • Accessing HDF via a service means can’t utilize usual shell commands: ls, rm, chmod, etc. • Command line tools are a set of simple apps to use instead: • hsinfo: display server version, connect info • hsls: list content of folder or file • hstouch: create folder or file • hsdel: delete a file • hsload: upload an HDF5 file • hsget: download content from server to an HDF5 file • hsacl: create/list/update ACLs (Access Control Lists) • Implemented in Python & uses h5pyd 22
  • 23. Future Work 2 3 • Work planned for the next year • Compression • Variable length datatypes • NetCDF support • Auto Scaling • Scalability and performance testing
  • 24. Demo Time! NREL (National Renewable Energy Laboratory) is using HSDS to make 7TB of wind simulation data accessible to the public. Datasets are three-dimensional covering the continental US: • Time (one slice/hour) • Lon (~2k resolution) • Lat (~2k resolution) Initial data covers one year (8760 slices), but will be soon be extended to 5 years (35 TBs). Rather than downloading TB’s of files, interested users can now use the HSDS client libraries to explore the datasets.
  • 25. To Find out More: • H5serv: https://github.com/HDFGroup/h5serv • Documentation: http://h5serv.readthedocs.io/ • H5pyd: https://github.com/HDFGroup/h5pyd • RESTful HDF5 White Paper: https://www.hdfgroup.org/pubs/papers/RESTful_HDF5.pdf • Blog articles: • https://hdfgroup.org/wp/2015/04/hdf5-for-the-web-hdf-server/ • https://hdfgroup.org/wp/2015/12/serve-protect-web-security-hdf5/ • https://www.hdfgroup.org/2017/04/the-gfed-analysis-tool-an-hdf- server-implementation/ 25
  • 26. 26 HDF5 Community Support • Documentation - https://support.hdfgroup.org/documentation/ • Tutorials, FAQs, examples • HDF-Forum – mailing list and archive • Great for specific questions • Helpdesk Email – help@hdfgroup.org • Issues with software and documentation https://support.hdfgroup.org/services/community_support.html 2 6
  • 27. Questions? Comments? Dave Pearah CEO David.Pearah@hdfgroup.org Dax Rodriguez Director of Commercial Services and Solutions Dax.Rodriguez@hdfgroup.org www.hdfgroup.org