Pyramid: A large-scale array-oriented active storage system

Pyramid: A large-scale
array-oriented active storage
system
Viet-Trung TRAN, Nicolae Bogdan,
Gabriel Antoniu, Luc Bougé

KerData Team
Inria, Rennes, France 02 09 2011

Outline

1. Motivation

2. Architecture

3. Preliminary evaluation

4. Conclusion

Viet-TrungTran 02 09 2011 - 2

1
Motivation
Whyarray-orientedstorage?

Viet-TrungTran 00 MOIS 2011 - 3

Context: Data-intensive large-scale HPC
simulations
• The scalability of data management is becoming
a critical issue
• Mismatch between storage model and application
data model
• Application data model
- Multidimensional typed arrays, images, etc.
• Storage model
- Parallel file systems: Simple and flat I/O
model
- Relational model: ill-suited for Scientifics
• Need additional layers to map the application
model to the storage model


•Sequence of bytes

[M. Stonebraker] The one-storage-fits-all-
needs has reached its limits
• Parallel I/O stack:
- Performance of non-contiguous I/O vs data
atomicity
• Relational data model:
Application (Visit, Tornado
- Simulating arrays on top of table is poor in simulation)
performance
Data model (HDF5, NetCDF)
- Scalability for join queries
• Need to specialize the I/O stack to match the MPI-IO middleware
applications requirements
Parallel file systems
- Array-oriented storage for array data model
• Example: SciDB with ArrayStore.


Our approach

• Multi-dimensional aware chunking
• Lock-free, distributed chunk indexing
• Array versioning
• Active storage support
• Versioning array-oriented access interface


Multi-dimensional aware chunking

• Split array into equal chunks and distributed over storage elements
- Simplify load balancing among storage elements
- Keep the neighbors of cells in the same chunk
• Shared nothing architecture
- Easier to handle data consistency


Lock-free, distributed chunk indexing

• Indexing multi-dimensional information
- R-tree, XD-tree, Quad-tree, etc
- Designed and optimized centralized management
• Centralized metadata management scheme may not scale
- Bottleneck under highly concurrency
• Our approach:
- Porting quad-tree like structures to distributed environment
- Using shadowing technique on quad-tree to enable lock-free
concurrent update


Array versioning

• Scientific applications need array versioning (VLDB 2009)
- Check pointing
- Cloning
- Provenance
• Keep data and metadata immutable
- Updating a chunk is handled at metadata level using shadowing
technique


Active storage support

• Move data computation to storage elements
- Conserve bandwidth
- Better workload parallelization
• Allow user sending User defined handlers to storage servers


Versioning array-oriented access interface

• Basic primitives
- id = CREATE(n, sizes[], defval)
- READ(id, v, offsets[], sizes[], buffer)
- w = WRITE(id, offsets[], sizes[], buffer)
- w = SEND_COMPUTATION(id, v, offsets[], sizes[], f)
• Other primitives like cloning, filtering mostly can be implemented based
on these above primitives


2
Pyramid: Architecture


Architecture

• Pyramid is inspired by our previous work: BlobSeer [JPDC 2011]
• Version managers
- Ensure concurrency control
• Metadata managers
- Store index tree nodes
• Storage manager
- Monitor the storage servers
- Ensures a load balancing strategy of chunks among storage servers
• Active storage servers
- Store chunks and perform handlers on chunks
• Clients
- Perform I/O accesses


Read
Storage Metadata Version
Client servers managers managers
• I: optionally ask the version manager for
the latest published version I

II
• II: fetch the corresponding metadata from
the metadata managers

III
• III: contact storage servers in parallel and
fetch the chunks in the local buffer


Write
Storage Metadata Version Storage
Client servers managers manager manager
• I: get a list of storage servers that are
able to store the chunks, one for each I
chunk

• II: contact storage servers in parallel and II
write the chunks to the corresponding
providers III

IV
• III: get a version number for the update
V
• IV: add new metadata to consolidate the
new version

• V: report the new version is ready for
publication.


Lock-free, distributed chunk indexing

• Organized as a Quad-tree to index 2D arrays
• Each tree node has at most 4 children, each covers one of the four quadrants
• Root tree covers the whole array
• Each leaf corresponds to a chunk and holds information about its location
• Tree nodes are immutable, uniquely identified by the version number and the
sub-domain they cover
• Using DHT to distribute tree nodes over metadata managers


Tree shadowing to update

• Write newly created chunks to storage servers
• Build the quad-tree associated to the new snapshot in bottom-up fashion
- Writing the leaves to DHT
- Inner nodes may point to nodes of previous snapshots (imply a
synchronization of the quad-tree generation)
- Avoid synchronization by feeding additional information about the other
concurrent updaters (thank to computational ID of tree nodes)


Efficient parallel updating
Client Client Storage Metadata Version
#1 #2 servers managers manager
• Chunks are written concurrently

• Versions are assigned in the order the
clients finish writing

• Clients get additional information about
the other concurrent writers

• Tree nodes are written in lock-free manner

• Versions are published in the order they
were assigned
Publish
Publish


Some more I/O primitives

• Easily implemented thanks to immutable data and metadata blocks
• Cheap I/O operators
• Clone a sub-domain
- Following the metadata tree of a specific snapshot
- Creating new metadata tree and publish as a newly created array
• Filtering, compression ca be done locally in parallel at active storage servers by
introducing user defined handlers


3
Preliminary evaluation
Experimented on G5K (www.grid5000.fr)


Experimental setup

Simulate common access pattern exhibited by scientific applications: Array Dicing

• Using at most 130 nodes of Graphene cluster on G5K
- 1 Gbps Ethernet interconnected network
- 49 nodes deployed our Pyramid and the competitor system PVFS
• Array dicing
- Each client accesses a dedicated sub-array
- 1 GB per clients consisting 32x32 chunks (1024x1024 bytes chunk size)
- Concurrent Reading/Writing
• Measure the performance and scalability


Aggregated throughput achieved under
concurrency
• PVFS suffers from non-
contiguous access pattern due
to serialization to flat file
• Pyramid
- Throughputincreased
steady
- Promising good scalability
on both data and metadata
organization


4
Conclusion


Conclusion

• Pyramid is an array-oriented active storage system
• Proposed a system offering support for
- Parallel array processing for both read and write workloads
- Versioning data
- Distributed metadata management, shadowing to reflect updates
• Preliminary evaluation shows promising scalability

• Future work
- Planed to integrate to HDF5
- Pyramid as a storage engine for SciDB?
- Investigate on keeping data at quad-tree nodes
Could be used for store array at different resolutions (map application)


Thankyou

INRIA – KerDataResearch Team

www.irisa.fr/kerdata

Pyramid: A large-scale array-oriented active storage system

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Pyramid: A large-scale array-oriented active storage system

Ähnlich wie Pyramid: A large-scale array-oriented active storage system (20)

Mehr von Viet-Trung TRAN

Mehr von Viet-Trung TRAN (20)

Pyramid: A large-scale array-oriented active storage system