The recent explosion in data sizes manipulated by distributed scientific applications has prompted the need to develop specialized storage systems capable to deal with specific access patterns in a scalable fashion. In this context, a large class of applications focuses on parallel array processing: small parts of huge multi-dimensional arrays are concurrently accessed by a large number of clients, both for reading and writing. A specialized storage system that deals with such an access pattern faces several challenges at the level of data/metadata management. We introduce Pyramid, an active array-oriented storage system that addresses these challenges and shows promising results in our initial evaluation.
Recommender systems: Content-based and collaborative filtering
Pyramid: A large-scale array-oriented active storage system
1. Pyramid: A large-scale
array-oriented active storage
system
Viet-Trung TRAN, Nicolae Bogdan,
Gabriel Antoniu, Luc Bougé
KerData Team
Inria, Rennes, France 02 09 2011
4. Context: Data-intensive large-scale HPC
simulations
• The scalability of data management is becoming
a critical issue
• Mismatch between storage model and application
data model
• Application data model
- Multidimensional typed arrays, images, etc.
• Storage model
- Parallel file systems: Simple and flat I/O
model
- Relational model: ill-suited for Scientifics
• Need additional layers to map the application
model to the storage model
•Sequence of bytes
Viet-TrungTran 02 09 2011 - 4
5. [M. Stonebraker] The one-storage-fits-all-
needs has reached its limits
• Parallel I/O stack:
- Performance of non-contiguous I/O vs data
atomicity
• Relational data model:
Application (Visit, Tornado
- Simulating arrays on top of table is poor in simulation)
performance
Data model (HDF5, NetCDF)
- Scalability for join queries
• Need to specialize the I/O stack to match the MPI-IO middleware
applications requirements
Parallel file systems
- Array-oriented storage for array data model
• Example: SciDB with ArrayStore.
Viet-TrungTran 02 09 2011 - 5
7. Multi-dimensional aware chunking
• Split array into equal chunks and distributed over storage elements
- Simplify load balancing among storage elements
- Keep the neighbors of cells in the same chunk
• Shared nothing architecture
- Easier to handle data consistency
Viet-TrungTran 02 09 2011 - 7
8. Lock-free, distributed chunk indexing
• Indexing multi-dimensional information
- R-tree, XD-tree, Quad-tree, etc
- Designed and optimized centralized management
• Centralized metadata management scheme may not scale
- Bottleneck under highly concurrency
• Our approach:
- Porting quad-tree like structures to distributed environment
- Using shadowing technique on quad-tree to enable lock-free
concurrent update
Viet-TrungTran 02 09 2011 - 8
9. Array versioning
• Scientific applications need array versioning (VLDB 2009)
- Check pointing
- Cloning
- Provenance
• Keep data and metadata immutable
- Updating a chunk is handled at metadata level using shadowing
technique
Viet-TrungTran 02 09 2011 - 9
10. Active storage support
• Move data computation to storage elements
- Conserve bandwidth
- Better workload parallelization
• Allow user sending User defined handlers to storage servers
Viet-TrungTran 02 09 2011 - 10
11. Versioning array-oriented access interface
• Basic primitives
- id = CREATE(n, sizes[], defval)
- READ(id, v, offsets[], sizes[], buffer)
- w = WRITE(id, offsets[], sizes[], buffer)
- w = SEND_COMPUTATION(id, v, offsets[], sizes[], f)
• Other primitives like cloning, filtering mostly can be implemented based
on these above primitives
Viet-TrungTran 02 09 2011 - 11
13. Architecture
• Pyramid is inspired by our previous work: BlobSeer [JPDC 2011]
• Version managers
- Ensure concurrency control
• Metadata managers
- Store index tree nodes
• Storage manager
- Monitor the storage servers
- Ensures a load balancing strategy of chunks among storage servers
• Active storage servers
- Store chunks and perform handlers on chunks
• Clients
- Perform I/O accesses
Viet-TrungTran 02 09 2011 - 13
14. Read
Storage Metadata Version
Client servers managers managers
• I: optionally ask the version manager for
the latest published version I
II
• II: fetch the corresponding metadata from
the metadata managers
III
• III: contact storage servers in parallel and
fetch the chunks in the local buffer
Viet-TrungTran 02 09 2011 - 14
15. Write
Storage Metadata Version Storage
Client servers managers manager manager
• I: get a list of storage servers that are
able to store the chunks, one for each I
chunk
• II: contact storage servers in parallel and II
write the chunks to the corresponding
providers III
IV
• III: get a version number for the update
V
• IV: add new metadata to consolidate the
new version
• V: report the new version is ready for
publication.
Viet-TrungTran 02 09 2011 - 15
16. Lock-free, distributed chunk indexing
• Organized as a Quad-tree to index 2D arrays
• Each tree node has at most 4 children, each covers one of the four quadrants
• Root tree covers the whole array
• Each leaf corresponds to a chunk and holds information about its location
• Tree nodes are immutable, uniquely identified by the version number and the
sub-domain they cover
• Using DHT to distribute tree nodes over metadata managers
Viet-TrungTran 02 09 2011 - 16
17. Tree shadowing to update
• Write newly created chunks to storage servers
• Build the quad-tree associated to the new snapshot in bottom-up fashion
- Writing the leaves to DHT
- Inner nodes may point to nodes of previous snapshots (imply a
synchronization of the quad-tree generation)
- Avoid synchronization by feeding additional information about the other
concurrent updaters (thank to computational ID of tree nodes)
Viet-TrungTran 02 09 2011 - 17
18. Efficient parallel updating
Client Client Storage Metadata Version
#1 #2 servers managers manager
• Chunks are written concurrently
• Versions are assigned in the order the
clients finish writing
• Clients get additional information about
the other concurrent writers
• Tree nodes are written in lock-free manner
• Versions are published in the order they
were assigned
Publish
Publish
Viet-TrungTran 02 09 2011 - 18
19. Some more I/O primitives
• Easily implemented thanks to immutable data and metadata blocks
• Cheap I/O operators
• Clone a sub-domain
- Following the metadata tree of a specific snapshot
- Creating new metadata tree and publish as a newly created array
• Filtering, compression ca be done locally in parallel at active storage servers by
introducing user defined handlers
Viet-TrungTran 02 09 2011 - 19
21. Experimental setup
Simulate common access pattern exhibited by scientific applications: Array Dicing
• Using at most 130 nodes of Graphene cluster on G5K
- 1 Gbps Ethernet interconnected network
- 49 nodes deployed our Pyramid and the competitor system PVFS
• Array dicing
- Each client accesses a dedicated sub-array
- 1 GB per clients consisting 32x32 chunks (1024x1024 bytes chunk size)
- Concurrent Reading/Writing
• Measure the performance and scalability
Viet-TrungTran 02 09 2011 - 21
22. Aggregated throughput achieved under
concurrency
• PVFS suffers from non-
contiguous access pattern due
to serialization to flat file
• Pyramid
- Throughputincreased
steady
- Promising good scalability
on both data and metadata
organization
Viet-TrungTran 02 09 2011 - 22
24. Conclusion
• Pyramid is an array-oriented active storage system
• Proposed a system offering support for
- Parallel array processing for both read and write workloads
- Versioning data
- Distributed metadata management, shadowing to reflect updates
• Preliminary evaluation shows promising scalability
• Future work
- Planed to integrate to HDF5
- Pyramid as a storage engine for SciDB?
- Investigate on keeping data at quad-tree nodes
Could be used for store array at different resolutions (map application)
Viet-TrungTran 02 09 2011 - 24
25. Thankyou
INRIA – KerDataResearch Team
www.irisa.fr/kerdata