Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski

895 Aufrufe

Veröffentlicht am

TileDB is an open-source storage manager for multi-dimensional sparse and dense array data. It has a novel architecture that addresses some of the pain points in storing array data on “big-data” and “cloud” storage architectures. This talk will highlight TileDB’s design and its ability to integrate with analysis environments relevant to the PyData community such as Python, R, Julia, etc.

Veröffentlicht in: Technologie
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

  • Gehören Sie zu den Ersten, denen das gefällt!

The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski

  1. 1. Data storage made 
 fast and easy
  2. 2. The Problem • We focus on persistent storage of massive data • Plethora of complex formats across many applications - Genomics (FastQ, BAM, VCF, CRAM, etc.), LiDAR (LAS, LAZ), Databases (proprietary formats, Parquet), … • Every format is associated with a library responsible for - Backend support (POSIX, HDFS, AWS S3, …), parallel IO, compression, other filters, … • Downstream computations (e.g., Linear Algebra) typically work on vectors and arrays • Two common problems: - redundant software engineering for high performance (parallel IO, compression, etc.) - expensive conversion to arrays for downstream computations
  3. 3. What is Array Data? 1) Slicing 2) Compression Goals
  4. 4. Applications Genomics Time Series Tabular Source: NYU’s Center for Urban Science and Progress LiDAR Imaging
  5. 5. Storage Module vs. DBMS Storage Module DBMS Storage Module IO Compression Access / Slicing APIs to higher level modules Other filters (e.g., encryption) DBMS Query language Query optimizer Query executor Query parser A storage module can be integrated with other data science tools as well, without an ODBC/JDBC
  6. 6. What is TileDB? Architecture TileDB is a storage module for a novel multi-dimensional array data format
  7. 7. TileDB History Stavros Jake Tyler Seth 2016 VLDB paper on TileDB 2018 - We are hiring! 2017 TileDB, Inc. is incorporated backed by 2015 TileDB research project kicks off at
  8. 8. The TileDB Format
 Physical Organization a1.tdb a2.tdb a2_var.tdb __fragment_metadata.tdb __<uuid>_<timestamp> __array_schema.tdb __lock.tdb my_array a1.tdb a2.tdb a2_var.tdb __fragment_metadata.tdb __<uuid>_<timestamp> __array_schema.tdb __lock.tdb __coords.tdb my_array
  9. 9. The TileDB Format
 Updates a1.tdb a2.tdb a2_var.tdb __fragment_metadata.tdb __<uuid>_<timestmap2> a1.tdb a2.tdb a2_var.tdb __fragment_metadata.tdb __coords.tdb __<uuid>_<timestmap3> LSM-tree-like updates and consolidation a1.tdb a2.tdb a2_var.tdb __fragment_metadata.tdb __<uuid>_<timestamp1> __array_schema.tdb __lock.tdb my_array
  10. 10. The TileDB Format
 Filters Binary data across an attribute Chunk Chunk Chunk Chunk Each chunk fits in L1 cache Atomic unit of filtering Tile Atomic unit of IO Filters Compression (gzip, zstd, …) Byte/Bit Shuffle Encryption Delta encoding Bit-width reduction Filter 1 Filter 2 Filter 1 Filter 2 Filter 1 Filter 2 Filter 1 Filter 2
  11. 11. The TileDB Format
 Cloud • TileDB works great on AWS S3 - Just use s3://bucket-name/path/to/array instead of my_array - No concept of directories, natural use of / in the URI - aws s3 sync just works - LSM-tree-based updates excellent fit for such an object store • Adding Azure, Google Cloud and Alibaba Cloud soon
  12. 12. TileDB Parallelism • Fully multi-threaded via Intel TBB • TileDB does not rely on an external engine for parallelism (e.g., Dask) • Thread-/Process-safety, no need for locking, multiple reader/writer model • Parallel IO (good use of S3 multipart upload and byte range requests) • Parallel filters • Parallel sorting • Parallel slicing
  13. 13. APIs and Integration • Lightweight interfaces between the TileDB C library and HL APIs • Zero-copying wherever possible • Predicate push-down • Effective partitioning (especially for sparse arrays)
  14. 14. ND arrays Sparse arrays Compression/Filters Parallel IO Parallelism S3 support Updates Zarr APIs LSM-tree-like chunk-based chunk-based file-based SWMR pushed to app pushed to app multiple multiple only Python multiple pushed to app Blosc / pushed to app pushed to app open-source closed-source open-source pushed to app
  15. 15. • In-memory columnar format • DataFrames, limited ND array support • Designed for fast in-memory operations • Rich datatype support, complex objects • Persistence through virtual memory mapping or delegated to external on-disk formats • TileDB integration with Apache Arrow is on our roadmap!
  16. 16. TileDB Value to • Manage dense and sparse data persistence using a single API • Get the most from you modern hardware! Concurrent IO, parallel compression, accelerated encryption and more • Easily interface with multiple different storage backends (including cloud storage) and get performance with little to no code changes • Common format that can be leveraged by “big data” / SQL platforms and Python, R, Julia, … ecosystems
  17. 17. Thank You We are Hiring ! tiledb.workable.com careers@tiledb.io https://github.com/TileDB-Inc pip install tiledb