SlideShare a Scribd company logo
1 of 17
Download to read offline
Data storage made 

fast and easy
The Problem
• We focus on persistent storage of massive data
• Plethora of complex formats across many applications

- Genomics (FastQ, BAM, VCF, CRAM, etc.), LiDAR (LAS, LAZ), Databases (proprietary formats, Parquet), …

• Every format is associated with a library responsible for

- Backend support (POSIX, HDFS, AWS S3, …), parallel IO, compression, other filters, …

• Downstream computations (e.g., Linear Algebra) typically work on vectors and arrays

• Two common problems:
- redundant software engineering for high performance (parallel IO, compression, etc.)
- expensive conversion to arrays for downstream computations
What is Array Data?
1) Slicing
2) Compression
Goals
Applications
Genomics Time Series Tabular
Source: NYU’s Center for Urban Science and Progress
LiDAR Imaging
Storage Module vs. DBMS
Storage Module
DBMS
Storage Module
IO
Compression
Access / Slicing
APIs to higher level modules
Other filters (e.g., encryption)
DBMS
Query language
Query optimizer
Query executor
Query parser
A storage module
can be integrated with other
data science tools as well,
without an ODBC/JDBC
What is TileDB?
Architecture
TileDB is a storage module for a novel
multi-dimensional array data format
TileDB History
Stavros Jake Tyler Seth
2016 VLDB paper on TileDB
2018 - We are hiring!
2017 TileDB, Inc. is incorporated backed by
2015 TileDB research project kicks off at
The TileDB Format

Physical Organization
a1.tdb
a2.tdb
a2_var.tdb
__fragment_metadata.tdb
__<uuid>_<timestamp>
__array_schema.tdb
__lock.tdb
my_array
a1.tdb
a2.tdb
a2_var.tdb
__fragment_metadata.tdb
__<uuid>_<timestamp>
__array_schema.tdb
__lock.tdb
__coords.tdb
my_array
The TileDB Format

Updates
a1.tdb
a2.tdb
a2_var.tdb
__fragment_metadata.tdb
__<uuid>_<timestmap2>
a1.tdb
a2.tdb
a2_var.tdb
__fragment_metadata.tdb
__coords.tdb
__<uuid>_<timestmap3>
LSM-tree-like updates
and consolidation
a1.tdb
a2.tdb
a2_var.tdb
__fragment_metadata.tdb
__<uuid>_<timestamp1>
__array_schema.tdb
__lock.tdb
my_array
The TileDB Format

Filters
Binary data across an attribute
Chunk Chunk Chunk Chunk
Each chunk fits in L1 cache
Atomic unit of filtering
Tile
Atomic unit of IO
Filters
Compression (gzip, zstd, …)
Byte/Bit Shuffle
Encryption
Delta encoding
Bit-width reduction
Filter 1
Filter 2
Filter 1
Filter 2
Filter 1
Filter 2
Filter 1
Filter 2
The TileDB Format

Cloud
• TileDB works great on AWS S3

- Just use s3://bucket-name/path/to/array instead of my_array

- No concept of directories, natural use of / in the URI

- aws s3 sync just works

- LSM-tree-based updates excellent fit for such an object store

• Adding Azure, Google Cloud and Alibaba Cloud soon
TileDB Parallelism
• Fully multi-threaded via Intel TBB

• TileDB does not rely on an external engine for parallelism (e.g., Dask)

• Thread-/Process-safety, no need for locking, multiple reader/writer model

• Parallel IO (good use of S3 multipart upload and byte range requests)

• Parallel filters

• Parallel sorting

• Parallel slicing
APIs and Integration
• Lightweight interfaces between the TileDB C library and HL APIs

• Zero-copying wherever possible

• Predicate push-down

• Effective partitioning (especially for sparse arrays)
ND arrays
Sparse arrays
Compression/Filters
Parallel IO
Parallelism
S3 support
Updates
Zarr
APIs
LSM-tree-like chunk-based chunk-based file-based
SWMR pushed to app pushed to app
multiple multiple only Python multiple
pushed to app Blosc / pushed to app pushed to app
open-source closed-source open-source pushed to app
• In-memory columnar format

• DataFrames, limited ND array support

• Designed for fast in-memory operations

• Rich datatype support, complex objects

• Persistence through virtual memory mapping or delegated to external on-disk formats

• TileDB integration with Apache Arrow is on our roadmap!
TileDB Value to
• Manage dense and sparse data persistence using a single API

• Get the most from you modern hardware! Concurrent IO, parallel
compression, accelerated encryption and more

• Easily interface with multiple different storage backends (including
cloud storage) and get performance with little to no code changes

• Common format that can be leveraged by “big data” / SQL
platforms and Python, R, Julia, … ecosystems
Thank You
We are Hiring !
tiledb.workable.com
careers@tiledb.io
https://github.com/TileDB-Inc
pip install tiledb

More Related Content

What's hot

Unix/Linux Basic Commands and Shell Script
Unix/Linux Basic Commands and Shell ScriptUnix/Linux Basic Commands and Shell Script
Unix/Linux Basic Commands and Shell Scriptsbmguys
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?confluent
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
 
Procella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at YoutubeProcella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at YoutubeDataWorks Summit
 
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...SindhuVasireddy1
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
Introduction to TCP/IP
Introduction to TCP/IPIntroduction to TCP/IP
Introduction to TCP/IPMichael Lamont
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka IntroductionAmita Mirajkar
 
runC – Open Container Initiative
runC – Open Container InitiativerunC – Open Container Initiative
runC – Open Container InitiativeJeeva Chelladhurai
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed SystemsRupsee
 
Kafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema RegistryKafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema RegistryJean-Paul Azar
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0Databricks
 
Performance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla ClusterPerformance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla ClusterScyllaDB
 
스타트업을 위한 Confluent 세미나
스타트업을 위한 Confluent 세미나스타트업을 위한 Confluent 세미나
스타트업을 위한 Confluent 세미나confluent
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeDatabricks
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on KubernetesDatabricks
 

What's hot (20)

Unix/Linux Basic Commands and Shell Script
Unix/Linux Basic Commands and Shell ScriptUnix/Linux Basic Commands and Shell Script
Unix/Linux Basic Commands and Shell Script
 
Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?Kafka Streams: What it is, and how to use it?
Kafka Streams: What it is, and how to use it?
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
 
Procella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at YoutubeProcella: A fast versatile SQL query engine powering data at Youtube
Procella: A fast versatile SQL query engine powering data at Youtube
 
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
Designing Data-Intensive Applications_ The Big Ideas Behind Reliable, Scalabl...
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Open ebs 101
Open ebs 101Open ebs 101
Open ebs 101
 
Chapter05
Chapter05Chapter05
Chapter05
 
Envoy and Kafka
Envoy and KafkaEnvoy and Kafka
Envoy and Kafka
 
Introduction to TCP/IP
Introduction to TCP/IPIntroduction to TCP/IP
Introduction to TCP/IP
 
Apache Kafka Introduction
Apache Kafka IntroductionApache Kafka Introduction
Apache Kafka Introduction
 
runC – Open Container Initiative
runC – Open Container InitiativerunC – Open Container Initiative
runC – Open Container Initiative
 
Amazon Redshift
Amazon Redshift Amazon Redshift
Amazon Redshift
 
Distributed Systems
Distributed SystemsDistributed Systems
Distributed Systems
 
Kafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema RegistryKafka and Avro with Confluent Schema Registry
Kafka and Avro with Confluent Schema Registry
 
What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Performance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla ClusterPerformance Monitoring: Understanding Your Scylla Cluster
Performance Monitoring: Understanding Your Scylla Cluster
 
스타트업을 위한 Confluent 세미나
스타트업을 위한 Confluent 세미나스타트업을 위한 Confluent 세미나
스타트업을 위한 Confluent 세미나
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
 
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
Spark Operator—Deploy, Manage and Monitor Spark clusters on Kubernetes
 

Similar to The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski

Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiridatastack
 
Accesso ai dati con Azure Data Platform
Accesso ai dati con Azure Data PlatformAccesso ai dati con Azure Data Platform
Accesso ai dati con Azure Data PlatformLuca Di Fino
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?David P. Moore
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFAmazon Web Services
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...DataWorks Summit
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxRahul Borate
 
Aws for Startups Building Cloud Enabled Apps
Aws for Startups Building Cloud Enabled AppsAws for Startups Building Cloud Enabled Apps
Aws for Startups Building Cloud Enabled AppsAmazon Web Services
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h basehdhappy001
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empowerDurga Gadiraju
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Emprovise
 
Azure Cosmos DB - The Swiss Army NoSQL Cloud Database
Azure Cosmos DB - The Swiss Army NoSQL Cloud DatabaseAzure Cosmos DB - The Swiss Army NoSQL Cloud Database
Azure Cosmos DB - The Swiss Army NoSQL Cloud DatabaseBizTalk360
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureOliver Buckley-Salmon
 

Similar to The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski (20)

Big Data Architecture Workshop - Vahid Amiri
Big Data Architecture Workshop -  Vahid AmiriBig Data Architecture Workshop -  Vahid Amiri
Big Data Architecture Workshop - Vahid Amiri
 
Accesso ai dati con Azure Data Platform
Accesso ai dati con Azure Data PlatformAccesso ai dati con Azure Data Platform
Accesso ai dati con Azure Data Platform
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Scaling horizontally on AWS
Scaling horizontally on AWSScaling horizontally on AWS
Scaling horizontally on AWS
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Using Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SFUsing Data Lakes: Data Analytics Week SF
Using Data Lakes: Data Analytics Week SF
 
Hadoop
HadoopHadoop
Hadoop
 
Deep Dive in Big Data
Deep Dive in Big DataDeep Dive in Big Data
Deep Dive in Big Data
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
UNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptxUNIT I Introduction to NoSQL.pptx
UNIT I Introduction to NoSQL.pptx
 
Aws for Startups Building Cloud Enabled Apps
Aws for Startups Building Cloud Enabled AppsAws for Startups Building Cloud Enabled Apps
Aws for Startups Building Cloud Enabled Apps
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Using Data Lakes
Using Data Lakes Using Data Lakes
Using Data Lakes
 
Michael stack -the state of apache h base
Michael stack -the state of apache h baseMichael stack -the state of apache h base
Michael stack -the state of apache h base
 
Database Technologies
Database TechnologiesDatabase Technologies
Database Technologies
 
Big Data Introduction - Solix empower
Big Data Introduction - Solix empowerBig Data Introduction - Solix empower
Big Data Introduction - Solix empower
 
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
Highlights of AWS ReInvent 2023 (Announcements and Best Practices)
 
Azure Cosmos DB - The Swiss Army NoSQL Cloud Database
Azure Cosmos DB - The Swiss Army NoSQL Cloud DatabaseAzure Cosmos DB - The Swiss Army NoSQL Cloud Database
Azure Cosmos DB - The Swiss Army NoSQL Cloud Database
 
Using Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architectureUsing Hazelcast in the Kappa architecture
Using Hazelcast in the Kappa architecture
 

More from PyData

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...PyData
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshPyData
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...PyData
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerPyData
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaPyData
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...PyData
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroPyData
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...PyData
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottPyData
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroPyData
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...PyData
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPyData
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...PyData
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydPyData
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverPyData
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldPyData
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...PyData
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardPyData
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...PyData
 
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...PyData
 

More from PyData (20)

Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
Michal Mucha: Build and Deploy an End-to-end Streaming NLP Insight System | P...
 
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif WalshUnit testing data with marbles - Jane Stewart Adams, Leif Walsh
Unit testing data with marbles - Jane Stewart Adams, Leif Walsh
 
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...Using Embeddings to Understand the Variance and Evolution of Data Science... ...
Using Embeddings to Understand the Variance and Evolution of Data Science... ...
 
Deploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne BauerDeploying Data Science for Distribution of The New York Times - Anne Bauer
Deploying Data Science for Distribution of The New York Times - Anne Bauer
 
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam LermaGraph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
Graph Analytics - From the Whiteboard to Your Toolbox - Sam Lerma
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo MazzaferroRESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
RESTful Machine Learning with Flask and TensorFlow Serving - Carlo Mazzaferro
 
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
Mining dockless bikeshare and dockless scootershare trip data - Stefanie Brod...
 
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven LottAvoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
Avoiding Bad Database Surprises: Simulation and Scalability - Steven Lott
 
Words in Space - Rebecca Bilbro
Words in Space - Rebecca BilbroWords in Space - Rebecca Bilbro
Words in Space - Rebecca Bilbro
 
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...End-to-End Machine learning pipelines for Python driven organizations - Nick ...
End-to-End Machine learning pipelines for Python driven organizations - Nick ...
 
Pydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica PuertoPydata beautiful soup - Monica Puerto
Pydata beautiful soup - Monica Puerto
 
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
1D Convolutional Neural Networks for Time Series Modeling - Nathan Janos, Jef...
 
Extending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will AydExtending Pandas with Custom Types - Will Ayd
Extending Pandas with Custom Types - Will Ayd
 
Measuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen HooverMeasuring Model Fairness - Stephen Hoover
Measuring Model Fairness - Stephen Hoover
 
What's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper SeaboldWhat's the Science in Data Science? - Skipper Seabold
What's the Science in Data Science? - Skipper Seabold
 
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
Applying Statistical Modeling and Machine Learning to Perform Time-Series For...
 
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-WardSolving very simple substitution ciphers algorithmically - Stephen Enright-Ward
Solving very simple substitution ciphers algorithmically - Stephen Enright-Ward
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 
Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...Deprecating the state machine: building conversational AI with the Rasa stack...
Deprecating the state machine: building conversational AI with the Rasa stack...
 

Recently uploaded

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMKumar Satyam
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdfSandro Moreira
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAnitaRaj43
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 

Recently uploaded (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 

The TileDB Array Data Storage Manager - Stavros Papadopoulos, Jake Bolewski

  • 1. Data storage made 
 fast and easy
  • 2. The Problem • We focus on persistent storage of massive data • Plethora of complex formats across many applications - Genomics (FastQ, BAM, VCF, CRAM, etc.), LiDAR (LAS, LAZ), Databases (proprietary formats, Parquet), … • Every format is associated with a library responsible for - Backend support (POSIX, HDFS, AWS S3, …), parallel IO, compression, other filters, … • Downstream computations (e.g., Linear Algebra) typically work on vectors and arrays • Two common problems: - redundant software engineering for high performance (parallel IO, compression, etc.) - expensive conversion to arrays for downstream computations
  • 3. What is Array Data? 1) Slicing 2) Compression Goals
  • 4. Applications Genomics Time Series Tabular Source: NYU’s Center for Urban Science and Progress LiDAR Imaging
  • 5. Storage Module vs. DBMS Storage Module DBMS Storage Module IO Compression Access / Slicing APIs to higher level modules Other filters (e.g., encryption) DBMS Query language Query optimizer Query executor Query parser A storage module can be integrated with other data science tools as well, without an ODBC/JDBC
  • 6. What is TileDB? Architecture TileDB is a storage module for a novel multi-dimensional array data format
  • 7. TileDB History Stavros Jake Tyler Seth 2016 VLDB paper on TileDB 2018 - We are hiring! 2017 TileDB, Inc. is incorporated backed by 2015 TileDB research project kicks off at
  • 8. The TileDB Format
 Physical Organization a1.tdb a2.tdb a2_var.tdb __fragment_metadata.tdb __<uuid>_<timestamp> __array_schema.tdb __lock.tdb my_array a1.tdb a2.tdb a2_var.tdb __fragment_metadata.tdb __<uuid>_<timestamp> __array_schema.tdb __lock.tdb __coords.tdb my_array
  • 9. The TileDB Format
 Updates a1.tdb a2.tdb a2_var.tdb __fragment_metadata.tdb __<uuid>_<timestmap2> a1.tdb a2.tdb a2_var.tdb __fragment_metadata.tdb __coords.tdb __<uuid>_<timestmap3> LSM-tree-like updates and consolidation a1.tdb a2.tdb a2_var.tdb __fragment_metadata.tdb __<uuid>_<timestamp1> __array_schema.tdb __lock.tdb my_array
  • 10. The TileDB Format
 Filters Binary data across an attribute Chunk Chunk Chunk Chunk Each chunk fits in L1 cache Atomic unit of filtering Tile Atomic unit of IO Filters Compression (gzip, zstd, …) Byte/Bit Shuffle Encryption Delta encoding Bit-width reduction Filter 1 Filter 2 Filter 1 Filter 2 Filter 1 Filter 2 Filter 1 Filter 2
  • 11. The TileDB Format
 Cloud • TileDB works great on AWS S3 - Just use s3://bucket-name/path/to/array instead of my_array - No concept of directories, natural use of / in the URI - aws s3 sync just works - LSM-tree-based updates excellent fit for such an object store • Adding Azure, Google Cloud and Alibaba Cloud soon
  • 12. TileDB Parallelism • Fully multi-threaded via Intel TBB • TileDB does not rely on an external engine for parallelism (e.g., Dask) • Thread-/Process-safety, no need for locking, multiple reader/writer model • Parallel IO (good use of S3 multipart upload and byte range requests) • Parallel filters • Parallel sorting • Parallel slicing
  • 13. APIs and Integration • Lightweight interfaces between the TileDB C library and HL APIs • Zero-copying wherever possible • Predicate push-down • Effective partitioning (especially for sparse arrays)
  • 14. ND arrays Sparse arrays Compression/Filters Parallel IO Parallelism S3 support Updates Zarr APIs LSM-tree-like chunk-based chunk-based file-based SWMR pushed to app pushed to app multiple multiple only Python multiple pushed to app Blosc / pushed to app pushed to app open-source closed-source open-source pushed to app
  • 15. • In-memory columnar format • DataFrames, limited ND array support • Designed for fast in-memory operations • Rich datatype support, complex objects • Persistence through virtual memory mapping or delegated to external on-disk formats • TileDB integration with Apache Arrow is on our roadmap!
  • 16. TileDB Value to • Manage dense and sparse data persistence using a single API • Get the most from you modern hardware! Concurrent IO, parallel compression, accelerated encryption and more • Easily interface with multiple different storage backends (including cloud storage) and get performance with little to no code changes • Common format that can be leveraged by “big data” / SQL platforms and Python, R, Julia, … ecosystems
  • 17. Thank You We are Hiring ! tiledb.workable.com careers@tiledb.io https://github.com/TileDB-Inc pip install tiledb