SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
TileDB webinars - Nov 4, 2021
Population Genomics is a
Data Management Problem
Founder & CEO of TileDB, Inc.
Dr. Stavros Papadopoulos
Deep roots at the intersection of HPC, databases and data science
Traction with telecoms, pharmas, hospitals and other scientific organizations
40 members with expertise across all applications and domains
Who we are
TileDB was spun out from MIT and Intel Labs in 2017
WHERE IT ALL STARTED
Raised over $20M, we are very well capitalized
INVESTORS
Originally developed GenomicsDB (collaboration between Intel and Broad)
Disclaimer
I am the exclusive recipient of complaints
Email me at: stavros@tiledb.com
All the credit for our amazing work goes to our powerful team
Check it out at https://tiledb.com/about
Their mission is to empower every person to improve their life through DNA
Multi-year collaboration to scale population genomics workflows
Provided tons of ideas and optimizations to TileDB-VCF
Find more info about Helix at helix.com
Special Thanks
Agenda
The problem in population genomics
A general solution
A concrete solution with TileDB
Dr. Stephen Kingsmore on genome-informed inpatient pediatric care
TileDB-VCF walkthrough (Dr. Aaron Wolen)
Work in progress
The Problem | On the Surface
A large collection of
(single-sample)
VCF files
...
Analysis is done by “slicing” a portion across files
Mainly two approaches:
● Separate single-sample VCFs
● Combined multi-sample VCFs
Specialized downstream tools, expecting VCF inputs
All solutions around files and environments
The Problem | On the Surface
Problem with single-sample VCF
Latency from slicing each file separately adds up
Problems with multi-sample VCF
Storage space scales super-linearly with the number of samples
Multi-sample VCF file cannot be updated (N+1 problem)
Scaling population genomics is blocked
This is just a tiny fraction of the problem in Genomics
The Holistic Problem
Data management
is nowhere in the
picture
The whole
data economics
in genomics is flawed
Data Economics
Consumption
How tools can compute
on the data, where
does the computation
happen
Distribution
Who has access to the
data, what is the means
of access, and
monetization
Production
What format does the
data get produced in
and where does it get
stored
The Production Problem
slow & expensive
often custom & in-house
costly & time consuming
Some analytics
infra
Specialized applications,
wrangling and fusion
Storage in some cloud
bucket or file manager
Numerous VCF files,
also some tables
The Distribution Problem #1
wasteful re-invention
Storage in some cloud
bucket or marketplace Org #N:
Download + Wrangle +
Built analytics infra
Org #1:
Download + Wrangle +
Built analytics infra
Numerous VCF files,
also some tables
Numerous VCF files,
also some tables
The Distribution Problem #2
Data owner bears the distribution cost,
Re-invention across data owners
wrangling,
etc.
some analytics
infra
Queries by
consumer #1
Queries by
consumer #N
The Consumption Problem
inefficient & costly,
poor governance
Storage in some cloud
bucket or server
Group #N:
Wrangle + Copy - Use tool & infra #N
Group #1:
Wrangle + Copy - Use tool & infra #1
Numerous VCF files,
also some tables
The Solution
Universal
data management platform
Data in a universal,
analysis-ready format
User / group #1:
any tool, any scale
User / group #N:
any tool, any scale
All Data Science tools
No infrastructure hassles
No downloads or copies
Efficient and cloud-native
Solves N+1 problem
Unifies all data
Accessible by any tool
Global-scale governance
One infra, you own the data
Collaboration and reproducibility
Marketplace built-in
Cost shifted to consumer
Enter TileDB
Secure governance & collaboration
Scalable, serverless compute
Data & code sharing & monetization
Pay-as-you-go, consumer pays
Extreme interoperability
Zero infrastructure
multi-dimensional arrays
Universal data
management platform
Data in a universal,
analysis-ready format
User / group #1:
any tool, any scale
User / group #N:
any tool, any scale
The Secret Sauce | The Data Model
Dense array
Store everything as dense or sparse multi-dimensional arrays
Sparse array
query range
expansion
anchor_gap
S
a
m
p
l
e
(
s
t
r
i
n
g
)
Position (uint32)
1 2 ...
v7
v4
v1
v3
Indel/CNVs
...
Contig
(string)
chr1
chr2
chr3
v2
v5 v6
SNPs
a1
anchor_gap
anchor
Population Genomics with TileDB
Store variant call data as 3D sparse arrays
Storage
query range
results
v3
v2
v1 a1
v4
v5 v6 v7
S
a
m
p
l
e
(
s
t
r
i
n
g
)
Position (uint32)
1 2 ...
...
Contig
(string)
chr1
chr2
chr3
Retrieval
https://github.com/TileDB-Inc/TileDB-VCF
Arrays Subsume Dataframes
Sparse array
Dataframe
Dense vector
The Secret Sauce | The Data Model
What can be modeled as an array
LiDAR (3D sparse)
SAR (2D or 3D dense)
Population genomics (3D sparse)
Single-cell genomics (2D dense or sparse)
Biomedical imaging (2D or 3D dense) Even flat files!!! (1D dense)
Time series (ND dense or sparse)
Weather (2D or 3D dense)
Graphs (2D sparse)
Video (3D dense)
Key-values (1D or ND sparse)
Tables (1D dense or ND sparse)
TileDB Cloud
❏ Access control and logging
❏ Serverless SQL, UDFs, task graphs
❏ Jupyter notebooks and dashboards
Unified data management
and easy serverless compute
at global scale
How we built a Universal Database
Efficient APIs & tool integrations, zero-copy techniques
TileDB Embedded
Open-source interoperable
storage with a universal
open-spec array format
❏ Parallel IO, rapid reads & writes
❏ Columnar, cloud-optimized
❏ Data versioning & time traveling
Superior
performance
Built in C++
Fully-parallelized
Columnar format
Multiple compressors
R-trees for sparse arrays
TileDB Embedded
https://github.com/TileDB-Inc/TileDB
Open source:
Rapid updates
& data versioning
Immutable writes
Lock-free
Parallel reader / writer model
Time traveling
TileDB Embedded
https://github.com/TileDB-Inc/TileDB
Open source:
Extreme
interoperability
Numerous APIs
Numerous integrations
All backends
Optimized
for the cloud
Immutable writes
Parallel IO
Minimization of requests
TileDB Cloud
Universal storage Universal tooling
Universal data
.vcf .csv .bam .fastq
Universal scale
Management. Collaboration. Scalability.
TileDB Cloud
Works as SaaS: https://cloud.tiledb.com
Works on premises
Currently on AWS, soon on any cloud
Built to work anywhere
Slicing, SQL, UDFs, task graphs
It is completely serverless
On-demand JupyterHub instances
Can launch Jupyter notebooks
Compute sent to the data
It is geo-aware
Authentication, compliance, etc.
It is secure
TileDB Cloud
Full marketplace (via Stripe)
Everything is monetizable
Access control inside and outside your
organization
Make any data and code public
Discover any public data and code
(central catalog)
Everything is shareable at global scale
Jupyter notebooks
UDFs and task graphs
ML models
Everything is an array!
Dashboards (e.g., R shiny apps)
All types of data (even flat files)
Full auditability (data, code, any action)
Everything is logged
Work in Progress
or why you should depart from file formats and
specialized solutions
RLE compression for strings
Compute directly on compressed data
Compute push-down
Fine-grained access policies
Constant perf optimizations
The Universal Database
Thank you

Weitere ähnliche Inhalte

Was ist angesagt?

Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache SparkLucian Neghina
 
Big data processing system
Big data processing systemBig data processing system
Big data processing systemshima jafari
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations PresentationAdam Doyle
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizITJobZone.biz
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introductionFrans van Noort
 
Introduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & ApplicationsIntroduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & ApplicationsNguyen Cao
 
Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015
Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015
Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015StampedeCon
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop IntroductionAdam Muise
 
From Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data WarehouseFrom Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data WarehouseBui Ha
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introductionyalla4u
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataMelissa Hornbostel
 
IDERA Live | The Ever Growing Science of Database Migrations
IDERA Live | The Ever Growing Science of Database MigrationsIDERA Live | The Ever Growing Science of Database Migrations
IDERA Live | The Ever Growing Science of Database MigrationsIDERA Software
 
2016 SDMX Experts meeting, Implementation of SDMX RI at INS, Kamel Abdellaoui
2016 SDMX Experts meeting, Implementation of SDMX RI at INS, Kamel Abdellaoui2016 SDMX Experts meeting, Implementation of SDMX RI at INS, Kamel Abdellaoui
2016 SDMX Experts meeting, Implementation of SDMX RI at INS, Kamel AbdellaouiStatsCommunications
 

Was ist angesagt? (20)

Big Data processing with Apache Spark
Big Data processing with Apache SparkBig Data processing with Apache Spark
Big Data processing with Apache Spark
 
Big data processing system
Big data processing systemBig data processing system
Big data processing system
 
Cassandra
CassandraCassandra
Cassandra
 
Great Expectations Presentation
Great Expectations PresentationGreat Expectations Presentation
Great Expectations Presentation
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.bizIntroduction to Big Data Hadoop Training Online by www.itjobzone.biz
Introduction to Big Data Hadoop Training Online by www.itjobzone.biz
 
Bar camp bigdata
Bar camp bigdataBar camp bigdata
Bar camp bigdata
 
Overview of Bigdata Analytics
Overview of Bigdata Analytics Overview of Bigdata Analytics
Overview of Bigdata Analytics
 
Big Data - A brief introduction
Big Data - A brief introductionBig Data - A brief introduction
Big Data - A brief introduction
 
Introduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & ApplicationsIntroduction to Big Data Technologies & Applications
Introduction to Big Data Technologies & Applications
 
Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015
Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015
Handling Electronic Health Records Logs with Hadoop - StampedeCon 2015
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Next Generation Hadoop Introduction
Next Generation Hadoop IntroductionNext Generation Hadoop Introduction
Next Generation Hadoop Introduction
 
Big Data Overview
Big Data OverviewBig Data Overview
Big Data Overview
 
From Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data WarehouseFrom Hadoop to Enterprise Data Warehouse
From Hadoop to Enterprise Data Warehouse
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
IDERA Live | The Ever Growing Science of Database Migrations
IDERA Live | The Ever Growing Science of Database MigrationsIDERA Live | The Ever Growing Science of Database Migrations
IDERA Live | The Ever Growing Science of Database Migrations
 
2016 SDMX Experts meeting, Implementation of SDMX RI at INS, Kamel Abdellaoui
2016 SDMX Experts meeting, Implementation of SDMX RI at INS, Kamel Abdellaoui2016 SDMX Experts meeting, Implementation of SDMX RI at INS, Kamel Abdellaoui
2016 SDMX Experts meeting, Implementation of SDMX RI at INS, Kamel Abdellaoui
 
Big data analytics.
Big data analytics.Big data analytics.
Big data analytics.
 

Ähnlich wie Population genomics is a data management problem

D.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital PreservationD.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital PreservationPRELIDA Project
 
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...Stavros Papadopoulos
 
Solving the Really Big Tech Problems with IoT
 Solving the Really Big Tech Problems with IoT Solving the Really Big Tech Problems with IoT
Solving the Really Big Tech Problems with IoTEric Kavanagh
 
Managing The Data Deluge By Optimizing Storage
Managing The Data Deluge By Optimizing StorageManaging The Data Deluge By Optimizing Storage
Managing The Data Deluge By Optimizing StorageDell World
 
Data Virtualization to Survive a Multi and Hybrid Cloud World
Data Virtualization to Survive a Multi and Hybrid Cloud WorldData Virtualization to Survive a Multi and Hybrid Cloud World
Data Virtualization to Survive a Multi and Hybrid Cloud WorldDenodo
 
Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja Swapnaja Tandale
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriDemi Ben-Ari
 
Webinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will Win
Webinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will WinWebinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will Win
Webinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will WinStorage Switzerland
 
Eliminating the Problems of Exponential Data Growth, Forever
Eliminating the Problems of Exponential Data Growth, ForeverEliminating the Problems of Exponential Data Growth, Forever
Eliminating the Problems of Exponential Data Growth, Foreverspectralogic
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Lviv Startup Club
 
Webinar: End NAS Sprawl - Gain Control Over Unstructured Data
Webinar: End NAS Sprawl - Gain Control Over Unstructured DataWebinar: End NAS Sprawl - Gain Control Over Unstructured Data
Webinar: End NAS Sprawl - Gain Control Over Unstructured DataStorage Switzerland
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache sparksarith divakar
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridEvert Lammerts
 
Webinar: Cloud Storage: The 5 Reasons IT Can Do it Better
Webinar: Cloud Storage: The 5 Reasons IT Can Do it BetterWebinar: Cloud Storage: The 5 Reasons IT Can Do it Better
Webinar: Cloud Storage: The 5 Reasons IT Can Do it BetterStorage Switzerland
 
How to Radically Simplify Your Business Data Management
How to Radically Simplify Your Business Data ManagementHow to Radically Simplify Your Business Data Management
How to Radically Simplify Your Business Data ManagementClusterpoint
 
big data and hadoop
 big data and hadoop big data and hadoop
big data and hadoopahmed alshikh
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data WarehousingEyad Manna
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...DATAVERSITY
 

Ähnlich wie Population genomics is a data management problem (20)

D.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital PreservationD.3.1: State of the Art - Linked Data and Digital Preservation
D.3.1: State of the Art - Linked Data and Digital Preservation
 
BIG DATA
BIG DATABIG DATA
BIG DATA
 
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...
Analyzing LiDAR and SAR data with Capella Space and TileDB (TileDB webinars, ...
 
Solving the Really Big Tech Problems with IoT
 Solving the Really Big Tech Problems with IoT Solving the Really Big Tech Problems with IoT
Solving the Really Big Tech Problems with IoT
 
Managing The Data Deluge By Optimizing Storage
Managing The Data Deluge By Optimizing StorageManaging The Data Deluge By Optimizing Storage
Managing The Data Deluge By Optimizing Storage
 
Data Virtualization to Survive a Multi and Hybrid Cloud World
Data Virtualization to Survive a Multi and Hybrid Cloud WorldData Virtualization to Survive a Multi and Hybrid Cloud World
Data Virtualization to Survive a Multi and Hybrid Cloud World
 
Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja Presentation on BigData by Swapnaja
Presentation on BigData by Swapnaja
 
Big Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-AriBig Data made easy in the era of the Cloud - Demi Ben-Ari
Big Data made easy in the era of the Cloud - Demi Ben-Ari
 
Webinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will Win
Webinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will WinWebinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will Win
Webinar: NAS vs. Object Storage: 10 Reasons Why Object Storage Will Win
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
Eliminating the Problems of Exponential Data Growth, Forever
Eliminating the Problems of Exponential Data Growth, ForeverEliminating the Problems of Exponential Data Growth, Forever
Eliminating the Problems of Exponential Data Growth, Forever
 
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
Artur Fejklowicz - “Data Lake architecture” AI&BigDataDay 2017
 
Webinar: End NAS Sprawl - Gain Control Over Unstructured Data
Webinar: End NAS Sprawl - Gain Control Over Unstructured DataWebinar: End NAS Sprawl - Gain Control Over Unstructured Data
Webinar: End NAS Sprawl - Gain Control Over Unstructured Data
 
Big data processing with apache spark
Big data processing with apache sparkBig data processing with apache spark
Big data processing with apache spark
 
Hadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG GridHadoop @ Sara & BiG Grid
Hadoop @ Sara & BiG Grid
 
Webinar: Cloud Storage: The 5 Reasons IT Can Do it Better
Webinar: Cloud Storage: The 5 Reasons IT Can Do it BetterWebinar: Cloud Storage: The 5 Reasons IT Can Do it Better
Webinar: Cloud Storage: The 5 Reasons IT Can Do it Better
 
How to Radically Simplify Your Business Data Management
How to Radically Simplify Your Business Data ManagementHow to Radically Simplify Your Business Data Management
How to Radically Simplify Your Business Data Management
 
big data and hadoop
 big data and hadoop big data and hadoop
big data and hadoop
 
Introduction to Data Warehousing
Introduction to Data WarehousingIntroduction to Data Warehousing
Introduction to Data Warehousing
 
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
ADV Slides: Platforming Your Data for Success – Databases, Hadoop, Managed Ha...
 

Kürzlich hochgeladen

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxolyaivanovalion
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Kürzlich hochgeladen (20)

Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

Population genomics is a data management problem

  • 1. TileDB webinars - Nov 4, 2021 Population Genomics is a Data Management Problem Founder & CEO of TileDB, Inc. Dr. Stavros Papadopoulos
  • 2. Deep roots at the intersection of HPC, databases and data science Traction with telecoms, pharmas, hospitals and other scientific organizations 40 members with expertise across all applications and domains Who we are TileDB was spun out from MIT and Intel Labs in 2017 WHERE IT ALL STARTED Raised over $20M, we are very well capitalized INVESTORS Originally developed GenomicsDB (collaboration between Intel and Broad)
  • 3. Disclaimer I am the exclusive recipient of complaints Email me at: stavros@tiledb.com All the credit for our amazing work goes to our powerful team Check it out at https://tiledb.com/about
  • 4. Their mission is to empower every person to improve their life through DNA Multi-year collaboration to scale population genomics workflows Provided tons of ideas and optimizations to TileDB-VCF Find more info about Helix at helix.com Special Thanks
  • 5. Agenda The problem in population genomics A general solution A concrete solution with TileDB Dr. Stephen Kingsmore on genome-informed inpatient pediatric care TileDB-VCF walkthrough (Dr. Aaron Wolen) Work in progress
  • 6. The Problem | On the Surface A large collection of (single-sample) VCF files ... Analysis is done by “slicing” a portion across files Mainly two approaches: ● Separate single-sample VCFs ● Combined multi-sample VCFs Specialized downstream tools, expecting VCF inputs All solutions around files and environments
  • 7. The Problem | On the Surface Problem with single-sample VCF Latency from slicing each file separately adds up Problems with multi-sample VCF Storage space scales super-linearly with the number of samples Multi-sample VCF file cannot be updated (N+1 problem) Scaling population genomics is blocked This is just a tiny fraction of the problem in Genomics
  • 8. The Holistic Problem Data management is nowhere in the picture The whole data economics in genomics is flawed
  • 9. Data Economics Consumption How tools can compute on the data, where does the computation happen Distribution Who has access to the data, what is the means of access, and monetization Production What format does the data get produced in and where does it get stored
  • 10. The Production Problem slow & expensive often custom & in-house costly & time consuming Some analytics infra Specialized applications, wrangling and fusion Storage in some cloud bucket or file manager Numerous VCF files, also some tables
  • 11. The Distribution Problem #1 wasteful re-invention Storage in some cloud bucket or marketplace Org #N: Download + Wrangle + Built analytics infra Org #1: Download + Wrangle + Built analytics infra Numerous VCF files, also some tables
  • 12. Numerous VCF files, also some tables The Distribution Problem #2 Data owner bears the distribution cost, Re-invention across data owners wrangling, etc. some analytics infra Queries by consumer #1 Queries by consumer #N
  • 13. The Consumption Problem inefficient & costly, poor governance Storage in some cloud bucket or server Group #N: Wrangle + Copy - Use tool & infra #N Group #1: Wrangle + Copy - Use tool & infra #1 Numerous VCF files, also some tables
  • 14. The Solution Universal data management platform Data in a universal, analysis-ready format User / group #1: any tool, any scale User / group #N: any tool, any scale All Data Science tools No infrastructure hassles No downloads or copies Efficient and cloud-native Solves N+1 problem Unifies all data Accessible by any tool Global-scale governance One infra, you own the data Collaboration and reproducibility Marketplace built-in Cost shifted to consumer
  • 15. Enter TileDB Secure governance & collaboration Scalable, serverless compute Data & code sharing & monetization Pay-as-you-go, consumer pays Extreme interoperability Zero infrastructure multi-dimensional arrays Universal data management platform Data in a universal, analysis-ready format User / group #1: any tool, any scale User / group #N: any tool, any scale
  • 16. The Secret Sauce | The Data Model Dense array Store everything as dense or sparse multi-dimensional arrays Sparse array
  • 17. query range expansion anchor_gap S a m p l e ( s t r i n g ) Position (uint32) 1 2 ... v7 v4 v1 v3 Indel/CNVs ... Contig (string) chr1 chr2 chr3 v2 v5 v6 SNPs a1 anchor_gap anchor Population Genomics with TileDB Store variant call data as 3D sparse arrays Storage query range results v3 v2 v1 a1 v4 v5 v6 v7 S a m p l e ( s t r i n g ) Position (uint32) 1 2 ... ... Contig (string) chr1 chr2 chr3 Retrieval https://github.com/TileDB-Inc/TileDB-VCF
  • 18. Arrays Subsume Dataframes Sparse array Dataframe Dense vector
  • 19. The Secret Sauce | The Data Model What can be modeled as an array LiDAR (3D sparse) SAR (2D or 3D dense) Population genomics (3D sparse) Single-cell genomics (2D dense or sparse) Biomedical imaging (2D or 3D dense) Even flat files!!! (1D dense) Time series (ND dense or sparse) Weather (2D or 3D dense) Graphs (2D sparse) Video (3D dense) Key-values (1D or ND sparse) Tables (1D dense or ND sparse)
  • 20. TileDB Cloud ❏ Access control and logging ❏ Serverless SQL, UDFs, task graphs ❏ Jupyter notebooks and dashboards Unified data management and easy serverless compute at global scale How we built a Universal Database Efficient APIs & tool integrations, zero-copy techniques TileDB Embedded Open-source interoperable storage with a universal open-spec array format ❏ Parallel IO, rapid reads & writes ❏ Columnar, cloud-optimized ❏ Data versioning & time traveling
  • 21. Superior performance Built in C++ Fully-parallelized Columnar format Multiple compressors R-trees for sparse arrays TileDB Embedded https://github.com/TileDB-Inc/TileDB Open source: Rapid updates & data versioning Immutable writes Lock-free Parallel reader / writer model Time traveling
  • 22. TileDB Embedded https://github.com/TileDB-Inc/TileDB Open source: Extreme interoperability Numerous APIs Numerous integrations All backends Optimized for the cloud Immutable writes Parallel IO Minimization of requests
  • 23. TileDB Cloud Universal storage Universal tooling Universal data .vcf .csv .bam .fastq Universal scale Management. Collaboration. Scalability.
  • 24. TileDB Cloud Works as SaaS: https://cloud.tiledb.com Works on premises Currently on AWS, soon on any cloud Built to work anywhere Slicing, SQL, UDFs, task graphs It is completely serverless On-demand JupyterHub instances Can launch Jupyter notebooks Compute sent to the data It is geo-aware Authentication, compliance, etc. It is secure
  • 25. TileDB Cloud Full marketplace (via Stripe) Everything is monetizable Access control inside and outside your organization Make any data and code public Discover any public data and code (central catalog) Everything is shareable at global scale Jupyter notebooks UDFs and task graphs ML models Everything is an array! Dashboards (e.g., R shiny apps) All types of data (even flat files) Full auditability (data, code, any action) Everything is logged
  • 26. Work in Progress or why you should depart from file formats and specialized solutions RLE compression for strings Compute directly on compressed data Compute push-down Fine-grained access policies Constant perf optimizations