SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
25127
The Data Lake Engine
Spark + AI Summit 2020
Data Science Across Data Sources with Apache Arrow
25127
Dremio is the Data Lake Engine CompanyTomer Shiran
Co-Founder & CPO, Dremio
tomer@dremio.com Powering the cloud data lakes of the world’s
leading companies across all industries
Creators of
Over $100M raised
Background
25127
Your Data Lake is Exploding, Yet Your Data Remains Inaccessible
But

>100% YoY S3
Data Growth1
>50% of Data
Will Live on Cloud Data
Lake Storage by 20252
1) Estimate based on historical growth https://aws.amazon.com/blogs/aws/amazon-s3-growth-for-2011-now-762-billion-objects/
2) Estimate based on trends around cloud migration plus growth in semi-structured and unstructured data
Data Lakes are becoming the
primary place that data lands
Consuming the data is
too slow & too difficult
SQL
Data Consumers
X X X
S3ADLS
S3ADLS
or or
25127
Data Movement is the Typical Workaround for Data Lake Storage
BI Users
SQL
Data Scientists
Data Lake
Storage ADLS S3
25127
Data Movement is the Typical Workaround for Data Lake Storage
BI Users
SQL
Data Scientists
1
Brittle & complex
ETL/ELT
Data Lake
Storage ADLS S3
25127
Data Movement is the Typical Workaround for Data Lake Storage
1
2
Brittle & complex
ETL/ELT
Data Lake
Storage
Proprietary & expensive
DW/Data Marts
BI Users
SQL
Data Scientists
ADLS S3
25127
Data Movement is the Typical Workaround for Data Lake Storage
Proliferating Cubes,
BI Extracts, &
Aggregation Tables
Proprietary & expensive
DW/Data Marts
+
+
+
+
+
+
+
+
+
1
2
3
Brittle & complex
ETL/ELT
D
ecreasingD
ataScope&
Flexibility
Data Lake
Storage
BI Users
SQL
Data Scientists
ADLS S3
25127
Proliferating Cubes,
BI Extracts, &
Aggregation Tables
Proprietary & expensive
DW/Data Marts
+
+
+
+
+
+
+
+
+
1
2
3
Brittle & complex
ETL/ELT
D
ecreasingD
ataScope&
Flexibility
BI Users
SQL
Data Scientists
Data Lake
Storage ADLS S3o
r
o
r
Query data lake storage directly with 4-100X performance
Powered by .
What is Apache Arrow?
Columnar In-
Memory
Representation
Many Language
Bindings
Broad Industry
Adoption
Row-based Column-based
10+ Downloads per Month
25127
Apache Arrow Gandiva Improves CPU Efficiency
✓ A standalone C++ library for efficient
evaluation of arbitrary SQL expressions on
Arrow vectors using runtime code-
generation in LLVM
✓ Expressions are compiled to LLVM bytecode
(IR), optimized & translated to machine code
✓ Gandiva enables vectorized execution with
Intel SIMD instructions
SQL expression
Vectorized
execution
kernel
Input Arrow
buffer
Output Arrow
buffer
Gandiva
compiler
Pre-compiled
functions (.bs)
OptimizeIRBuilder
25127
4.5x-90x Faster than Java-based Code Generation
Test Project time (secs)
with Java JIT
Project time (secs)
with Gandiva LLVM
Improvement
Sum 3.805 0.558 6.8x
Project 5 columns 8.681 1.689 5.13x
Project 10 columns 24.923 3.476 7.74x
CASE-10 4.308 0.925 4.66x
CASE-100 1361 15.187 89.6x
25127
Dremio’s Arrow-based Columnar Cloud Cache (C3) Accelerates I/O
✓ Columnar cloud cache (C3) automatically provides
NVMe-level I/O performance when reading from
S3/ADLS
✓ Arrow persistence enables granular caching as Arrow
buffers in local engine NVMe
✓ Bypass data deserialization and decompression
✓ Enables high-concurrency, low-latency BI workloads
on cloud data lake storage


Executor Executor Executor Executor
AWS S3
NVMe NVMeNVMe NVMe
C3 with Apache Arrow persistence


Executor Executor Executor
NVMe NVMe NVMe
C3 with Apache Arrow persistence
XL engine
M engine
25127
The Open Data Platform
Storage
Data
Compute
Client
Interactive SQL & BI Data science & batch Occasional SQL
Athena EMR
Batch processing
AWS
S3
ADLS HDFS
File formats: Text | JSON | Parquet | ORC
Table formats: Glue | Hive Metastore | Delta Lake | Iceberg
We Need Fast, Industry-Standard Data Exchange
Storage
Data
Compute
Client
Interactive SQL & BI Data science & batch Occasional SQL
Athena EMR
AWS
S3
ADLS HDFS
File formats: Text | JSON | Parquet | ORC
Table formats: Glue | Hive Metastore | Delta Lake | Iceberg
Batch processing
2
1
3
4
Arrow Flight is an Arrow-based RPC Interface
✓ High-performance wire protocol
✓ Parallel streams of Arrow buffers are transferred
✓ Delivers on the interoperability promise of Apache
Arrow
✓ Client-cluster and cluster-cluster communication


Arrow Flight dataframe
Arrow Flight Python Client
import pyarrow.flight as flt
c = flt.FlightClient.connect("localhost", 47470)
fd = flt.FlightDescriptor.for_command(sql)
fi = c.get_flight_info(fd)
ticket = fi.endpoints[0].ticket
df = c.do_get(ticket0).read_all()
Client-Cluster Communication
Cluster-Cluster Communication
Demo
Demo
25127
Q&AThe Data Lake Engine
25127
Dremio is the Data Lake Engine
Data
Lake
Storage
Data
Lake
Engine
BI Users
SQL
Data Scientists
ADLS S3or or
Optional
External
Sources
Data
Users
Accelerate
Business
100X BI query speed
4X Ad-hoc query speed
0 cubes, extracts, or
aggregation tables
Reduce
Cost & Risk&
10x lower AWS EC2 /
Azure VM spend for same
performance
0 lock-in, loss of control,
and duplication of data
Powered by
A Next-Generation Data Lake Query Engine for Live, Interactive Analytics Directly on Data Lake Storage

Weitere Àhnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In PracticeApache Arrow: In Theory, In Practice
Apache Arrow: In Theory, In Practice
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!Make your PySpark Data Fly with Arrow!
Make your PySpark Data Fly with Arrow!
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Building a modern data warehouse
Building a modern data warehouseBuilding a modern data warehouse
Building a modern data warehouse
 
Delta from a Data Engineer's Perspective
Delta from a Data Engineer's PerspectiveDelta from a Data Engineer's Perspective
Delta from a Data Engineer's Perspective
 
The Apache Spark File Format Ecosystem
The Apache Spark File Format EcosystemThe Apache Spark File Format Ecosystem
The Apache Spark File Format Ecosystem
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Lakehouse Analytics with Dremio
Lakehouse Analytics with DremioLakehouse Analytics with Dremio
Lakehouse Analytics with Dremio
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 

Ähnlich wie Data Science Across Data Sources with Apache Arrow

Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Amazon Web Services
 

Ähnlich wie Data Science Across Data Sources with Apache Arrow (20)

Serverless Data Platform
Serverless Data PlatformServerless Data Platform
Serverless Data Platform
 
The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019The Roadmap for SQL Server 2019
The Roadmap for SQL Server 2019
 
Cloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AICloud-based Data Lake for Analytics and AI
Cloud-based Data Lake for Analytics and AI
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Big Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI MobileBig Data Analytics from Azure Cloud to Power BI Mobile
Big Data Analytics from Azure Cloud to Power BI Mobile
 
Overview SQL Server 2019
Overview SQL Server 2019Overview SQL Server 2019
Overview SQL Server 2019
 
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
Serverless Analytics with Amazon Redshift Spectrum, AWS Glue, and Amazon Quic...
 
Dev/Test Environment Provisioning and Management on AWS
Dev/Test Environment Provisioning and Management on AWSDev/Test Environment Provisioning and Management on AWS
Dev/Test Environment Provisioning and Management on AWS
 
Seminario de Cloud Computing na UFRRJ
Seminario de Cloud Computing na UFRRJSeminario de Cloud Computing na UFRRJ
Seminario de Cloud Computing na UFRRJ
 
ArcReady - Architecting For The Cloud
ArcReady - Architecting For The CloudArcReady - Architecting For The Cloud
ArcReady - Architecting For The Cloud
 
Spark + AI Summit 2020 ă‚€ăƒ™ăƒłăƒˆæŠ‚èŠ
Spark + AI Summit 2020 ă‚€ăƒ™ăƒłăƒˆæŠ‚èŠSpark + AI Summit 2020 ă‚€ăƒ™ăƒłăƒˆæŠ‚èŠ
Spark + AI Summit 2020 ă‚€ăƒ™ăƒłăƒˆæŠ‚èŠ
 
Discovery Day 2019 Sofia - What is new in SQL Server 2019
Discovery Day 2019 Sofia - What is new in SQL Server 2019Discovery Day 2019 Sofia - What is new in SQL Server 2019
Discovery Day 2019 Sofia - What is new in SQL Server 2019
 
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon MeichtryAWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
AWS Innovate: Build a Data Lake on AWS- Johnathon Meichtry
 
Global Azure Bootcamp 2017 - Why I love S2D for MSSQL on Azure
Global Azure Bootcamp 2017 - Why I love S2D for MSSQL on AzureGlobal Azure Bootcamp 2017 - Why I love S2D for MSSQL on Azure
Global Azure Bootcamp 2017 - Why I love S2D for MSSQL on Azure
 
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
Building Data Warehouses and Data Lakes in the Cloud - DevDay Austin 2017 Day 2
 
Building Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWSBuilding Data Lakes and Analytics on AWS
Building Data Lakes and Analytics on AWS
 
Solved: Your Most Dreaded Test Environment Management Challenges
Solved: Your Most Dreaded Test Environment Management ChallengesSolved: Your Most Dreaded Test Environment Management Challenges
Solved: Your Most Dreaded Test Environment Management Challenges
 
IBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data LakeIBM Cloud Native Day April 2021: Serverless Data Lake
IBM Cloud Native Day April 2021: Serverless Data Lake
 
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...
 
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
Running Fast, Interactive Queries on Petabyte Datasets using Presto - AWS Jul...
 

Mehr von Databricks

Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 

KĂŒrzlich hochgeladen

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
SUHANI PANDEY
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 

KĂŒrzlich hochgeladen (20)

VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >àŒ’8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >àŒ’8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >àŒ’8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >àŒ’8448380779 Escort Service
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 

Data Science Across Data Sources with Apache Arrow

  • 1. 25127 The Data Lake Engine Spark + AI Summit 2020 Data Science Across Data Sources with Apache Arrow
  • 2. 25127 Dremio is the Data Lake Engine CompanyTomer Shiran Co-Founder & CPO, Dremio tomer@dremio.com Powering the cloud data lakes of the world’s leading companies across all industries Creators of Over $100M raised Background
  • 3. 25127 Your Data Lake is Exploding, Yet Your Data Remains Inaccessible But
 >100% YoY S3 Data Growth1 >50% of Data Will Live on Cloud Data Lake Storage by 20252 1) Estimate based on historical growth https://aws.amazon.com/blogs/aws/amazon-s3-growth-for-2011-now-762-billion-objects/ 2) Estimate based on trends around cloud migration plus growth in semi-structured and unstructured data Data Lakes are becoming the primary place that data lands Consuming the data is too slow & too difficult SQL Data Consumers X X X S3ADLS S3ADLS or or
  • 4. 25127 Data Movement is the Typical Workaround for Data Lake Storage BI Users SQL Data Scientists Data Lake Storage ADLS S3
  • 5. 25127 Data Movement is the Typical Workaround for Data Lake Storage BI Users SQL Data Scientists 1 Brittle & complex ETL/ELT Data Lake Storage ADLS S3
  • 6. 25127 Data Movement is the Typical Workaround for Data Lake Storage 1 2 Brittle & complex ETL/ELT Data Lake Storage Proprietary & expensive DW/Data Marts BI Users SQL Data Scientists ADLS S3
  • 7. 25127 Data Movement is the Typical Workaround for Data Lake Storage Proliferating Cubes, BI Extracts, & Aggregation Tables Proprietary & expensive DW/Data Marts + + + + + + + + + 1 2 3 Brittle & complex ETL/ELT D ecreasingD ataScope& Flexibility Data Lake Storage BI Users SQL Data Scientists ADLS S3
  • 8. 25127 Proliferating Cubes, BI Extracts, & Aggregation Tables Proprietary & expensive DW/Data Marts + + + + + + + + + 1 2 3 Brittle & complex ETL/ELT D ecreasingD ataScope& Flexibility BI Users SQL Data Scientists Data Lake Storage ADLS S3o r o r Query data lake storage directly with 4-100X performance Powered by .
  • 9. What is Apache Arrow? Columnar In- Memory Representation Many Language Bindings Broad Industry Adoption Row-based Column-based
  • 11. 25127 Apache Arrow Gandiva Improves CPU Efficiency ✓ A standalone C++ library for efficient evaluation of arbitrary SQL expressions on Arrow vectors using runtime code- generation in LLVM ✓ Expressions are compiled to LLVM bytecode (IR), optimized & translated to machine code ✓ Gandiva enables vectorized execution with Intel SIMD instructions SQL expression Vectorized execution kernel Input Arrow buffer Output Arrow buffer Gandiva compiler Pre-compiled functions (.bs) OptimizeIRBuilder
  • 12. 25127 4.5x-90x Faster than Java-based Code Generation Test Project time (secs) with Java JIT Project time (secs) with Gandiva LLVM Improvement Sum 3.805 0.558 6.8x Project 5 columns 8.681 1.689 5.13x Project 10 columns 24.923 3.476 7.74x CASE-10 4.308 0.925 4.66x CASE-100 1361 15.187 89.6x
  • 13. 25127 Dremio’s Arrow-based Columnar Cloud Cache (C3) Accelerates I/O ✓ Columnar cloud cache (C3) automatically provides NVMe-level I/O performance when reading from S3/ADLS ✓ Arrow persistence enables granular caching as Arrow buffers in local engine NVMe ✓ Bypass data deserialization and decompression ✓ Enables high-concurrency, low-latency BI workloads on cloud data lake storage 
 Executor Executor Executor Executor AWS S3 NVMe NVMeNVMe NVMe C3 with Apache Arrow persistence 
 Executor Executor Executor NVMe NVMe NVMe C3 with Apache Arrow persistence XL engine M engine
  • 14. 25127 The Open Data Platform Storage Data Compute Client Interactive SQL & BI Data science & batch Occasional SQL Athena EMR Batch processing AWS S3 ADLS HDFS File formats: Text | JSON | Parquet | ORC Table formats: Glue | Hive Metastore | Delta Lake | Iceberg
  • 15. We Need Fast, Industry-Standard Data Exchange Storage Data Compute Client Interactive SQL & BI Data science & batch Occasional SQL Athena EMR AWS S3 ADLS HDFS File formats: Text | JSON | Parquet | ORC Table formats: Glue | Hive Metastore | Delta Lake | Iceberg Batch processing 2 1 3 4
  • 16. Arrow Flight is an Arrow-based RPC Interface ✓ High-performance wire protocol ✓ Parallel streams of Arrow buffers are transferred ✓ Delivers on the interoperability promise of Apache Arrow ✓ Client-cluster and cluster-cluster communication 
 Arrow Flight dataframe
  • 17. Arrow Flight Python Client import pyarrow.flight as flt c = flt.FlightClient.connect("localhost", 47470) fd = flt.FlightDescriptor.for_command(sql) fi = c.get_flight_info(fd) ticket = fi.endpoints[0].ticket df = c.do_get(ticket0).read_all()
  • 20. Demo
  • 21. Demo
  • 23. 25127 Dremio is the Data Lake Engine Data Lake Storage Data Lake Engine BI Users SQL Data Scientists ADLS S3or or Optional External Sources Data Users Accelerate Business 100X BI query speed 4X Ad-hoc query speed 0 cubes, extracts, or aggregation tables Reduce Cost & Risk& 10x lower AWS EC2 / Azure VM spend for same performance 0 lock-in, loss of control, and duplication of data Powered by A Next-Generation Data Lake Query Engine for Live, Interactive Analytics Directly on Data Lake Storage