SlideShare ist ein Scribd-Unternehmen logo
1 von 30
Downloaden Sie, um offline zu lesen
Ursa Labs and
Apache Arrow in 2019
Infrastructure for Next-generation Data Science
Wes McKinney
PyData Miami
2019-01-11
https://ursalabs.org
● Funding and employment for
full-time open source developers
● Grow Apache Arrow ecosystem
● Build cross-language, portable
computational libraries for data
science
● Not-for-profit, funded by multiple
corporations
Ursa Labs Mission
Led by key figures from R and Python worlds
Team
• 5 full-time remote engineers (US, Canada, Europe)
• Contributions from the RStudio tidyverse team
• We’re hiring!
• Senior computational systems engineer
• Build / test / packaging automation engineer
Ursa Labs Sponsors
Main sponsor and
administrative partner
Sponsors help in many ways
Much of the data science stack’s computational
foundation is severely dated, rooted in 1980s /
1990s FORTRAN-style semantics
Single-core /
single-threaded
algorithms
Naïve execution
model, eager
evaluation
Primitive memory
management,
expensive data access
Fragmented language
ecosystems,
“Proprietary” memory
models …
We can do so much better through modern
systems techniques
Multi-core algorithms,
GPU acceleration,
Code generation
(LLVM)
Lazy evaluation,
“query” optimization
Sophisticated memory
management,
Efficient access to huge
data sets
Interoperable memory
models, zero-copy
interchange between
system components
Note 1
Moore’s Law (and small
data) enabled us to get by
for a long time without
confronting some of these
challenges
Note 2
Most of these methods
have already been widely
employed in analytic
databases. Limited
“novel” research needed
• Open source project founded in 2016 by key developers of 13
major open source data projects
• Key ideas
• Language agnostic, open standard in-memory format for
columnar data (aka “data frames”)
• Bring together database and data science communities to
collaborate on shared computational technology
• 3 years old, over 200 unique contributors, > 1 million monthly
installs
Defragmenting Data Access
Up to 80-90% of CPU
cycles spent on
de/serialization
Life without Arrow Life with Arrow
No de/serialization
The Arrow Development Platform
• Open source library stack offering some level of support for 11
different programming languages
• Focus
• Reuse of runtime data and algorithms without copying or
serialization
• Fast data access (storage systems, file formats)
• Efficient data interchange (IPC, RPC)
• Accelerated In-memory computing
• Foundation of new systems, while accelerating existing ones
Worse Patterns Better Patterns
Custom data structures
Copy and convert
Custom algorithms
Custom file formats
Custom wire protocols
(Open)
Standard data structures
Zero copy
Standard algorithms
Standard file formats
Standard wire protocols
Analytic database architecture
Front end API
Computation Engine
In-memory storage
IO and
Deserialization
● Vertically integrated /
“Black Box”
● Internal components do
not have a public API
● Users interact with front
end
Analytic database, deconstructed
Front end API
Computation Engine
In-memory storage
IO and
Deserialization
● Components have public
APIs
● Use what you need
● Different front ends can
be developed
Analytic database, deconstructed
Front end API
Computation Engine
In-memory storage
IO and
Deserialization
Arrow is front end agnostic
Arrow Use Cases
● Data access
○ Read and write widely used storage formats
○ Interact with database protocols, other data sources
● Data movement
○ Zero-copy interprocess communication
○ Efficient RPC / client-server communications
● Computation libraries
○ Efficient in-memory / out-of-core data frame-type analytics
○ LLVM-compilation for vectorized expression evaluation
Some problems relevant to pandas users
• Memory-mapping large on-disk datasets
• Efficient string processing
• Chunked / non-contiguous tables
• Native nested types (structs, arrays, unions)
• Efficient interchange with other systems
Example: Arrow-accelerated Python + Apache Spark
● Joint work with Li Jin from Two
Sigma, Bryan Cutler from IBM
● Vectorized user-defined functions,
fast data export to pandas
import pandas as pd
from scipy import stats
@pandas_udf('double')
def cdf(v):
return pd.Series(stats.norm.cdf(v))
df.withColumn('cumulative_probability',
cdf(df.v))
Example: Arrow-accelerated Python + Apache Spark
Spark SQL
Arrow Columnar
Stream Input
PySpark Worker
Zero copy via socket pandas
Arrow Columnar
Stream Output
to arrow
from arrow
from arrow
to arrow
Example: NVIDIA RAPIDS libraries
Some Industry Contributors to Apache Arrow
ClearCode
2019 Ursa Labs Development Agenda
● File format ingest/export
● Arrow RPC: “Flight” Framework
● Gandiva: LLVM-based expression compiler
● In-memory Columnar Query Engine
● Language interop: Python and R
● Cloud file-system support
2018 Accomplishments
• 3 major releases (0.9, 0.10, 0.11)
• 1600+ resolved JIRA issues
• 7 codebase donations
• Major engineering efforts
• Improved CI / CD tooling; packaging automation for releases
• Bootstrap R library development
• C++ CSV reader
• Combine Arrow and Parquet C++ codebases
• GPU support library
File format support
● CSV
● JSON
● Avro
● Parquet
● ORC
Arrow Flight RPC Framework
• Key idea: standardized high performance data transport
• A gRPC-based framework for defining custom data services that
send and receive Arrow columnar data natively
• Uses Protocol Buffers v3 for client protocol
• Pluggable command execution layer, authentication
• Low-level gRPC optimizations (~ 10x faster than comparables)
• Write Arrow memory directly onto outgoing gRPC buffer
• Avoid any copying or deserialization
Arrow Flight - Efficient gRPC transport
Client
DoGet
Data Node
FlightData
Row
Batch
Row
Batch
Row
Batch
Row
Batch
Row
Batch
...
Data transported in a Protocol
Buffer, but reads can be made
zero-copy by writing a custom
gRPC “deserializer”
Gandiva, LLVM-powered expression compiler
• Initially developed by Dremio, donated to Apache
Arrow
• Efficient evaluation of projections, filters, and
aggregates
• Uses LLVM for runtime code generation
• Dremio using to accelerate a Java-based distributed
SQL engine
Using Gandiva from Java with zero-copy
SELECT year(timestamp), month(timestamp), …
FROM table
...
Input Table
Fragment
Arrow Java
JNI (Zero-copy)
Evaluate
Gandiva
LLVM
Function
Arrow C++
Result Table
Fragment
Cloud Service Support
● Support data engineering workflows on AWS, GCP,
Azure
● Optimized IO for cloud blob storage (S3, GCS, etc.)
● In C++, so can be used in Python, R, Ruby, etc.
Looking ahead
• 2019 likely to be year of rapid growth for Apache
Arrow
• Grow community, diversity of language and scope
• Join us: https://github.com/apache/arrow

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
 
Neo4j Graph Data Science - Webinar
Neo4j Graph Data Science - WebinarNeo4j Graph Data Science - Webinar
Neo4j Graph Data Science - Webinar
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Intro to Neo4j and Graph Databases
Intro to Neo4j and Graph DatabasesIntro to Neo4j and Graph Databases
Intro to Neo4j and Graph Databases
 
Scaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter ExperienceScaling Big Data Mining Infrastructure Twitter Experience
Scaling Big Data Mining Infrastructure Twitter Experience
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Data Engineering Basics
Data Engineering BasicsData Engineering Basics
Data Engineering Basics
 
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
Deep Dive into Project Tungsten: Bringing Spark Closer to Bare Metal-(Josh Ro...
 
Getting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on KubernetesGetting Started with Apache Spark on Kubernetes
Getting Started with Apache Spark on Kubernetes
 
Deep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.xDeep Dive into GPU Support in Apache Spark 3.x
Deep Dive into GPU Support in Apache Spark 3.x
 
Snowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for EveryoneSnowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for Everyone
 
DataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven OrganizationsDataOps: An Agile Method for Data-Driven Organizations
DataOps: An Agile Method for Data-Driven Organizations
 
Introduction to Data Engineering
Introduction to Data EngineeringIntroduction to Data Engineering
Introduction to Data Engineering
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
 
Optimizing Hive Queries
Optimizing Hive QueriesOptimizing Hive Queries
Optimizing Hive Queries
 
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
Accelerating Spark SQL Workloads to 50X Performance with Apache Arrow-Based F...
 
ONNX and MLflow
ONNX and MLflowONNX and MLflow
ONNX and MLflow
 
How to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They WorkHow to Actually Tune Your Spark Jobs So They Work
How to Actually Tune Your Spark Jobs So They Work
 
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
Interactive real-time dashboards on data streams using Kafka, Druid, and Supe...
 

Ähnlich wie Ursa Labs and Apache Arrow in 2019

Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
StampedeCon
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
DataWorks Summit
 

Ähnlich wie Ursa Labs and Apache Arrow in 2019 (20)

Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
 
Strata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmapStrata NY 2017 Parquet Arrow roadmap
Strata NY 2017 Parquet Arrow roadmap
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
Building a Next-gen Data Platform and Leveraging the OSS Ecosystem for Easy W...
 
Threat hunting using notebook technologies
Threat hunting using notebook technologiesThreat hunting using notebook technologies
Threat hunting using notebook technologies
 
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowSimplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
High Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2OHigh Performance Machine Learning in R with H2O
High Performance Machine Learning in R with H2O
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis Supercharging Data Performance for Real-Time Data Analysis
Supercharging Data Performance for Real-Time Data Analysis
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
aip_developer_overview_icar_2014
aip_developer_overview_icar_2014aip_developer_overview_icar_2014
aip_developer_overview_icar_2014
 
Data streaming
Data streamingData streaming
Data streaming
 
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
Berlin Buzzwords 2019 - Taming the language border in data analytics and scie...
 

Mehr von Wes McKinney

Mehr von Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 

Ursa Labs and Apache Arrow in 2019

  • 1. Ursa Labs and Apache Arrow in 2019 Infrastructure for Next-generation Data Science Wes McKinney PyData Miami 2019-01-11
  • 2. https://ursalabs.org ● Funding and employment for full-time open source developers ● Grow Apache Arrow ecosystem ● Build cross-language, portable computational libraries for data science ● Not-for-profit, funded by multiple corporations Ursa Labs Mission
  • 3. Led by key figures from R and Python worlds
  • 4. Team • 5 full-time remote engineers (US, Canada, Europe) • Contributions from the RStudio tidyverse team • We’re hiring! • Senior computational systems engineer • Build / test / packaging automation engineer
  • 5. Ursa Labs Sponsors Main sponsor and administrative partner
  • 6. Sponsors help in many ways
  • 7. Much of the data science stack’s computational foundation is severely dated, rooted in 1980s / 1990s FORTRAN-style semantics Single-core / single-threaded algorithms Naïve execution model, eager evaluation Primitive memory management, expensive data access Fragmented language ecosystems, “Proprietary” memory models …
  • 8. We can do so much better through modern systems techniques Multi-core algorithms, GPU acceleration, Code generation (LLVM) Lazy evaluation, “query” optimization Sophisticated memory management, Efficient access to huge data sets Interoperable memory models, zero-copy interchange between system components Note 1 Moore’s Law (and small data) enabled us to get by for a long time without confronting some of these challenges Note 2 Most of these methods have already been widely employed in analytic databases. Limited “novel” research needed
  • 9. • Open source project founded in 2016 by key developers of 13 major open source data projects • Key ideas • Language agnostic, open standard in-memory format for columnar data (aka “data frames”) • Bring together database and data science communities to collaborate on shared computational technology • 3 years old, over 200 unique contributors, > 1 million monthly installs
  • 10. Defragmenting Data Access Up to 80-90% of CPU cycles spent on de/serialization Life without Arrow Life with Arrow No de/serialization
  • 11. The Arrow Development Platform • Open source library stack offering some level of support for 11 different programming languages • Focus • Reuse of runtime data and algorithms without copying or serialization • Fast data access (storage systems, file formats) • Efficient data interchange (IPC, RPC) • Accelerated In-memory computing • Foundation of new systems, while accelerating existing ones
  • 12. Worse Patterns Better Patterns Custom data structures Copy and convert Custom algorithms Custom file formats Custom wire protocols (Open) Standard data structures Zero copy Standard algorithms Standard file formats Standard wire protocols
  • 13. Analytic database architecture Front end API Computation Engine In-memory storage IO and Deserialization ● Vertically integrated / “Black Box” ● Internal components do not have a public API ● Users interact with front end
  • 14. Analytic database, deconstructed Front end API Computation Engine In-memory storage IO and Deserialization ● Components have public APIs ● Use what you need ● Different front ends can be developed
  • 15. Analytic database, deconstructed Front end API Computation Engine In-memory storage IO and Deserialization Arrow is front end agnostic
  • 16. Arrow Use Cases ● Data access ○ Read and write widely used storage formats ○ Interact with database protocols, other data sources ● Data movement ○ Zero-copy interprocess communication ○ Efficient RPC / client-server communications ● Computation libraries ○ Efficient in-memory / out-of-core data frame-type analytics ○ LLVM-compilation for vectorized expression evaluation
  • 17. Some problems relevant to pandas users • Memory-mapping large on-disk datasets • Efficient string processing • Chunked / non-contiguous tables • Native nested types (structs, arrays, unions) • Efficient interchange with other systems
  • 18. Example: Arrow-accelerated Python + Apache Spark ● Joint work with Li Jin from Two Sigma, Bryan Cutler from IBM ● Vectorized user-defined functions, fast data export to pandas import pandas as pd from scipy import stats @pandas_udf('double') def cdf(v): return pd.Series(stats.norm.cdf(v)) df.withColumn('cumulative_probability', cdf(df.v))
  • 19. Example: Arrow-accelerated Python + Apache Spark Spark SQL Arrow Columnar Stream Input PySpark Worker Zero copy via socket pandas Arrow Columnar Stream Output to arrow from arrow from arrow to arrow
  • 21. Some Industry Contributors to Apache Arrow ClearCode
  • 22. 2019 Ursa Labs Development Agenda ● File format ingest/export ● Arrow RPC: “Flight” Framework ● Gandiva: LLVM-based expression compiler ● In-memory Columnar Query Engine ● Language interop: Python and R ● Cloud file-system support
  • 23. 2018 Accomplishments • 3 major releases (0.9, 0.10, 0.11) • 1600+ resolved JIRA issues • 7 codebase donations • Major engineering efforts • Improved CI / CD tooling; packaging automation for releases • Bootstrap R library development • C++ CSV reader • Combine Arrow and Parquet C++ codebases • GPU support library
  • 24. File format support ● CSV ● JSON ● Avro ● Parquet ● ORC
  • 25. Arrow Flight RPC Framework • Key idea: standardized high performance data transport • A gRPC-based framework for defining custom data services that send and receive Arrow columnar data natively • Uses Protocol Buffers v3 for client protocol • Pluggable command execution layer, authentication • Low-level gRPC optimizations (~ 10x faster than comparables) • Write Arrow memory directly onto outgoing gRPC buffer • Avoid any copying or deserialization
  • 26. Arrow Flight - Efficient gRPC transport Client DoGet Data Node FlightData Row Batch Row Batch Row Batch Row Batch Row Batch ... Data transported in a Protocol Buffer, but reads can be made zero-copy by writing a custom gRPC “deserializer”
  • 27. Gandiva, LLVM-powered expression compiler • Initially developed by Dremio, donated to Apache Arrow • Efficient evaluation of projections, filters, and aggregates • Uses LLVM for runtime code generation • Dremio using to accelerate a Java-based distributed SQL engine
  • 28. Using Gandiva from Java with zero-copy SELECT year(timestamp), month(timestamp), … FROM table ... Input Table Fragment Arrow Java JNI (Zero-copy) Evaluate Gandiva LLVM Function Arrow C++ Result Table Fragment
  • 29. Cloud Service Support ● Support data engineering workflows on AWS, GCP, Azure ● Optimized IO for cloud blob storage (S3, GCS, etc.) ● In C++, so can be used in Python, R, Ruby, etc.
  • 30. Looking ahead • 2019 likely to be year of rapid growth for Apache Arrow • Grow community, diversity of language and scope • Join us: https://github.com/apache/arrow