Ursa Labs and Apache Arrow in 2019

Ursa Labs and
Apache Arrow in 2019
Infrastructure for Next-generation Data Science
Wes McKinney
PyData Miami
2019-01-11

https://ursalabs.org
● Funding and employment for
full-time open source developers
● Grow Apache Arrow ecosystem
● Build cross-language, portable
computational libraries for data
science
● Not-for-profit, funded by multiple
corporations
Ursa Labs Mission

Led by key figures from R and Python worlds

Team
• 5 full-time remote engineers (US, Canada, Europe)
• Contributions from the RStudio tidyverse team
• We’re hiring!
• Senior computational systems engineer
• Build / test / packaging automation engineer

Ursa Labs Sponsors
Main sponsor and
administrative partner

Much of the data science stack’s computational
foundation is severely dated, rooted in 1980s /
1990s FORTRAN-style semantics
Single-core /
single-threaded
algorithms
Naïve execution
model, eager
evaluation
Primitive memory
management,
expensive data access
Fragmented language
ecosystems,
“Proprietary” memory
models …

We can do so much better through modern
systems techniques
Multi-core algorithms,
GPU acceleration,
Code generation
(LLVM)
Lazy evaluation,
“query” optimization
Sophisticated memory
management,
Efficient access to huge
data sets
Interoperable memory
models, zero-copy
interchange between
system components
Note 1
Moore’s Law (and small
data) enabled us to get by
for a long time without
confronting some of these
challenges
Note 2
Most of these methods
have already been widely
employed in analytic
databases. Limited
“novel” research needed

• Open source project founded in 2016 by key developers of 13
major open source data projects
• Key ideas
• Language agnostic, open standard in-memory format for
columnar data (aka “data frames”)
• Bring together database and data science communities to
collaborate on shared computational technology
• 3 years old, over 200 unique contributors, > 1 million monthly
installs

Defragmenting Data Access
Up to 80-90% of CPU
cycles spent on
de/serialization
Life without Arrow Life with Arrow
No de/serialization

The Arrow Development Platform
• Open source library stack offering some level of support for 11
different programming languages
• Focus
• Reuse of runtime data and algorithms without copying or
serialization
• Fast data access (storage systems, file formats)
• Efficient data interchange (IPC, RPC)
• Accelerated In-memory computing
• Foundation of new systems, while accelerating existing ones

Worse Patterns Better Patterns
Custom data structures
Copy and convert
Custom algorithms
Custom file formats
Custom wire protocols
(Open)
Standard data structures
Zero copy
Standard algorithms
Standard file formats
Standard wire protocols

Analytic database architecture
Front end API
Computation Engine
In-memory storage
IO and
Deserialization
● Vertically integrated /
“Black Box”
● Internal components do
not have a public API
● Users interact with front
end

Analytic database, deconstructed
Front end API
Computation Engine
In-memory storage
IO and
Deserialization
● Components have public
APIs
● Use what you need
● Different front ends can
be developed

Analytic database, deconstructed
Front end API
Computation Engine
In-memory storage
IO and
Deserialization
Arrow is front end agnostic

Arrow Use Cases
● Data access
○ Read and write widely used storage formats
○ Interact with database protocols, other data sources
● Data movement
○ Zero-copy interprocess communication
○ Efficient RPC / client-server communications
● Computation libraries
○ Efficient in-memory / out-of-core data frame-type analytics
○ LLVM-compilation for vectorized expression evaluation

Some problems relevant to pandas users
• Memory-mapping large on-disk datasets
• Efficient string processing
• Chunked / non-contiguous tables
• Native nested types (structs, arrays, unions)
• Efficient interchange with other systems

Example: Arrow-accelerated Python + Apache Spark
● Joint work with Li Jin from Two
Sigma, Bryan Cutler from IBM
● Vectorized user-defined functions,
fast data export to pandas
import pandas as pd
from scipy import stats
@pandas_udf('double')
def cdf(v):
return pd.Series(stats.norm.cdf(v))
df.withColumn('cumulative_probability',
cdf(df.v))

Example: Arrow-accelerated Python + Apache Spark
Spark SQL
Arrow Columnar
Stream Input
PySpark Worker
Zero copy via socket pandas
Arrow Columnar
Stream Output
to arrow
from arrow
from arrow
to arrow

Example: NVIDIA RAPIDS libraries

Some Industry Contributors to Apache Arrow
ClearCode

2019 Ursa Labs Development Agenda
● File format ingest/export
● Arrow RPC: “Flight” Framework
● Gandiva: LLVM-based expression compiler
● In-memory Columnar Query Engine
● Language interop: Python and R
● Cloud file-system support

2018 Accomplishments
• 3 major releases (0.9, 0.10, 0.11)
• 1600+ resolved JIRA issues
• 7 codebase donations
• Major engineering efforts
• Improved CI / CD tooling; packaging automation for releases
• Bootstrap R library development
• C++ CSV reader
• Combine Arrow and Parquet C++ codebases
• GPU support library

File format support
● CSV
● JSON
● Avro
● Parquet
● ORC

Arrow Flight RPC Framework
• Key idea: standardized high performance data transport
• A gRPC-based framework for defining custom data services that
send and receive Arrow columnar data natively
• Uses Protocol Buffers v3 for client protocol
• Pluggable command execution layer, authentication
• Low-level gRPC optimizations (~ 10x faster than comparables)
• Write Arrow memory directly onto outgoing gRPC buffer
• Avoid any copying or deserialization

Arrow Flight - Efficient gRPC transport
Client
DoGet
Data Node
FlightData
Row
Batch
Row
Batch
Row
Batch
Row
Batch
Row
Batch
...
Data transported in a Protocol
Buffer, but reads can be made
zero-copy by writing a custom
gRPC “deserializer”

Gandiva, LLVM-powered expression compiler
• Initially developed by Dremio, donated to Apache
Arrow
• Efficient evaluation of projections, filters, and
aggregates
• Uses LLVM for runtime code generation
• Dremio using to accelerate a Java-based distributed
SQL engine

Using Gandiva from Java with zero-copy
SELECT year(timestamp), month(timestamp), …
FROM table
...
Input Table
Fragment
Arrow Java
JNI (Zero-copy)
Evaluate
Gandiva
LLVM
Function
Arrow C++
Result Table
Fragment

Cloud Service Support
● Support data engineering workflows on AWS, GCP,
Azure
● Optimized IO for cloud blob storage (S3, GCS, etc.)
● In C++, so can be used in Python, R, Ruby, etc.

Looking ahead
• 2019 likely to be year of rapid growth for Apache
Arrow
• Grow community, diversity of language and scope
• Join us: https://github.com/apache/arrow

Ursa Labs and Apache Arrow in 2019

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Ursa Labs and Apache Arrow in 2019

Ähnlich wie Ursa Labs and Apache Arrow in 2019 (20)

Mehr von Wes McKinney

Mehr von Wes McKinney (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Ursa Labs and Apache Arrow in 2019