Next-generation Python Big Data Tools, powered by Apache Arrow

1
©
Cloudera,
Inc.
All
rights
reserved.

Next-‐genera;on

Python
Big
Data
Tools,

powered
by
Apache
Arrow

Wes
McKinney
@wesmckinn

SF
Big
Analy;cs
Meetup,
2016-‐04-‐05

2
©
Cloudera,
Inc.
All
rights
reserved.

Me

•  Data
Science
Tools
at
Cloudera,
formerly
DataPad
CEO/founder

•  Serial
creator
of
structured
data
tools
/
user
interfaces

•  Wrote
bestseller
Python
for
Data
Analysis
2012

•  Open
source
projects

• Python
{pandas,
Ibis,
statsmodels}

• Apache
{Arrow,
Parquet,
Kudu
(incuba;ng)}

•  Mostly
work
in
Python
and
Cython/C/C++

3
©
Cloudera,
Inc.
All
rights
reserved.

In
process:

Python
for
Data
Analysis:
2nd
Edi4on

Coming
late
2016
/
early

2017

4
©
Cloudera,
Inc.
All
rights
reserved.

Python
+
Big
Data:
The
State
of
things

•  See
“Python
and
Apache
Hadoop:
A
State
of
the
Union”
from
February
17

•  Areas
where
much
more
work
needed

• Binary
ﬁle
format
read/write
support
(e.g.
Parquet
ﬁles)

• File
system
libraries
(HDFS,
S3,
etc.)

• Client
drivers
(Spark,
Hive,
Impala,
Kudu)

• Compute
system
integra;on
(Spark,
Impala,
etc.)

5
©
Cloudera,
Inc.
All
rights
reserved.

Apache

Arrow

Many
slides
here
from
my
joint
talk
with
Jacques
Nadeau,
VP
Apache
Arrow

6
©
Cloudera,
Inc.
All
rights
reserved.

Arrow
in
a
Slide

•  New
Top-‐level
Apache
Sofware
Founda;on
project

•  Announced
Feb
17,
2016

•  Focused
on
Columnar
In-‐Memory
Analy;cs

1.  10-‐100x
speedup
on
many
workloads

2.  Common
data
layer
enables
companies
to
choose
best
of

breed
systems

3.  Designed
to
work
with
any
programming
language

4.  Support
for
both
rela;onal
and
complex
data
as-‐is

•  Developers
from
13+
major
open
source
projects
involved

•  A
signiﬁcant
%
of
the
world’s
data
will
be
processed
through

Arrow!

Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R

7
©
Cloudera,
Inc.
All
rights
reserved.

Apache
Arrow:
What
is
it?

•  hkp://arrow.apache.org

•  Not
a
piece
of
sofware,
exactly!

•  A
standardized
in-‐memory
representa;on
for
columnar
data

•  Enables

• Suitable
for
implemen;ng
high-‐performance
analy;cs
in-‐memory
(think
like

“pandas
internals”)

• Cheap
data
interchange
amongst
systems,
likle
or
no
serializa;on

• Flexible
support
for
complex
JSON-‐like
data

•  Targets:
Impala,
Kudu,
Parquet,
Spark

8
©
Cloudera,
Inc.
All
rights
reserved.

Focus
on
CPU
Eﬃciency

1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
Row 1
Row 2
Row 3
Row 4
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
session_id
timestamp
source_ip
Traditional
Memory Buffer

Arrow
Memory Buffer

•  Cache
Locality

•  Super-‐scalar
&
vectorized

opera;on

•  Minimal
Structure
Overhead

•  Constant
value
access

•  With
minimal
structure
overhead

•  Operate
directly
on
columnar

compressed
data

9
©
Cloudera,
Inc.
All
rights
reserved.

High
Performance
Sharing
&
Interchange

Today With Arrow
•  Each system has its own internal
memory format
•  70-80% CPU wasted on serialization
and deserialization
•  Similar functionality implemented in
multiple projects
•  All systems utilize the same memory
format
•  No overhead for cross-system
communication
•  Projects can share functionality (eg,
Parquet-to-Arrow reader)
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Arrow Memory
Pandas Drill
Impala
HBase
KuduCassandra
Parquet
Spark
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert
Copy & Convert

10
©
Cloudera,
Inc.
All
rights
reserved.

Big
Data
Systems:
Poor
Python
IO
performance

h9p://wesmckinney.com/blog/pandas-‐and-‐apache-‐arrow/

11
©
Cloudera,
Inc.
All
rights
reserved.

Real
World
Example:
Feather
File
Format
for
Python

and
R

• Problem:
fast,
language-‐
agnos;c
binary
data
frame

file
format

• Wriken
by
Wes
McKinney

(Python)
Hadley
Wickham
(R)

• Read
speeds
close
to
disk
IO

performance

Arrow array 0
Arrow array 1
…
Arrow array n
Feather
metadata
Feather file
Apache Arrow
memory
Google
flatbuffers

12
©
Cloudera,
Inc.
All
rights
reserved.

Real
World
Example:
Feather
File
Format
for
Python

and
R

library(feather)

path
<-‐
"my_data.feather"

write_feather(df,
path)

df
<-‐
read_feather(path)

import
feather

path
=
'my_data.feather'

feather.write_dataframe(df,
path)

df
=
feather.read_dataframe(path)

R
Python

13
©
Cloudera,
Inc.
All
rights
reserved.

Apache
Parquet:
Binary
columnar
storage
format

•  I
just
became
a
Parquet
commiker!

•  github.com/apache/parquet-‐cpp

•  Python
users
will
soon
be
able
to

read
Parquet
ﬁles
via
PyArrow

•  parquet-‐cpp
<-‐>
PyArrow
<-‐>

pandas

14
©
Cloudera,
Inc.
All
rights
reserved.

Language
Bindings

•  Target
Languages

• Java
(beta)

• CPP
(underway)

• Python
&
Pandas
(underway)

• R

• Julia

•  Ini;al
Focus

• Read
a
structure

• Write
a
structure

• Manage
Memory

15
©
Cloudera,
Inc.
All
rights
reserved.

pandas
and
Arrow
in
context

16
©
Cloudera,
Inc.
All
rights
reserved.

RPC
&
IPC:
Moving
Data
Between
Systems

RPC

•  Avoid
Serializa;on
&
Deserializa;on

•  Layer
TBD:
Focused
on
suppor;ng
vectored
io

• Scaker/gather
reads/writes
against
socket

IPC

•  Alpha
implementa;on

using
memory
mapped
ﬁles

• Moving
data
between
Python
and
Drill

•  Working
on
shared
alloca;on
approach

• Shared
reference
coun;ng
and
well-‐deﬁned
ownership
seman;cs

17
©
Cloudera,
Inc.
All
rights
reserved.

Execu;ng
data
science
languages
in
the
compute
layer

UI
Ibis, SQL, Spark API, …
Compute
Analytic SQL, Spark, MapReduce
Storage
HDFS, Kudu, HBase
Python,
R, Julia, …?

18
©
Cloudera,
Inc.
All
rights
reserved.

Real
World
Example:
Python
With
Spark,
Drill,
Impala

in partition 0
…
in partition
n - 1
SQL Engine
Python
function
input
Python
function
input
User-supplied
Python code
output
output
out partition 0
…
out partition
n - 1
SQL Engine

19
©
Cloudera,
Inc.
All
rights
reserved.

What’s
Next

•  Parquet
for
Python
&
C++

• Using
Arrow
as
intermediary

•  Available
IPC
Implementa;on

•  Spark,
Drill
Integra;on

• Faster
UDFs,
Storage
interfaces

Next-generation Python Big Data Tools, powered by Apache Arrow

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Next-generation Python Big Data Tools, powered by Apache Arrow

Ähnlich wie Next-generation Python Big Data Tools, powered by Apache Arrow (20)

Mehr von Wes McKinney

Mehr von Wes McKinney (16)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Next-generation Python Big Data Tools, powered by Apache Arrow