Future of Pandas - Jeff Reback

Future of Pandas
Jeff Reback
PyData NYC
November 2017

• The information presented here is offered for informational purposes only and should not be used for any other
purpose (including, without limitation, the making of investment decisions). Examples provided herein are for
illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell
or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This
presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the
right to require the return of this presentation at any time.
• Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so,
such copyrights and/or trademarks are most likely owned by the entity that created the material and are used
purely for identification and comment as fair use under international copyright and/or trademark laws. Use of
such image, copyright or trademark does not imply any association with such organization (or endorsement of
such organization) by Two Sigma, nor vice versa.
• Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
IMPORTANT LEGAL INFORMATION

@jreback
● Former quant
● Senior Engineer at Two Sigma, working on holistic approaches to
modeling
● Core committer to pandas for last 5 years
● Managed pandas since 2013

Overview
● State of the Pandas
○ The Good
○ The Bad
○ The Ugly

Overview
○ The Good
○ The Bad
○ The Ugly
● The Present

Overview
○ The Good
○ The Bad
○ The Ugly
● The Present
● The Future

pandas’s role in the Python Data Ecosystem
pandas
Numerical
Computing
IO / Data
Access
Data
Visualization
Statistics +
Machine
Learning
Libraries
Users

http://wesmckinney.com/blog/apache-arrow-pandas-internals/
@wesm "10 Things I Hate About pandas"

DataTypes - what are we Missing?

DataTypes - Missing values can cause dtype changes

● Complex groupby operations awkward and slow
● Copy Semantics
API

● In-memory format that is custom
● Eager evaluation model, no query planning
● "Slow", limited multicore algorithms for large datasets
Performance

Data Tooling Spectrum
Small Data
“Medium” Data
“Big” Data
< 5GB 5-100GB > 100 GB
pandas starts to fail as an effective
tool somewhere around the 10 GB
mark

Block Storage
Float
1.0 2.0
1.0 2.0
1.0 2.0
1.0 2.0
Int
1
2
3
4
Index
RangeIndex
(0, 4, 1)
Columns
Index
[‘A’, ‘B’, ‘C’]
Block Manager
AxesBlocks

Big Data Unfriendly
Each system has its own internal memory format
70-80% computation wasted on serialization and deserialization
Similar functionality implemented in multiple projects

CategoricalDtype efficient memory & first class Categoricals
efficient IO
out-of-core and multi-core
In current pandas

pandas2 architecture
Arrow-optimized data connectors
Arrow in-memory format
Python user API, User-defined functions
Logical Data Frame Expression Graphs
Parallel Dataflow Execution Engine
Apache Arrow
Ibis
pandas2
DataFrame semantics & compatibility

Ibis
Python user API, User-defined functions
Logical Data Frame Expression Graphs

Ibis in a nutshell
Ibis
Python code
Compiler Back End
compiled code
Abstract
Syntax Tree

Apache Arrow
Arrow-optimized data connectors
Arrow in-memory format
Parallel Dataflow Execution Engine

Apache Arrow project
The Arrow supports zero-copy reads and is optimized for data locality.
Fast
Arrow acts as a new high-performance interface between various systems.
Flexible
Apache Arrow is backed by key developers of 13 major open source projects
Standard

Big Data friendly
All systems utilize the same memory format
No overhead for cross-system communication
Projects can share functionality (eg, Parquet-to-Arrow reader)

Computation
● Kernel functions: atomic units of computation
● Operator nodes: input/output types, operator
parallelism properties

Physical Operator Graphs
Log Add
b
a
(a + b).log().sum()
Sum

Parallel Evaluation of Operator Graphs
a b tmp
tmp
2
out
Add SumLog

Parallel Evaluation of Operator Graphs

Status & Links
pandas2
https://github.com/pandas-dev/pandas2
Ibis
https://github.com/ibis-project/ibis
0.12.0 released in October.
Arrow
https://github.com/apache/arrow
0.7.1 released in October.

What can the community do?
● We love contributions.
● We love donations (to NUMFocus).

Future of Pandas - Jeff Reback

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Future of Pandas - Jeff Reback

Ähnlich wie Future of Pandas - Jeff Reback (20)

Mehr von Two Sigma

Mehr von Two Sigma (16)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Future of Pandas - Jeff Reback