3. @jreback
● Former quant
● Senior Engineer at Two Sigma, working on holistic approaches to
modeling
● Core committer to pandas for last 5 years
● Managed pandas since 2013
12. pandas’s role in the Python Data Ecosystem
pandas
Numerical
Computing
IO / Data
Access
Data
Visualization
Statistics +
Machine
Learning
Libraries
Users
22. ● In-memory format that is custom
● Eager evaluation model, no query planning
● "Slow", limited multicore algorithms for large datasets
Performance
23. Data Tooling Spectrum
Small Data
“Medium” Data
“Big” Data
< 5GB 5-100GB > 100 GB
pandas starts to fail as an effective
tool somewhere around the 10 GB
mark
27. Big Data Unfriendly
Each system has its own internal memory format
70-80% computation wasted on serialization and deserialization
Similar functionality implemented in multiple projects
42. Apache Arrow project
The Arrow supports zero-copy reads and is optimized for data locality.
Fast
Arrow acts as a new high-performance interface between various systems.
Flexible
Apache Arrow is backed by key developers of 13 major open source projects
Standard
43. Big Data friendly
All systems utilize the same memory format
No overhead for cross-system communication
Projects can share functionality (eg, Parquet-to-Arrow reader)