SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Downloaden Sie, um offline zu lesen
www.twosigma.com
Python Data Wrangling: Preparing for
the Future
October 26, 2016All Rights Reserved
Wes McKinney @wesmckinn
PyCon HK 2016
Disclaimer
October 26, 2016
Important Legal Information
The information presented here is offered for informational purposes only and
should not be used for any other purpose (including, without limitation, the
making of investment decisions). Examples provided herein are for illustrative
purposes only and are not necessarily based on actual data. Nothing herein
constitutes: an offer to sell or the solicitation of any offer to buy any security or
other interest; tax advice; or investment advice. This presentation shall remain the
property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the
right to require the return of this presentation at any time. Copyright © 2016
TWO SIGMA INVESTMENTS, LP. All rights reserved.
All Rights Reserved
Me
October 26, 2016
• Creator of Python pandas project (ca. 2008)
• PMC member for Apache Arrow and Parquet
• Currently: Data technology at Two Sigma
• Past
• Cloudera: Python on Hadoop/Spark + Ibis project
• DataPad: Co-founder/CEO
All Rights Reserved
Python for Data Analysis, 2nd
Edition
October 26, 2016All Rights Reserved
Coming in 2017
• Updated for pandas 1.0
• Remove deprecated / out-of-date
functionality
• New content: Advanced pandas, intro
to statsmodels, scikit-learn
The Python data community
October 26, 2016
• Python has grown from a niche scientific computing language in 2011 to a
mainstream data science language now in 2016
• A language of choice for latest-gen ML: Keras, Tensorflow, Theano
• Worldwide ecosystem of conferences and meetups: PyData, SciPy, etc.
All Rights Reserved
How / why did the community grow?
October 26, 2016
• Confluence of important projects
• Array computing, linear algebra: NumPy, Cython, SciPy
• Data manipulation tools: pandas
• Statistics / Machine Learning: statsmodels, scikit-learn
• Programming interfaces: Jupyter notebooks
• Packaging: {Ana, mini}conda, portable wheels (.whl) for pip
All Rights Reserved
conda-forge: community-led conda packaging
October 26, 2016All Rights Reserved
conda config --add channels conda-forge
NumFOCUS: Not-for-profit open source sponsorship
October 26, 2016All Rights Reserved
pandas, the project
October 26, 2016
• https://github.com/pandas-dev/pandas
• Latest version: 0.19.0, released October 3, 2016
• Code base turned 8 years old in April 2016
• Over 600 distinct contributors, 14,000+ commits
• > 5300 pull requests, > 7200 issues closed on GitHub
• Only 11 developers with 100 or more commits
All Rights Reserved
Estimating the pandas user base size
October 26, 2016All Rights Reserved
pandas.pydata.org
50-70K unique
visitors per month
Sources of pandas’s popularity
October 26, 2016
• Responsive (volunteer) core developers: code reviewed and bugs fixed quickly
• Easy to use, good performance
• One of the only mature data preparation tools available. Most data is not clean
• Strong suite of time series / event analytics
• Library has been battle tested by tens of thousands of users
• Good learning resources (multiple books, etc.)
All Rights Reserved
pandas development, where to from here?
October 26, 2016
• pandas has accumulated much technical debt, problems stemming from early
software architecture decisions
• pandas being used increasingly as a building block in distributed systems like
Spark and dask
• Sprawling codebase: over 200K lines of code
• In works: pandas 1.0
• Instead of pandas v0.{n + 1}
• API stable release, focusing only on stability, bug fixes, performance
improvements
All Rights Reserved
Some known pandas problems
October 26, 2016
• Missing data inconsistencies
• Excess / unpredictable memory use
• Mutability issues, defensive memory copying
• Difficult to add new data types
• Mostly single-threaded, GIL-contention issues
• Degrading micro-performance from complex pure Python internals
• Costly interoperability with file formats, other systems (Dask, Spark)
See also: 2013 talk “10 Things I Hate About pandas”
All Rights Reserved
pandas 2.0
October 26, 2016
• Re-architecting of pandas Series / DataFrame internals
• Backwards compatible for 90+% of user code
• Goals
• Improved performance
• Lower and more predictable memory requirements
• Better utilize hardware with many cores and large amounts of RAM
• Tighter integration with external data
All Rights Reserved
libpandas: a modern C++ core “engine”
October 26, 2016All Rights Reserved
Pure Python
pandas.core
C C C
C
C
C
C C C
C
C
C
pandas 0.x/1.x
C++11
libpandas.so
Python wrapper layer
pandas 2.x
pandas 2.0: some libpandas benefits
October 26, 2016
• Encapsulate memory management and low-level details of Series and
DataFrame
• Isolate user from low-level data issues (e.g. missing data in integers, booleans)
• Better performance for “cheap” operations (e.g. indexing / selection)
• Easier access to multithreading primitives, hardware optimizations
• Note: Similar to the architecture used by Tensorflow and other modern
scientific tools targeting Python
All Rights Reserved
pandas 2.0: memory mapping and fast serialization
October 26, 2016
• Provide for memory-mapped DataFrames or “lazy-loading” of data from disk
• Improved performance with fast disk formats
• HDF5, feather, bcolz, zarr, etc.
• Faster data movement in distributed applications
• Dask, Spark
All Rights Reserved
Apache Arrow: Process and Move Data Fast
October 26, 2016
• New Top-level Apache project as of February 2016
• Collaboration amongst broad set of OSS projects around shared needs
• Language-independent columnar data structures
• Metadata for describing schemas / chunks of data
• Protocol for moving data between processes with minimal serialization
overhead
All Rights Reserved
High performance data interchange
October 26, 2016All Rights Reserved
Today With Arrow
Source: Apache Arrow
File formats and memory interoperability
October 26, 2016
• Improving IO throughput essential to good in-memory pandas performance
• Output data quickly to other tools/systems (R, Spark, Dask, etc.)
All Rights Reserved
Arrow in action: Feather file format
October 26, 2016All Rights Reserved
• https://github.com/wesm/feather
• Cross-language binary DataFrame format
• Write from Python, read in R, and vice versa
• Collaboration with Hadley Wickham from the R community
Arrow in action: Parquet files in Python
October 26, 2016
• Parquet C++ project https://github.com/apache/parquet-cpp
• Parquet files read into Arrow arrays
• Array arrays written back to Parquet files
All Rights Reserved
import pyarrow.parquet as pq
table = pq.read_table(path, **options)
Some serialization benchmarks
October 26, 2016
• Benchmark setup
• 1 GB DataFrame of floating point data, 100 columns
• Pickle, Parquet, bcolz, Feather format, Arrow IPC format
All Rights Reserved
Benchmark results
October 26, 2016All Rights Reserved
Tool Absolute time Bandwidth
Arrow IPC 148 ms 6.77 GB/s
pandas.HDFStore 178 ms 5.62 GB/s
pickle 296 ms 3.38 GB/s
bcolz 498 ms 2.01 GB/s
Wall clock RAM-to-RAM conversion time from source to 1 GB
pandas.DataFrame with 100 float64 columns (no missing data).
HDF5 using tmpfs
Benchmark results
October 26, 2016All Rights Reserved
Tool Absolute time Bandwidth
Arrow IPC 148 ms 6.77 GB/s
pandas.HDFStore 178 ms 5.62 GB/s
pickle 296 ms 3.38 GB/s
bcolz 498 ms 2.01 GB/s
pandas.read_csv 21247 ms 0.047 GB /s
Wall clock RAM-to-RAM conversion time from source to 1 GB
pandas.DataFrame with 100 float64 columns (no missing data).
HDF5 using tmpfs
Benefits of standard columnar formats
October 26, 2016
• Readable for other systems (even JVM-based), easier interoperability
• On-disk filtering / subsetting possible
• Benefit from development work by other communities
All Rights Reserved
Thank you
October 26, 2016
• Conference organizers
• pandas developer community
• Apache Software Foundation
● Note: Opinions are my own
All Rights Reserved

Weitere ähnliche Inhalte

Was ist angesagt?

Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeMigrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeAmazon Web Services
 
ナレッジグラフ/LOD利用技術の入門(後編)
ナレッジグラフ/LOD利用技術の入門(後編)ナレッジグラフ/LOD利用技術の入門(後編)
ナレッジグラフ/LOD利用技術の入門(後編)KnowledgeGraph
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
SQL Performance Improvements at a Glance in Apache Spark 3.0
SQL Performance Improvements at a Glance in Apache Spark 3.0SQL Performance Improvements at a Glance in Apache Spark 3.0
SQL Performance Improvements at a Glance in Apache Spark 3.0Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Odoo Experience 2018 - Code Profiling in Odoo
Odoo Experience 2018 - Code Profiling in OdooOdoo Experience 2018 - Code Profiling in Odoo
Odoo Experience 2018 - Code Profiling in OdooElínAnna Jónasdóttir
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaAmazon Web Services
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal HealthSRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal HealthAmazon Web Services
 
An Introduction to SPARQL
An Introduction to SPARQLAn Introduction to SPARQL
An Introduction to SPARQLOlaf Hartig
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingDatabricks
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkDatabricks
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural searchDmitry Kan
 
Building and using ontologies
Building and using ontologies Building and using ontologies
Building and using ontologies Elena Simperl
 

Was ist angesagt? (20)

Migrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data LakeMigrating your traditional Data Warehouse to a Modern Data Lake
Migrating your traditional Data Warehouse to a Modern Data Lake
 
ナレッジグラフ/LOD利用技術の入門(後編)
ナレッジグラフ/LOD利用技術の入門(後編)ナレッジグラフ/LOD利用技術の入門(後編)
ナレッジグラフ/LOD利用技術の入門(後編)
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
SQL Performance Improvements at a Glance in Apache Spark 3.0
SQL Performance Improvements at a Glance in Apache Spark 3.0SQL Performance Improvements at a Glance in Apache Spark 3.0
SQL Performance Improvements at a Glance in Apache Spark 3.0
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Odoo Experience 2018 - Code Profiling in Odoo
Odoo Experience 2018 - Code Profiling in OdooOdoo Experience 2018 - Code Profiling in Odoo
Odoo Experience 2018 - Code Profiling in Odoo
 
Data Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & AthenaData Catalog & ETL - Glue & Athena
Data Catalog & ETL - Glue & Athena
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal HealthSRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal Health
 
Dive into PySpark
Dive into PySparkDive into PySpark
Dive into PySpark
 
An Introduction to SPARQL
An Introduction to SPARQLAn Introduction to SPARQL
An Introduction to SPARQL
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Optimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache SparkOptimizing Delta/Parquet Data Lakes for Apache Spark
Optimizing Delta/Parquet Data Lakes for Apache Spark
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Vector databases and neural search
Vector databases and neural searchVector databases and neural search
Vector databases and neural search
 
Building and using ontologies
Building and using ontologies Building and using ontologies
Building and using ontologies
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 

Ähnlich wie Python Data Wrangling: Preparing for the Future

Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latestWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FutureWes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 
Lightning Fast Dataframes with Polars
Lightning Fast Dataframes with PolarsLightning Fast Dataframes with Polars
Lightning Fast Dataframes with PolarsAlberto Danese
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
 
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)DataPad Inc.
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and RWes McKinney
 
Improving Data Interoperability for Python and R
Improving Data Interoperability for Python and RImproving Data Interoperability for Python and R
Improving Data Interoperability for Python and RWork-Bench
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopJim Dowling
 

Ähnlich wie Python Data Wrangling: Preparing for the Future (20)

Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Python Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the Future
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Lightning Fast Dataframes with Polars
Lightning Fast Dataframes with PolarsLightning Fast Dataframes with Polars
Lightning Fast Dataframes with Polars
 
An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
 
Data streaming
Data streamingData streaming
Data streaming
 
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
 
How Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperabilityHow Apache Arrow and Parquet boost cross-language interoperability
How Apache Arrow and Parquet boost cross-language interoperability
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
Improving data interoperability in Python and R
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
 
Improving Data Interoperability for Python and R
Improving Data Interoperability for Python and RImproving Data Interoperability for Python and R
Improving Data Interoperability for Python and R
 
Python for data science
Python for data sciencePython for data science
Python for data science
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
 

Mehr von Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 KeynoteWes McKinney
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache SparkWes McKinney
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
 

Mehr von Wes McKinney (16)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
 

Kürzlich hochgeladen

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 

Kürzlich hochgeladen (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

Python Data Wrangling: Preparing for the Future

  • 1. www.twosigma.com Python Data Wrangling: Preparing for the Future October 26, 2016All Rights Reserved Wes McKinney @wesmckinn PyCon HK 2016
  • 2. Disclaimer October 26, 2016 Important Legal Information The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. Copyright © 2016 TWO SIGMA INVESTMENTS, LP. All rights reserved. All Rights Reserved
  • 3. Me October 26, 2016 • Creator of Python pandas project (ca. 2008) • PMC member for Apache Arrow and Parquet • Currently: Data technology at Two Sigma • Past • Cloudera: Python on Hadoop/Spark + Ibis project • DataPad: Co-founder/CEO All Rights Reserved
  • 4. Python for Data Analysis, 2nd Edition October 26, 2016All Rights Reserved Coming in 2017 • Updated for pandas 1.0 • Remove deprecated / out-of-date functionality • New content: Advanced pandas, intro to statsmodels, scikit-learn
  • 5. The Python data community October 26, 2016 • Python has grown from a niche scientific computing language in 2011 to a mainstream data science language now in 2016 • A language of choice for latest-gen ML: Keras, Tensorflow, Theano • Worldwide ecosystem of conferences and meetups: PyData, SciPy, etc. All Rights Reserved
  • 6. How / why did the community grow? October 26, 2016 • Confluence of important projects • Array computing, linear algebra: NumPy, Cython, SciPy • Data manipulation tools: pandas • Statistics / Machine Learning: statsmodels, scikit-learn • Programming interfaces: Jupyter notebooks • Packaging: {Ana, mini}conda, portable wheels (.whl) for pip All Rights Reserved
  • 7. conda-forge: community-led conda packaging October 26, 2016All Rights Reserved conda config --add channels conda-forge
  • 8. NumFOCUS: Not-for-profit open source sponsorship October 26, 2016All Rights Reserved
  • 9. pandas, the project October 26, 2016 • https://github.com/pandas-dev/pandas • Latest version: 0.19.0, released October 3, 2016 • Code base turned 8 years old in April 2016 • Over 600 distinct contributors, 14,000+ commits • > 5300 pull requests, > 7200 issues closed on GitHub • Only 11 developers with 100 or more commits All Rights Reserved
  • 10. Estimating the pandas user base size October 26, 2016All Rights Reserved pandas.pydata.org 50-70K unique visitors per month
  • 11. Sources of pandas’s popularity October 26, 2016 • Responsive (volunteer) core developers: code reviewed and bugs fixed quickly • Easy to use, good performance • One of the only mature data preparation tools available. Most data is not clean • Strong suite of time series / event analytics • Library has been battle tested by tens of thousands of users • Good learning resources (multiple books, etc.) All Rights Reserved
  • 12. pandas development, where to from here? October 26, 2016 • pandas has accumulated much technical debt, problems stemming from early software architecture decisions • pandas being used increasingly as a building block in distributed systems like Spark and dask • Sprawling codebase: over 200K lines of code • In works: pandas 1.0 • Instead of pandas v0.{n + 1} • API stable release, focusing only on stability, bug fixes, performance improvements All Rights Reserved
  • 13. Some known pandas problems October 26, 2016 • Missing data inconsistencies • Excess / unpredictable memory use • Mutability issues, defensive memory copying • Difficult to add new data types • Mostly single-threaded, GIL-contention issues • Degrading micro-performance from complex pure Python internals • Costly interoperability with file formats, other systems (Dask, Spark) See also: 2013 talk “10 Things I Hate About pandas” All Rights Reserved
  • 14. pandas 2.0 October 26, 2016 • Re-architecting of pandas Series / DataFrame internals • Backwards compatible for 90+% of user code • Goals • Improved performance • Lower and more predictable memory requirements • Better utilize hardware with many cores and large amounts of RAM • Tighter integration with external data All Rights Reserved
  • 15. libpandas: a modern C++ core “engine” October 26, 2016All Rights Reserved Pure Python pandas.core C C C C C C C C C C C C pandas 0.x/1.x C++11 libpandas.so Python wrapper layer pandas 2.x
  • 16. pandas 2.0: some libpandas benefits October 26, 2016 • Encapsulate memory management and low-level details of Series and DataFrame • Isolate user from low-level data issues (e.g. missing data in integers, booleans) • Better performance for “cheap” operations (e.g. indexing / selection) • Easier access to multithreading primitives, hardware optimizations • Note: Similar to the architecture used by Tensorflow and other modern scientific tools targeting Python All Rights Reserved
  • 17. pandas 2.0: memory mapping and fast serialization October 26, 2016 • Provide for memory-mapped DataFrames or “lazy-loading” of data from disk • Improved performance with fast disk formats • HDF5, feather, bcolz, zarr, etc. • Faster data movement in distributed applications • Dask, Spark All Rights Reserved
  • 18. Apache Arrow: Process and Move Data Fast October 26, 2016 • New Top-level Apache project as of February 2016 • Collaboration amongst broad set of OSS projects around shared needs • Language-independent columnar data structures • Metadata for describing schemas / chunks of data • Protocol for moving data between processes with minimal serialization overhead All Rights Reserved
  • 19. High performance data interchange October 26, 2016All Rights Reserved Today With Arrow Source: Apache Arrow
  • 20. File formats and memory interoperability October 26, 2016 • Improving IO throughput essential to good in-memory pandas performance • Output data quickly to other tools/systems (R, Spark, Dask, etc.) All Rights Reserved
  • 21. Arrow in action: Feather file format October 26, 2016All Rights Reserved • https://github.com/wesm/feather • Cross-language binary DataFrame format • Write from Python, read in R, and vice versa • Collaboration with Hadley Wickham from the R community
  • 22. Arrow in action: Parquet files in Python October 26, 2016 • Parquet C++ project https://github.com/apache/parquet-cpp • Parquet files read into Arrow arrays • Array arrays written back to Parquet files All Rights Reserved import pyarrow.parquet as pq table = pq.read_table(path, **options)
  • 23. Some serialization benchmarks October 26, 2016 • Benchmark setup • 1 GB DataFrame of floating point data, 100 columns • Pickle, Parquet, bcolz, Feather format, Arrow IPC format All Rights Reserved
  • 24. Benchmark results October 26, 2016All Rights Reserved Tool Absolute time Bandwidth Arrow IPC 148 ms 6.77 GB/s pandas.HDFStore 178 ms 5.62 GB/s pickle 296 ms 3.38 GB/s bcolz 498 ms 2.01 GB/s Wall clock RAM-to-RAM conversion time from source to 1 GB pandas.DataFrame with 100 float64 columns (no missing data). HDF5 using tmpfs
  • 25. Benchmark results October 26, 2016All Rights Reserved Tool Absolute time Bandwidth Arrow IPC 148 ms 6.77 GB/s pandas.HDFStore 178 ms 5.62 GB/s pickle 296 ms 3.38 GB/s bcolz 498 ms 2.01 GB/s pandas.read_csv 21247 ms 0.047 GB /s Wall clock RAM-to-RAM conversion time from source to 1 GB pandas.DataFrame with 100 float64 columns (no missing data). HDF5 using tmpfs
  • 26. Benefits of standard columnar formats October 26, 2016 • Readable for other systems (even JVM-based), easier interoperability • On-disk filtering / subsetting possible • Benefit from development work by other communities All Rights Reserved
  • 27. Thank you October 26, 2016 • Conference organizers • pandas developer community • Apache Software Foundation ● Note: Opinions are my own All Rights Reserved