SlideShare a Scribd company logo
1 of 52
Download to read offline
Wes McKinney @wesmckinn
PyCon Colombia 2020
Python for Data Analysis:
Past, Present, and Future
Wes’s professional timeline
pandas
DataPad
2008 2013 2014 — Present
Apache Arrow
Perspectives on
the last 12 years
January 2020: pandas 1.0
● 26th major release after 10 years of
development
● ~2000 unique contributors
Thanks, Indeed!
Dec 2009 - pandas 0.1
● First open source release after ~18 months
of proprietary use
● Still on PyPI!
Funding pandas development
● pandas received first formal grant in 2019
from Chan-Zuckerberg Initiative
● Core devs primarily volunteers, self-funded,
or company-funded (Anaconda, others)
The early pandas gang (2011 - 2012)
Wes McKinney Chang She Adam Klein
pandas’s amazing Core Dev Team
Core Dev Meetup,
2019
Jeff Reback Tom Augspurger
Brock MendelMarc Garcia
Partial cast of characters
Joris van den
Bossche
Community engagement
Python’s journey to
mainstream data
language
"We believe that in the coming years there will be
great opportunity to attract users in need of
statistical data analysis tools to Python who might
have previously chosen R, MATLAB, or another
research environment. By designing robust, easy
to-use data structures that cohere with the rest of the
scientific Python stack, we can make Python
compelling choice for data analysis applications. In
our opinion, pandas provides a solid foundation upon
which a very powerful data analysis ecosystem can
be established."
Me, Proceedings of SciPy 2011
StackOverflow
data from
September 2017
StackOverflow
data from
September 2017
Factors driving
Python’s growth
Contributing factors
● Massive need for data wranglers + scientists
● “Perfect storm” of necessary packages
● New data science education
● Successful early adopters
● Packaging improvements
Perfect storm of packages
View from 2008
Confronting
Fear
Uncertainty
Doubt
● Large codebase concerns
● Long-term software lifecycle
● Interpreted languages
○ ... unsafe?
○ ... slow?
● Open source… trustworthy?
Common concerns
May 2011 - “PyData” core dev meetings
"Need a toolset that is robust, fast, and suitable
for a production environment..."
May 2011
"Need a toolset that is robust, fast, and suitable
for a production environment..."
"... but also good for interactive research... "
May 2011 - “PyData” core dev meetings
May 2011
"Need a toolset that is robust, fast, and suitable
for a production environment..."
"... but also good for interactive research... "
"... and easy / intuitive for non-software
engineers to use"
May 2011 - “PyData” core dev meetings
May 2011
* also, we need to fix packaging
May 2011 - “PyData” core dev meetings
July 2011- Concerns
"... the current state of affairs has me rather
anxious … these tools [e.g. pandas] have
largely not been integrated with any other tools
because of the community's collective
commitment anxiety"
http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/
Reading CSV files
Python for Data Analysis book - 2012
● A primer in data
manipulation in Python
● Focus: NumPy, IPython
/Jupyter, pandas,
matplotlib
● 2 editions (2012, 2017)
● 8 translations so far
PyData NYC 2013: 10 Things I Hate About pandas
● November 2013
● Summary: “pandas is
not designed like, or
intended to be used
as, a database query
engine”
Fall 2014: Python in a Big Data World
Task: Helping Python
become a first-class
technology for Big Data
Some Problems
● File formats
● JVM interop
● Non-array-oriented
interfaces
Difficulties in pandas (and R) dataframes
● Limited built-in data types
● Performance and memory use issues
● Challenges with larger-than-memory datasets
● Naive execution strategies (no “query
optimization”)
Does not cut down trees.
Out of memory on 10GB of CSVs
A
of doubt
Changing the tides
… and others
Fragmentation of data
and code
Other thoughts
● Projects like pandas may be taking
responsibility for too many things
● It would be more productive (long-term) to
have a reusable computational foundation
for data frames
● New data frame format for
designed for speed
● Computational foundation for
data processing libraries
● Fast cross-language data
interchange
Arrow
memory
JVM Data Ecosystem
Database Systems
Data Science Libraries
Defragmenting Data
● https://github.com/apache/arrow
● Over 400 unique contributors
● Some level of support for 11 programming
languages
● CPU/GPU-friendly columnar memory layout
● Memory map huge datasets
● Relocate data structures without serialization
Important features
Arrow C++ Platform
Multi-core Work Scheduler
Core Data
Platform
Query
Engine
Datasets
Framework
Arrow Flight RPC
Network
Storage
“New Data Frame” projects
● dask.dataframe
● Modin
● NVIDIA RAPIDS
● Vaex
● … and more surely in development
Learning from R
● Domain-specific language culture (“same
code, different backends”)
● Non-standard evaluation
○ Inspect and manipulate unevaluated code
fragments
Arrow’s relationship with dplyr and friends
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
Can be a massive Arrow dataset
Arrow’s relationship with dplyr and friends
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
dplyr verbs can be
translated to Arrow
computation graphs,
executed by parallel
runtime
Can be a massive Arrow dataset
Arrow’s relationship with dplyr and friends
flights %>%
group_by(year, month, day) %>%
select(arr_delay, dep_delay) %>%
summarise(
arr = mean(arr_delay, na.rm = TRUE),
dep = mean(dep_delay, na.rm = TRUE)
) %>%
filter(arr > 30 | dep > 30)
dplyr verbs can be
translated to Arrow
computation graphs,
executed by parallel
runtime
R expressions can be JIT-compiled with LLVM
Can be a massive Arrow dataset
Funding ambitious
new open source
projects
Some Partners
● https://ursalabs.org
● Apache Arrow-powered
Data Science Tools
● Funded by corporate
partners
● Built in collaboration with
RStudio
Looking forward

More Related Content

What's hot

Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenDatabricks
 
The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...
The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...
The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...Neo4j
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptxWasm1953
 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Vicente Orjales
 
Data-centric design and the knowledge graph
Data-centric design and the knowledge graphData-centric design and the knowledge graph
Data-centric design and the knowledge graphAlan Morrison
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBernard Marr
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake ArchitectureDATAVERSITY
 
Creating data apps using Streamlit in Python
Creating data apps using Streamlit in PythonCreating data apps using Streamlit in Python
Creating data apps using Streamlit in PythonNithish Raghunandanan
 
Modern Data Architecture
Modern Data Architecture Modern Data Architecture
Modern Data Architecture Mark Hewitt
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultDataWorks Summit
 
Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AINeo4j
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in ProductionDataWorks Summit
 
Fedarated learning
Fedarated learningFedarated learning
Fedarated learningVaishakhKP1
 
Real time big data stream processing
Real time big data stream processing Real time big data stream processing
Real time big data stream processing Luay AL-Assadi
 
Snowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for EveryoneSnowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for EveryoneAngel Abundez
 
Synthetic data generation
Synthetic data generationSynthetic data generation
Synthetic data generationSandeep Joshi
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023HyunJoon Jung
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingDatabricks
 
MIT Deep Learning Basics: Introduction and Overview by Lex Fridman
MIT Deep Learning Basics: Introduction and Overview by Lex FridmanMIT Deep Learning Basics: Introduction and Overview by Lex Fridman
MIT Deep Learning Basics: Introduction and Overview by Lex FridmanPeerasak C.
 

What's hot (20)

Data Discovery at Databricks with Amundsen
Data Discovery at Databricks with AmundsenData Discovery at Databricks with Amundsen
Data Discovery at Databricks with Amundsen
 
The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...
The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...
The perfect couple: Uniting Large Language Models and Knowledge Graphs for En...
 
Databricks on AWS.pptx
Databricks on AWS.pptxDatabricks on AWS.pptx
Databricks on AWS.pptx
 
Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.Google Dremel. Concept and Implementations.
Google Dremel. Concept and Implementations.
 
Data-centric design and the knowledge graph
Data-centric design and the knowledge graphData-centric design and the knowledge graph
Data-centric design and the knowledge graph
 
Big Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should KnowBig Data - 25 Amazing Facts Everyone Should Know
Big Data - 25 Amazing Facts Everyone Should Know
 
Data Lake Architecture
Data Lake ArchitectureData Lake Architecture
Data Lake Architecture
 
Creating data apps using Streamlit in Python
Creating data apps using Streamlit in PythonCreating data apps using Streamlit in Python
Creating data apps using Streamlit in Python
 
Modern Data Architecture
Modern Data Architecture Modern Data Architecture
Modern Data Architecture
 
Best practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at RenaultBest practices and lessons learnt from Running Apache NiFi at Renault
Best practices and lessons learnt from Running Apache NiFi at Renault
 
Knowledge Graphs and Generative AI
Knowledge Graphs and Generative AIKnowledge Graphs and Generative AI
Knowledge Graphs and Generative AI
 
Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
Fedarated learning
Fedarated learningFedarated learning
Fedarated learning
 
Real time big data stream processing
Real time big data stream processing Real time big data stream processing
Real time big data stream processing
 
Snowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for EveryoneSnowflake + Power BI: Cloud Analytics for Everyone
Snowflake + Power BI: Cloud Analytics for Everyone
 
Synthetic data generation
Synthetic data generationSynthetic data generation
Synthetic data generation
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023
 
Thinking BIG
Thinking BIGThinking BIG
Thinking BIG
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
MIT Deep Learning Basics: Introduction and Overview by Lex Fridman
MIT Deep Learning Basics: Introduction and Overview by Lex FridmanMIT Deep Learning Basics: Introduction and Overview by Lex Fridman
MIT Deep Learning Basics: Introduction and Overview by Lex Fridman
 

Similar to PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
 
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Peter Wang
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4Ferdin Joe John Joseph PhD
 
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Mobcoder
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceFerdin Joe John Joseph PhD
 
How Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackDenodo
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data ScienceDataWorks Summit
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big DataDataWorks Summit
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Wes McKinney
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latestWes McKinney
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
 
Know thy logos
Know thy logosKnow thy logos
Know thy logosVishal V
 
BDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of sparkBDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of sparkJerry Wen
 

Similar to PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future (20)

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
 
2019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 42019 DSA 105 Introduction to Data Science Week 4
2019 DSA 105 Introduction to Data Science Week 4
 
Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021Top 10 Data analytics tools to look for in 2021
Top 10 Data analytics tools to look for in 2021
 
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data ScienceIntroduction to Data Science - Week 4 - Tools and Technologies in Data Science
Introduction to Data Science - Week 4 - Tools and Technologies in Data Science
 
How Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science StackHow Data Virtualization Adds Value to Your Data Science Stack
How Data Virtualization Adds Value to Your Data Science Stack
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
The Future of Data Science
The Future of Data ScienceThe Future of Data Science
The Future of Data Science
 
Scaling Data Science on Big Data
Scaling Data Science on Big DataScaling Data Science on Big Data
Scaling Data Science on Big Data
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow and Python: The latest
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
ODSC and iRODS
ODSC and iRODSODSC and iRODS
ODSC and iRODS
 
Toolboxes for data scientists
Toolboxes for data scientistsToolboxes for data scientists
Toolboxes for data scientists
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 
Know thy logos
Know thy logosKnow thy logos
Know thy logos
 
Dc python meetup
Dc python meetupDc python meetup
Dc python meetup
 
Resume
ResumeResume
Resume
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
BDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of sparkBDTC2015 databricks-辛湜-state of spark
BDTC2015 databricks-辛湜-state of spark
 

More from Wes McKinney

Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data ScienceWes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FutureWes McKinney
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 KeynoteWes McKinney
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache SparkWes McKinney
 

More from Wes McKinney (20)

Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 

Recently uploaded

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future

  • 1. Wes McKinney @wesmckinn PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
  • 2. Wes’s professional timeline pandas DataPad 2008 2013 2014 — Present Apache Arrow
  • 4. January 2020: pandas 1.0 ● 26th major release after 10 years of development ● ~2000 unique contributors Thanks, Indeed!
  • 5. Dec 2009 - pandas 0.1 ● First open source release after ~18 months of proprietary use ● Still on PyPI!
  • 6. Funding pandas development ● pandas received first formal grant in 2019 from Chan-Zuckerberg Initiative ● Core devs primarily volunteers, self-funded, or company-funded (Anaconda, others)
  • 7. The early pandas gang (2011 - 2012) Wes McKinney Chang She Adam Klein
  • 8. pandas’s amazing Core Dev Team Core Dev Meetup, 2019 Jeff Reback Tom Augspurger Brock MendelMarc Garcia Partial cast of characters Joris van den Bossche
  • 11. "We believe that in the coming years there will be great opportunity to attract users in need of statistical data analysis tools to Python who might have previously chosen R, MATLAB, or another research environment. By designing robust, easy to-use data structures that cohere with the rest of the scientific Python stack, we can make Python compelling choice for data analysis applications. In our opinion, pandas provides a solid foundation upon which a very powerful data analysis ecosystem can be established." Me, Proceedings of SciPy 2011
  • 13.
  • 15.
  • 17. Contributing factors ● Massive need for data wranglers + scientists ● “Perfect storm” of necessary packages ● New data science education ● Successful early adopters ● Packaging improvements
  • 18. Perfect storm of packages
  • 21. ● Large codebase concerns ● Long-term software lifecycle ● Interpreted languages ○ ... unsafe? ○ ... slow? ● Open source… trustworthy? Common concerns
  • 22. May 2011 - “PyData” core dev meetings "Need a toolset that is robust, fast, and suitable for a production environment..."
  • 23. May 2011 "Need a toolset that is robust, fast, and suitable for a production environment..." "... but also good for interactive research... " May 2011 - “PyData” core dev meetings
  • 24. May 2011 "Need a toolset that is robust, fast, and suitable for a production environment..." "... but also good for interactive research... " "... and easy / intuitive for non-software engineers to use" May 2011 - “PyData” core dev meetings
  • 25. May 2011 * also, we need to fix packaging May 2011 - “PyData” core dev meetings
  • 26. July 2011- Concerns "... the current state of affairs has me rather anxious … these tools [e.g. pandas] have largely not been integrated with any other tools because of the community's collective commitment anxiety" http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/
  • 28. Python for Data Analysis book - 2012 ● A primer in data manipulation in Python ● Focus: NumPy, IPython /Jupyter, pandas, matplotlib ● 2 editions (2012, 2017) ● 8 translations so far
  • 29. PyData NYC 2013: 10 Things I Hate About pandas ● November 2013 ● Summary: “pandas is not designed like, or intended to be used as, a database query engine”
  • 30. Fall 2014: Python in a Big Data World Task: Helping Python become a first-class technology for Big Data Some Problems ● File formats ● JVM interop ● Non-array-oriented interfaces
  • 31. Difficulties in pandas (and R) dataframes ● Limited built-in data types ● Performance and memory use issues ● Challenges with larger-than-memory datasets ● Naive execution strategies (no “query optimization”)
  • 32.
  • 33. Does not cut down trees.
  • 34. Out of memory on 10GB of CSVs
  • 38. Other thoughts ● Projects like pandas may be taking responsibility for too many things ● It would be more productive (long-term) to have a reusable computational foundation for data frames
  • 39. ● New data frame format for designed for speed ● Computational foundation for data processing libraries ● Fast cross-language data interchange Arrow memory JVM Data Ecosystem Database Systems Data Science Libraries
  • 41. ● https://github.com/apache/arrow ● Over 400 unique contributors ● Some level of support for 11 programming languages
  • 42. ● CPU/GPU-friendly columnar memory layout ● Memory map huge datasets ● Relocate data structures without serialization Important features
  • 43. Arrow C++ Platform Multi-core Work Scheduler Core Data Platform Query Engine Datasets Framework Arrow Flight RPC Network Storage
  • 44. “New Data Frame” projects ● dask.dataframe ● Modin ● NVIDIA RAPIDS ● Vaex ● … and more surely in development
  • 45. Learning from R ● Domain-specific language culture (“same code, different backends”) ● Non-standard evaluation ○ Inspect and manipulate unevaluated code fragments
  • 46. Arrow’s relationship with dplyr and friends flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) Can be a massive Arrow dataset
  • 47. Arrow’s relationship with dplyr and friends flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime Can be a massive Arrow dataset
  • 48. Arrow’s relationship with dplyr and friends flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime R expressions can be JIT-compiled with LLVM Can be a massive Arrow dataset
  • 49. Funding ambitious new open source projects
  • 50. Some Partners ● https://ursalabs.org ● Apache Arrow-powered Data Science Tools ● Funded by corporate partners ● Built in collaboration with RStudio
  • 51.