PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"

•

1 gefällt mir•1,143 views

Wes McKinney

Talk in Karlsruhe, Germany, on October 25, 2018

Technologie

Looking backward, looking
forward
Wes McKinney @wesmckinn
PyCon DE / PyData Karlsruhe 2018

More fruitful open
source collaborations

April 2008 - Avant garde PyData
● Socializing Python inside AQR, a quantitative
hedge fund
● scipy.stats.models enabled some R ->
Python workload migration

Dec 2009 - pandas 0.1
● First open source release after ~18 months
of internal-only use

May 2011 - “PyData” core dev meetings
"Need a toolset that is robust, fast, and suitable
for a production environment..."

May 2011
"Need a toolset that is robust, fast, and suitable
for a production environment..."
"... but also good for interactive research... "
May 2011 - “PyData” core dev meetings

May 2011
"Need a toolset that is robust, fast, and suitable
for a production environment..."
"... but also good for interactive research... "
"... and easy / intuitive for non-software
engineers to use"
May 2011 - “PyData” core dev meetings

May 2011
* also, we need to fix packaging
May 2011 - “PyData” core dev meetings

July 2011- Concerns
"... the current state of affairs has me rather
anxious … these tools [e.g. pandas] have
largely not been integrated with any other tools
because of the community's collective
commitment anxiety"
http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/

July 2011- Concerns
"Fragmentation is killing us”
http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/

Python for Data Analysis book - 2012
● A primer in data
manipulation in Python
● Focus: NumPy, IPython
/Jupyter, pandas,
matplotlib
● 2 editions (2012, 2017)
● 8 translations so far

2013-2014 - An Entrepeneurial Detour
DataPad
Python-powered
Business Analytics
● Backend built with
PyData stack + custom
analytics
● Goal to contribute tech
back to OSS
ecosystem

DataPad learnings
● 200ms threshold for interactivity
● Multitenant query execution, resource management
● pandas performance / memory use problems

PyData NYC 2013: 10 Things I Hate About pandas
● November 2013
● Summary: “pandas is
not designed like, or
intended to be used
as, a database query
engine”

Vertical
Integration
The Good
● Control
● Development Speed
● Releases

Vertical
Integration
The Bad
● Large scope of code
ownership
● Lack of code reuse
● Bit rot

Fall 2014: Python in a Big Data World
Task: Helping Python
become a first-class
technology for Big Data
Some Problems
● File formats
● JVM interop
● Non-array-oriented
interfaces

Apache Arrow:
Defragmenting data systems
● Language-independent open
standard in-memory
representation for columnar data
(i.e. data frames)
● Easily reuse code targeting
Arrow memory
● Efficient memory interchange
Arrow
memory
JVM Data Ecosystem
Database Systems
Data Science Libraries

Apache Arrow:
Defragmenting data systems
● https://github.com/apache/arrow
● Over 200 unique contributors
● Some level of support for 11 programming
languages

Funding ambitious
new open source
projects

Early Partners
● https://ursalabs.org
● Apache Arrow-powered
Data Science Tools
● Funded by corporate
partners
● Built in collaboration with
RStudio

Empfohlen

Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney

Apache Arrow at DataEngConf Barcelona 2018Wes McKinney

Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney

Ursa Labs and Apache Arrow in 2019Wes McKinney

Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney

PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney

ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney

Future of pandasJeff Reback

Empfohlen

Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney

Apache Arrow at DataEngConf Barcelona 2018Wes McKinney

Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney

Ursa Labs and Apache Arrow in 2019Wes McKinney

Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney

PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney

ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney

Future of pandasJeff Reback

Memory Interoperability in Analytics and Machine LearningWes McKinney

Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney

Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney

Improving data interoperability in Python and RWes McKinney

Apache Arrow - An OverviewDremio Corporation

Apache Arrow: Leveling Up the Analytics StackWes McKinney

Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney

DataFrames: The Extended CutWes McKinney

Python Data Wrangling: Preparing for the FutureWes McKinney

PyCon Singapore 2013 KeynoteWes McKinney

An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney

DataFrames: The Good, Bad, and UglyWes McKinney

My Data Journey with Python (SciPy 2015 Keynote)Wes McKinney

Sparkler Presentation for Spark Summit East 2017Karanjeet Singh

How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn

Productive Data Tools for QuantsWes McKinney

Introduction to DremioDremio Corporation

Extending Pandas using Apache Arrow and NumbaUwe Korn

Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Databricks

Apache Arrow: In Theory, In PracticeDremio Corporation

Top 10 Data analytics tools to look for in 2021Mobcoder

Python and big data : a good match?PyDataParis

Weitere ähnliche Inhalte

Was ist angesagt?

Memory Interoperability in Analytics and Machine LearningWes McKinney

Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney

Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney

Improving data interoperability in Python and RWes McKinney

Apache Arrow - An OverviewDremio Corporation

Apache Arrow: Leveling Up the Analytics StackWes McKinney

Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Wes McKinney

DataFrames: The Extended CutWes McKinney

Python Data Wrangling: Preparing for the FutureWes McKinney

PyCon Singapore 2013 KeynoteWes McKinney

An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney

DataFrames: The Good, Bad, and UglyWes McKinney

My Data Journey with Python (SciPy 2015 Keynote)Wes McKinney

Sparkler Presentation for Spark Summit East 2017Karanjeet Singh

How Apache Arrow and Parquet boost cross-language interoperabilityUwe Korn

Productive Data Tools for QuantsWes McKinney

Introduction to DremioDremio Corporation

Extending Pandas using Apache Arrow and NumbaUwe Korn

Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...Databricks

Apache Arrow: In Theory, In PracticeDremio Corporation

Was ist angesagt? (20)

Memory Interoperability in Analytics and Machine Learning

Apache Arrow Workshop at VLDB 2019 / BOSS Session

Apache Arrow Flight: A New Gold Standard for Data Transport

Improving data interoperability in Python and R

Apache Arrow - An Overview

Apache Arrow: Leveling Up the Analytics Stack

Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...

DataFrames: The Extended Cut

Python Data Wrangling: Preparing for the Future

PyCon Singapore 2013 Keynote

An Incomplete Data Tools Landscape for Hackers in 2015

DataFrames: The Good, Bad, and Ugly

My Data Journey with Python (SciPy 2015 Keynote)

Sparkler Presentation for Spark Summit East 2017

How Apache Arrow and Parquet boost cross-language interoperability

Productive Data Tools for Quants

Introduction to Dremio

Extending Pandas using Apache Arrow and Numba

Lightning-Fast Analytics for Workday Transactional Data with Pavel Hardak and...

Apache Arrow: In Theory, In Practice

Ähnlich wie PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"

Top 10 Data analytics tools to look for in 2021Mobcoder

Python and big data : a good match?PyDataParis

Neurodb Engr245 2021 Lessons LearnedStanford University

Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTSMatt Stubbs

Pandas/Data Analysis at BaypiggiesAndy Hayden

Intro to Python Data Analysis in WakariKarissa Rae McKelvey

Big Data in AzureDataWorks Summit/Hadoop Summit

Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Peter Wang

Big and fast data strategy 2017 jrJonathan Raspaud

Keynote at Converge 2019Travis Oliphant

When big data meet python @ COSCUP 2012Jimmy Lai

From Lab to Factory: Or how to turn data into valuePeadar Coyle

Advanced Python Skills for Data ScientistsSerhii Kushchenko

🌟Is Learning Python Your Career Game-Changer? 🚀🐍abhishekdf3

From Lab to Factory: Creating value with dataPeadar Coyle

Delivering Agile Data Science on Openshift - Red Hat Summit 2019John Archer

Stephen Dillon - Fast Data Presentation Sept 02Stephen Dillon

Proposed Talk Outline for Pycon2017 Dr. Ananth Krishnamoorthy

MongoDC - Ikanow April 2012 Meetupikanow

The Python ecosystem for data science - Landscape OverviewDr. Ananth Krishnamoorthy

Ähnlich wie PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward" (20)

Top 10 Data analytics tools to look for in 2021

Python and big data : a good match?

Neurodb Engr245 2021 Lessons Learned

Big Data LDN 2018: ENABLING DATA-DRIVEN DECISIONS WITH AUTOMATED INSIGHTS

Pandas/Data Analysis at Baypiggies

Intro to Python Data Analysis in Wakari

Big Data in Azure

Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)

Big and fast data strategy 2017 jr

Keynote at Converge 2019

When big data meet python @ COSCUP 2012

From Lab to Factory: Or how to turn data into value

Advanced Python Skills for Data Scientists

🌟Is Learning Python Your Career Game-Changer? 🚀🐍

From Lab to Factory: Creating value with data

Delivering Agile Data Science on Openshift - Red Hat Summit 2019

Stephen Dillon - Fast Data Presentation Sept 02

Proposed Talk Outline for Pycon2017

MongoDC - Ikanow April 2012 Meetup

The Python ecosystem for data science - Landscape Overview

Mehr von Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney

Solving Enterprise Data Challenges with Apache ArrowWes McKinney

Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney

Apache Arrow: High Performance Columnar Data FrameworkWes McKinney

New Directions for Apache ArrowWes McKinney

Apache Arrow: Leveling Up the Data Science StackWes McKinney

Shared Infrastructure for Data ScienceWes McKinney

Data Science Without Borders (JupyterCon 2017)Wes McKinney

Raising the Tides: Open Source Analytics for Data ScienceWes McKinney

Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney

PyCon APAC 2016 KeynoteWes McKinney

Apache Arrow and Python: The latestWes McKinney

High Performance Python on Apache SparkWes McKinney

Python Data Ecosystem: Thoughts on Building for the FutureWes McKinney

Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney

Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney

Enabling Python to be a Better Big Data CitizenWes McKinney

Mehr von Wes McKinney (17)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...

Solving Enterprise Data Challenges with Apache Arrow

Apache Arrow: Open Source Standard Becomes an Enterprise Necessity

Apache Arrow: High Performance Columnar Data Framework

New Directions for Apache Arrow

Apache Arrow: Leveling Up the Data Science Stack

Shared Infrastructure for Data Science

Data Science Without Borders (JupyterCon 2017)

Raising the Tides: Open Source Analytics for Data Science

Improving Python and Spark (PySpark) Performance and Interoperability

PyCon APAC 2016 Keynote

Apache Arrow and Python: The latest

High Performance Python on Apache Spark

Python Data Ecosystem: Thoughts on Building for the Future

Next-generation Python Big Data Tools, powered by Apache Arrow

Apache Arrow (Strata-Hadoop World San Jose 2016)

Enabling Python to be a Better Big Data Citizen

Kürzlich hochgeladen

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

How to convert PDF to text with Nanonetsnaman860154

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700

2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung

From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software

A Call to Action for Generative AI in 2024Results

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Developing An App To Navigate The Roads of BrazilV3cube

Automating Google Workspace (GWS) & more with Apps Scriptwesley chun

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc

Kürzlich hochgeladen (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

[2024]Digital Global Overview Report 2024 Meltwater.pdf

How to convert PDF to text with Nanonets

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...

Breaking the Kubernetes Kill Chain: Host Path Mount

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...

2024: Domino Containers - The Next Step. News from the Domino Container commu...

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...

From Event to Action: Accelerate Your Decision Making with Real-Time Automation

A Call to Action for Generative AI in 2024

Axa Assurance Maroc - Insurer Innovation Award 2024

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Developing An App To Navigate The Roads of Brazil

Automating Google Workspace (GWS) & more with Apps Script

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"

1. Looking backward, looking forward Wes McKinney @wesmckinn PyCon DE / PyData Karlsruhe 2018

2. Motivations

3. Guiding questions

4. How to make data analysis “easier”?

5. Making individuals more productive

6. More fruitful open source collaborations

7. Better hardware utilization

8. Examining the status quo

9. Change is difficult

10. From one existential crisis to another

11. April 2008 - Avant garde PyData ● Socializing Python inside AQR, a quantitative hedge fund ● scipy.stats.models enabled some R -> Python workload migration

12. Dec 2009 - pandas 0.1 ● First open source release after ~18 months of internal-only use

13. May 2011 - “PyData” core dev meetings "Need a toolset that is robust, fast, and suitable for a production environment..."

14. May 2011 "Need a toolset that is robust, fast, and suitable for a production environment..." "... but also good for interactive research... " May 2011 - “PyData” core dev meetings

15. May 2011 "Need a toolset that is robust, fast, and suitable for a production environment..." "... but also good for interactive research... " "... and easy / intuitive for non-software engineers to use" May 2011 - “PyData” core dev meetings

16. May 2011 * also, we need to fix packaging May 2011 - “PyData” core dev meetings

17. July 2011- Concerns "... the current state of affairs has me rather anxious … these tools [e.g. pandas] have largely not been integrated with any other tools because of the community's collective commitment anxiety" http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/

18. July 2011- Concerns "Fragmentation is killing us” http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/

19. Reading CSV files

20. Python for Data Analysis book - 2012 ● A primer in data manipulation in Python ● Focus: NumPy, IPython /Jupyter, pandas, matplotlib ● 2 editions (2012, 2017) ● 8 translations so far

21. 2013-2014 - An Entrepeneurial Detour DataPad Python-powered Business Analytics ● Backend built with PyData stack + custom analytics ● Goal to contribute tech back to OSS ecosystem

22. DataPad learnings ● 200ms threshold for interactivity ● Multitenant query execution, resource management ● pandas performance / memory use problems

23. PyData NYC 2013: 10 Things I Hate About pandas ● November 2013 ● Summary: “pandas is not designed like, or intended to be used as, a database query engine”

24. Vertical Integration The Good ● Control ● Development Speed ● Releases

25. Vertical Integration The Bad ● Large scope of code ownership ● Lack of code reuse ● Bit rot

26. Fall 2014: Python in a Big Data World Task: Helping Python become a first-class technology for Big Data Some Problems ● File formats ● JVM interop ● Non-array-oriented interfaces

27. Fragmentation of data and code

28. Apache Arrow: Defragmenting data systems ● Language-independent open standard in-memory representation for columnar data (i.e. data frames) ● Easily reuse code targeting Arrow memory ● Efficient memory interchange Arrow memory JVM Data Ecosystem Database Systems Data Science Libraries

29. Apache Arrow: Defragmenting data systems ● https://github.com/apache/arrow ● Over 200 unique contributors ● Some level of support for 11 programming languages

30. Funding ambitious new open source projects

31. Early Partners ● https://ursalabs.org ● Apache Arrow-powered Data Science Tools ● Funded by corporate partners ● Built in collaboration with RStudio

32. Looking forward