SciPy 2011 pandas lightning talk

•

5 gefällt mir•2,920 views

Wes McKinney

Technologie

pandas?
In [13]: foo
Out[13]:
methyl1 age edu something indic
0 38.36 30to39 geCollege 1 False
1 37.85 lt30 geCollege 1 False
2 38.57 30to39 geCollege 1 False
3 39.75 30to39 geCollege 1 True
4 43.83 30to39 geCollege 1 True
5 39.08 30to39 ltHS 1 True

Size-mutable “labeled arrays” that
can handle heterogeneous data

Kinda like a structured array??

•  Automatic data alignment with lots of
reshaping and indexing methods

•  Implicit and explicit handling of missing
data

•  Easy time series functionality
–  Far less fuss than scikits.timeseries

•  Lots of in-memory SQL-like operations
(group by, join, etc.)

pandas?
•  Extremely good for financial data
–  StackOverflow: “this is a beast of a financial
analysis tool”

•  One of the better relational data
munging tools in any language?

•  But also has maybe 60+% of what R
users expect when they come to
Python

1. Heavily redesigned
internals
•  Merged old DataFrame and DataMatrix
into a single DataFrame: retain
optimal performance where possible

•  Internal BlockManager class manages
homogeneous ndarrays for optimal
performance and reshaping

1. Heavily redesigned
internals
•  Better handling of missing data for
non-floating point dtypes

•  Soon: DataFrame variant with N-dim
“hyperslabs”

2. Fancier indexing
Mix boolean / integer / label /
slice-based indexing

df.ix[0]
df.ix[date1:date2]
df.ix[:5, ‘A’:’F’]

Setting works too

df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan

3. More robust IO
data_frame = read_csv(‘mydata.csv’)

data_frame2 = read_table(‘mydata.txt’, sep=‘t’,
skiprows=[1,2],
na_values=[‘#N/A NA’])

store = HDFStore(‘pytables.h5’)
store[‘a’] = data_frame
store[‘b’] = data_frame2

4. Better pivoting / reshaping

foo bar A B C
0 one a -0.0524 1.664 1.171
1 one a 0.2514 0.8306 -1.396
2 one b 0.1256 0.3897 0.5227
3 one b -0.9301 0.6513 -0.2313
4 one c 2.037 1.938 -0.3454
5 two a 0.2073 0.7857 0.9051
6 two a -1.032 -0.8615 1.028
7 two b -0.7319 -1.846 0.9294
8 two b 0.1004 -1.19 0.6043
9 two c -1.008 -0.3339 0.09522

4. Better pivoting / reshaping

In [29]: pivoted = df.pivot('bar', 'foo')

In [30]: pivoted['B']
Out[30]:
one two
a 1.664 0.7857
b 0.8306 -0.8615
c 0.3897 -1.846
d 0.6513 -1.19
e 1.938 -0.3339

4. Better pivoting / reshaping

In [31]: pivoted.major_xs('a')
Out[31]:
A B C
one -0.0524 1.664 1.171
two 0.2073 0.7857 0.9051

In [32]: pivoted.minor_xs('one')
Out[32]:
A B C
a -0.0524 1.664 1.171
b 0.2514 0.8306 -1.396
c 0.1256 0.3897 0.5227
d -0.9301 0.6513 -0.2313
e 2.037 1.938 -0.3454

4. Better pivoting / reshaping

In [30]: pivoted['B']
Out[30]:
one two
a 1.664 0.7857
b 0.8306 -0.8615
c 0.3897 -1.846
d 0.6513 -1.19
e 1.938 -0.3339

4. Some other things
•  “Sparse” (mostly NA) versions of
data structures
•  Time zone support in DateRange
•  Generic moving window function
rolling_apply

Near future
•  More powerful Group By

•  Flexible, fast frequency (time series) conversions

•  More integration with statsmodels

Thanks!
•  Hack: github.com/wesm/pandas

•  Twitter: @wesmckinn

•  Blog: blog.wesmckinney.com

Empfohlen

What's new in pandas and the SciPy stack for financial usersWes McKinney

pandas: Powerful data analysis tools for PythonWes McKinney

Improving data interoperability in Python and RWes McKinney

PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney

Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks

Visualizing big data in the browser using sparkDatabricks

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks

Empfohlen

What's new in pandas and the SciPy stack for financial usersWes McKinney

pandas: Powerful data analysis tools for PythonWes McKinney

Improving data interoperability in Python and RWes McKinney

PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney

Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney

Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks

Visualizing big data in the browser using sparkDatabricks

Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks

A look inside pandas design and developmentWes McKinney

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France

Enabling Python to be a Better Big Data CitizenWes McKinney

Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks

Enabling exploratory data science with Spark and RDatabricks

Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks

Overview of the Hive Stinger InitiativeModern Data Stack France

Jump Start into Apache® Spark™ and DatabricksDatabricks

Introduction to Spark (Intern Event Presentation)Databricks

Koalas: Pandas on Apache SparkDatabricks

A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb

Large Scale Data Analysis Toolsboorad

Apache Arrow: Leveling Up the Data Science StackWes McKinney

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks

New Directions for Spark in 2015 - Spark Summit EastDatabricks

Spark what's new what's comingDatabricks

Spark - Philly JUGBrian O'Neill

Distributed ML in Apache SparkDatabricks

Using the python_data_toolkit_timbers_slidesTiffany Timbers

Time Series Analysis:Basic Stochastic Signal RecoveryDaniel Cuneo

Weitere ähnliche Inhalte

Was ist angesagt?

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks

A look inside pandas design and developmentWes McKinney

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with SparkDatabricks

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...Modern Data Stack France

Enabling Python to be a Better Big Data CitizenWes McKinney

Spark Application Carousel: Highlights of Several Applications Built with SparkDatabricks

Enabling exploratory data science with Spark and RDatabricks

Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks

Overview of the Hive Stinger InitiativeModern Data Stack France

Jump Start into Apache® Spark™ and DatabricksDatabricks

Introduction to Spark (Intern Event Presentation)Databricks

Koalas: Pandas on Apache SparkDatabricks

A Rusty introduction to Apache Arrow and how it applies to a time series dat...Andrew Lamb

Large Scale Data Analysis Toolsboorad

Apache Arrow: Leveling Up the Data Science StackWes McKinney

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...Databricks

New Directions for Spark in 2015 - Spark Summit EastDatabricks

Spark what's new what's comingDatabricks

Spark - Philly JUGBrian O'Neill

Distributed ML in Apache SparkDatabricks

Was ist angesagt? (20)

Spark's Role in the Big Data Ecosystem (Spark Summit 2014)

A look inside pandas design and development

Spark Summit EU 2015: Revolutionizing Big Data in the Enterprise with Spark

HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...

Enabling Python to be a Better Big Data Citizen

Spark Application Carousel: Highlights of Several Applications Built with Spark

Enabling exploratory data science with Spark and R

Enabling Exploratory Analysis of Large Data with Apache Spark and R

Overview of the Hive Stinger Initiative

Jump Start into Apache® Spark™ and Databricks

Introduction to Spark (Intern Event Presentation)

Koalas: Pandas on Apache Spark

A Rusty introduction to Apache Arrow and how it applies to a time series dat...

Large Scale Data Analysis Tools

Apache Arrow: Leveling Up the Data Science Stack

Not your Father's Database: Not Your Father’s Database: How to Use Apache® Sp...

New Directions for Spark in 2015 - Spark Summit East

Spark what's new what's coming

Spark - Philly JUG

Distributed ML in Apache Spark

Ähnlich wie SciPy 2011 pandas lightning talk

Using the python_data_toolkit_timbers_slidesTiffany Timbers

Time Series Analysis:Basic Stochastic Signal RecoveryDaniel Cuneo

Graphs for AI & ML, Jim Webber, Neo4jNeo4j

R_Proficiency.pptxShivammittal880395

Don't optimize my queries, organize my data!Julian Hyde

In Search of Plan Stability - Part 1Enkitec

Lazy beats Smart and FastJulian Hyde

Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...Databricks

Pandasmaikroeder

30 分鐘學會實作 Python Feature SelectionJames Huang

Getting started with R when analysing GitHub commitsBarbara Fusinska

Hailey_Database_Performance_Made_Easy_through_Graphics.pdfcookie1969

Golang in TiDB (GopherChina 2017)PingCAP

Data Profiling in Apache CalciteJulian Hyde

python高级内存管理rfyiamcool

30 分鐘學會實作 Python Feature SelectionJames Huang

Koalas: Pandas on Apache SparkDatabricks

[Www.pkbulk.blogspot.com]file and indexingAnusAhmad

Quick WinsHighLoad2009

Anomaly Detection with Apache SparkCloudera, Inc.

Ähnlich wie SciPy 2011 pandas lightning talk (20)

Using the python_data_toolkit_timbers_slides

Time Series Analysis:Basic Stochastic Signal Recovery

Graphs for AI & ML, Jim Webber, Neo4j

R_Proficiency.pptx

Don't optimize my queries, organize my data!

In Search of Plan Stability - Part 1

Lazy beats Smart and Fast

Unafraid of Change: Optimizing ETL, ML, and AI in Fast-Paced Environments wit...

Pandas

30 分鐘學會實作 Python Feature Selection

Getting started with R when analysing GitHub commits

Hailey_Database_Performance_Made_Easy_through_Graphics.pdf

Golang in TiDB (GopherChina 2017)

Data Profiling in Apache Calcite

python高级内存管理

30 分鐘學會實作 Python Feature Selection

Koalas: Pandas on Apache Spark

[Www.pkbulk.blogspot.com]file and indexing

Quick Wins

Anomaly Detection with Apache Spark

Mehr von Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney

Solving Enterprise Data Challenges with Apache ArrowWes McKinney

Apache Arrow: High Performance Columnar Data FrameworkWes McKinney

New Directions for Apache ArrowWes McKinney

Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney

ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney

Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney

Apache Arrow: Leveling Up the Analytics StackWes McKinney

Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney

Ursa Labs and Apache Arrow in 2019Wes McKinney

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney

Apache Arrow at DataEngConf Barcelona 2018Wes McKinney

Apache Arrow: Cross-language Development Platform for In-memory DataWes McKinney

Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney

Shared Infrastructure for Data ScienceWes McKinney

Data Science Without Borders (JupyterCon 2017)Wes McKinney

Memory Interoperability in Analytics and Machine LearningWes McKinney

Raising the Tides: Open Source Analytics for Data ScienceWes McKinney

Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney

Python Data Wrangling: Preparing for the FutureWes McKinney

Mehr von Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...

Solving Enterprise Data Challenges with Apache Arrow

Apache Arrow: High Performance Columnar Data Framework

New Directions for Apache Arrow

Apache Arrow Flight: A New Gold Standard for Data Transport

ACM TechTalks : Apache Arrow and the Future of Data Frames

Apache Arrow: Present and Future @ ScaledML 2020

Apache Arrow: Leveling Up the Analytics Stack

Apache Arrow Workshop at VLDB 2019 / BOSS Session

Ursa Labs and Apache Arrow in 2019

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"

Apache Arrow at DataEngConf Barcelona 2018

Apache Arrow: Cross-language Development Platform for In-memory Data

Apache Arrow -- Cross-language development platform for in-memory data

Shared Infrastructure for Data Science

Data Science Without Borders (JupyterCon 2017)

Memory Interoperability in Analytics and Machine Learning

Raising the Tides: Open Source Analytics for Data Science

Improving Python and Spark (PySpark) Performance and Interoperability

Python Data Wrangling: Preparing for the Future

Kürzlich hochgeladen

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

Story boards and shot lists for my a level piececharlottematthew16

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

How to write a Business Continuity PlanDatabarracks

Commit 2024 - Secret Management made easyAlfredo García Lavilla

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe

Search Engine Optimization SEO PDF for 2024.pdfRankYa

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Kürzlich hochgeladen (20)

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Designing IA for AI - Information Architecture Conference 2024

Unleash Your Potential - Namagunga Girls Coding Club

Nell’iperspazio con Rocket: il Framework Web di Rust!

Advanced Test Driven-Development @ php[tek] 2024

TrustArc Webinar - How to Build Consumer Trust Through Data Privacy

Streamlining Python Development: A Guide to a Modern Project Setup

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

Story boards and shot lists for my a level piece

DSPy a system for AI to Write Prompts and Do Fine Tuning

How to write a Business Continuity Plan

Commit 2024 - Secret Management made easy

"Debugging python applications inside k8s environment", Andrii Soldatenko

The Ultimate Guide to Choosing WordPress Pros and Cons

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack

DevEX - reference for building teams, processes, and platforms

Scanning the Internet for External Cloud Exposures via SSL Certs

How AI, OpenAI, and ChatGPT impact business and software.

Search Engine Optimization SEO PDF for 2024.pdf

Human Factors of XR: Using Human Factors to Design XR Systems

SciPy 2011 pandas lightning talk

1. What’s new and awesome in pandas

2. pandas? In [13]: foo Out[13]: methyl1 age edu something indic 0 38.36 30to39 geCollege 1 False 1 37.85 lt30 geCollege 1 False 2 38.57 30to39 geCollege 1 False 3 39.75 30to39 geCollege 1 True 4 43.83 30to39 geCollege 1 True 5 39.08 30to39 ltHS 1 True Size-mutable “labeled arrays” that can handle heterogeneous data

3. Kinda like a structured array?? •  Automatic data alignment with lots of reshaping and indexing methods •  Implicit and explicit handling of missing data •  Easy time series functionality –  Far less fuss than scikits.timeseries •  Lots of in-memory SQL-like operations (group by, join, etc.)

4. pandas? •  Extremely good for financial data –  StackOverflow: “this is a beast of a financial analysis tool” •  One of the better relational data munging tools in any language? •  But also has maybe 60+% of what R users expect when they come to Python

5. 1. Heavily redesigned internals •  Merged old DataFrame and DataMatrix into a single DataFrame: retain optimal performance where possible •  Internal BlockManager class manages homogeneous ndarrays for optimal performance and reshaping

6. 1. Heavily redesigned internals •  Better handling of missing data for non-floating point dtypes •  Soon: DataFrame variant with N-dim “hyperslabs”

7. 2. Fancier indexing Mix boolean / integer / label / slice-based indexing df.ix[0] df.ix[date1:date2] df.ix[:5, ‘A’:’F’] Setting works too df.ix[df[‘A’] > 0, [‘B’, ‘C’, ‘D’]] = nan

8. 3. More robust IO data_frame = read_csv(‘mydata.csv’) data_frame2 = read_table(‘mydata.txt’, sep=‘t’, skiprows=[1,2], na_values=[‘#N/A NA’]) store = HDFStore(‘pytables.h5’) store[‘a’] = data_frame store[‘b’] = data_frame2

9. 4. Better pivoting / reshaping foo bar A B C 0 one a -0.0524 1.664 1.171 1 one a 0.2514 0.8306 -1.396 2 one b 0.1256 0.3897 0.5227 3 one b -0.9301 0.6513 -0.2313 4 one c 2.037 1.938 -0.3454 5 two a 0.2073 0.7857 0.9051 6 two a -1.032 -0.8615 1.028 7 two b -0.7319 -1.846 0.9294 8 two b 0.1004 -1.19 0.6043 9 two c -1.008 -0.3339 0.09522

10. 4. Better pivoting / reshaping In [29]: pivoted = df.pivot('bar', 'foo') In [30]: pivoted['B'] Out[30]: one two a 1.664 0.7857 b 0.8306 -0.8615 c 0.3897 -1.846 d 0.6513 -1.19 e 1.938 -0.3339

11. 4. Better pivoting / reshaping In [31]: pivoted.major_xs('a') Out[31]: A B C one -0.0524 1.664 1.171 two 0.2073 0.7857 0.9051 In [32]: pivoted.minor_xs('one') Out[32]: A B C a -0.0524 1.664 1.171 b 0.2514 0.8306 -1.396 c 0.1256 0.3897 0.5227 d -0.9301 0.6513 -0.2313 e 2.037 1.938 -0.3454

12. 4. Better pivoting / reshaping In [30]: pivoted['B'] Out[30]: one two a 1.664 0.7857 b 0.8306 -0.8615 c 0.3897 -1.846 d 0.6513 -1.19 e 1.938 -0.3339

13. 4. Some other things •  “Sparse” (mostly NA) versions of data structures •  Time zone support in DateRange •  Generic moving window function rolling_apply

14. Near future •  More powerful Group By •  Flexible, fast frequency (time series) conversions •  More integration with statsmodels

15. Thanks! •  Hack: github.com/wesm/pandas •  Twitter: @wesmckinn •  Blog: blog.wesmckinney.com